Mastering Large Models 01 - Can Large Models Hit the Cache? From KV Cache to Prompt Cache

2025-02-18

When using large models, besides evaluating the model's performance, price (cost) is also a crucial parameter. Have you noticed that in the pricing of large model APIs, input costs are divided into two categories cache hit and cache miss? When there is a cache hit, the price is cheaper. Moreover, hitting the cache can also reduce the overall time consumption. How does this work?

Translation Notice

This content is automatically translated from Chinese by AI. While we strive for accuracy, there might be nuances that are lost in translation.

Introduction

When using large models, besides evaluating the model’s performance, price (cost) is also a crucial parameter. Have you noticed that in the pricing of large model APIs, input costs are divided into two categories: cache hit and cache miss? When there is a cache hit, the price is cheaper. Moreover, hitting the cache can also reduce the overall time consumption.

How to Hit the Cache?

The following uses OpenAI’s mechanism as an example. Different vendors may have different implementations, but they are generally similar.

Cache hits are based on exact prefix matching of prompts. Therefore, to hit the cache, you need to place static content, such as instructions and examples, at the beginning of the prompt.

There are also some conditions and limitations for the cache. The prompt must exceed 1024 tokens, and the growth interval is 128. During peak hours, the cache is only effective for 5-10 minutes, but during off-peak hours, it can last up to 1 hour.

Taking advantage of the Prompt Cache feature, when writing prompts, you should follow these guidelines: place static content at the beginning (such as role specifications, fixed output formats, instructions, etc.), and dynamic content at the end (such as user questions). At the same time, the application system can monitor metrics like cache hit rate to optimize the prompts.

How Does Prompt Cache Work?

Note: This section requires some understanding of attention mechanisms.

Prompt Cache was first mentioned in this 2023 paper: Prompt Cache: Modular Attention Reuse for Low-Latency Inference. Essentially, it is a space-for-time approach, aiming to improve inference speed and reduce latency without compromising inference effectiveness.

In summary, Prompt Cache stores the attention states corresponding to frequently occurring text segments during the inference phase. The next time the same text segment is encountered during inference, the previously stored attention state can be directly used.

As is well known, LLMs work by concatenating generated tokens with the input during the inference phase and then generating the next token until completion. In the initial version of autoregressive models, each new token required recalculating the attention state. Could these attention states be stored and reused when generating the next token? Yes, this is the idea behind KV Cache.

However, the limitation of KV Cache is that it can only be used within the same sequence. Prompt Cache goes a step further by liberating the attention state from a single sequence and applying it to other sequences.

It is precisely because of this working mechanism of LLMs that only exact prefix matching of prompts can achieve prompt caching.

References

OpenAI Platform

DeepSeek API Innovates with Disk Cache, Reducing Prices by an Order of Magnitude | DeepSeek API Docs

AI Inference Acceleration Tool: Prompt Cache Technology Explained

How does Prompt Caching work?

How Prompt caching works?