Prompt Caching Best Practices: Cut Token Costs Without Sacrificing Quality
Prompt caching reduces LLM API costs by serving repeated prompt prefixes from a pre-computed KV cache instead of re-processing them on every request. For applications with large, stable system prompts — a common architecture in RAG systems, document analysis tools, and AI assistants — caching can reduce costs by 60-80% on the cacheable portion of requests.
How Prompt Caching Works
Modern transformers process text by computing key-value pairs at each layer of the attention mechanism. This computation is the primary cost of prompt processing. Caching stores these KV pairs for a prompt prefix and reuses them when the same prefix appears in subsequent requests, eliminating the recomputation cost. The model still generates the completion fresh — only the fixed prefix computation is cached.
Structuring Prompts for Caching
The most important rule is to put the stable content first. System instructions, tool definitions, RAG context, and any other content that remains constant across requests should appear at the beginning of the prompt, before any dynamic content like the user message. This maximizes the length of the cacheable prefix. Dynamic content at the beginning invalidates the cache for every request.
Cache TTL Considerations
Claude offers 5-minute cache TTL with an extended 1-hour option for eligible plans. GPT-4 prompt caching operates on a rolling window basis. Understanding the TTL is critical for measuring cache effectiveness: if your request interval exceeds the TTL for a given prefix, you will see lower hit rates than expected. The solution is either to request cache extension, or to design workloads that batch requests within the TTL window.
Measuring Cache Effectiveness
Both Anthropic and OpenAI return cache statistics in the usage field of the completion response. Monitor cache_creation_input_tokens and cache_read_input_tokens. A healthy caching configuration shows cache_read_input_tokens significantly exceeding cache_creation_input_tokens over time — meaning the cache is being read more often than it is being written.