How Prompt Caching Can Cut Your AI API Costs by 90%
A deep dive into prompt caching from Anthropic, OpenAI, and Google — how it works, how much it saves, and best practices for maximizing cache hit rates.
What is prompt caching?
Prompt caching is a technique that avoids re-processing the same token sequences on every API call. When you send a prompt with a prefix that matches a recently cached prompt, the provider skips processing those tokens and charges a reduced rate — typically 75–90% less.
For applications with long system prompts, few-shot examples, or repeated context blocks, caching can dramatically reduce your input token costs without any changes to your application logic.
Which providers support prompt caching?
Not all providers offer built-in caching, and the implementations differ:
- Anthropic — Native prompt caching with 90% discount on cached input tokens. Minimum cacheable prefix is 1,024 tokens (Sonnet/Opus) or 2,048 tokens (Haiku). Cache TTL is 5 minutes, extended on each hit.
- OpenAI — Automatic prompt caching on prompts longer than 1,024 tokens. Cached tokens are charged at 50% of the standard input rate. No manual configuration needed.
- Google Gemini — Context caching available for prompts over 32,768 tokens. Useful for very long documents but less practical for typical API calls.
How much can caching save you?
The savings depend on your prompt structure and request volume. Here's a realistic example:
Say you have a 3,000-token system prompt and send 10,000 requests per day using Claude 3.5 Sonnet ($3/1M input tokens). Without caching, that's 30M input tokens just for system prompts — $90/day. With Anthropic's 90% cache discount, the same workload costs $9/day. That's $2,430 saved per month from one change.
Best practices for maximizing cache hits
Caching works on prefix matching, so prompt structure matters:
- Put static content first — System prompt, instructions, and examples should come before the variable user input.
- Keep the cacheable prefix stable — Any change to the prefix (even whitespace) invalidates the cache.
- Maintain request frequency — Cache entries expire after a TTL (5 minutes for Anthropic). Low-volume endpoints may not benefit.
- Use consistent prompt templates — Templating ensures every request shares the same prefix structure.
Application-level caching for other providers
For providers without native caching, implement your own at the application level:
- Hash the full prompt (system + user input) as a cache key
- Store the response in Redis or a similar fast store with a TTL
- Return cached responses for identical prompts without making an API call
- This eliminates both input and output token costs for repeat queries
Application-level caching works across all providers and can save 100% on repeated requests, not just the input portion. The tradeoff is that only exact-match prompts benefit.
Track your cache performance
Caching is only valuable if it's actually hitting. Use MeterFox to monitor your per-model input token costs over time. A successful caching implementation should show a noticeable drop in input costs without any change in request volume. If input costs aren't dropping, your cache hit rate may be low and your prompt structure needs adjustment.
Start monitoring your API costs for free
Track spending across 15+ providers in one dashboard. No credit card required.
Get Started Free