What is prompt caching and how does it save money?

Prompt caching stores the processed representations of repeated prompt prefixes so they don't need to be recomputed on subsequent requests. Anthropic offers 90% off cached tokens, OpenAI offers 50% off. This is most impactful for applications with long system prompts, few-shot examples, or agent loops.

How much can prompt caching save?

Savings depend on your prompt structure and request patterns. For applications with significant static prompt content (system prompts, examples), caching typically saves 40-85% on input token costs. AI agents with 20+ calls per task see the largest savings.

Does prompt caching affect response quality?

No — prompt caching is purely an optimization of the computation process. The model receives exactly the same input and produces the same quality output. It's a behind-the-scenes efficiency improvement with no impact on results.

Which provider has the best prompt caching?

Anthropic offers the most aggressive discount at 90% off cached input tokens (vs OpenAI's 50%). However, Anthropic requires explicit cache control markers while OpenAI's caching is automatic. The best choice depends on your willingness to modify code and the volume of cacheable content.

Prompt Caching: Definition & Guide

Prompt caching is a powerful cost optimization technique that allows LLM providers to store and reuse the computed representations of prompt prefixes. When subsequent API requests share the same prefix, the provider can skip reprocessing those tokens, passing the savings on to you through significantly reduced input token pricing.

How Prompt Caching Works

When you send a prompt to an LLM API, the model processes each input token through its transformer layers to build internal representations (key-value caches). This computation is a major part of what you're paying for with input tokens.

Prompt caching works by:

First request: The full prompt is processed normally at standard input pricing. The provider stores the computed key-value cache for the prompt prefix.

Subsequent requests: If a new request shares the same prefix (system prompt, examples, etc.), the provider reuses the stored cache. Only the new, unique portion of the prompt needs processing.

Cache expiration: Cached prefixes expire after a provider-specific TTL (typically 5-60 minutes) if not refreshed by new requests.

Provider-Specific Implementations

Anthropic Prompt Caching

Anthropic offers the most aggressive caching discount at 90% off cached input tokens:

Regular input: $3.00/1M tokens (Sonnet), $0.80/1M (Haiku)

Cached input: $0.30/1M tokens (Sonnet), $0.08/1M (Haiku)

Cache write: $3.75/1M tokens (Sonnet) — a small premium on the first write

Cache TTL: 5 minutes (refreshed on each use)

Minimum cacheable: 1,024 tokens (Sonnet/Opus), 2,048 tokens (Haiku)

You explicitly control caching with cache_control breakpoints in your messages.

OpenAI Automatic Caching

OpenAI implements automatic prompt caching with a 50% discount:

Regular input: $2.50/1M tokens (GPT-4o)

Cached input: $1.25/1M tokens (GPT-4o)

Cache TTL: Automatic, managed by OpenAI

Minimum cacheable: 1,024 tokens

OpenAI's caching is automatic — no code changes needed. The system detects repeated prefixes and caches them transparently.

Google Gemini Context Caching

Google offers explicit context caching for Gemini models with different pricing mechanics — a per-hour storage fee plus discounted input pricing.

When to Use Prompt Caching

Prompt caching is most effective when:

1. System Prompts Are Long

If your system prompt is 1,000+ tokens (common for agents with detailed instructions), caching it saves those tokens on every subsequent call within the TTL window.

2. Few-Shot Examples Are Consistent

Applications that include the same few-shot examples in every prompt can cache the example section.

3. RAG with Repeated Context

If multiple user queries reference the same documents, the document context can be cached.

4. Agent Loops

AI agents make multiple LLM calls per task, often with the same system prompt and growing context. Caching the static portions saves significantly.

5. Multi-Turn Conversations

Each new message in a conversation resends the full history. The prior messages (which don't change) can be cached.

Calculating Prompt Caching Savings

Scenario: An AI agent with a 3,000-token system prompt makes 20 LLM calls per task using Claude 3.5 Sonnet.

Without caching:

System prompt cost per task: 20 calls × 3,000 tokens × $3.00/1M = $0.18

With caching:

First call (cache write): 3,000 × $3.75/1M = $0.011

Remaining 19 calls (cached): 19 × 3,000 × $0.30/1M = $0.017

Total: $0.028

Savings: 84% on system prompt tokens

At 10,000 tasks per month, that's $1,800 vs $280 — saving $1,520 monthly just on system prompt tokens.

Best Practices for Prompt Caching

Structure Prompts for Cacheability

Place static content at the beginning of prompts:

System instructions (static, cacheable)

Few-shot examples (static, cacheable)

Retrieved context (semi-static, potentially cacheable)

User query (dynamic, not cached)

Monitor Cache Hit Rates

Track what percentage of input tokens are hitting cache. Low hit rates indicate:

Prompts aren't structured optimally

TTL is expiring between requests (low traffic)

Dynamic content is mixed into cacheable sections

Combine with Other Optimizations

Prompt caching stacks with other cost optimizations:

Cached input + model routing = compound savings

Cached prompts reduce latency too (faster time-to-first-token)

Batch API discounts apply on top of cache discounts

Prompt Caching

How Prompt Caching Works

Provider-Specific Implementations

Anthropic Prompt Caching

OpenAI Automatic Caching

Google Gemini Context Caching

When to Use Prompt Caching

1. System Prompts Are Long

2. Few-Shot Examples Are Consistent

3. RAG with Repeated Context

4. Agent Loops

5. Multi-Turn Conversations

Calculating Prompt Caching Savings

Best Practices for Prompt Caching

Structure Prompts for Cacheability

Monitor Cache Hit Rates

Combine with Other Optimizations

🦞How ClawHQ Helps

Frequently Asked Questions

What is prompt caching and how does it save money?

How much can prompt caching save?

Does prompt caching affect response quality?

Which provider has the best prompt caching?

Related Terms

AI Cost Optimization

Token Pricing

Cost Per Token

OpenAI Pricing

Anthropic Pricing

Take Control of Your AI Costs