Glossary

Prompt Caching

A technique where LLM providers store and reuse processed prompt prefixes to reduce both latency and costs for repeated or similar API requests.

Prompt caching is a powerful cost optimization technique that allows LLM providers to store and reuse the computed representations of prompt prefixes. When subsequent API requests share the same prefix, the provider can skip reprocessing those tokens, passing the savings on to you through significantly reduced input token pricing.

How Prompt Caching Works

When you send a prompt to an LLM API, the model processes each input token through its transformer layers to build internal representations (key-value caches). This computation is a major part of what you're paying for with input tokens.

Prompt caching works by:

  • First request: The full prompt is processed normally at standard input pricing. The provider stores the computed key-value cache for the prompt prefix.
  • Subsequent requests: If a new request shares the same prefix (system prompt, examples, etc.), the provider reuses the stored cache. Only the new, unique portion of the prompt needs processing.
  • Cache expiration: Cached prefixes expire after a provider-specific TTL (typically 5-60 minutes) if not refreshed by new requests.
  • Provider-Specific Implementations

    Anthropic Prompt Caching

    Anthropic offers the most aggressive caching discount at 90% off cached input tokens:

  • Regular input: $3.00/1M tokens (Sonnet), $0.80/1M (Haiku)
  • Cached input: $0.30/1M tokens (Sonnet), $0.08/1M (Haiku)
  • Cache write: $3.75/1M tokens (Sonnet) — a small premium on the first write
  • Cache TTL: 5 minutes (refreshed on each use)
  • Minimum cacheable: 1,024 tokens (Sonnet/Opus), 2,048 tokens (Haiku)
  • You explicitly control caching with cache_control breakpoints in your messages.

    OpenAI Automatic Caching

    OpenAI implements automatic prompt caching with a 50% discount:

  • Regular input: $2.50/1M tokens (GPT-4o)
  • Cached input: $1.25/1M tokens (GPT-4o)
  • Cache TTL: Automatic, managed by OpenAI
  • Minimum cacheable: 1,024 tokens
  • OpenAI's caching is automatic — no code changes needed. The system detects repeated prefixes and caches them transparently.

    Google Gemini Context Caching

    Google offers explicit context caching for Gemini models with different pricing mechanics — a per-hour storage fee plus discounted input pricing.

    When to Use Prompt Caching

    Prompt caching is most effective when:

    1. System Prompts Are Long

    If your system prompt is 1,000+ tokens (common for agents with detailed instructions), caching it saves those tokens on every subsequent call within the TTL window.

    2. Few-Shot Examples Are Consistent

    Applications that include the same few-shot examples in every prompt can cache the example section.

    3. RAG with Repeated Context

    If multiple user queries reference the same documents, the document context can be cached.

    4. Agent Loops

    AI agents make multiple LLM calls per task, often with the same system prompt and growing context. Caching the static portions saves significantly.

    5. Multi-Turn Conversations

    Each new message in a conversation resends the full history. The prior messages (which don't change) can be cached.

    Calculating Prompt Caching Savings

    Scenario: An AI agent with a 3,000-token system prompt makes 20 LLM calls per task using Claude 3.5 Sonnet.

    Without caching:

  • System prompt cost per task: 20 calls × 3,000 tokens × $3.00/1M = $0.18
  • With caching:

  • First call (cache write): 3,000 × $3.75/1M = $0.011
  • Remaining 19 calls (cached): 19 × 3,000 × $0.30/1M = $0.017
  • Total: $0.028
  • Savings: 84% on system prompt tokens
  • At 10,000 tasks per month, that's $1,800 vs $280 — saving $1,520 monthly just on system prompt tokens.

    Best Practices for Prompt Caching

    Structure Prompts for Cacheability

    Place static content at the beginning of prompts:

  • System instructions (static, cacheable)
  • Few-shot examples (static, cacheable)
  • Retrieved context (semi-static, potentially cacheable)
  • User query (dynamic, not cached)
  • Monitor Cache Hit Rates

    Track what percentage of input tokens are hitting cache. Low hit rates indicate:

  • Prompts aren't structured optimally
  • TTL is expiring between requests (low traffic)
  • Dynamic content is mixed into cacheable sections
  • Combine with Other Optimizations

    Prompt caching stacks with other cost optimizations:

  • Cached input + model routing = compound savings
  • Cached prompts reduce latency too (faster time-to-first-token)
  • Batch API discounts apply on top of cache discounts
  • 🦞How ClawHQ Helps

    ClawHQ tracks your prompt caching performance in real-time: cache hit rates, tokens saved, cost savings, and TTL utilization across all providers. Identify which prompts aren't being cached effectively and get recommendations for restructuring them. ClawHQ's analytics show you exactly how much caching saves compared to non-cached requests, helping you maximize this powerful optimization.

    Frequently Asked Questions

    What is prompt caching and how does it save money?

    Prompt caching stores the processed representations of repeated prompt prefixes so they don't need to be recomputed on subsequent requests. Anthropic offers 90% off cached tokens, OpenAI offers 50% off. This is most impactful for applications with long system prompts, few-shot examples, or agent loops.

    How much can prompt caching save?

    Savings depend on your prompt structure and request patterns. For applications with significant static prompt content (system prompts, examples), caching typically saves 40-85% on input token costs. AI agents with 20+ calls per task see the largest savings.

    Does prompt caching affect response quality?

    No — prompt caching is purely an optimization of the computation process. The model receives exactly the same input and produces the same quality output. It's a behind-the-scenes efficiency improvement with no impact on results.

    Which provider has the best prompt caching?

    Anthropic offers the most aggressive discount at 90% off cached input tokens (vs OpenAI's 50%). However, Anthropic requires explicit cache control markers while OpenAI's caching is automatic. The best choice depends on your willingness to modify code and the volume of cacheable content.

    Related Terms

    Take Control of Your AI Costs

    Take control of your AI agent fleet. Monitor, manage, and optimize — all from one command center.

    Start Free Trial →