LLM Prompt Caching 2026

Prompt caching is the single most underused cost lever in LLM applications. By 2026 every major frontier vendor exposes some form of cache for repeated prompt prefixes, and for production workloads that send the same system prompt, the same retrieved-document context, or the same tool catalog across many requests, the cost reductions land squarely in the 50 to 90 percent range on the cached portion. The reason teams leave this on the table is that the engineering work is small but unfamiliar, and the prompt design changes are not obvious until somebody walks you through them. This post walks through what each provider offers, the patterns that actually deliver the savings, and where the gotchas live.

LLM prompt caching

What prompt caching actually is#

When a model processes a prompt, the expensive step is computing attention over the input tokens. Prompt caching stores the intermediate key-value state for a prefix of the prompt so that the next request reusing the same prefix can skip that work. The provider charges a reduced rate for those cached input tokens, typically 10 percent of the normal input price on Anthropic and Google, and 50 percent on OpenAI. Latency to first token also drops meaningfully because the prefill stage is shorter.

The mental model that matters: cache hits only happen on exact prefix matches. If you change one character at position zero of the system prompt, the entire cache for that prefix is gone. The cache is keyed on the byte-exact prefix from the start of the request, including system message, tools, and any messages that precede the first non-cached content.

Anthropic Claude: explicit cache_control#

Anthropic exposes prompt caching through explicit cache_control breakpoints on message and system blocks. The default time to live is five minutes, refreshed on every hit; an extended one-hour TTL is available at a higher write cost. Cached reads are priced at 10 percent of the base input rate, cached writes at 125 percent of base, so the break-even is roughly two reads per write at the five-minute tier. Up to four cache breakpoints per request are allowed, which means you can layer cache scopes: tools cached for a day-class workload, system prompt cached at the conversation level, and a long document cached for the duration of a chat.

OpenAI: automatic prefix caching#

OpenAI’s prompt caching activates automatically on prompts of more than 1024 tokens. There is no API knob; the platform detects repeated prefixes and serves the cached portion at 50 percent of the base input rate. Cache lifetime is typically five to ten minutes of inactivity, longer during off-peak. The trade-off versus Anthropic’s explicit model is convenience against control: you cannot force a cache, you cannot inspect cache state, and the savings ceiling is lower per token, but the integration cost is zero.

Google Gemini and Bedrock cross-provider#

Gemini exposes Context Caching for content of more than 32 thousand tokens, with explicit cache creation, billable storage by the hour, and reads priced at roughly 25 percent of base input. The pattern fits long-document workloads better than chat. AWS Bedrock now exposes prompt caching across Anthropic and Amazon Nova models with cache_control semantics similar to Anthropic’s native API, which means you can adopt the same patterns under a single AWS contract. Azure OpenAI mirrors OpenAI’s automatic behavior.

The design patterns that deliver the savings#

A few patterns reliably move the cost line.

First, lock the prompt structure: system prompt, then tool definitions, then long static context (retrieved documents, schemas, examples), then the dynamic user turn. The first three blocks are the cache target; the user turn is the cache miss. Anything dynamic, timestamps, request IDs, personalization tokens, must live after the cache breakpoint, not before it.

Second, cache the tool catalog. For agent workflows with twenty or more tool definitions, the tool block can easily exceed 4 thousand tokens. Caching it across a session turns a recurring cost into a one-time write.

Third, cache the RAG context for the duration of a chat. If the user asks five follow-up questions against the same retrieved chunk set, caching the retrieved context delivers cache hits on turns two through five.

Fourth, do not interleave dynamic content. Putting the current date inside the system prompt is the single most common mistake; it invalidates the cache on every request that crosses a minute boundary depending on how the date is formatted.

The cost impact in practice#

For a representative production RAG workload with a 6 thousand-token system prompt plus tool catalog and an average 12 thousand-token retrieved context, we routinely see 60 to 85 percent reductions in input-token cost once caching is wired correctly. Agent workflows with long tool definitions sit at the high end of that range. Single-shot zero-context calls obviously see nothing.

Latency improvements are real but smaller: typically 30 to 50 percent off time to first token on cache hits.

Where pdpspectra fits#

Our AI integration practice wires prompt caching into production LLM deployments as part of broader cost-optimization and gateway work.

Prompt caching saves real money. Talk to our team about your AI cost program.

What prompt caching actually is#

Anthropic Claude: explicit cache_control#

OpenAI: automatic prefix caching#

Google Gemini and Bedrock cross-provider#

The design patterns that deliver the savings#

The cost impact in practice#

Where pdpspectra fits#

Related posts.

LLM Cost Optimization in 2026: Patterns That Actually Save Money

Building Reliable AI for In-House Legal Teams

Engineering an LLM Pipeline for Fraud and Waste Detection in Audit Reports