Production Embedding Caches: Strategies That Hit 60% Hit Rate
Caching embeddings cuts costs dramatically — when the cache strategy fits the workload. Three patterns that produce real hit rates.
Embedding generation is one of the biggest cost lines for production RAG and search applications. At scale, generating embeddings for every query is wasteful — many queries are similar or identical to previous queries, and the underlying content being embedded changes slowly. Properly designed caches hit 50-70% on production traffic, cutting embedding costs proportionally.
This post walks through three caching patterns that produce real hit rates.
The cost math#
A typical RAG application generates embeddings for two purposes: indexing documents (one-time per document) and embedding user queries (per query). For document indexing, caching is straightforward — generate once, store, reference forever. For query embedding, caching is more nuanced.
Query embedding cost depends on the embedding model. OpenAI’s text-embedding-3-small is ~$0.02 per million tokens; text-embedding-3-large is ~$0.13 per million tokens. For a high-traffic application processing 10 million queries/month, query embedding costs run from a few hundred to a few thousand dollars per month. Caching can cut this 50%+.
For self-hosted embedding models, the cost is GPU time rather than API cost, but the savings are comparable.
Pattern 1: Exact-match cache#
The simplest pattern: cache embeddings keyed by the exact query string.
def get_embedding(query):
cache_key = hash(query)
if cached := cache.get(cache_key):
return cached
embedding = embed_api.embed(query)
cache.set(cache_key, embedding)
return embedding
Where this works: Applications with high query repetition. Search applications often have substantial query repetition — popular queries get asked thousands of times.
Hit rate: Typically 30-50% for search applications, lower for conversational applications where queries are more unique.
Storage: Modest. Each embedding is ~1-3 KB depending on dimensions. A million unique cached queries fits in a few GB of Redis or similar.
Cache TTL: Depends on whether the underlying embedding model changes. With stable models, indefinite TTL is fine. With evolving models, periodic cache invalidation is needed.
Pattern 2: Normalized exact-match cache#
The improvement: normalize queries before hashing.
Common normalizations:
- Lowercase
- Strip leading/trailing whitespace
- Collapse internal whitespace
- Remove specific punctuation
- Stem or lemmatize
- Strip stop words (carefully)
The normalization should be lossless for retrieval purposes. “Find the CEO of Microsoft” and “find the ceo of microsoft” produce the same intent; normalizing the case before caching produces cache hits where the exact-match cache would miss.
Hit rate: Typically 50-70% for search applications. The normalization captures the surface variation in how users phrase the same intent.
Trade-offs: Aggressive normalization can produce false cache hits — different queries that normalize to the same string but actually have different intent. Conservative normalization (case, whitespace) is usually safe; more aggressive normalization (stemming, stop-word removal) needs careful evaluation.
Pattern 3: Semantic cache#
The sophisticated pattern: cache based on semantic similarity, not exact match.
def get_embedding(query):
# Find semantically similar cached queries
similar = vector_db.similarity_search(
embed(query),
threshold=0.97
)
if similar:
return similar[0].embedding
embedding = embed_api.embed(query)
vector_db.upsert(query, embedding)
return embedding
Where this works: Conversational and natural-language interfaces where users phrase the same intent many different ways.
Hit rate: Can reach 60-80% with proper threshold tuning. Different queries that mean the same thing share embeddings; the cache catches more than exact-match would.
Trade-offs: The similarity threshold is critical. Too high (e.g., 0.99) produces few cache hits. Too low (e.g., 0.90) produces false hits where the cached embedding is actually for a different intent. The right threshold depends on the embedding model and the application; typical values are 0.95-0.98.
Implementation: Often built on the same vector database used for the RAG retrieval. The cache lookup is essentially a vector similarity query with a strict threshold.
The hybrid pattern#
Most production deployments use hybrid approaches:
- Exact-match cache first. Fastest, cheapest, no false hits.
- Normalized exact-match second. Catches simple variations.
- Semantic similarity third. Catches deeper variations but with cost and complexity.
The first two layers handle most traffic efficiently; the semantic layer catches the remaining matches.
What we typically see at clients#
Common patterns in client engagements:
No caching. Every query generates a fresh embedding. The bill grows linearly with traffic.
Exact-match only. Reasonable for some applications but leaves substantial value on the table.
Aggressive semantic caching without validation. False hits produce wrong retrievals, which produce wrong answers, which produce user complaints.
Cache without monitoring. Hit rate not measured; nobody knows whether the cache is working.
The fix is usually instrumentation — measure hit rate, validate against quality metrics, tune as you learn.
Where pdpspectra fits#
Our AI engineering practice builds RAG and search applications with appropriate caching architecture. The cache work is often high-leverage — small engineering investment, substantial ongoing cost reduction.
Related reading: the RAG architecture patterns post, the LLM cost optimization post, and the prompt caching production economics post.
Embedding cache discipline pays for itself. Talk to our team about your AI infrastructure.