Production Embedding Caches: Strategies That Hit 60% Hit Rate

Embedding generation is one of the biggest cost lines for production RAG and search applications. At scale, generating embeddings for every query is wasteful — many queries are similar or identical to previous queries, and the underlying content being embedded changes slowly. Properly designed caches hit 50-70% on production traffic, cutting embedding costs proportionally.

This post walks through three caching patterns that produce real hit rates.

The cost math#

A typical RAG application generates embeddings for two purposes: indexing documents (one-time per document) and embedding user queries (per query). For document indexing, caching is straightforward — generate once, store, reference forever. For query embedding, caching is more nuanced.

Query embedding cost depends on the embedding model. OpenAI’s text-embedding-3-small is ~$0.02 per million tokens; text-embedding-3-large is ~$0.13 per million tokens. For a high-traffic application processing 10 million queries/month, query embedding costs run from a few hundred to a few thousand dollars per month. Caching can cut this 50%+.

For self-hosted embedding models, the cost is GPU time rather than API cost, but the savings are comparable.

Pattern 1: Exact-match cache#

The simplest pattern: cache embeddings keyed by the exact query string.

def get_embedding(query):
    cache_key = hash(query)
    if cached := cache.get(cache_key):
        return cached
    embedding = embed_api.embed(query)
    cache.set(cache_key, embedding)
    return embedding

Where this works: Applications with high query repetition. Search applications often have substantial query repetition — popular queries get asked thousands of times.

Hit rate: Typically 30-50% for search applications, lower for conversational applications where queries are more unique.

Storage: Modest. Each embedding is ~1-3 KB depending on dimensions. A million unique cached queries fits in a few GB of Redis or similar.

Cache TTL: Depends on whether the underlying embedding model changes. With stable models, indefinite TTL is fine. With evolving models, periodic cache invalidation is needed.

Pattern 2: Normalized exact-match cache#

The improvement: normalize queries before hashing.

Common normalizations:

Lowercase
Strip leading/trailing whitespace
Collapse internal whitespace
Remove specific punctuation
Stem or lemmatize
Strip stop words (carefully)

The normalization should be lossless for retrieval purposes. “Find the CEO of Microsoft” and “find the ceo of microsoft” produce the same intent; normalizing the case before caching produces cache hits where the exact-match cache would miss.

Hit rate: Typically 50-70% for search applications. The normalization captures the surface variation in how users phrase the same intent.

Trade-offs: Aggressive normalization can produce false cache hits — different queries that normalize to the same string but actually have different intent. Conservative normalization (case, whitespace) is usually safe; more aggressive normalization (stemming, stop-word removal) needs careful evaluation.

Pattern 3: Semantic cache#

The sophisticated pattern: cache based on semantic similarity, not exact match.

def get_embedding(query):
    # Find semantically similar cached queries
    similar = vector_db.similarity_search(
        embed(query), 
        threshold=0.97
    )
    if similar:
        return similar[0].embedding
    embedding = embed_api.embed(query)
    vector_db.upsert(query, embedding)
    return embedding

Where this works: Conversational and natural-language interfaces where users phrase the same intent many different ways.

Hit rate: Can reach 60-80% with proper threshold tuning. Different queries that mean the same thing share embeddings; the cache catches more than exact-match would.

Trade-offs: The similarity threshold is critical. Too high (e.g., 0.99) produces few cache hits. Too low (e.g., 0.90) produces false hits where the cached embedding is actually for a different intent. The right threshold depends on the embedding model and the application; typical values are 0.95-0.98.

Implementation: Often built on the same vector database used for the RAG retrieval. The cache lookup is essentially a vector similarity query with a strict threshold.

The hybrid pattern#

Most production deployments use hybrid approaches:

Exact-match cache first. Fastest, cheapest, no false hits.
Normalized exact-match second. Catches simple variations.
Semantic similarity third. Catches deeper variations but with cost and complexity.

The first two layers handle most traffic efficiently; the semantic layer catches the remaining matches.

What we typically see at clients#

Common patterns in client engagements:

No caching. Every query generates a fresh embedding. The bill grows linearly with traffic.

Exact-match only. Reasonable for some applications but leaves substantial value on the table.

Aggressive semantic caching without validation. False hits produce wrong retrievals, which produce wrong answers, which produce user complaints.

Cache without monitoring. Hit rate not measured; nobody knows whether the cache is working.

The fix is usually instrumentation — measure hit rate, validate against quality metrics, tune as you learn.

Where pdpspectra fits#

Our AI engineering practice builds RAG and search applications with appropriate caching architecture. The cache work is often high-leverage — small engineering investment, substantial ongoing cost reduction.

Embedding cache discipline pays for itself. Talk to our team about your AI infrastructure.

The cost math#

Pattern 1: Exact-match cache#

Pattern 2: Normalized exact-match cache#

Pattern 3: Semantic cache#

The hybrid pattern#

What we typically see at clients#

Where pdpspectra fits#

Related posts.

Test-Time Compute: Why Reasoning Models Scale Differently

Quantization in Production: GPTQ vs AWQ vs Bitsandbytes in 2026

Online Inference at Sub-100ms: vLLM vs Triton vs TGI in 2026