Embedding Model Selection 2026

Choosing an embedding model used to be easy because there were three. In 2026 there are dozens — OpenAI text-embedding-3, Cohere v3, Voyage 3, Mistral embed, Google Gemini Embedding, the BGE family, NV-Embed, Stella, and a long tail of fine-tuned variants. MTEB leaderboard rankings shift weekly. Half the models on it would never survive in your stack.

Four criteria we actually apply.

1. Retrieval quality on your corpus#

The only benchmark that matters is your own. Build a small evaluation set: 50–100 queries that look like real user queries, paired with the documents that should be retrieved. Score recall@K and MRR for each candidate model.

MTEB and BEIR are useful for triage, not for selection. We’ve seen models that score 5 points higher on MTEB perform 10 points worse on a specific domain corpus. Domain shift is real.

2. Cost at your scale#

Embedding cost is a per-token charge times two — once at index time, once per query. For a million-document corpus refreshed monthly plus 10k queries per day, the total cost varies 20x across providers. Run the math before selecting:

monthly_cost ≈ (corpus_tokens / 1M × index_price)
             + (queries × avg_query_tokens / 1M × query_price × 30)
             + reindex_frequency × corpus_tokens / 1M × index_price

Self-hosted open-source (BGE, NV-Embed, Stella via TEI or vLLM) wins on cost above ~5M docs and 100k+ queries/day. Below that, hosted is usually cheaper after operational overhead.

3. Dimension and storage cost#

Higher dimensions = better retrieval up to a point, then it’s just storage. A 3072-dim model isn’t 2x better than 1536-dim; it’s marginally better and 2x the storage and memory.

Matryoshka embeddings (truncate-and-still-useful) are now standard. Pick a model that supports them; store at full dimension, query at smaller dimension when latency matters.

4. Latency at your traffic#

Hosted embedding APIs have p99 latencies that move under load. Measure during your peak, not during midnight tests. Self-hosted gives you control of latency but requires GPU capacity planning.

For agent workflows where embeddings are on the hot path of a real-time response, the difference between 50ms and 500ms p99 is felt by users.

Recommendations by use case#

Default for English RAG on hosted infrastructure. Voyage 3 Large or Cohere v3.5. Strong quality, predictable cost, decent latency.

Multilingual or cross-lingual retrieval. Cohere multilingual or BGE-M3. Test on your specific language pairs — performance varies wildly.

Cost-sensitive at scale. Self-host BGE-large or NV-Embed via TEI. GPU cost amortizes well above 5M docs.

Code search. Voyage-code or specialized fine-tunes. Generic text embedders underperform on code by a wide margin.

Hybrid (text + image). Cohere Embed v3 multimodal or Google’s multimodal embedding. Cheaper than running text and image embedders separately.

Where teams go wrong#

Picking by leaderboard. MTEB shifts and is gameable. Build a local eval.

Skipping re-embedding when switching. When you change the embedding model, you must re-embed the corpus. Plan the migration; don’t run dual models in production for “convenience.”

Ignoring drift over time. As your corpus grows, the model that worked at 100k docs may underperform at 5M. Refresh the eval annually.

Treating embeddings as commodity. They aren’t. Embedding choice often moves retrieval quality more than the LLM choice moves answer quality.

What we ship by default#

For RAG engagements via our AI & LLM integration service:

Local eval set built from real user queries on day one
2–3 candidate models compared on that set
Cost projection at 12-month volumes
Latency measurement under simulated load
Re-eval cadence in the runbook (annual at minimum)

Embedding selection is a one-week decision that shapes 12 months of retrieval quality. Don’t skip the eval.

The right embedding model is the one that wins on your corpus, not the leaderboard. Our team runs structured embedding bake-offs as the first step of every RAG engagement. Tell us about the corpus.

1. Retrieval quality on your corpus#

2. Cost at your scale#

3. Dimension and storage cost#

4. Latency at your traffic#

Recommendations by use case#

Where teams go wrong#

What we ship by default#

Related posts.

Fine-Tuning vs RAG vs Prompting: A Decision Framework That Holds Up

Plumbing-First AI: Why Implementation Is Mostly Data Engineering

Building Reliable AI for In-House Legal Teams