Speculative Decoding in Production

Speculative decoding is the inference optimization that actually delivers in 2026. A small “draft” model proposes the next few tokens; the large “target” model verifies them in a single forward pass. When the draft is right, you get multiple tokens per target-model call. When it’s wrong, you fall back to standard decoding with no quality loss.

The numbers, the cases where it helps, and what’s involved.

The headline numbers we see#

For chat-style workloads with a target model in the 70B–123B range and a draft model in the 3B–14B range:

p50 throughput: 1.8–2.5x faster
p99 latency: 1.5–2x faster
Quality: identical (token outputs match by construction)

For code generation specifically:

p50 throughput: 2.5–3.5x faster (code is more predictable; draft accuracy higher)
Quality: identical

For long-form creative output:

p50 throughput: 1.3–1.7x faster (less predictable token sequences)

These are real, measured numbers. They are not the 5–10x sometimes claimed in papers — those assume idealized batching or specific narrow workloads.

Why it’s not free#

The catch is that the draft model runs every time. So you’re paying for:

N draft tokens per accepted target token
Plus target verification compute
Plus VRAM for both models

When draft acceptance is high (code, predictable text), the math is great. When it’s low (open-ended creative, multilingual edge cases), you may be paying for draft compute without much savings.

Acceptance rates we typically measure:

Code: 65–80%
Structured output (JSON, SQL): 60–75%
English chat: 50–65%
Multilingual or creative text: 35–55%

Below ~40% acceptance, speculative decoding may net negative on cost.

The draft model choice#

The draft model needs to be:

Much smaller than the target (4–10x ratio)
From a similar model family ideally (tokenizer alignment, distribution alignment)
Fine-tuned on similar data when possible

Same-family drafts (Llama draft for Llama target, Qwen draft for Qwen target) work better than cross-family. Tokenizer mismatches kill the technique.

Operational realities#

Memory. Both models live in GPU memory. The draft is small, so the overhead is modest, but not zero. For tight memory budgets, this matters.

Tail latency. Speculative decoding helps median latency more than tail latency. Worst-case (no acceptance) latency is roughly the same as without spec decoding plus draft overhead.

Batching. Speculative decoding’s interaction with batching is complex. At high batch sizes, the simple advantages diminish. Frameworks (vLLM, TensorRT-LLM, SGLang) handle this differently — measure on your workload.

What we deploy by default#

For self-hosted inference engagements via our AI & LLM integration service:

Default to enabled for code-heavy and structured-output workloads
Measure acceptance rate per workload class; disable per-route if acceptance < 40%
Same-family draft models, sized 5–8x smaller than target
Throughput and tail-latency dashboards before and after enabling

Where to be careful#

Speculative decoding does not change quality. If you’re seeing different outputs with vs without it, your implementation has a bug.

Determinism is preserved under greedy decoding. Under sampling, you need a small adjustment to maintain the same distribution — most modern frameworks handle this; verify on your stack.

Eval before deploying. Run your full eval suite with and without spec decoding. Outputs should be identical for deterministic decoding. If they’re not, debug before shipping.

The bottom line#

Speculative decoding is one of the few “free lunch” optimizations in inference. The catch is that it’s only free when draft acceptance is high. Measure your acceptance rate before assuming the headline number applies to your workload.

For most production AI features we ship, it’s worth enabling. For some, it costs more than it saves. The measurement is cheap; do it before committing.

Speculative decoding is mostly free, except when it isn’t. Measure. Our team tunes inference stacks for actual production envelopes. Tell us about the workload.

The headline numbers we see#

Why it’s not free#

The draft model choice#

Operational realities#

What we deploy by default#

Where to be careful#

The bottom line#

Related posts.

Enterprise AI Rollout: A 12-Month Phased Roadmap for Global Firms

Banking AI Roadmap: What to Build First in 2026

Healthcare AI Playbook: From Pilot to Production