Inference Benchmarks in 2026: Reading the Numbers Honestly
Artificial Analysis, MLPerf, vLLM throughput, Cerebras and Groq numbers — how to read 2026 inference benchmarks without getting fooled by the marketing.
There is more honest data on inference performance in 2026 than at any prior point — and more dishonest marketing too. Artificial Analysis publishes daily numbers across most hosted providers. MLPerf Inference v4.1 and v5.0 have ironed out most of the comparability holes. vLLM, SGLang, and TensorRT-LLM all publish reproducible throughput numbers on canonical workloads. And yet most enterprise procurement decisions about inference still get made off provider-supplied slides that pick the benchmark that flatters the vendor.
This post is the senior-engineer version of how to actually read inference numbers in 2026.

The four metrics that actually matter#
You will see a lot of numbers thrown around. Most of them reduce to four:
- Tokens per second per user, sometimes called per-stream throughput or output speed. The single number that determines whether the interactive experience feels fast. 80 to 200 tokens per second per user is the comfortable interactive range for most chat workloads; below 40 the experience drags.
- Time to first token (TTFT). The latency between request submitted and first token emitted. Dominated by prompt-processing cost for long contexts. Sub-second is good; 200 to 500ms is excellent; above 2 seconds for short prompts is a problem.
- Aggregate throughput (total tokens per second across all concurrent users on the deployment). The number that determines unit economics. A single GPU pushing 200 tokens per second per user at concurrency 16 is doing 3,200 aggregate tokens per second.
- Dollar per million tokens at the deployment’s steady-state utilization. The number that determines whether the deployment can be sold or used economically.
Any benchmark that does not let you reconstruct these four is a marketing artifact. Most provider-published numbers showcase exactly one of them.
Artificial Analysis: what it is and is not#
Artificial Analysis became the de facto industry scoreboard for hosted inference in 2024 and 2025. It runs synthetic traffic against every major provider and publishes output speed, latency, and price per million tokens on a continuously updated dashboard.
What it is good for: comparing hosted providers’ real-world serving performance on common models (Llama 3.3 70B, Llama 4, Mixtral, Claude, GPT, Gemini) under approximately similar load. It catches when a provider’s serving stack regresses or improves. It exposes the variance between providers running nominally the same OSS model.
What it is not: a workload-specific benchmark. The synthetic prompts are not your prompts. The concurrency profile is not your concurrency profile. If your real workload is 4k-token in, 200-token out at concurrency 64, the AA numbers measured at a different input/output shape may not translate.
In our procurement work, AA numbers are the right starting point — never the ending point. We always re-benchmark with workload-shaped prompts before signing a contract.
MLPerf Inference: the closer-to-rigorous option#
MLPerf Inference v5.0 covers data-center inference for image classification, recommendation, GPT-J, Llama 2 70B, Mixtral 8x7B, Stable Diffusion XL, and an emerging Llama 3.1 405B track. It enforces specific accuracy targets and queue-discipline rules, so submissions are genuinely comparable.
What it is good for: comparing hardware. The MLPerf Llama 2 70B server-scenario numbers across NVIDIA H100, H200, B200, AMD MI300X, Intel Gaudi 3, and AWS Inferentia are the closest you will get to apples-to-apples on the iron.
What it is not: a model-quality benchmark. It does not tell you whether Llama 4 8B is good enough for your task. And the workloads are still synthetic — your prompt distribution will differ.
vLLM, SGLang, TensorRT-LLM: the reproducible self-hosted numbers#
For self-hosted inference, the serving framework’s own canonical benchmarks are usually the most honest data point. The vLLM project publishes throughput numbers per GPU per model with reproducible scripts. SGLang publishes a similar suite. TensorRT-LLM has perf guides with documented configs.
Rough late-Q1 2026 numbers for Llama 3.3 70B FP8 on 2x H100 SXM5 under vLLM 0.7 with continuous batching and prefix caching at concurrency 32:
- Output speed per user: 90 to 140 tokens per second.
- Aggregate output throughput: 2,800 to 3,800 tokens per second.
- TTFT for 2k-token prompts: 600ms to 1.1s.
- At an amortized 1.99 per-hour GPU rate, this lands roughly 0.55 to 0.75 dollar per million output tokens.
On 2x H200 the same workload sees roughly 1.4x to 1.7x throughput at roughly 1.2x cost — net favorable. On B200 the improvement is larger and the cost ratio better still on serving stacks that have been tuned. SGLang’s structured-output throughput beats vLLM by 20 to 40 percent on JSON-shaped responses, which matters for tool-using agent workloads.
The hosted providers, ranked by what they actually do well#
Together AI. Strong serverless tokens-per-second for OSS models; competitive on price. Good fit for production OSS inference without operating vLLM.
Fireworks AI. Speculative decoding rolled out broadly through 2025; their Llama 3.3 70B and Llama 4 numbers are consistently near the top of the AA leaderboard. Strong for low-latency interactive workloads.
DeepInfra. Aggressive on price per million tokens; throughput is good but rarely best-in-class. Strong for cost-sensitive batch and high-volume non-interactive workloads.
Groq. The headline LPU numbers are real — 500 to 1,200 tokens per second per user on the right models. The catch is the model menu is constrained, context windows are limited compared to GPU stacks, and the cost-per-token economics are sometimes less favorable than the speed implies. Strong fit when interactive latency is the dominant requirement.
Cerebras Inference (WSE-3). Similar story to Groq — extraordinary tokens-per-second-per-user on a specific menu of models, with a different cost shape. The wafer-scale story is genuinely differentiated for latency-critical workloads.
SambaNova. The RDU-based inference stack continues to compete in the same low-latency tier. Strong technical story; smaller market footprint than Groq or Cerebras.
Bedrock, OpenAI, Anthropic, Google. First-party providers serve their own models with quality that the OSS-on-neocloud providers cannot match for the same models. You pay a premium for it; usually worth it for the frontier models.
The token economics across providers#
Late-Q1 2026 rough dollar-per-million-output-token rates for the Llama 3.3 70B class:
| Provider | Output cost per million tokens |
|---|---|
| Together AI serverless | 0.79 to 0.99 |
| Fireworks AI standard | 0.85 to 1.10 |
| DeepInfra | 0.40 to 0.70 |
| AWS Bedrock (Llama 70B) | 0.99 to 1.20 |
| Self-hosted vLLM on 2x H100 reserved | 0.50 to 0.80 amortized at 70 percent utilization |
The honest read: at moderate volume, hosted serverless is cheaper than self-hosted because hosted providers are operating at 90+ percent utilization across thousands of customers. Self-hosted only wins below market price when your utilization is consistently high and your operational maturity is real.
For frontier models the comparison is different. There is no self-hosted equivalent of Claude Opus or GPT-5; you pay the first-party rate, period.
Speculative decoding and the throughput shift#
Speculative decoding — using a small draft model to propose tokens that a larger target model accepts or rejects in batches — landed in production at Fireworks, Together, DeepInfra, and increasingly in self-hosted vLLM through 2025. The net effect is typically 1.5x to 3x throughput uplift on suitable models with no quality regression. By mid-2026 it is a default expectation for any high-throughput serving stack.
What this means for benchmarks: numbers published before mid-2025 do not include this uplift. Numbers published recently do. Always check the timestamp before comparing.
How to benchmark for your own workload#
The mistake we see most often is teams accepting a provider-published number and skipping the workload-shaped re-benchmark. The pattern that actually works:
- Capture a representative sample of real production traffic — say 1,000 prompts with realistic input/output length distribution and concurrency profile.
- Replay the sample against each candidate provider or stack with the same concurrency. Measure tokens per second per user, TTFT, aggregate throughput, and dollar-per-million.
- Track p50, p95, and p99 — averages hide pathological tail latencies that wreck the user experience.
- Re-benchmark monthly. Provider stacks change, model versions roll, and the leader on the same workload shifts.
This is dull work. It also routinely flips a procurement decision that the slides had already settled.

The benchmarks that lie#
A short list of benchmark patterns to treat with suspicion:
- Tokens-per-second numbers with no concurrency or batch-size disclosed. At concurrency 1 the number is always flattering and almost never matches production.
- Cost-per-token numbers that do not include cold-start or idle time. A serverless number that is 0.20 per million but 4-second cold start is not what it looks like.
- Throughput numbers measured at the deployment’s optimal model size and quantization that you would not actually use in production.
- TTFT numbers measured on short prompts and quoted as the typical latency for a long-context workload.
If the vendor will not show you the reproduction script, treat the number as marketing.
How we use the numbers#
For our own procurement and recommendation work, the priority order is workload-shaped re-benchmark first, MLPerf for hardware-level comparison, Artificial Analysis for tracking hosted-provider drift, vendor benchmarks for narrowing the candidate list. Anyone doing it in reverse is going to overpay or underdeliver.
For the broader topic of where inference sits in the stack, see our sub-100ms inference with vLLM, Triton, and TGI, the AI gateway pattern, and our take on open-source LLMs in production.
Where pdpspectra fits#
Our AI and LLM integration practice runs workload-shaped inference benchmarks before signing contracts or rolling new serving stacks. We are happy to do this work for clients — it is unglamorous, repetitive, and reliably saves real money.
Related reading: Sub-100ms inference, the AI gateway pattern, and Mixture-of-Experts inference economics.
Inference benchmarks only matter when they predict your workload. Talk to our team about benchmarking your real traffic.