The Economics of Inference: What Baseten's $1.5B Round Tells Engineers
Serving, not training, is where most AI cost and latency live. A field guide to inference economics: GPU use, cold starts, autoscaling, build-vs-buy.
On June 18, 2026, TechCrunch reported that Baseten, an AI inference company, is finalizing a roughly $1.5 billion round at a valuation reported as high as $13 billion — months after a $300M round that valued it near $5 billion. The detail that matters is not the number; it is what the number is for. Baseten does not train frontier models. It runs inference — what a model does after a user submits a prompt — and, per The Next Web, it does so by renting capacity from around 20 cloud providers and layering its own serving software on top, routing requests to competent, cheaper open-source models where it can.
The capital is chasing a quiet truth that most AI roadmaps still get backwards: training makes the headlines, but serving is where your money and your latency live. If you are building anything real on top of LLMs, the economics of inference are your economics. Here is the engineer’s version of why.
Training is a project. Serving is a bill that never stops#
Training a model is a capital event with a beginning and an end. You provision a cluster, you burn it for weeks, you get weights, you stop paying. Most teams reading this will never train a foundation model and shouldn’t — the build-vs-buy math is settled.
Inference is the opposite shape. It is a per-request cost that scales with adoption, runs every hour your product is live, and gets worse as you succeed. Every prompt is a GPU doing forward passes; every token streamed back is compute you are paying for in real time. A feature that delights users is a feature whose serving bill grows linearly with use. This is why the inference layer is where the venture money is going — and why it should be where your engineering attention goes too.
The mental model we push in every AI Implementation: treat inference like a database, not like a build step. It is an always-on, latency-sensitive, cost-metered system that needs capacity planning, monitoring, and a unit-economics dashboard. Nobody ships a product without knowing their query cost. Far too many ship without knowing their cost per thousand tokens.
GPU utilization is the whole game#
A modern accelerator is the most expensive idle asset in your stack. The economics of serving reduce, almost entirely, to one question: what fraction of the time is that silicon actually doing useful work?
The enemies of utilization are specific and measurable:
- Memory-bound generation. Token generation is dominated by reading model weights out of high-bandwidth memory, not by arithmetic. A naive serving loop leaves the compute units starved, waiting on memory. This is why techniques like continuous (in-flight) batching exist — packing many concurrent requests through the model together so the expensive weight reads are amortized across users instead of wasted on one.
- Padding waste. Static batching pads every request to the longest sequence in the batch. Variable prompt lengths mean you pay to multiply by zeros. Continuous batching and paged attention (the idea behind vLLM) exist precisely to reclaim that waste.
- Poor bin-packing. One model per GPU when the model only needs 40% of the memory is 60% of a very expensive card set on fire.
A serving fleet running at 30% utilization costs roughly three times what the same workload costs at 90%. That multiplier is the difference between a margin and a crater. When a vendor like Baseten claims customers cut costs sharply, this is mechanically where it comes from: higher utilization plus routing cheaper open-source models where quality permits.
Cold starts are a tax you pay in latency and dollars#
Autoscaling inference is harder than autoscaling a web service, and the reason is the cold start. Spinning up a new replica means scheduling a GPU, pulling a container, loading tens of gigabytes of weights into memory, and warming the runtime. That can take from tens of seconds to minutes. During a traffic spike, that is exactly when you have no spare capacity and exactly when users are waiting.
You are left with an unpleasant trilemma:
- Over-provision and eat idle GPU cost to keep warm replicas on standby.
- Scale reactively and serve a wave of slow or failed requests every time load jumps.
- Engineer the cold start down with snapshotting, faster weight loading, and pre-warmed pools.
Most of what specialized serving platforms sell is the third option done well — fast model loading, scale-to-zero that doesn’t punish the next user, and replica pools sized against real traffic shape. If you build it yourself, this is the hard part, and it is far more engineering than “wrap the model in an API.”
Latency budgets are a data problem, not a model problem#
Here is the call we’ll defend: most production latency is not in the model. It is in everything you wired around it.
Decompose a real request. There is the network hop, the authentication and rate-limit check, the retrieval step that pulls context out of a vector store or a warehouse, the prompt assembly, the model forward pass, and the post-processing or validation on the way out. In a retrieval-augmented system serving an enterprise knowledge base — say, a clinical lookup in a Hospital Management System or a records query in a School ERP — the retrieval and data-shaping stages routinely eat more wall-clock time than generation does, especially when someone is issuing an unindexed query against a transactional database at request time.
This is the heart of pdpspectra’s plumbing-first thesis. An AI feature is mostly data engineering with a model on top. The latency budget is a data-pipeline budget. Our default operational engine — ClickHouse for sub-second analytical retrieval, Airflow for orchestration, and dbt for transformation — exists so that the context the model needs is precomputed, materialized, and millisecond-cheap to fetch, instead of being assembled with a slow join while a user watches a spinner. Get retrieval to sub-second and the model becomes the latency you optimize last, not first.
The build-vs-buy decision for inference#
So should you run your own serving stack or rent one? The honest answer depends on where you sit on two axes: volume and uniformity.
Buy (managed serving) when your traffic is spiky, your models are standard open-source checkpoints, and your team’s comparative advantage is the product, not the kernel. You are paying someone to keep utilization high and cold starts low so you don’t have to staff that. Below a certain scale, you will not beat a platform’s bin-packing and you shouldn’t try.
Build (or go closer to the metal) when volume is high and steady enough that the platform margin exceeds the fully loaded cost of an infra team, when you have hard data-residency or compliance constraints, or when a custom model or routing logic is itself the moat. At sustained scale, the per-token markup on managed inference is real money, and reserved or owned capacity wins.
What you should not do is make this decision by vibes. Instrument cost per request and cost per thousand tokens as first-class metrics, the same way you track p99 latency. Track utilization. Then the build-vs-buy line is arithmetic, and you can revisit it as volume grows — most teams correctly buy early and selectively build later.
There is also a middle path that too few teams use: rent the serving control plane but supply your own reserved capacity. You let a platform handle batching, routing, and cold-start engineering while you bring committed GPU instances at a negotiated rate underneath. You get the operational leverage without paying the full on-demand markup on every token. The point is that build-vs-buy is not binary — it is a dial, and the right setting moves as your traffic becomes predictable enough to commit against.
One more cost lever sits upstream of all of this: model choice. Routing an easy classification or extraction request to a small open-source model, and reserving a frontier model for the genuinely hard prompts, is often the single largest cost reduction available — frequently larger than any utilization tuning. This is exactly the routing behavior Baseten is built around, and it is the kind of decision your evals must gate, because “cheaper model, same quality” is a claim, not a given.
The non-negotiables, whichever way you go#
Whether you rent inference or run it, three things are not optional, and they are the parts teams skip first:
- Evals. You cannot reason about a cost-quality tradeoff — routing to a cheaper open-source model, quantizing, shrinking context — without a test suite that tells you whether quality held. No evals means every cost optimization is a blind gamble on user trust.
- Observability. Per-request traces with token counts, latency by stage, cache-hit rates, and utilization. If you cannot see where the time and money go, you cannot cut either.
- Cost tracking. Tokens-to-dollars attributed by feature and by customer. This is what turns “AI is expensive” into “this one endpoint is 60% of the bill,” which is the sentence that actually changes a roadmap.
Baseten’s round is a signal worth reading correctly. The market has decided inference is its own discipline — a serving, utilization, and data-pipeline problem distinct from model building. Teams that internalize that and treat inference cost as an engineering metric will ship AI that is both fast and affordable. Teams that treat the model as the whole system will keep being surprised by the bill.
Inference is plumbing before it’s intelligence. We build the pipeline, the evals, and the cost dashboard before we ship the model. Talk to pdpspectra.