Custom AI Silicon and Inference Cost

Two stories landed in the same week of June 2026, and read together they redraw the map for anyone building on large language models. First, TechCrunch reported that Amazon is in early talks to sell its own AI chips — the Trainium line — directly to outside data centers, reversing a long-standing policy of keeping that silicon inside AWS. CEO Andy Jassy noted that if the chip business stood alone, its run rate would be near $50 billion. Second, CNBC reported that Anthropic is in early talks with Microsoft to run some Claude inference on Microsoft’s custom Maia 200 chips via Azure — silicon Microsoft says delivers roughly 30% better performance per dollar than the prior generation in its own fleet.

Neither deal is signed, and the perf-per-dollar figures are vendor benchmarks, not gospel. But the direction is unambiguous: the era of “one accelerator runs everything” is ending. For teams that build on these models rather than make the chips, that shift is an opportunity and a trap. The opportunity is cheaper inference. The trap is coupling your stack to silicon that may not be the cheapest option in eighteen months.

What’s actually happening: the accelerator monoculture is breaking#

For most of the current AI cycle, the hardware story was simple — one vendor’s GPUs, one programming model, one set of assumptions. That monoculture made early decisions easy and made everyone’s cost structure identical.

That is now fracturing along two lines. Hyperscalers have spent years building in-house inference silicon — Amazon’s Trainium and Inferentia, Google’s TPUs, Microsoft’s Maia — and they are increasingly willing to point it at outside workloads. Amazon contemplating selling chips outside its own cloud is the clearest signal yet that custom silicon wants to compete in the open market, not just trim internal AWS bills. And a frontier lab like Anthropic testing Claude inference on a hyperscaler’s custom accelerator says the quality bar for non-incumbent hardware has been cleared for serious work.

The strategic takeaway for builders is not “switch to chip X.” It is that the number of viable places to run the same model is going up, and the price spread between them is widening. That is good for your bill — but only if your architecture can actually move.

It helps to see why hyperscalers are doing this. Custom inference silicon lets them sidestep the GPU supply crunch, control their own roadmap, and undercut the margin a third-party chip vendor charges. Those incentives are durable, which means this is not a one-season fad — there will be more credible accelerators, not fewer, and the competition among them is precisely what drives the price-per-token down. The builders who benefit are the ones positioned to take each new option as it clears the quality bar.

Why coupling your stack to one accelerator is a mistake#

It is tempting to optimize hard for whatever silicon you start on. Resist it past a certain depth. Hardware-specific lock-in shows up in places that are expensive to unwind later:

Custom kernels and ops hand-tuned for one vendor’s architecture that have no equivalent elsewhere.
Quantization and compilation formats baked for one runtime, so moving means re-validating numerics from scratch.
Orchestration and autoscaling wired to one cloud’s specific instance types and scheduler quirks.
Contract gravity — committed-spend agreements that quietly make the “free” choice the only economic one.

The cost of this coupling is not abstract. When a new chip appears claiming better performance per dollar — exactly what Maia 200 and Trainium are claiming — the team that abstracted its inference layer runs a benchmark and a migration. The team that didn’t runs a rewrite, or simply pays the premium because moving is too expensive to justify. Lock-in is rarely a single decision; it is a hundred small conveniences that add up to a stack that can only run in one place.

Treat inference as a portable workload#

The discipline that survives the silicon wave is hardware abstraction — designing so the model is a portable artifact and the accelerator is a swappable backend. Concretely, that means a few deliberate choices.

Keep a clean serving interface. Your application talks to an inference service through a stable contract — request in, tokens out — and never reaches around it to touch hardware-specific APIs. Whether the backend is a GPU, a Trainium instance, or a Maia chip on Azure is an implementation detail behind that boundary, not a fact your product code knows.

Prefer portable intermediate representations. Standards like ONNX and compiler stacks such as OpenXLA exist precisely so a model graph can target multiple backends without a per-vendor rewrite. They are not free and not perfect, but they convert “port the model” from a quarter-long project into a benchmark-and-validate exercise.

Hold your evals at the boundary. This is the non-negotiable. The moment you move a model to new silicon — or quantize it to fit a cheaper chip — numerics shift subtly. Without a regression suite that proves output quality held, a hardware swap is a silent quality gamble. Evals are what make portable inference safe, not just possible. We say it on every AI Implementation: no eval harness, no migration.

Be realistic about the limits, too. Portability is a spectrum, not a switch. Standard transformer architectures running through mainstream runtimes move across backends with modest effort; exotic custom ops, bleeding-edge attention variants, and aggressive vendor-specific optimizations do not. The discipline is to keep the common case portable and to treat any hardware-specific optimization as a conscious, documented trade — something you accept for a measured performance win, with a known cost to unwind, rather than a default you slide into because the vendor’s SDK made it the path of least resistance.

Cost per token is a first-class engineering metric#

The custom-silicon wave only pays off if you are measuring the thing it changes. Most teams cannot answer “what does one inference request cost us?” with a real number — and you cannot shop for cheaper silicon if you don’t know your current price.

Make cost per token, and cost per request, observable the same way you track p99 latency:

Attribute tokens to features and customers. “AI is expensive” is useless; “this endpoint is 60% of inference spend” changes a roadmap.
Track perf-per-dollar per backend. When you can run the same eval-passing workload on two accelerators, the cheaper one is an arithmetic result, not a sales pitch. A vendor’s “30% better perf-per-dollar” is a hypothesis you test on your traffic shape, not a fact you adopt.
Watch for cost-quality cliffs. A chip that is cheaper per token but forces aggressive quantization can cost more in failed requests and retries. Net cost, not sticker cost, is the metric.

The reason most teams fly blind here is that the data is genuinely awkward to collect. Token counts live in API responses, latency lives in traces, prices live in a contract spreadsheet, and the backend a request landed on lives in infrastructure metadata. Nobody owns the join. The fix is not heroics; it is treating inference telemetry as a data-engineering problem and building the pipeline that stitches those streams into one queryable table. Once that table exists, perf-per-dollar comparisons stop being arguments and start being SQL.

This is where the warehouse earns its keep. Inference logs — tokens, latency by stage, backend, cost — are just events, and events belong in a system built to aggregate them in sub-second time. Our default operational engine — ClickHouse for the analytics, Airflow for the pipelines, dbt for the modeling — turns raw inference telemetry into a live per-token cost dashboard. The same Data Platforms pattern that powers Operational Automation for a Hospital Management System or a School ERP applies cleanly here: instrument the events, warehouse them, and let the numbers drive the decision. You don’t pick silicon by reading a press release; you pick it by querying your own data.

The move: from “trust one vendor” to portable inference#

The old posture was singular — pick the dominant accelerator, optimize hard, and accept the coupling as the price of performance. The custom-silicon wave makes that posture a liability. Amazon selling chips, Anthropic testing Claude on Maia, Google’s TPUs in the open market — each new credible backend is leverage you can only use if your inference layer is portable and your costs are measured.

So the call is plain. Abstract the hardware behind a stable serving interface. Keep your model in portable formats. Guard every move with evals. Measure cost per token like it’s a SLO. Do that, and the next chip that claims better economics is a benchmark you run on a Tuesday — not a vendor you’re married to. Skip it, and you will spend the next cycle paying a premium to avoid a rewrite you could have designed away.

Portable inference is a design decision, not a migration project. We build the abstraction and the cost telemetry before the lock-in sets in. Talk to pdpspectra.

What’s actually happening: the accelerator monoculture is breaking#

Why coupling your stack to one accelerator is a mistake#

Treat inference as a portable workload#

Cost per token is a first-class engineering metric#

The move: from “trust one vendor” to portable inference#

Related posts.

The Economics of Inference: What Baseten's $1.5B Round Tells Engineers

An AI Agent Debugging Production Is a Retrieval Problem: What Elastic Buying DeductiveAI Tells You About AI SRE

Sovereign AI and Data Residency: An Architecture Decision, Not a Checkbox