Custom AI Silicon: TPU, Trainium, MTIA, and Maia
Every hyperscaler now builds its own AI accelerator. The real story is the build-vs-buy economics against NVIDIA — and what it means for your workloads.
For most of the deep-learning era there was one answer to “what do we train and serve on,” and it came from NVIDIA. In 2026 that is no longer true, and the change is structural rather than cosmetic. Every hyperscaler now designs and deploys its own AI accelerator: Google has the TPU, AWS has Trainium and Inferentia, Meta has MTIA, and Microsoft has Maia. These are not science projects. They run production traffic at enormous scale, and custom ASIC shipments are now growing faster than merchant GPU shipments for the first time.
The interesting question is not whether custom silicon is happening — it plainly is — but why the most software-centric companies on earth decided to become chip designers, and what that means for anyone whose workloads land on this hardware. The answer is mostly economics, and the economics are specific.
Why a software company builds a chip#
The case rests on a single observation: a hyperscaler runs a small number of workloads at a colossal, predictable volume. When you serve the same model architecture billions of times a day, you no longer need a general-purpose accelerator that does everything adequately. You need one that does your operations efficiently and skips everything else. That is the definition of an ASIC — application-specific integrated circuit — and it is exactly the regime where custom silicon wins.
A merchant GPU is a marvel of flexibility. It runs anyone’s model, any framework, any precision, any new architecture invented next month. You pay for that flexibility in silicon area and power: transistors spent on generality you may not use, and a vendor margin stacked on top. For a fixed, high-volume inference workload, that flexibility is overhead. Custom ASICs are reported to deliver several times better performance per watt on the specific model classes they target, and at the scale of a hyperscaler’s inference fleet, performance per watt is the line item that dominates the bill.
![]()
There is a second motive that is less about cost and more about control. Building your own accelerator decouples your roadmap from a single supplier’s allocation queue and pricing. When demand for the dominant GPU outstrips supply, owning a credible in-house alternative is leverage — both a fallback and a negotiating position. The hyperscalers are not trying to eliminate NVIDIA from their fleets. They are trying to stop being entirely dependent on it.
The four chips, and what each one is for#
The designs are not interchangeable, and their differences tell you what each company optimised for.
Google TPU#
The TPU is the oldest and most mature of the bunch — Google has been building tensor processing units for the better part of a decade, and the latest generation, Ironwood, is squarely aimed at the inference workloads that now make up the majority of AI compute. The most telling signal is external adoption: Anthropic committed to running Claude inference on Google TPUs at very large scale, a deal that put TPUs on the map as a serious destination for a frontier model house that is not Google itself. A custom accelerator that an outside frontier lab is willing to bet its inference fleet on is no longer just an internal cost optimisation.
AWS Trainium and Inferentia#
AWS runs a two-chip strategy: Inferentia for inference, Trainium for training, both offered as instances anyone can rent. The newest Trainium generation moved to a leading-edge TSMC process and is assembled into very large interconnected clusters. AWS’s pitch is blunt and entirely about price: it positions Trainium and Inferentia as materially cheaper than equivalent GPU instances for the right workloads. That framing matters because, unlike a purely internal chip, these are products you can put on your own bill and reason about directly.
Meta MTIA#
Meta’s MTIA — Meta Training and Inference Accelerator — exists to serve the workloads that actually dominate Meta’s compute: recommendation and ranking models, the things that decide what shows up in a feed. Those models have very different memory and compute characteristics from a large language model, and a chip tuned for them looks different from one tuned for transformer inference. Meta has laid out one of the most aggressive multi-generation custom-chip roadmaps in the industry, which is what you would expect from a company whose core business is itself a recommendation engine running at planetary scale.
Microsoft Maia#
Maia is Microsoft’s accelerator, developed in close concert with its foundry and aimed at the inference traffic flowing through its cloud and its AI products. It is the newest of the four and the one most clearly built to serve a specific software stack rather than a rentable general market. Like the others, it fabricates on a leading-edge TSMC node — a reminder that for all the talk of independence from NVIDIA, every one of these chips depends utterly on the same foundry.
![]()
The software stack is the real moat#
It is tempting to read custom silicon as a hardware story, but the hard part is the software. NVIDIA’s durable advantage was never purely the GPU — it was the decade of compilers, libraries, and kernels that let researchers run almost anything without thinking about the metal underneath. Every hyperscaler building a chip has had to build a matching software stack: a compiler that maps models onto the new hardware, a runtime, and an optimised kernel for every operation that matters. A chip with no mature compiler is a space heater.
This is why the accelerators that succeed are the ones whose owners control both the model and the silicon. When the same organisation defines the workload and designs the chip, it can co-design the two — shaping the model to the hardware and the hardware to the model — and skip the impossible task of supporting everyone’s arbitrary code. It is also why a custom chip is far less useful to an outside customer than its raw specifications suggest: you inherit the owner’s software stack and its sharp edges, not just the FLOPS. The benchmark that matters is your model on their toolchain, not the peak number on a slide.
The economics, stated honestly#
The build-vs-buy calculation is not subtle once you write it down. Designing a leading-edge accelerator costs an enormous amount in engineering, IP, mask sets, and software — easily hundreds of millions of dollars before a single chip ships, and most of it sunk regardless of how many you build. That number is absurd for almost everyone. It is rational only when you can amortise it across a fleet large enough that a few percentage points of efficiency dwarf the entire design cost.
That is precisely the hyperscaler situation and precisely not yours. The lesson for everyone below that scale is the inverse of what the headlines suggest: the fact that Google and AWS build chips is a strong argument that you should not. The break-even volume is astronomical. What you can do is consume the result — rent the custom silicon when it fits, rent GPUs when it does not, and let the hyperscalers carry the design risk.
The pragmatic posture that has emerged, and the one we recommend, is dual-track. Custom ASICs handle steady, high-volume, well-understood inference where the workload is stable enough to justify a less flexible chip. Merchant GPUs handle training, experimentation, and anything where the model architecture is still moving. This split — ASICs for predictable inference, GPUs for flexible training — is not a hedge born of indecision. It maps cleanly onto the actual economics of each workload type.
What this means for your AI implementation#
The proliferation of accelerators is, for a buyer, mostly good news with a sharp edge. The good news is choice and downward price pressure: more credible silicon options mean less monopoly rent baked into your compute bill. The sharp edge is portability. Each accelerator family has its own compiler, its own kernels, and its own quirks, and code tuned for one does not automatically run well — or at all — on another.
The defensive move is to keep your AI implementation a layer above the silicon. Target a portable interface, keep model-serving code free of hardware-specific assumptions, and benchmark on the accelerator you will actually deploy on rather than the one in the marketing deck. For a steady inference workload — the entity-extraction service behind a Hospital Management System, the forecasting model inside a School ERP — custom silicon can cut cost sharply, but only if you measured your real traffic on it first. The hyperscalers built their chips by knowing their workloads cold. Capturing the savings on your side starts the same way: know yours.
Picking the right accelerator is a workload question, not a brochure question. We benchmark your real traffic across GPU and custom silicon and build AI implementation that ports cleanly between them. Talk to our engineers.