The NPU in Your Pocket: On-Device AI

For a decade the default answer to “where does the model run?” was: someone else’s GPU, behind an API, across a network you don’t control. That answer is quietly losing its monopoly. The interesting inference is moving onto the device in your hand — and the engineering that makes it work has very little to do with the parts that get press releases.

This is a piece about the plumbing. NPUs, quantization, runtimes, and the one bottleneck that decides what actually ships. If you build consumer hardware, healthcare endpoints, or field tools, this is the layer where latency budgets stop being a slogan and become a data problem you have to solve in silicon.

Why inference moved to the edge#

Four forces pushed it there, and none of them are about novelty.

Latency. A round trip to a data center is a hard floor you cannot optimize past. For anything interactive — live captioning, camera effects, a keyboard that finishes your sentence — the network is the slow part. On-device inference removes the round trip entirely. The model can’t be faster than the bandwidth between its weights and its compute, but at least it isn’t waiting on a cell tower.

Privacy. Data that never leaves the device is data you never have to govern in transit. For a point-of-care device in a Hospital Management System, or a tablet running a School ERP offline in a classroom, “the inference happened locally and the bytes stayed on the chip” is a stronger compliance story than any encryption-in-flight diagram.

Cost. Cloud inference is a per-token meter that never stops running. Edge inference is a fixed silicon cost you already paid when the customer bought the device. At consumer scale, moving even a fraction of inference off the meter changes the unit economics of a feature.

Offline. The network is not a given. A field device, a rural clinic, a plane, a basement — these are not edge cases, they are most of the world some of the time. A feature that degrades gracefully without connectivity is simply a better feature.

The accelerator landscape#

The NPU is not a GPU. It’s a fixed-function matrix engine tuned for the low-precision multiply-accumulate that dominates neural network inference, and it trades flexibility for performance-per-watt. Every major SoC vendor now ships one.

Apple’s Neural Engine sits next to the CPU and GPU on Apple silicon and is reached through Core ML; recent work has people running quantized one-to-three-billion-parameter language models against it (Apple’s own Llama-on-Core-ML writeup is the canonical reference). Qualcomm’s Hexagon NPU anchors the Snapdragon line and its heterogeneous compute story — the pitch is that the NPU, CPU, and GPU share the generative-AI workload rather than any one block carrying it (Qualcomm’s Hexagon page). Google’s Tensor SoC carries an Edge-TPU-derived block that powers the Gemini Nano features on Pixel. For the broader embedded world, Arm’s Ethos-U family targets microcontroller-class IoT silicon and now explicitly supports transformer networks, not just CNNs. MediaTek’s APU in Dimensity rounds out the phone market, and notably pairs with Google’s LiteRT through a NeuroPilot accelerator.

Exploded view of a smartphone SoC with the NPU block highlighted

The strategic read: the hardware is converging faster than the software. You can buy capable NPUs from five vendors. The hard part is targeting them all without rewriting your model five times.

The engineering of fitting a model on a phone#

A model that trains happily on a cluster does not fit on a phone by accident. Getting it there is a sequence of deliberate, lossy compromises — each one a tradeoff between size, speed, and accuracy you have to measure, not assume.

Quantization#

This is the lever that moves the most. Training happens in FP16 or BF16; inference does not need that range. Dropping weights to INT8 roughly halves the memory footprint and lets the NPU’s integer units do the work they were built for. INT4 halves it again. Recent Core ML versions support 4-bit block-wise quantization and group-wise palettization; a common pattern keeps activations in FP16 while pushing weights to INT4 with per-channel granularity. The accuracy cost is real and model-dependent — you do not get the compression for free, and the only honest way to know the damage is to evaluate the quantized model on your own task, not on a leaderboard.

Pruning and distillation#

Quantization shrinks each weight. Pruning removes weights entirely — the structured kind, which drops whole channels or heads, is the kind hardware can actually exploit. Distillation takes a different route: train a smaller student model to imitate a larger teacher, so you ship the student. Both are upstream of the runtime and both are how a model that was never going to fit becomes one that does.

Operator fusion#

A graph of separate operations means a separate trip to memory for each intermediate result. Fusion collapses adjacent operations — a convolution, a bias add, an activation — into a single kernel that keeps intermediates in registers. The compute didn’t change; the memory traffic did. That distinction is the whole game.

Memory bandwidth is the real bottleneck#

Here is the thing the TOPS marketing number hides: most on-device inference is not compute-bound. It’s memory-bound. The NPU can multiply faster than the SoC can feed it weights from DRAM.

A bandwidth bottleneck between a large weight reservoir and a small compute core

For a language model generating tokens one at a time, every token requires streaming the entire weight matrix from memory through the accelerator. The arithmetic per token is trivial; the bytes moved are not. This is precisely why quantization buys speed and not just space — INT4 weights are a quarter the bytes to move, so the bandwidth-bound decode step runs roughly proportionally faster. Disaggregation strategies that run prefill on the NPU and decode on the GPU exist precisely because the two phases stress different parts of the system.

If you take one thing from this section: when you profile on-device inference and it’s slow, look at bytes moved before you look at FLOPs. The bottleneck is almost always the channel, not the core. Latency budgets are data problems.

Small language models and on-device generative AI#

The generative-AI-on-device story runs on small language models — SLMs — not shrunken frontier models. Phi-class, Gemma-class, and Llama-class models in the one-to-three-billion-parameter range are the sweet spot: large enough to be useful for summarization, rewriting, extraction, and structured assistance; small enough that, once quantized, their weights fit in the memory budget and stream fast enough to generate at a usable rate. The work here is in getting Gemma-class models running well on phone NPUs — not in chasing a benchmark a 70-billion-parameter cloud model will always win.

Runtimes: cut the trendy framework, keep the load-bearing tools#

The runtime is where portability and performance fight. The honest landscape, vendor-neutral:

Core ML — the path to Apple’s Neural Engine. Non-negotiable on Apple platforms.
LiteRT — Google’s successor to TFLite, tuned for Gemma-class models and the path to MediaTek’s APU through NeuroPilot.
ONNX Runtime — the cross-platform abstraction layer when you need one model to target many backends.
ExecuTorch — Meta’s PyTorch-native on-device runtime, which reached v1.0 in late 2025 with a small base footprint and a dozen hardware backends, and runs in their own apps at billions-of-users scale.
llama.cpp — the pragmatic choice for local LLM inference, GGUF weights, and getting a model running on commodity hardware tonight without a vendor toolchain.

The temptation is to standardize on the newest, most-starred framework. Resist it. The right call is the boring one: pick the runtime that owns the silicon you’re shipping on, and add an abstraction layer only when you genuinely target multiple backends. A framework earns its place by being load-bearing under real traffic, not by trending. Boring tools win when they earn it — and on the edge, the ones that own the hardware path have earned it.

The power and thermal budget is non-negotiable#

A phone has no fan. Every milliwatt the NPU burns becomes heat the chassis has to shed, and sustained inference that warms the device gets throttled by the OS — which means your carefully benchmarked token rate is a number you hit for thirty seconds and never again. The performance that matters is sustained performance inside the thermal envelope, not the peak you can quote. This is why performance-per-watt, not raw TOPS, is the figure of merit. Design for the steady state, then measure the device hot, not cold.

Hybrid routing: the honest architecture#

The mature answer is not “everything on device.” It’s a router. Cheap, latency-sensitive, privacy-bound work runs locally on the SLM. Hard reasoning, long context, and anything the small model can’t do with confidence routes to the cloud. The engineering is in the routing decision — knowing when the on-device model is good enough — and in degrading gracefully to local-only when the network is gone. Done well, the user never sees the seam. That is an AI implementation discipline and an Operational Automation problem before it is a modeling problem, and it lives or dies on the same Data Platforms thinking that governs everything else: know your inputs, measure your path, instrument the handoff.

This is the plumbing-first worldview applied to silicon. The model is the part everyone talks about. The bytes, the budget, and the routing are the parts that decide whether it ships.

Stop quoting peak TOPS and start profiling bytes moved on a hot device — that’s where your latency budget actually lives. Talk to pdpspectra about engineering edge AI that survives the thermal envelope.