Analog In-Memory Compute: Escaping von Neumann

Compute-in-memory does matrix-multiply inside the memory array. The promise, the precision/noise tradeoff, and what Mythic and IBM actually shipped.

Analog In-Memory Compute: Escaping von Neumann

Every digital AI accelerator fights the same enemy: moving data. A multiply-accumulate is cheap; fetching the operand from DRAM and writing the result back is not. On a modern inference workload the energy spent shuttling weights and activations across the memory bus dwarfs the energy of the arithmetic itself. This is the von Neumann bottleneck, and you cannot quantize your way out of it. Analog in-memory compute proposes a different escape: stop moving the weights and do the math where they already sit.

The physics of compute-in-memory#

The core idea is almost embarrassingly simple, and it falls straight out of two laws every engineer learned in school. Arrange memory cells in a grid — a crossbar — where each cell stores a weight as a physical conductance. Apply your input vector as voltages on the rows. By Ohm’s law, the current through each cell is voltage times conductance, which is input times weight. By Kirchhoff’s current law, the currents summing down each column give you the dot product of the input with that column of weights.

That means a full matrix-vector multiply — the workhorse operation of every neural network layer — happens in a single step, in physics, the instant you apply the voltages. No fetch, no per-element multiply, no accumulate loop. The array is the matmul. A good overview of the principles and applications of memristive in-memory computing walks through the mechanics in detail.

The storage element varies by camp. Some use ReRAM/memristors, where a metal-oxide layer changes resistance. Some use phase-change memory (PCM), which switches a glass between amorphous and crystalline states with different conductances. Some use plain NOR flash, storing weights as trapped charge on a floating gate. The principle is identical; the device tradeoffs differ.

Extreme macro of a memory crossbar die showing a grid of intersecting metal lines

Who actually shipped something#

Mythic#

Mythic is the clearest commercial example of analog compute-in-memory at the edge. Their Analog Matrix Processor stores weights directly in NOR flash arrays and performs the matrix multiply in the analog domain inside those arrays. The M1076 part integrates 76 analog tiles, holds up to 80 million weight parameters on-chip, and delivers up to 25 TOPS without any external memory — in form factors down to an M.2 card. The company shipped commercial products and raised further funding to scale, after an earlier near-death restructuring that is itself a lesson in how hard this path is.

The pitch is power and latency at the edge: because the weights never leave the array, a Mythic part runs vision models in a thermal and power envelope that a comparable digital accelerator struggles to hit. The constraint is the same one that defines the whole field — weights are baked into analog conductances, so this is an inference part, and reprogramming is not free.

IBM#

IBM has pushed the research frontier hardest on PCM-based analog AI. The HERMES project chip is a fully integrated mixed-signal part: 64 analog in-memory cores built on phase-change memory, plus digital processing units and an on-chip communication network. Published in Nature Electronics, it demonstrated that you can keep deep-learning accuracy while doing the multiply-accumulate inside the memory and cutting both compute time and energy. IBM’s broader argument that analog in-memory computing is “coming of age” is worth reading as the optimistic case from the people closest to the silicon.

It is worth separating IBM’s two efforts. HERMES is true analog compute-in-memory. NorthPole, which often gets lumped in, is digital near-memory — it intertwines SRAM with compute but runs ordinary digital arithmetic. Both attack the data-movement problem; only one does the math in the analog domain. Conflating them muddies the engineering picture.

The tradeoff that defines everything: precision and noise#

Here is the catch, and it is not a footnote. Analog computation is noisy, and that noise is structural, not a bug you patch in a later stepping.

When you store a weight as a physical conductance, you cannot place it exactly. Device-to-device variability means two cells programmed to the same target differ. Reading the cell introduces random fluctuation — read noise. Over time conductances drift, especially with PCM. The DACs that drive the rows and the ADCs that digitize the column currents have finite resolution and add their own quantization error. A survey of these non-idealities in memristor crossbars catalogs the full list: programming error, read noise, conductance drift, and endurance limits.

The practical consequence: native analog precision is low — think a handful of equivalent bits, not the 16 or 32 a GPU gives you for free. For many inference workloads that is fine, because neural networks tolerate noise and you were going to quantize to 8-bit or 4-bit anyway. For anything needing exact arithmetic, it is disqualifying.

The field’s answer is engineering around the physics. Bit-slicing spreads a high-precision weight across multiple crossbars, each holding a slice of the value, then recombines digitally. Closed-loop programming iteratively nudges conductances toward target and has pushed effective precision higher than people once thought possible. Error detection and correction schemes operate directly on the analog crossbar. None of this is free — every recovery technique adds area, energy, or latency, and the ADCs alone can dominate the power budget of an in-memory tile. The honest summary: you trade some of the theoretical energy win back to claw precision into a usable range.

Precision lab bench with a source-measure unit and a wafer under focused light

Why the ADC is the part nobody talks about#

It is worth dwelling on one number that decides whether an in-memory design is actually efficient: the cost of getting the answer out. The crossbar produces analog currents; before you can use them you must digitize each column with an analog-to-digital converter. ADCs are expensive in area and power, and their cost scales steeply with resolution — every extra bit of precision roughly doubles the converter’s burden. In a real in-memory tile the ADCs and the surrounding peripheral circuitry frequently consume more energy than the multiply they are reading out.

This creates a perverse incentive that shapes the whole field. The crossbar wants to be large, because a bigger array amortizes the fixed peripheral cost over more multiply-accumulate operations. But a larger array accumulates more noise down each column and needs higher ADC resolution to resolve the result, which drives the peripheral cost back up. Designers live on this tradeoff curve, and where a given product lands on it explains most of the difference between architectures. When you read an in-memory efficiency figure, check whether it accounts for the data converters or quietly quotes only the array. The honest figures include the converters; the flattering ones do not.

There is a second cost that is easy to forget: the weights have to get into the array. Programming analog conductances is slow and, for some device types, wears the cells out. That is fine for a model you write once and run a billion times. It is a dealbreaker for anything that updates weights frequently, which is the structural reason this technology is an inference story, not a training one.

Where it fits, and where it doesn’t#

Good fit: fixed-weight inference at the edge where power and latency dominate and the model tolerates low precision. Vision pipelines, keyword spotting, sensor classification, anything you would already run at 8-bit or below. If the weights rarely change and the thermal budget is tight, compute-in-memory is one of the few architectures that meaningfully moves the energy-per-inference number.

Poor fit: training, frequently updated weights, or workloads needing exact results. Writing analog cells is slow and wears them out, so you do not train in place. And the precision ceiling rules out scientific computing that needs deterministic, reproducible arithmetic.

The open question is large generative models. On-array capacity is finite, the same wall NorthPole hits, so today’s parts target models that fit. Whether analog scales to multi-billion-parameter LLMs without the ADC and bit-slicing overhead eating the advantage is unproven. A good overview of compute-in-memory for LLM inference lays out why this is still an active research problem rather than a settled win.

The engineering read#

Analog in-memory compute is not a general-purpose replacement for digital accelerators, and anyone selling it that way is overreaching. It is a specialist tool for a real and growing class of problems: low-precision, fixed-weight inference where data movement is the binding constraint. The von Neumann bottleneck is genuine, and crossbars genuinely sidestep it — but the precision and noise tax is the price of admission, and you have to budget for it up front.

For the AI implementation and Operational Automation work we do, the calculus is the same one we apply to any exotic substrate: does the power or latency constraint actually bind, and is the model tolerant of analog noise? When both are true — an always-on vision trigger, an edge classifier in a remote sensor feeding a School ERP attendance system or a clinical-monitoring front end on a Hospital Management System — compute-in-memory earns a serious look. When they are not, a well-quantized digital part wins on cost and tooling every time.


Got an edge inference workload where data movement, not math, is your power budget? We’ll tell you honestly whether analog buys you anything. Talk to our architecture team.