Mechanistic Interpretability: Reading the Weights

Circuits, features, superposition, and sparse autoencoders. What mechanistic interpretability gives you when a model has to be trusted in production.

Mechanistic Interpretability: Reading the Weights

For most of the deep learning era we treated trained networks as sealed boxes. You measured them at the edges, eval sets in, predictions out, and inferred behavior from statistics. That worked while models recommended products. It stops working when a model approves a loan, triages a patient, or drafts the clinical note in a Hospital Management System. The question shifts from “how often is it right” to “what is it actually doing, and will it keep doing that on inputs we never tested.” Mechanistic interpretability is the attempt to answer that question by reading the weights directly instead of inferring behavior from outputs.

This is not explainability in the SHAP-values sense. Feature attribution tells you which input tokens correlated with an output. Mechanistic interpretability tries to recover the internal algorithm: the intermediate representations a network computes and the computational pathways that connect them. The ambition is closer to reverse-engineering a compiled binary than to fitting a surrogate model.

Features, not neurons#

The first instinct is to read individual neurons. Find the neuron that fires on “Python code,” find the one that fires on “anger,” and you have a dictionary. It does not work, because neurons are polysemantic. A single neuron in a language model will fire on academic citations, HTTP headers, and Korean text, with no obvious common thread. The unit of meaning is not the neuron.

The explanation that holds up is superposition. A model needs to represent far more concepts than it has dimensions, so it packs many features into overlapping directions in activation space, accepting a little interference in exchange for capacity. Chris Olah’s team at Anthropic laid this out concretely in Toy Models of Superposition, showing that when features are sparse, a network will reliably store more of them than it has neurons, arranging them as near-orthogonal directions rather than clean per-neuron codes.

The practical consequence: the right object to study is a feature, a direction in activation space, not a neuron. And recovering those directions is a dictionary-learning problem.

Sparse autoencoders#

The tool that broke this open is the sparse autoencoder (SAE). The idea is almost embarrassingly simple. Take the residual stream activations at some layer, train a wide autoencoder with an aggressive sparsity penalty so that only a handful of its many hidden units fire on any given input, and force it to reconstruct the original activation. The hidden units that emerge tend to be monosemantic, each corresponding to a single human-interpretable concept.

Anthropic’s Towards Monosemanticity demonstrated this on a small model, then Scaling Monosemanticity pushed it onto a production-scale model, Claude 3 Sonnet, extracting on the order of tens of millions of features ranging from “the Golden Gate Bridge” to abstract ones like “code with a security vulnerability” and “sycophantic praise.” OpenAI published parallel work scaling SAEs to their frontier models. A broader survey of sparse autoencoders now catalogs the variants, gated SAEs, top-k SAEs, transcoders, that trade off reconstruction fidelity against how clean the resulting features are.

Macro photograph of a fluorescent sample slide on a lab bench

Two properties of features matter for anyone deploying this. First, they are causal, not merely correlational. Clamp the “Golden Gate Bridge” feature high and the model starts insisting it is the bridge; this is feature steering, and it is the cleanest evidence that you have found a real computational handle rather than a post-hoc story. Second, some of the features are safety-relevant. The same technique surfaces directions corresponding to deception, to unsafe content, to the model agreeing with a user regardless of truth. If you can find the deception direction, you can monitor whether it activates during a given generation.

From features to circuits#

Features are the nouns. Circuits are the verbs, the connected subgraphs of features and attention pathways that implement a specific behavior. In 2025 Anthropic released open-source circuit-tracing tools that build attribution graphs over a transformer, letting you trace how an answer is assembled feature by feature and then perturb individual nodes to confirm the causal path. Their accompanying work on tracing model internals showed the model doing genuinely surprising things: planning a rhyme several words ahead before writing the line, or computing arithmetic through a set of parallel approximation features rather than the algorithm it describes when you ask it to show its work.

That last point is the one to sit with. The model’s stated reasoning and its actual mechanism can diverge. A chain-of-thought trace is itself a generated artifact; it is not a log of the computation. Anyone building governance on top of “the model explained its reasoning” is building on sand. The July 2025 circuits update and the steady cadence of work on this site are worth tracking if you want to follow where the field actually is rather than where vendors say it is.

What a circuit buys you#

When you have a circuit for a behavior, you can do three things you cannot do from the outside. You can localize it, which layers and which features implement it. You can ablate it and watch the behavior disappear, confirming you have the right mechanism. And you can monitor it at inference time, checking whether the circuit fires on traffic you never had in your eval set. That third capability is the one that turns interpretability from a research curiosity into an operational control.

Why this matters for production trust#

Here is the honest version of the value proposition, because the hype version is easy to sell and wrong.

Interpretability does not yet give you a complete account of any frontier model. SAEs reconstruct activations imperfectly; there is always a residual the dictionary does not explain. Feature splitting means the same concept can fragment across many features at higher dictionary sizes. Coverage is partial, and nobody credibly claims otherwise. If a vendor tells you their model is “fully interpretable,” walk.

What it does give you, today, is leverage on specific, high-stakes questions. Three are worth the investment for AI implementation in regulated settings.

Auditing for a known failure mode. If you are worried about a specific behavior, demographic bias in a credit decision, a particular category of unsafe medical advice in a Hospital Management System, you can often find the relevant features and check whether they activate on your traffic. This is narrow but real, and it is more rigorous than red-teaming alone.

Steering instead of retraining. Feature steering lets you nudge a model’s behavior by clamping directions, without a fine-tuning run. It is blunt and it has side effects, push a feature too hard and coherence degrades, but for some guardrails it is faster and more inspectable than collecting a preference dataset.

Monitoring as a runtime control. The most durable use is treating safety-relevant features as detectors. Wire the deception or unsafe-content feature into your inference path as a signal, alongside your output classifiers. It catches a different class of failure than a content filter on the output text, because it reads intent in the activations rather than pattern-matching the words.

Whiteboard covered in hand-drawn graph nodes and arrows

The cost, stated honestly#

None of this is cheap. Training SAEs at production scale is a serious compute line item, the activation tensors of a large model are enormous, and you train a separate dictionary per layer you care about. Then there is the labor: a raw SAE gives you millions of features with no names attached, and turning a feature index into “this is the deception direction” requires automated interpretability pipelines plus human verification, both imperfect. Budget for this as a research function, not a weekend integration. The teams that get value treat it the way they treat a security program, ongoing, staffed, and measured, rather than a one-time audit that produces a certificate and gets filed.

The corollary is that you should be ruthless about scope. You do not interpret “the whole model.” You pick the two or three behaviors whose failure carries real liability, the bias direction in a credit model, the unsafe-advice direction in a clinical assistant inside a Hospital Management System, and you invest there. Everything else stays in the conventional eval-and-filter stack. Interpretability is a scalpel, and treating it as a floodlight is how budgets evaporate with nothing operational to show for it.

Where the field actually is#

Be precise about maturity. Mechanistic interpretability has gone from toy models to real features on production-scale systems in roughly three years, which is fast. It has not gone to a full, faithful decompilation of any large model, and it may never get there for the largest ones. The realistic 2026 posture is: use it where the question is specific and the stakes justify the cost, and do not market it as a blanket guarantee.

For an enterprise building governance, the takeaway is structural. Treat interpretability as one layer in a defense-in-depth stack, sitting underneath your evals, your output filters, and your human review, reading signals the other layers cannot see. It is the only layer that looks at the mechanism rather than the behavior, and for the decisions that carry real liability, that difference is the whole point. The teams who will trust models with consequential decisions are the ones who stop treating the weights as unreadable and start, carefully and partially, reading them.


Shipping a model into a decision that carries liability? We build interpretability and monitoring into the inference path, not the slide deck. Talk to our engineers.