AI Connectomics: Mapping the Brain's Wiring

How automated EM segmentation produced the FlyWire fly connectome and the H01 human-cortex fragment — and why connectomics is a petabyte data problem.

AI Connectomics: Mapping the Brain's Wiring

Connectomics is the project of tracing every neuron and every synapse in a chunk of brain — the wiring diagram, not the activity. For decades it was hand work, a graduate student tracing neurons across electron-microscope images for years to map a few hundred cells. That era ended. The reason it ended is not a microscope; it is machine learning, specifically automated segmentation of enormous image volumes, backed by data infrastructure that would be at home at a hyperscaler. Connectomics in 2026 is, structurally, a Data Platforms problem with a neuroscience payload. This post explains the pipeline, the two landmark datasets that prove it works, and why the petabytes are the actual hard part.

The pipeline is an image-segmentation factory#

The raw input is unforgiving. You take a block of brain tissue, slice it into sections tens of nanometers thick, image each one in an electron microscope at nanometer-scale resolution, and stack the images back into a 3D volume. A single cubic millimeter of cortex produces on the order of a petabyte of image data. Inside that volume, every neuron is a thin, branching wire that weaves for long distances and has to be followed across thousands of consecutive slices without ever being confused with its neighbors. Lose the thread on one slice and you merge two neurons or sever one — both fatal to the wiring diagram.

This is a 3D instance-segmentation problem at a scale that breaks naive approaches. The method that cracked it treats segmentation as iterative flood-filling: a network starts from a seed inside one neuron and repeatedly predicts, voxel by voxel, whether each neighbor belongs to the same object, growing the segment outward until it has the whole cell. Run that across the volume and you get millions of candidate neuron fragments. The model is good but not perfect, and at this scale even a tiny per-voxel error rate produces a large absolute number of mistakes — so the architecture that actually ships pairs automated segmentation with human proofreading, and the proofreading tooling is as much of the system as the network.

The two error modes are not symmetric, and that asymmetry shapes the whole workflow. A merge fuses two neurons into one object and silently corrupts the wiring diagram, because now a connection appears between cells that never touched. A split breaks one neuron into pieces, which is annoying but recoverable — a human can stitch the fragments back together. Good pipelines deliberately tune the model to prefer splits over merges, accepting more cleanup work in exchange for fewer poisonous false connections. That single design choice tells you the field has internalized something every data engineer learns the hard way: the cost of an error depends entirely on which direction it points, and you tune for the failure you can afford.

FlyWire: a complete brain, proofread by a crowd#

The proof that the full pipeline works end to end is the fly. In October 2024 a nine-paper package in Nature delivered the FlyWire connectome, the first complete wiring map of an adult fruit fly brain. The numbers set the scale: roughly 140,000 neurons and more than 15 million synaptic connections, annotated into over 8,000 distinct cell types — close to 50% more cell types than had been proposed for the far larger mouse brain.

The methodology is the lesson. Electron-microscope images of the whole brain were segmented automatically to identify the neurons, and then — because automated segmentation is not foolproof — a global consortium of fly labs proofread the segments, corrected the merge-and-split errors, and annotated cell types in a community-driven effort. This is the template: AI does the brutal first pass across a volume no human could trace by hand, and distributed human review converts a good-enough segmentation into a trustworthy graph. The output is not a picture; it is a directed graph with typed nodes and weighted edges, which is to say a queryable dataset that any neuroscientist can now interrogate without touching the raw images.

Ultramicrotome cutting an ultrathin tissue section onto a water boat

H01: a human fragment, and a petabyte wake-up call#

The fly is a complete brain because a fly brain is small. The human cortex is a different universe of scale, and the landmark there is honest about it. H01, from the Lichtman lab at Harvard and the Connectomics team at Google, is a 1.4-petabyte reconstruction of roughly one cubic millimeter of human temporal-lobe cortex. The sample was cut into about 5,300 sections around 30 nanometers thick and imaged down to roughly 4-nanometer resolution, then reconstructed and annotated by automated segmentation. The result includes tens of thousands of reconstructed neurons and on the order of 150 million annotated synapses within that single cubic millimeter.

Sit with the ratio. One cubic millimeter, a speck you would lose on a fingertip, is 1.4 petabytes and 150 million synapses. A whole human brain is on the order of a million cubic millimeters. The arithmetic is not subtle: a complete human connectome at this resolution is an exabyte-class undertaking, and that is before you account for the proofreading, which for the fly already consumed a global community. H01 is rightly celebrated, but the most useful thing it did for engineers was make the scaling wall impossible to ignore.

Why this is really a data-platform problem#

Here is the opinionated core. The segmentation models are largely solved well enough to be useful; the part that determines whether a connectomics project succeeds is the infrastructure around them. You cannot load a petabyte volume into memory. You cannot ask a human to scroll through it. Everything has to be chunked, served, and reasoned about lazily, and that is straight-up Data Platforms engineering.

Concretely, the volumes are stored as multi-resolution chunked arrays so a client can stream just the region and zoom level it needs, the way a maps application serves tiles instead of the whole planet. The segmentation is versioned, because proofreading edits the graph continuously and you need to know which version of the connectome a given analysis ran against — the same reproducibility discipline you would demand of any production dataset. Browsable, on-demand access to H01 and FlyWire exists precisely because the teams treated serving the data as a first-class engineering problem, not an afterthought to the science.

This is the part of connectomics that looks most like the work we do every day. Petabyte-scale chunked storage, versioned mutable datasets, lazy tile serving, distributed compute that runs the model over billions of voxels without melting the cluster, and a proofreading layer that is, underneath, a collaborative annotation tool with conflict resolution. Swap the neurons for sensor traces or financial events and it is the same Operational Automation pattern: a model proposes at scale, humans correct at the margins, and the platform makes both tractable. The neuroscience is the easy part to get excited about; the data platform is the part that decides whether you ship.

There is a subtler infrastructure problem hiding inside the proofreading layer that anyone who has built a multi-user system will recognize. The segmentation graph is mutable and many people edit it at once, so two proofreaders can touch overlapping regions and produce conflicting corrections — a classic concurrent-write problem dressed in neuroscience clothing. You need an edit log, a way to attribute and review changes, and a mechanism to roll back a bad merge without unwinding everything downstream of it. The teams behind these datasets effectively built a version-control system for a graph with hundreds of millions of edges. That is not incidental tooling; it is the difference between a dataset the community trusts and a heap of contested annotations. The same logic governs a production School ERP or any system where many hands edit shared state: the audit trail and the conflict-resolution policy are the product, not the paperwork.

Silicon wafers holding rows of ultrathin brain tissue sections

What scales next, and what doesn’t#

The trajectory from here is not in doubt about direction, only about cost. A whole mouse brain — a long-stated goal for the field — is roughly a thousand times the volume of these cubic-millimeter samples, which puts it at exabyte scale with proofreading effort to match. Getting there will come from two places, and neither is a new microscope. Faster imaging shortens the years it takes to acquire a volume, and better segmentation models reduce the proofreading burden, because the binding constraint is increasingly human review time, not compute. Every percentage point of merge-and-split error you remove automatically is human-years you do not have to spend.

What does not scale is treating each project as a bespoke artifact. The reusable asset coming out of FlyWire and H01 is not the wiring diagram of a fly or a speck of human cortex; it is the pipeline — segmentation, versioned chunked storage, browsable serving, crowd proofreading — that turns raw electron-microscope volumes into queryable graphs. That pipeline is general. The same shape handles any domain where you have to extract structured relationships from an image volume too large to fit anywhere, and that is a much larger set of problems than connectomics. The brain maps are the proof of concept. The platform is the product.


Got a petabyte that won’t fit in memory and a model that has to run over all of it? That’s our kind of problem — let’s talk.