Virtual Cell Models: Simulating Biology with AI
Arc Institute and CZI build AI virtual cells to predict perturbation responses in silico. The promise is real; data and benchmark gaps are the hard part.
The pitch for a virtual cell is simple enough that it sells itself: build a model that, given a cell’s current state and a perturbation — a drug, a cytokine, a gene knockout — predicts the cell’s new transcriptional state, accurately enough that you can screen millions of hypotheses on a GPU before you touch a pipette. If that worked, it would change the economics of biology. Most of what a discovery lab spends money and months on is finding out, the hard way, which of its thousands of guesses are wrong.
In 2026 this is no longer a thought experiment. There are real models, real training corpora, and real benchmarks. There are also real gaps, and the gaps are more interesting than the press releases. This post is about both.
What “virtual cell” actually means right now#
The phrase gets used loosely, so pin it down. The concrete, fundable version of the virtual cell is in-silico perturbation prediction: a function that takes a baseline single-cell expression profile plus a specified intervention and returns the predicted post-perturbation profile. Not a full mechanistic simulation of a cell — nobody has that — but a learned map from state-plus-action to next-state, trained on enough measured perturbations to generalize.
Arc Institute’s first virtual cell model, State, is the cleanest example of the shape. State is two interlocking pieces: a State Embedding model that learns a representation of a cell’s transcriptome, and a State Transition model that, given that embedding and a perturbation, predicts the resulting shift in expression. It was trained on observational data from roughly 170 million cells and perturbational data from over 100 million cells spanning 70 cell lines. The whole point is to let a lab run large numbers of in-silico perturbations to narrow a hypothesis space, then spend its limited wet-lab budget validating the survivors.
CZI took a different architectural bet with rBio, a reasoning model trained to answer cellular-biology questions using virtual simulations as a training signal rather than only experimental labels. rBio uses what CZI calls “soft verification” — it distills predictions from a separate virtual-cell model (TranscriptFormer, itself trained on 112 million cells across 12 species) into a model you can query conversationally. Two very different designs, one shared goal: substitute computed predictions for some fraction of bench work.
Data is the moat, and the moat is being filled#
Foundation models are downstream of data, and for years the single-cell perturbation data simply did not exist at the scale these models need. That changed fast. Vevo Therapeutics released Tahoe-100M, a single-cell dataset of 100 million cells mapping roughly 60,000 drug-cell interactions across 50 cancer cell lines and 1,200 drug perturbations, as the inaugural contribution to Arc’s Virtual Cell Atlas. The framing they used is the one that matters: this single release is on the order of 50x larger than all previously public drug-perturbed single-cell data combined. The Atlas it seeds spans over 300 million cells.
That is the real story of the last eighteen months. The model architectures get the attention, but the rate-limiting input was perturbation-labeled data, and the field has gone from data-starved to data-rich in a way that genuinely shifts what is trainable. CZI’s expanded collaboration with NVIDIA to scale biological data processing to petabytes spanning billions of cellular observations is the infrastructure bet that follows from the same realization.
But more cells is not the same as more coverage, and this is where the qualitative honesty has to come in. Tahoe-100M is deep on cancer cell lines and drug perturbations. It is not a representative sample of human biology. Cancer lines are immortalized, karyotypically scrambled, and behave nothing like primary cells in tissue. A model trained mostly on cell-line responses learns cell-line pharmacology, and the moment you ask it about a primary T cell in an inflamed gut you are extrapolating off the edge of the training distribution. The data explosion is real. The data is also lopsided, and the lopsidedness is exactly where most of the clinically interesting questions live.
The benchmark problem nobody can wave away#
Here is the uncomfortable part. To trust an in-silico perturbation, you need a benchmark that tells you when the model is right and, more importantly, when it is confidently wrong. The field knows this, which is why Arc launched the Virtual Cell Challenge, framed explicitly as working toward a Turing test for the virtual cell. The inaugural challenge drew over 5,000 registrants across 114 countries and more than 1,200 submitting teams. That level of participation tells you the problem is taken seriously. It also tells you the problem is unsolved, because you do not run a global competition to evaluate something you already know how to evaluate.
The technical difficulty is that the obvious metrics are easy to game. Predicting a perturbed expression profile and scoring it by correlation against the truth sounds rigorous until you notice that most genes barely move under most perturbations. A model that predicts “almost no change” scores deceptively well on aggregate correlation while being useless for the thing you actually care about — the specific genes that did move. Naive baselines that predict the dataset mean are stubbornly hard to beat on the wrong metric, and a model that cannot beat the mean on the right metric has learned nothing worth deploying. Any team evaluating a virtual-cell model has to fight this: separate differentially expressed genes from the inert majority, hold out genuinely unseen perturbations and unseen cell types, and report performance on the hard split, not the flattering one.
Why this is hard in ways the hype skips#
Three structural problems sit under the whole enterprise, and none of them are solved by a bigger model.
Generalization across perturbations is the actual test, and it is brutal. Interpolating among perturbations you have measured is tractable. The value is in predicting the response to a perturbation you have never seen — a new compound, a combination, a gene you never knocked out. That is a far stronger claim, and the evidence that current models do it reliably, across cell contexts, is thin. Be specific when a vendor says “perturbation prediction”: prediction of seen perturbations in unseen cells, or unseen perturbations entirely? The gap between those two is most of the difficulty.
Context dependence is not noise; it is the biology. The same drug does different things to different cell types, in different tissue microenvironments, at different baseline states. A model that ignores context will be right on average and wrong exactly where the decision matters. The whole reason single-cell resolution is worth the cost is that population averages hide the responding subset, and a virtual cell that collapses back to a population-average prediction has thrown away the signal it was built to capture.
Batch effects and technical confounds leak into everything. Single-cell data carries the fingerprint of the lab, the kit, the day, the sequencer. Models can and do learn these artifacts and present them as biology. A perturbation “signal” that is really a batch signal is worse than no model, because it is confident and plausible. Disentangling technical from biological variation is a standing problem in the single-cell field, and virtual-cell models inherit all of it.
The engineering view, and where we fit#
Strip away the biology and a virtual-cell program is a recognizable AI implementation: a foundation model is one stage in an instrumented loop, the value is in the loop’s discipline, and the hardest part is the Data Platform underneath. The cell atlases that feed these models are not tidy tables — they are heterogeneous, multi-source, ontology-tagged, and riddled with batch structure. CZI’s CELLxGENE corpus alone holds on the order of 170 million cells across more than 1,500 datasets, each contributed with its own annotations and quality. Making that trainable is a data-engineering problem before it is a modeling problem: versioned datasets, reproducible splits, provenance from raw reads to embedding, and a feedback path that pushes confirmed wet-lab results back into the next training round.
This is the same backbone we build for any serious Data Platform, and the instincts transfer cleanly. The traceability that makes a Hospital Management System trustworthy — every record auditable from input to outcome — is exactly what makes a virtual-cell pipeline trustworthy. Operational Automation of the predict-validate-retrain cycle, with tracked plates and structured assay capture rather than spreadsheets, is what lets the loop turn fast enough to compound. The exotic part is the cells. The part that determines whether the program works is the boring, rigorous data engineering.
The honest 2026 read: virtual cell models have crossed from vaporware into real, useful tools for narrowing hypotheses, and the data scale-up is genuine. They are not yet trustworthy oracles for unseen perturbations in unseen contexts, and the benchmarks that would prove otherwise are still being built. Treat them as a way to spend wet-lab capacity wisely, keep the wet lab as the arbiter, and judge any model by its performance on the hard split — not the demo.
Building the data platform under a virtual-cell or perturbation-prediction program — versioned atlases, reproducible splits, an auditable predict-to-validate loop? Talk to our team. We engineer the infrastructure that makes computational biology reproducible.