AI and Single-Cell Genomics at Scale
Geneformer and scGPT promise atlas-scale biology. The reality: batch effects, weak zero-shot results, and a hard data-platform problem underneath.
Single-cell RNA sequencing turned biology into a data problem. A single experiment now profiles the transcriptomes of hundreds of thousands of individual cells, and public atlases aggregate hundreds of millions. The obvious move — the one the field has spent three years on — is to train foundation models on that corpus and hope they learn a general representation of what a cell is, transferable across tissues, diseases, and labs. Geneformer and scGPT are the two best-known attempts. They are genuinely useful, genuinely overhyped, and a good lens on what does and does not transfer from the language-model playbook to biology.
This is an engineer’s read on where the models help, where they quietly fail, and why the data platform underneath them is the part that actually decides whether a single-cell program works.
Two models, two bets on tokenizing a cell#
The interesting design question in single-cell foundation models is how you turn a cell into a sequence a transformer can chew on. A cell is not a sentence. It is a vector of expression counts over roughly twenty thousand genes, most of them zero in any given cell. The two flagship models answer the tokenization question differently, and the difference shows up downstream.
Geneformer ranks the genes in each cell by expression and feeds the ordered ranking as tokens. The representation is rank-based: what matters is which genes are most highly expressed relative to each other, not their absolute counts. scGPT takes a generative-pretrained-transformer approach over highly variable genes with value-aware, regression-style pretraining, and is explicitly designed to be fine-tuned for downstream tasks: batch correction, multi-omics integration, cell-type annotation, gene-network inference, and perturbation prediction.
Those choices are not cosmetic. A careful 2025 benchmark found that Geneformer reaches higher overall accuracy yet lower macro-F1 than scGPT on the Human Lung Cell Atlas and Tabula Sapiens, while scGPT does better when training data for a cell type is limited. Read that carefully: Geneformer’s rank encoding captures dominant co-expression patterns in abundant cell types, which inflates accuracy while it quietly underperforms on the rare cell types that drag down macro-F1. If your scientific question is about a rare population — a small tumor-infiltrating subset, a transitional cell state — the headline accuracy number is measuring the wrong thing.

Cell-type annotation: the task everyone wants, and the trap in it#
Cell-type annotation is the workhorse application. You have a fresh dataset of unlabeled cells and you want to know what each one is. Done by hand, it is slow, subjective, and inconsistent between labs. A foundation model that annotates reliably would save real time, and fine-tuned on a labeled reference, these models do a competent job.
The trap is the gap between fine-tuned and zero-shot. Fine-tuning requires labels for the cell types you care about, in data resembling yours — which means you have already solved most of the problem the model was supposed to solve. The genuinely valuable case is zero-shot: drop in an unlabeled dataset from a new tissue or disease and get useful structure with no task-specific training. That is the discovery setting, where the labels are unknown by definition.
And that is exactly where the models stumble. A pointed Genome Biology study evaluating Geneformer and scGPT in zero-shot settings found that neither consistently beats much simpler, older baselines — methods like scVI and Harmony that the field has used for years. Microsoft Research, summarizing the same line of work, reported that scGPT embeddings can underperform a naive baseline of just predicting the mean on some tasks. The proposed explanation is uncomfortable and worth sitting with: the models may not actually be learning the masked-gene-expression task they are pretrained on well enough to build a deep representation of cellular state. They learn something. It is not obviously the thing the marketing implies.
The operational takeaway is blunt. If you are evaluating a single-cell foundation model, benchmark it zero-shot against scVI and Harmony on your data, on rare cell types, with macro-F1 and not just accuracy. If it cannot clear those baselines, you are paying transformer inference costs for a worse answer than a 2018 method gives you for free.
There is a deeper structural reason to be careful here. Annotation is not one task; it is a family of them at different resolutions. Calling a cell a “T cell” is easy and most methods agree. Calling it a specific exhausted CD8 subset in a particular activation state is where the science lives, and it is also where label noise, reference disagreement, and class imbalance compound. A model that posts a strong coarse-grained accuracy can be quietly useless at the fine-grained calls that drive a biological conclusion. When you report annotation performance, report it at the resolution your decision actually depends on, and show the confusion matrix — the off-diagonal mistakes between adjacent cell states are the ones that will mislead a downstream analysis.
Batch effects are the whole game#
Every single-cell dataset carries a technical fingerprint: the kit, the sequencer, the operator, the day. Two biologically identical cells profiled in different runs look different, and the difference can swamp the biology. Correcting it — batch integration — is the unglamorous core of single-cell analysis, and it is where foundation models most need to prove themselves, because the entire premise of pretraining across many datasets is that the model learns biology robust to the lab it came from.
The evidence here is mixed at best. The same benchmarks that exposed the zero-shot weakness also showed these models struggling with batch effects relative to dedicated integration methods. This is not surprising once you think about it as a data-engineering problem rather than a modeling one: a model trained on data where batch and biology are confounded will happily learn the confound. If every Alzheimer’s sample in your corpus was run on one platform and every control on another, the model can “predict disease” by reading the platform. That is not a model failing; it is a model succeeding at the wrong objective because the data let it.
The defense is held-out evaluation on data the model has never touched, with batch structure orthogonal to the biology. The AIDA v2 dataset released in April 2025, roughly 201,000 immune cells from healthy donors across Singapore, Thailand, and India, is valuable precisely because it postdates the models’ pretraining — none of them could have memorized it. International, multi-site reference data of that kind is the only honest way to know whether a model learned biology or learned the bench.

The real problem is atlas-scale data engineering#
Step back and the modeling is the smaller half of the work. The corpus is the larger half. CZ CELLxGENE Discover, the platform most of this research draws on, hosts on the order of 170 million cells across more than 1,500 datasets, each contributed by a different lab with its own annotations and quality. Crucially, CELLxGENE does not impose a single alignment, clustering, or annotation pipeline — cell-type labels come from each contributor’s own analysis, standardized to a shared ontology on upload but not re-derived. That is a sane design choice for a public repository and a serious headache for anyone training on it, because “the same” cell type can mean subtly different things across two datasets.
So the atlas-scale data platform is the actual deliverable, and it looks like every other serious Data Platform we build, just with stranger payloads:
- Ontology reconciliation. Mapping heterogeneous, contributor-supplied labels onto a consistent hierarchy so a model is not penalized for two labs naming the same cell differently. Atlas-scale annotation work is now explicitly turning to hierarchy-aware training to handle this.
- Provenance and versioning. Every embedding traceable to raw reads, kit, and reference version, so a result is reproducible and a regression is debuggable.
- Honest splits. Holdouts defined by donor and by batch, never by random cell, because random splits leak batch structure and inflate every metric.
- Pipeline reproducibility. The same instinct that makes a School ERP trustworthy — auditable records, deterministic processing — applied to a corpus of hundreds of millions of cells.
This is unglamorous, and it is where single-cell programs live or die. A team that nails the data platform and uses a modest model will out-discover a team with a giant model trained on a leaky, unversioned corpus. Operational Automation of the ingest-standardize-validate pipeline is what makes the atlas usable; the model is what you put on top once the foundation is sound.
The 2026 position is not anti-foundation-model. These models are real tools, strong when fine-tuned on good references, and improving. The position is that the field’s own rigorous benchmarks show they do not yet earn blanket trust in the zero-shot, cross-batch settings that matter most, and that the data engineering underneath them is the part most teams underinvest in and most regret. Build the platform first. Benchmark against the old baselines honestly. Treat the transformer as a component, not an oracle.
Standing up an atlas-scale single-cell platform — ontology reconciliation, batch-aware splits, reproducible pipelines under your models? Talk to our team. We engineer the data infrastructure that makes genomics at scale trustworthy.