DNA Foundation Models: Reading the Genome with AI
Genomic language models like Evo 2, Nucleotide Transformer, and Enformer learn the grammar of DNA. What they predict well, and where they still fall short.
For a decade the dominant way to predict what a stretch of DNA does was to hand-build a model around a single task: a classifier for splice sites, a separate one for promoters, another for chromatin accessibility. Each was trained from scratch on a curated dataset, and each broke the moment you moved it to a new cell type or a new species. The genomic foundation model flips that. You train one large network on raw sequence, let it learn the statistics of the genome, then read predictions off the representations it has already built. It is the same move that pretrained language models made in NLP, applied to a four-letter alphabet.
The results are real, and so are the limits. This is an engineer’s look at what these models actually do, which ones matter, and where the wet lab still has the last word.
What “language model for DNA” actually means#
A genomic language model treats a sequence of nucleotides the way a text model treats tokens. It tokenizes the input — sometimes single nucleotides, sometimes k-mers — and trains on a self-supervised objective: predict a masked base, or predict the next base, given context. No labels. The genome is its own supervision signal, and there is a lot of it.
The payoff is the representation. Once the network has seen enough sequence, its internal embeddings encode structure no one annotated by hand. The Nucleotide Transformer, built by InstaDeep and trained on 3,202 human genomes plus 850 genomes from other species, learns features about regulatory elements that surface directly in its attention maps. On a benchmark of 18 downstream prediction tasks it matched or beat specialized methods on 11 of them out of the box, and on more after light fine-tuning — which is the whole pitch. You stop building a new model per question and start probing one model with many questions.

Tokenization is not a detail. Early models like DNABERT used overlapping k-mers, which inflates the vocabulary and smears positional information. Single-nucleotide resolution is cleaner but blows up the sequence length you have to attend over, and attention cost grows with the square of length. That tension — resolution versus context — defines the whole design space.
Long context is the hard part#
Biology does not respect short windows. An enhancer can sit a hundred kilobases from the gene it regulates. A model that only sees a few hundred bases around a site is structurally blind to that interaction, no matter how good its architecture.
Enformer was the model that pushed this. It pairs convolutional layers with a transformer backbone to take in roughly 200 kilobases of sequence and predict gene expression and chromatin state across many cell types directly from DNA. The convolutions compress the raw sequence into a manageable number of bins; the transformer then models long-range relationships between them. It was a clear step up on enhancer–promoter prediction precisely because it could finally see both ends of the interaction.
The current frontier on context is Evo 2 from Arc Institute. It models DNA at single-nucleotide resolution with a context window of one million tokens, trained on more than 9 trillion nucleotides spanning over 128,000 genomes across bacteria, archaea, and eukaryotes. A naive transformer cannot reach a megabase of context — the quadratic cost is fatal — so Evo 2 uses a convolution-and-attention hybrid architecture that scales close to linearly with length. That is the engineering unlock. Context length here is not a vanity metric; it is what lets a single model reason about a gene, its regulatory neighborhood, and the structural context around them at once.
From representation to prediction#
The most useful thing these models do today is variant effect prediction: given a mutation, is it likely benign or likely to break something? This matters clinically because most variants found in a patient’s genome are “variants of uncertain significance” — we have no idea what they do.
Evo 2 predicts the functional impact of variants from sequence alone, with no task-specific fine-tuning. In the published evaluation it scored the pathogenicity of BRCA1 variants — a breast-cancer gene where the clinical stakes are concrete — at over 90% accuracy separating benign from likely-pathogenic. The model never saw a label saying “this mutation causes disease.” It learned that the mutated sequence is improbable under the distribution of real, functional genomes, and improbable sequence is a decent proxy for broken function.
What the model learned without being told#
The interpretability work is the part worth dwelling on. Using sparse autoencoders on Evo 2’s activations, researchers pulled out interpretable features that correspond to real biology: intron–exon boundaries, transcription-factor binding motifs, even features that track protein secondary structure. Nobody supervised these. They fell out of next-token prediction on raw DNA, the same way a text model learns syntax it was never explicitly taught. For an AI implementation team, that is the signal that the representation is capturing mechanism, not just surface correlation.

Where it breaks#
Be precise about the failure modes, because the hype around these models outruns them.
First, these are correlational engines. A model can tell you a sequence is improbable or that a variant shifts a predicted expression track. It cannot tell you the causal mechanism, and it will be confidently wrong on edge cases that are underrepresented in training data — rare populations, unusual structural variants, anything far from the distribution of sequenced genomes, which still skews heavily toward a few well-studied species.
Second, predicted is not measured. A high pathogenicity score is a hypothesis, not a diagnosis. Every prediction that touches a patient has to be confirmed in an assay. The honest framing is that these models triage: they rank thousands of candidate variants so the wet lab spends its limited capacity on the ones most likely to matter. That is genuinely valuable and a long way from “the AI read the genome.”
Third, evaluation is treacherous. Genomic benchmarks leak. If your test sequences are evolutionarily close to training sequences, you measure memorization of conserved regions, not generalization. Reported accuracy numbers mean little without knowing how the split was constructed, and many published figures are not directly comparable across papers.
Fourth, the long-context wins are real but expensive. A megabase context model is heavy to train and to serve. For most practical questions — score these 10,000 variants in one gene — a smaller, cheaper model gets you most of the way, and the frontier context only pays off when long-range regulatory interactions genuinely drive the answer.
Not just DNA, and not just one species#
Two more shifts are worth naming because they change what these models are good for. The first is multi-modality. The newest genomic models do not stop at DNA — Evo 2 spans DNA, RNA, and protein, because in a real cell those layers are coupled and a model that sees only one of them is missing the dependencies that drive function. A regulatory mutation matters because of what it does to transcription and ultimately to protein; a model trained across all three can, in principle, follow that chain. That breadth is also what makes a single backbone reusable across tasks a lab would previously have built separate pipelines for.
The second is the species question. Training data is wildly imbalance: human and a handful of model organisms are sequenced to death, while most of the tree of life is sparse. Models trained across more than a hundred thousand genomes spanning bacteria, archaea, and eukaryotes generalize better than human-only models, and that breadth is part of why they pick up deep, conserved signals. But the flip side is real — performance on an obscure clade, or on a structural-variant class that barely appears in training, will be worse than the headline numbers suggest, and the model rarely tells you when you have wandered off the map. Knowing where your sequence sits relative to the training distribution is not optional context; it is the difference between a usable prediction and a confident hallucination.
How to actually use them#
If you are standing up genomics workloads, treat foundation models as a feature layer, not an oracle. The pattern that works: take embeddings or variant scores from a pretrained model, feed them into a lightweight task-specific head trained on your own labeled data, and keep a wet-lab validation loop downstream of every prediction that drives a decision. The same discipline we apply to any Data Platforms project applies here — version your models and your training splits, log every inference, and never let a score reach a clinician without provenance.
The reason to care is consolidation. One pretrained backbone replaces a drawer full of brittle single-task models, and it generalizes across cell types and species in a way the old approach never did. That is a real shift in how genomic prediction gets built. It is just not a shortcut around biology.
Building genomic or clinical ML and want it grounded in real validation, not benchmark theater? Talk to our team.