Climate Foundation Models: AI for Earth Systems

ML weather and climate models now beat physics on real benchmarks. GraphCast, AIFS, NeuralGCM — and the data-assimilation and trust questions underneath.

Climate Foundation Models: AI for Earth Systems

For seventy years, forecasting the atmosphere meant solving the equations of fluid dynamics on the biggest computer you could afford. Numerical weather prediction (NWP) discretizes the globe into a grid, integrates the primitive equations forward in time, and consumes a meaningful fraction of a national supercomputer’s budget to produce a ten-day forecast. It works remarkably well. It is also slow, expensive, and largely closed to anyone without a public-sector HPC center. Over the last three years, machine-learned models have matched or beaten that physics-based pipeline on standard scoring metrics while running in minutes on a single accelerator. This is not a tweak to the workflow. It is a different way to model the Earth system, and it brings a new set of engineering and trust problems that the old pipeline did not have.

Weather modeling supercomputer cluster

The three models worth understanding#

The phrase “AI weather model” hides real architectural differences. Three systems define the current landscape, and they sit at different points on the physics-to-data spectrum.

GraphCast: pure learning on a graph#

GraphCast, from DeepMind, is a graph neural network trained on ECMWF’s ERA5 reanalysis — roughly four decades of assimilated atmospheric state. It represents the globe as a multi-mesh graph and learns to step the atmosphere forward in six-hour increments, autoregressively, with no explicit fluid dynamics in the loop. On published benchmarks it outscored ECMWF’s high-resolution deterministic forecast on a large majority of variables and lead times, and it produces a ten-day global forecast in roughly a minute on a single TPU. The catch is that it learned the dynamics implicitly from data; it has no built-in guarantee of physical consistency, and it remains primarily a research artifact rather than an operational system.

ECMWF AIFS: a learned model in operations#

The more consequential story is ECMWF putting a machine-learned model into actual operational service. The Artificial Intelligence Forecasting System (AIFS) went operational in February 2025, with an ensemble version following in mid-2025 carrying 51 members. Architecturally it is a graph neural network encoder and decoder wrapped around a sliding-window transformer processor, trained on ERA5 and ECMWF’s operational analyses. ECMWF’s own verification showed AIFS beating its physics-based ensemble on several mid-latitude variables, with day-five surface-temperature error dropping by close to twenty percent and tropical-cyclone track error narrowing by a comparable margin. A later update added physical-consistency constraints through bounding layers, which notably improved precipitation — a direct admission that pure learning needed physical guardrails to be trustworthy on the variables that matter most operationally.

NeuralGCM: the hybrid#

NeuralGCM, from Google Research with ECMWF, is the most interesting from an architecture standpoint. It is a fully differentiable hybrid: a conventional dynamical core solving the large-scale fluid equations, coupled to learned neural components that handle the small-scale physics — clouds, convection, turbulence — that traditional models approximate with hand-built parameterizations. Because the whole model is differentiable, the learned parts are trained end to end through the physics. It runs orders of magnitude faster than a conventional general circulation model, stays competitive with the best models on one-to-ten-day forecasts, and, unlike the pure-ML systems, can run stable multi-decade climate simulations. That last property is the dividing line between a weather model and something you can call an Earth-system model.

Skill versus physics: what the benchmarks actually say#

It is tempting to read “AI beats NWP” as a clean win. The reality is narrower and more interesting. The ML models win decisively on standard deterministic scores like root-mean-square error and anomaly correlation, at a fraction of the compute. But RMSE rewards a particular behavior: when uncertain, predict something close to the average. ML models, trained to minimize exactly that loss, learn to blur. They produce smooth, low-error fields that systematically underrepresent the sharp gradients and extreme values — the intense rain band, the rapidly deepening cyclone — that forecasts exist to warn about. A model can win on average error while being worse at the one percent of cases anyone cares about.

This is the “double penalty” problem, and it is why precipitation and extremes are where physics-based models still hold ground, and why ECMWF bolted physical constraints onto AIFS rather than letting it optimize freely. The honest framing is not “ML replaced physics.” It is: ML is dramatically more efficient and competitive on smooth, large-scale fields, physics is still essential for sharp local extremes and for any guarantee that mass and energy are conserved over long runs, and the productive frontier is hybrid models that keep the physics where it earns its place.

There is a deeper reason to care about how the skill is achieved, not just the score. A physics-based forecast is wrong in interpretable ways — you can trace an error to a misrepresented front or a coarse grid cell, and a forecaster can reason about it. A pure-ML forecast that wins on RMSE can be wrong in ways that have no physical narrative at all, because the model never committed to physics in the first place. For a research benchmark that is fine. For an operational center that has to stand behind a warning and explain it afterward, the interpretability of the error matters as much as its magnitude. That is part of why the operational systems are converging on hybrids and constraint layers rather than chasing the lowest possible benchmark number with an unconstrained network.

Weather radar ground station antenna

The part nobody markets: data assimilation#

Here is the dependency that gets buried under the benchmark numbers. Every one of these models — GraphCast, AIFS, NeuralGCM — was trained on ERA5, and most of them still need an initial condition to forecast from. That initial condition comes from data assimilation: the slow, expensive, deeply unglamorous process of fusing millions of irregular observations from satellites, balloons, aircraft, and surface stations into a single physically consistent estimate of the atmosphere’s current state. Data assimilation is itself a massive inverse problem, and today it still largely runs on the conventional physics-based system. The flashy ML forecast is riding on a pipeline it did not replace.

Which means the genuinely hard, genuinely valuable work over the next few years is learned data assimilation — building ML systems that ingest raw observations and produce the analysis state directly, closing the loop end to end. Several groups are working on it. Until it is solved, “AI weather models” are an accelerated forecast step bolted onto a traditional ingestion and assimilation backbone. For anyone building on top of these systems, that backbone — the observational Data Platforms, the quality control, the bias correction, the assimilation — is where most of the real engineering and most of the cost actually live.

The reanalysis dependency cuts deeper than convenience. ERA5 is not raw observation; it is itself the output of a physics-based assimilation system, a model’s best guess at the past stitched together from sparse measurements. Training the next generation of forecast models on it means inheriting whatever biases that system carries — and it means these “data-driven” models are, at one remove, still anchored to the physics they appear to replace. Break the reanalysis pipeline and the ML models have nothing to learn from and nothing to initialize from. This is the kind of hidden coupling that looks like a footnote until it becomes the thing that takes the whole system down, and it is exactly what a serious Earth-system Data Platform has to make explicit rather than assume away.

Trust, and what this means if you build on it#

The trust questions are not abstract. A purely learned model can produce a forecast that is locally smooth, plausible, and physically impossible — a field that quietly violates conservation laws in a way a meteorologist would catch but a downstream automated system would not. For low-stakes use that is tolerable. For flood response, grid balancing, agricultural planning, reinsurance pricing, or any Operational Automation that triggers real-world action, it is not. The mitigations are concrete: physical-constraint layers, ensembles that expose uncertainty instead of hiding it behind a single smooth number, and rigorous out-of-distribution evaluation — because a model trained on the last forty years has, by construction, never seen the climate of the next forty.

The architectural lesson generalizes well past meteorology. These are foundation models in the real sense: pretrained on enormous historical archives, increasingly fine-tuned to regional and downstream tasks, and only as trustworthy as the data pipeline feeding them and the physical constraints fencing them in. That is the same pattern behind every serious AI Implementation we ship — a Hospital Management System ranking clinical risk, or forecasting demand across a logistics network. The model is the visible part. The Data Platform underneath, and the guardrails around it, are what decide whether you can trust the output enough to act on it.

The takeaway is not that physics-based modeling is obsolete. It is that Earth-system modeling has become a hybrid discipline — learned where data is abundant and dynamics are smooth, physical where conservation and extremes matter — and that the value, as usual, is concentrated less in the model and more in the data infrastructure and the engineering discipline around it.


Standing up a foundation model on real-world data, or hardening the Data Platform underneath one? We build the assimilation, validation, and guardrails that make the output trustworthy. Talk to our engineers.