Multimodal AI Diagnosis: Fusing Scans, Notes, and Labs
Multimodal clinical foundation models fuse imaging, notes, and EHR labs. Why fusion is hard, missing modalities break it, and how the HMS anchors it all.
A clinician diagnosing a patient does not read one thing. They read the chest CT, the radiologist’s note, the trend in the white-cell count, the medication list, and the three lines of history that say the patient was short of breath last Tuesday. The diagnosis lives in the combination. For most of the history of clinical AI, models did the opposite — one modality, one task: a classifier for the scan, a separate model for the notes, nothing that reasoned across them. Multimodal clinical foundation models are the attempt to close that gap, and they are genuinely promising. They are also harder to deploy than the demos suggest, and the difficulty is not the model.
What a multimodal clinical model is#
The clearest public example is Google’s Med-Gemini family. Med-Gemini is a set of multimodal models built on the Gemini architecture to process text, medical images, genomics, and EHR records in support of diagnosis, radiology interpretation, EHR summarization, and treatment planning. On a battery of multimodal benchmarks — including the NEJM Image Challenges and multimodal exam-style questions — it posted strong results, and its drafts for tasks like text summarization and referral letters were in some evaluations preferred to clinician-written ones.
Benchmark wins are necessary and badly insufficient. The NEJM Image Challenge is a clean, curated puzzle with exactly the modalities you need, present and aligned. Real clinical data is none of those things. The gap between “tops the leaderboard” and “useful at the bedside” is where the actual work lives, and a scoping review of multimodal AI in medicine is blunt that most multimodal models remain confined to research rather than routine clinical use.
Why fusion is genuinely hard#
The instinct is that more data is better, so fusing modalities must improve everything. In practice fusion introduces problems that single-modality systems never had.
The modalities do not align#
A CT volume, a paragraph of free text, and a time series of lab values have nothing in common structurally. They differ in dimensionality, in sampling rate, in noise, in how they are encoded. Bringing them into a shared representation where the model can reason jointly is the core technical challenge, and where you do it defines your architecture.
Early fusion combines low-level features before modeling. It can capture fine cross-modal interactions but is brittle: it usually demands that every modality be present and aligned, and a single missing input can break the whole pipeline. Late fusion, by contrast, keeps separate per-modality models and combines their outputs at the decision layer via ensembling, stacking, or mixture-of-experts. It captures less cross-modal nuance but it is modular, easier to integrate into clinical workflows, and relatively tolerant of missing inputs. That tolerance is why, in real deployments, late fusion via stacking is often the more compelling choice — each specialist model can be trained, validated, and monitored independently, then combined by a lightweight meta-learner.

Missing modalities are the normal case#
This deserves its own heading because it is the assumption most demos quietly violate. In a real hospital, the patient who has a CT may not have an MRI; the labs may be three days stale; the outside imaging may never have arrived; the notes may be a scanned fax. A multimodal model that only performs when all inputs are present and current is a research artifact, not a clinical tool.
The literature treats robustness to absent modalities as a first-order requirement. Approaches range from generating learnable embeddings to stand in for a missing modality, to encoders that accept variable-length input so the model degrades gracefully instead of failing. The engineering consequence is concrete: you design for partial inputs from day one, and you measure performance under realistic missingness, not on the complete-case subset that flatters your numbers.
Aligning to the patient and the moment#
Even when modalities are present, they have to be joined to the same patient at the same clinically relevant time. A lab value from a prior admission and a scan from today are different evidence than the same two taken together this morning. Temporal alignment and correct patient linkage are prerequisites the model cannot fix; they have to be solved in the data layer.
The Hospital Management System is the data backbone#
This is the part that gets skipped in every multimodal-AI keynote, and it is the part that decides whether any of this ships.
Every modality a clinical model consumes originates somewhere in the hospital’s operational systems. Orders, admissions, medications, labs, demographics, and the patient timeline live in the Hospital Management System. Imaging lives in PACS but is indexed by identifiers the HMS owns. Notes are written into the HMS. The structured EHR signal a multimodal model needs — the lab trends, the problem list, the medication history — is HMS data. The HMS is not a peripheral source; it is the spine the other modalities hang off.
That has a hard implication. The quality ceiling of a multimodal clinical model is set by the HMS, not by the model. If the HMS cannot emit a clean, time-stamped, correctly attributed event stream — this lab, for this patient, at this time, linked to this encounter — then fusion is built on sand. The most common reason a promising model never reaches the ward is not that it was inaccurate. It is that the data backbone could not reliably deliver aligned, identified, multimodal inputs at inference time.
So the real work of AI implementation here is Data Platforms work: integration with the HMS, an identity and timeline model that ties imaging, notes, and labs to the same patient-encounter, and Operational Automation that assembles the multimodal record reliably and on time. Get that right and a relatively standard fusion architecture performs well. Get it wrong and the most sophisticated foundation model in the world has nothing trustworthy to reason over.

The encoder problem behind the architecture#
There is a subtler reason fusion is hard, and it sits below the early-versus-late choice. Each modality needs an encoder that turns raw input into a representation the fusion layer can use, and those encoders are not equally mature. Imaging encoders are strong because medical imaging has large, relatively standardized datasets. Clinical-text encoders are improving fast on the back of language models. Structured EHR data — irregular time series of labs, vitals, and events — is the awkward one: it is sparse, irregularly sampled, and full of implicit meaning, and a missing value can itself be informative (a test not ordered is a clinical signal). A multimodal model is only as strong as its weakest encoder, and in clinical settings that weak link is frequently the structured EHR stream, not the images. Teams that pour effort into the imaging branch and treat labs as a bag of numbers leave most of the available signal on the floor. Med-Gemini’s reported gains came in part from customized encoders adapted to novel data types such as electrocardiograms, which is the right instinct: invest in the encoder for the modality that is hardest to represent, not the one that is easiest.
Deployment and governance#
Suppose the fusion works and the backbone is solid. Putting a multimodal model into clinical use brings its own constraints, and skipping them is how good models cause harm.
Decision-support, with a human in the loop. A multimodal model surfaces findings, ranks differentials, drafts summaries. It does not diagnose autonomously, and it should not be built as if it could. The clinician owns the decision; the model’s job is to make sure relevant evidence is not missed and to compress the synthesis work. This is both the ethical position and, in most jurisdictions, the regulatory one.
Per-modality monitoring. A late-fusion design pays off in operations precisely because you can watch each component independently. When the chest-imaging model drifts after a scanner upgrade, you want to catch it at the component level, not as an unexplained dip in the combined output. Monitor the parts, not just the whole.
Provenance and explainability. When a model integrates a scan, a note, and a lab trend, a clinician must be able to ask which evidence drove the output. Per-modality attribution is not a nicety; it is what makes the recommendation auditable and trustworthy at the point of care.
Validation under real conditions and across populations. Performance has to be measured with realistic missingness, on the populations and sites where the model will run. A model validated only on complete, single-site, curated data tells you little about how it behaves on a Tuesday-night admission with stale labs and an outside scan.
The honest position#
Multimodal clinical AI is the right direction. Reasoning across scans, notes, and labs together matches how medicine is actually practiced, and the foundation-model results show the approach has real capability. But the framing in the marketplace inverts the difficulty. The model is increasingly the easy part — capable, often available, well-benchmarked. The hard part is everything around it: fusing misaligned modalities, surviving the missing-modality reality of real care, aligning everything to the patient and the moment, and standing it all on a Hospital Management System clean enough to feed it.
For health systems and the teams building for them, the lesson is to invest where the constraint actually binds. Spend on the data backbone, the integration, and the governance — the unglamorous Data Platforms and Operational Automation layer — and the multimodal model becomes a tractable, high-value tool. Skip that layer and you have a leaderboard score with nowhere to land.
Multimodal AI is only as good as the HMS feeding it. We build the integration and identity layer that makes clinical fusion work in production. Talk to our engineering team.