10,000 Critical Flaws and Counting: What Project Glasswing's Expansion Means for Critical-Infrastructure Operators
Anthropic expanded Project Glasswing to 150 organizations across power, water, and healthcare. The hard part of AI security review is the data pipeline.
Around June 2, 2026, Anthropic expanded Project Glasswing to roughly 150 new organizations across more than fifteen countries, opening Claude Mythos Preview to sectors it had largely skipped in the first cohort: power, water, healthcare, communications, and hardware. The headline number from the existing partners is the one worth sitting with. Collectively, they have already found more than 10,000 high- or critical-severity security flaws using the model.
Ten thousand critical flaws is the kind of figure that sounds like a win and reads like a warning, depending on which side of the patch you sit on. For anyone running infrastructure that a nation depends on, both readings are correct, and the gap between them is exactly where the engineering work lives.
What Glasswing actually does#
Strip away the program branding and Glasswing is a controlled distribution of a vulnerability-discovery capability to organizations that can put it to defensive use. Partners do not just run Mythos as a one-shot scanner. They use it to write patches for flaws it surfaces, to vet software for defects before release, for penetration testing, for threat detection, and even to translate code into memory-safe languages. Anthropic has paired this with Claude Security, a product that points its frontier models at codebases to scan and suggest fixes.
The sectors added in this round are not arbitrary. Power, water, healthcare, communications, and hardware are the load-bearing systems of a functioning society. Anthropic’s own framing is blunt: for most of these partners, a major attack could affect more than 100 million people. That is the difference between a security program and a security program that matters at the level of national resilience.

AI for defense, finally ahead of AI for offense — for now#
Most of the public anxiety about AI and security has run the other way: models that help attackers write exploits faster than defenders can keep up. Glasswing is the counterweight. A model that can read a large codebase and reliably surface high-severity flaws is a defensive multiplier, and 10,000 critical findings across the partner base is evidence that the multiplier is real, not a demo.
But the same capability resets where the bottleneck sits. When discovery was the hard part, finding the flaw was the win. When a Mythos-class model can surface flaws faster than any team has historically been able to, the constraint moves downstream — to verifying, disclosing, and patching what the model found. Anthropic says this plainly, and it is the most operationally important sentence in the whole announcement. A pile of 10,000 unverified findings is not security. It is a backlog with a deadline that attackers also get to read.
There is a sharper edge here that operators should sit with. The same generation of model that Glasswing partners point at their own code is, in principle, available to the people attacking them. The defensive lead Anthropic has built is a head start, not a moat. The “for now” in the framing is load-bearing. Whatever asymmetry exists today — controlled access, partner vetting, a discovery-to-patch pipeline only defenders have built — is the window. Operators who treat this as a one-time scan that produced a tidy list have misread the moment. The advantage belongs to whoever can run this class of review continuously and close findings faster than an adversary running the same class of model can weaponize them.
Critical infrastructure includes the systems you do not think of as critical#
When operators hear “critical infrastructure” they picture grids and pipelines. The category is wider than that in practice, and healthcare being named in this round is the tell. A Hospital Management System is critical infrastructure in every sense that matters: it holds patient records, it gates clinical workflows, and an outage or breach maps directly to harm. The same is increasingly true of a School ERP that holds the records of every minor in a district. These are not back-office conveniences. They are systems whose failure has a body count or a privacy catastrophe attached.
This is where the legacy ERP vendors in those markets become a liability rather than a foil. Their software is old, monolithic, and — critically — opaque. You cannot run a Mythos-class review against a codebase you cannot read, instrument, or rebuild. The data and the logic are trapped behind a vendor who ships on a multi-year cadence. A modern, data-centric Hospital Management System built on systems you actually control is not just easier to operate. It is the precondition for being able to apply this generation of AI security tooling to it at all.
Put the two pictures together and the strategic case for a modern stack stops being about features and starts being about exposure. If a hospital’s clinical system is a black box you license, your security posture is entirely a function of how fast that vendor patches — and you have already seen how slowly that tends to go. If it is software you own, instrument, and can pipe through a model-driven review, you set your own patch cadence. The same logic holds for a School ERP, for a payments switch, for any system where the failure mode is regulatory rather than merely annoying. The lesson Glasswing teaches indirectly is that owning your critical software is now a security control, not just an operational preference.
The unglamorous part: the data pipeline behind model-driven security review#
Here is the thing the announcement headline obscures. Running a model-driven security program in production is mostly data engineering with a model on top, exactly like every other serious AI implementation. The model surfaces candidate flaws. Everything that makes those findings actionable is pipeline.
Consider what it takes to operate this responsibly. Every finding needs to be deduplicated against what you already know, enriched with the affected component and blast radius, scored, routed to an owner, and tracked from disclosure through patch and verification. That is an ingestion-and-transform problem before it is anything else. The operational engine we reach for here is the same one we use for any high-volume event domain: ClickHouse to store findings and the full trace of every model run, Airflow to orchestrate the triage pipeline, dbt to keep the scoring and dedup logic versioned and tested. Boring, load-bearing, and the reason a 10,000-finding firehose becomes a managed queue instead of a panic.

Evals are not optional when the model is the scanner#
If a model is finding your vulnerabilities, you have to know how good it is at it — continuously, not once. A security review model that quietly regresses after an upgrade and starts missing a class of flaw is worse than no scanner, because it manufactures false confidence. So the evals, observability, and cost-tracking discipline we treat as non-negotiable everywhere becomes existential here. You need a labelled corpus of known flaws to measure recall against, you need to watch false-positive rates so triage does not drown, and you need to track cost per codebase scanned so the program survives a budget review. None of this is exotic. All of it is the difference between a security capability and a science project.
The false-positive number deserves its own paragraph because it is where most model-driven security programs quietly die. A scanner that surfaces 10,000 findings is only useful if a high fraction of them are real; if a third are noise, you have not bought security, you have bought a denial-of-service attack on your own engineers. Triage capacity is finite and expensive, and human reviewers burn out fast on a queue that cries wolf. This is why precision has to be a tracked, dashboarded metric with a threshold that triggers a rollback, not a vibe. The teams that succeed with Mythos-class tooling are the ones who treat the model’s output as a noisy signal to be measured and calibrated, not a verdict to be trusted — and who built the eval harness to do that measuring before they pointed the model at anything that mattered.
What operators should take from this#
If you run critical infrastructure — and a hospital or a district school system counts — Glasswing’s expansion is a preview of the bar you will be held to. Model-driven vulnerability discovery is moving from frontier-lab privilege to operating expectation. The teams that benefit will not be the ones with the cleverest prompt. They will be the ones who built the data platform to verify, route, and patch what the model surfaces, and who instrumented the whole thing well enough to trust it. The flaws are already in your code. The only open question is whether you find them before someone else runs the same class of model against you.
The primary coverage is worth reading: Cybersecurity Dive, SiliconANGLE, and TechCrunch each add useful detail on the sectors and the patching bottleneck.
A model that finds 10,000 flaws is only useful if your pipeline can verify and patch them. We build that pipeline. Talk to us about model-driven security operations.