Formal Verification for AI Systems

On June 17, 2026, TechCrunch reported that Pramaana Labs raised a $27M seed round led by Khosla Ventures to bring formal verification to AI. The pitch, from co-founder and CEO Ranjan Rajagopalan, is that the world’s hardest problems are not unsolvable — they are unformalized. The company pairs an LLM with deterministic verification built on LEAN, the proof language usually reserved for checking mathematics, and aims it at law, tax, and drug discovery. As Rajagopalan put it: “Once you have a codified version of it, the reasoning on top of it starts becoming deterministic.”

That sentence is the whole debate in one line. So before the term “formal verification for AI” gets flattened into a marketing checkbox, it is worth a clear-eyed engineering read: what can actually be verified about an LLM system, what cannot, and what you should build instead of waiting for proofs that may never come.

What “formal verification” actually means#

Formal verification is not testing. Testing samples behavior; verification proves it. In classical software you write a specification — a precise statement of what must always hold — and then a tool mechanically checks that the implementation satisfies it for every possible input, not just the ones you happened to try. Model checkers explore state spaces exhaustively. SMT solvers discharge logical constraints. Proof assistants like LEAN let you build a machine-checked argument that a property holds, full stop.

This works beautifully for things that are deterministic and well-specified. Avionics, cryptographic protocols, chip design, distributed consensus — domains where “usually correct” is a synonym for “broken,” and where the rules are themselves formalizable. Pramaana’s insight is shrewd: tax codes, clinical guidelines, and statutes are already rule systems. France’s CATALA project formalized the French tax and benefit code into executable logic. If you can codify the domain, you can check an answer against it deterministically — and that check is real verification, independent of how the answer was generated.

The probabilistic problem#

Here is the wall. A large language model is a probability distribution over tokens. The same prompt, the same weights, a different sampling seed, and you get a different output. There is no compact specification of “what GPT-class model M does” to feed a solver, and even if there were, the state space is astronomically large and the function is not the kind of thing proof tools were built to reason about. You cannot, today, formally verify that a model will never produce a harmful, wrong, or out-of-policy output. Anyone selling you that is selling you the determinism the model does not have.

So when a headline says “formal verification for AI,” read it carefully. What Pramaana is verifying is not the model. It is the answer, against a formalized domain, after the model produces it. The LLM is a fast, fuzzy proposal engine; the formal layer is a deterministic judge. That distinction — verifying the model versus verifying the system around the model — is the single most useful frame an engineering team can carry into an AI Implementation.

Verify the system, not the model#

Once you stop trying to prove the model and start hardening the system around it, the problem becomes tractable and, frankly, familiar. The model is one probabilistic component inside an otherwise ordinary piece of software. Everything around it is scaffolding you fully control, and most of that scaffolding can be verified or constrained with techniques that already exist.

Consider what actually surrounds an LLM in production:

Output contracts. Force the model to emit structured output against a strict schema. Then validate it deterministically — types, ranges, enums, referential integrity. A response that does not parse never reaches a user. This is not verification of the model; it is a verified boundary the model’s output must cross.
Tool calls. When the model invokes a function, the arguments are checked, authorized, and bounded before execution. The model can request a database write; it cannot perform one outside the permissions, rate limits, and validation you wrote.
Retrieval. Ground answers in a known corpus and require citations. Now the claim “the model said X” becomes “the model said X, sourced from document Y,” and Y is auditable.
Bounded failure modes. Define what the system does when confidence is low or validation fails: deflect, escalate, return a safe default. The failure is designed, not emergent.

None of that proves the model is correct. All of it proves the system cannot take certain actions, cannot return malformed data, and cannot fail in unbounded ways. For most teams, that is the verification that actually matters — and you can have it now, without a proof assistant.

Where formal methods help, and where they’re overkill#

Be opinionated about this, because the cost asymmetry is enormous. Formal methods are expensive: they demand a formalized spec, specialist skills, and tooling that resists casual iteration. Spend that budget where being wrong is catastrophic and the domain is genuinely codifiable.

Formal verification earns its keep on the deterministic spine of a high-stakes system: a tax calculation, a drug-interaction rule, a benefits-eligibility decision, an access-control policy. Codify the rules, check the model’s proposed answer against them, and reject anything that fails. That is exactly Pramaana’s bet, and in regulated verticals it is the right one.

Formal verification is overkill for the open-ended parts — summarizing a document, drafting an email, answering a general question. There is no spec for “good summary.” Reaching for a proof assistant here is category error. What you want instead is a strong eval harness: a curated set of cases, graded automatically and by humans, run on every change, tracking regression over time. Evals are to LLM systems what unit tests are to ordinary code — not proof, but the empirical discipline that keeps quality from quietly drifting.

The mature stance holds both at once. Verify the codifiable core deterministically. Eval the fuzzy edges continuously. Wrap the whole thing in contracts so failures are bounded. Treat “formal verification” and “evals” as different tools for different layers, not competing religions.

Plumbing-first AI#

This is the part teams skip, and it is the part that decides whether AI survives contact with production. The demo is the model. The product is the plumbing.

You cannot ship an LLM into a Hospital Management System on vibes. A clinician needs the system to refuse confidently when it should, to cite where a recommendation came from, to never silently corrupt a record, and to leave an audit trail when something goes wrong. None of that comes from a better prompt. It comes from contracts, evals, observability, and bounded failure modes — the unglamorous layer that turns a probabilistic component into a dependable one. The same is true of a School ERP touching minors’ data: the model is the easy part; the controls around it are the product.

Plumbing-first means three things are non-negotiable from day one, not bolted on after launch:

Evals. A versioned test suite of real cases, graded on every change, so you know whether today’s tweak made the system better or worse — empirically, not anecdotally.
Observability. Every model call traced: inputs, outputs, latency, the retrieved context, the tool calls, the validation result. When something misbehaves at 2 a.m., you can reconstruct exactly what happened. We build this on a ClickHouse + Airflow + dbt operational engine — events land in ClickHouse, Airflow orchestrates the pipelines, dbt models the eval and cost tables — so reliability questions get answered against data, not guesses.
Cost tracking. Per-request token and dollar accounting, attributed to features and customers. An LLM feature with no cost telemetry is a budget incident waiting to happen.

This is the same discipline that makes any Data Platform trustworthy, applied to a new kind of component. The output contracts, the schema validation, the deflect-on-low-confidence logic — that is just good Operational Automation around a probabilistic core. Engineers who already build reliable distributed systems have most of the instincts; the LLM does not excuse you from them, it raises the stakes.

The honest read#

Formal verification coming to AI is real and worth taking seriously — for the codifiable, high-stakes spine of regulated systems, where deterministic proof against a formalized domain is a genuine unlock. It will not make models deterministic, and it will not verify the open-ended behavior that makes LLMs useful in the first place. Treat the Pramaana news as a signal of where the industry is heading: away from “trust the model” and toward “constrain and verify the system.”

For the team shipping next quarter, the practical takeaway is unglamorous and immediate. You probably do not need a proof assistant yet. You absolutely need contracts, evals, observability, cost tracking, and bounded failure modes. Build the plumbing first. The verification frontier will meet you there.

Shipping an LLM into a system where being wrong has a cost? That is a plumbing problem before it is a model problem — and it is exactly what we build. Let’s talk.

What “formal verification” actually means#

The probabilistic problem#

Verify the system, not the model#

Where formal methods help, and where they’re overkill#

Plumbing-first AI#

The honest read#

Related posts.

An AI Agent Debugging Production Is a Retrieval Problem: What Elastic Buying DeductiveAI Tells You About AI SRE

The Economics of Inference: What Baseten's $1.5B Round Tells Engineers

LLM Observability: LangSmith, Helicone, and What to Actually Log