Engineering an LLM Pipeline for Fraud and Waste Detection in Audit Reports

HHS is running ChatGPT over all 50 states' audit reports to flag fraud and waste. Here's how to actually build that pipeline — RAG, schema-bound extraction, citation grounding, and why a flag is a lead, not a verdict.

Engineering an LLM Pipeline for Fraud and Waste Detection in Audit Reports

In May 2026, the US Department of Health and Human Services announced it would run ChatGPT and other large language models over the annual audit reports of all 50 states, on a rolling basis, to flag fraud, waste, and abuse in federal health spending. The program — AERO, the Audit Enforcement and Risk Oversight initiative, led by Assistant Secretary Gustav Chiarello — has already put every governor and treasurer on notice. It is one of the largest live deployments of LLMs against unstructured government documents to date, and it is a useful forcing function for a question every data team now faces: how do you actually engineer an LLM pipeline that reads dense audit reports and surfaces real anomalies — without drowning investigators in false positives or, worse, manufacturing accusations out of a hallucination?

This is a build guide, and it is honest about the limits. The single most important design principle up front: an LLM flag is a lead for a human investigator, never a verdict. Everything below exists to make that lead high-quality, traceable, and reviewable.

The shape of the problem#

Single Audit reports — the artifacts HHS is ingesting — are long, semi-structured PDFs: financial statements, schedules of federal awards, auditor findings, corrective action plans. Across 50 states and five-plus years of history, you have thousands of multi-hundred-page documents. Chiarello has estimated $100–200 billion in annual wasteful or fraudulent spending as the prize. The engineering job is to turn that pile into a ranked, evidence-backed queue of leads.

Two failure modes dominate, and they pull in opposite directions:

  • Missing real fraud (false negatives) — the program fails its purpose.
  • Crying wolf (false positives) — investigators lose trust and stop acting on flags, which also fails the program.

Every design decision below is, ultimately, a knob on that tradeoff.

Stage 1: Retrieval over long documents (RAG done carefully)#

You cannot stuff a 400-page audit into a prompt and expect reliable reasoning across all of it. Even with long context windows, accuracy degrades and cost explodes. The standard answer is retrieval-augmented generation (RAG): chunk the documents, embed them, and retrieve only the passages relevant to a given question.

For audit documents specifically, naive chunking is a trap. A finding’s severity lives in one section, its dollar amount in a schedule, and its resolution status in a corrective action plan pages away. Things that help:

  • Structure-aware chunking. Split on the document’s real boundaries — finding numbers, award schedules, sections — not on a fixed token count that slices a finding in half.
  • Metadata on every chunk. Tag each chunk with state, fiscal year, program (CFDA number), and section type. Most of your highest-value queries are filters (state = X AND finding_type = material_weakness), and metadata filtering is cheaper and more precise than semantic search alone.
  • Hybrid retrieval. Combine dense embeddings with keyword/BM25 search. Audit language is full of exact terms — “questioned costs,” “material weakness,” specific program names — that semantic search alone can miss.

The goal of this stage is not answers. It is to put the right evidence in front of the model so that the next stage has something real to extract from.

Stage 2: Structured extraction with a schema, not prose#

The biggest mistake teams make is asking the LLM “is there fraud here?” and getting back a paragraph. Paragraphs are not auditable, not aggregable, and not comparable across 50 states. Instead, force the model to populate a strict schema for each candidate finding. Use the provider’s structured-output / JSON-schema mode so the shape is guaranteed:

{
  "finding_id": "string",
  "state": "string",
  "fiscal_year": 2025,
  "program": "string",
  "anomaly_type": "questioned_costs | repeat_finding | delinquent_audit | material_weakness | other",
  "amount_usd": 0,
  "severity": "low | medium | high",
  "rationale": "string",
  "source_citations": ["doc_id:page:span"],
  "confidence": 0.0
}

Schema-bound extraction does three things at once. It makes outputs machine-comparable (you can now rank, filter, and aggregate across states). It constrains the model to the questions you actually care about instead of free-associating. And it gives you a natural place to demand the single most important field: source_citations.

Grounding: every flag points to a source span#

A flag the model can’t tie to a specific page and span is not evidence — it is an assertion, and in a fraud context an unsourced assertion is a liability. Require a verbatim citation for every structured claim, and then verify it programmatically: check that the quoted text actually appears in the cited chunk before the finding is allowed into the queue. If the citation doesn’t resolve, you drop or down-rank the finding regardless of how confident the model sounded. This single check is the cheapest, highest-leverage defense against hallucinated accusations. It converts “the model thinks” into “the model points here, and here is the page.”

Stage 3: Anomaly logic the LLM should not own#

It is tempting to ask the model to judge whether a number is anomalous. Don’t, for anything quantitative. LLMs are weak at arithmetic and comparison across large tables, and they will confidently miscompare figures. Split the labor:

  • The LLM extracts and normalizes — pulling structured findings, statuses, and amounts out of messy prose and tables.
  • Deterministic code does the anomaly detection — year-over-year deltas, repeat-finding detection across fiscal years, outlier dollar amounts, delinquency flags. This is SQL and statistics over the extracted schema, and it is reproducible, explainable, and free of hallucination.

This division is the heart of a trustworthy pipeline. Use the LLM for what it is uniquely good at — reading unstructured language at scale — and hand the judgment that needs to be exact and defensible to plain code. A repeat material weakness across three fiscal years is a deterministic GROUP BY, not a vibe.

It also gives you two independent signals to combine. A finding the LLM rated high severity that also trips a deterministic rule — say, questioned costs above a dollar threshold that recur year over year — is a far stronger lead than either signal alone. Treat the final ranking as an ensemble: the language model’s reading and the deterministic anomaly logic each get a vote, and the cases where they agree rise to the top of the investigator queue. Disagreements are interesting too, but they belong in a lower triage tier, not in front of an enforcement official.

Stage 4: Human-in-the-loop and the precision/recall dial#

No flag becomes an action without a human. The pipeline’s job is to make the investigator’s review fast and well-evidenced: present the structured finding, the source citations rendered in context, and the deterministic anomaly signals side by side. The reviewer confirms, dismisses, or escalates — and those decisions become labeled data to tune the system.

That feedback loop is where you control precision versus recall:

  • Tune for high precision at the top of the queue — the flags you route straight to investigators should be overwhelmingly real, or you burn human trust fast.
  • Keep a lower-confidence tier for analyst triage rather than discarding it, so you don’t silently drop true positives.
  • Track the operating point with honest metrics — precision, recall, and the false-positive rate per investigator-hour — and report them. “The AI found fraud” is a press release; “at this threshold, 7 of 10 top-tier flags were confirmed actionable” is an engineering result.

A workable rollout treats the LLM tier as a ranking and routing layer over a queue that humans already work, not a replacement for the humans. That is also the legally and ethically defensible posture: the consequence — a funding action against a state — is taken by an accountable official looking at sourced evidence, not by a model.

Auditability and governance#

A system that accuses states of misspending federal money has to be able to explain itself months later. Build for that from day one:

  • Log everything: the retrieved chunks, the exact prompt, the model and version, the raw structured output, and the human disposition. You need to reconstruct why a flag fired long after the fact.
  • Pin model versions. A silent model update can shift your operating point overnight; treat the model like any other dependency with a version and a changelog.
  • Hold a human-decision boundary between any model output and any real-world consequence, and make that boundary an explicit, logged step.
  • Red-team for bias and adversarial text. Test whether the pipeline systematically over-flags certain states or report formats, and whether crafted wording can suppress a true finding.

The honest takeaway#

Running ChatGPT over audit reports is not magic, and it is not, by itself, fraud detection. What it is — done right — is a force multiplier on retrieval and reading: a way to turn thousands of impenetrable PDFs into a ranked, sourced, reviewable queue so that scarce human investigators spend their time on the cases most likely to be real. The engineering that matters is unglamorous: structure-aware retrieval, schema-bound extraction, programmatic citation checks, deterministic anomaly logic, and a hard human-decision boundary with measured precision and recall. Get those right and an LLM meaningfully shrinks the haystack. Skip them and you have built a very expensive, very confident generator of accusations no one should act on. The line between the two is entirely in how you wire the pipeline — and it is the same line, for the same reasons, whether the documents are state audits, insurance claims, or procurement records.