Building Reliable AI for In-House Legal Teams
In-house legal AI is hot, but the hard part is reliability. How to ground answers in citations, run RAG over contracts, and keep a lawyer in the loop where it counts.
In-house legal teams are suddenly a funded category. On June 9, 2026, Sandstone raised a $30M Series A led by Lightspeed to bring AI workflow automation to corporate legal departments — six months after a $10M seed. The pitch lands because the work is real: in-house counsel drown in contract review, policy questions, and intake triage. But the engineering problem behind a credible legal AI is not “call an LLM.” It is reliability under a constraint most products quietly ignore: a wrong answer that sounds right is worse than no answer. This is a field guide to building that system — what works, what to measure, and where a human stays non-negotiable.
Why legal is a retrieval problem, not a generation problem#
The instinct is to reach for a bigger model. The actual bottleneck is grounding. A general LLM will happily summarize an indemnification clause it has never seen, blending plausible boilerplate with the specific terms of your contract. For a sales team that is a minor annoyance. For legal it is malpractice-adjacent.
So the architecture is retrieval-augmented generation (RAG) with a hard rule layered on top: every assertion the model makes must point to a span in a real source document, and if it cannot cite, it must say so. The generation step is almost the easy part. The work is in retrieval quality over documents that are long, structured, cross-referenced, and full of defined terms.
The documents fight back#
Legal corpora break naive RAG in specific ways:
- Length. A master services agreement plus its order forms and amendments can run hundreds of pages. Fixed-size chunking splits a single obligation across two chunks, and the retriever returns half of it.
- Defined terms. “Confidential Information” means whatever Section 1.4 says it means. A chunk that uses the term without the definition is a trap — the model fills the gap with the generic meaning.
- Cross-references. “Subject to Section 9.2” is load-bearing. If retrieval doesn’t pull 9.2 alongside the clause that points to it, the answer is wrong by omission.
- Amendments and precedence. The operative term may live in a later amendment that supersedes the original. Retrieval has to understand recency and an explicit precedence order, not just semantic similarity.
The fix is structure-aware ingestion. Parse the document into its clause hierarchy first, chunk on clause and sub-clause boundaries rather than token counts, and attach metadata to every chunk — document ID, effective date, section number, party, defined-terms-in-scope. Resolve cross-references at index time so a retrieved clause carries its dependencies. None of this is glamorous, and all of it matters more than the choice of base model.
Retrieval that survives contact with real corpora#
A single dense-vector lookup is not enough on this data. Defined terms and section numbers are exact tokens, and pure semantic search drifts past them. The retrieval stack that holds up in practice is hybrid: combine dense embeddings with a keyword/BM25 pass so a query mentioning “Section 12.3” or “Net 60” actually finds it, then rerank the merged candidates with a cross-encoder before they reach the model. Over-retrieve deliberately — pull more candidates than you think you need — because in legal work the cost of a missing clause dwarfs the cost of a few extra tokens in the context window. Where the corpus has a clean entity graph (contracts, parties, obligations, dates), a small graph or metadata-filtered retrieval layer on top of vectors lets you answer “show me every NDA with this counterparty expiring this quarter” deterministically, which embeddings alone will never do reliably.
Grounding every answer to a citation#
The single most important design decision is to make citations a first-class output, not a nice-to-have footnote. Concretely:
- Retrieve candidate spans with metadata.
- Prompt the model to answer only from the retrieved spans, and to emit, for each claim, the document and section it came from.
- Run a post-generation verification pass that checks each cited span actually supports the claim — a separate, cheap model call or a string/semantic-overlap check. If a sentence has no supporting span, drop it or flag it.
- Render the answer with the citation inline, linking back to the exact clause in the source viewer so a lawyer can verify in one click.
This is the difference between a demo and a tool people trust. The hallucination risk in legal AI is well documented — courts have sanctioned lawyers for filings built on fabricated citations from general chatbots, and the lesson generalizes. A system that can only speak from retrieved text, and that proves it every time, changes the failure mode from “confidently wrong” to “honestly incomplete.” The second is survivable.
A useful discipline: treat “I couldn’t find that in your documents” as a success state, not a failure. Most products are tuned to always produce an answer. A legal tool should be tuned to abstain when retrieval comes up empty, and to say which documents it searched.
The workflows that actually save hours#
Once retrieval and grounding are solid, the high-value use cases are narrow and well-defined — which is exactly why they work.
Clause extraction and comparison#
Pull every limitation-of-liability, termination, auto-renewal, governing-law, and data-processing clause across a contract portfolio into a structured table. Compare an incoming third-party paper against your playbook of approved positions and flag where it deviates. This is first-pass review: the model surfaces the five clauses worth a lawyer’s attention out of forty, with the deviation highlighted and cited. The lawyer makes the call; the AI did the reading.
Policy and precedent search#
In-house counsel field a constant stream of “are we allowed to…” questions from sales, HR, and product. RAG over internal policies, prior approvals, and standard positions answers the routine ones with a citation to the governing policy — and routes the genuinely novel ones to a human. The win is deflecting the 70% that are lookups so counsel can spend time on the 30% that are judgment.
Intake triage and summarization#
New matters arrive as email threads and attachments. Summarize the request, classify it (contract review, employment, IP, dispute), extract parties and deadlines, and draft a structured intake record. Summarization grounded in the source thread is one of the lowest-risk, highest-leverage tasks an LLM does.
The pattern across all three: AI handles search, extraction, first-pass review, and summarization. It does not give advice, make risk calls, or sign off. That line is the product.
Privilege, confidentiality, and data governance#
Legal data is among the most sensitive in an enterprise, and the governance constraints shape the architecture as much as the retrieval logic does.
- Privilege is contagious and fragile. Attorney-client privileged material cannot leak into contexts where it loses protection. Tenancy isolation, per-matter access controls, and strict separation between clients (for outside counsel) or business units (for in-house) are baseline requirements, not enterprise upsells.
- No training on client data without explicit, scoped consent. The safe default is retrieval-only: documents live in a vector store the customer controls, and nothing flows back into model weights.
- Auditability. Every answer should be reconstructable — which documents were in scope, which spans were retrieved, which model version ran. When a lawyer relies on an output, they need to show their work.
- Residency and deletion. International teams have to honor data-residency rules and contractual deletion obligations. For a firm operating across Boston, London, Sydney, and Kathmandu, “where does this document physically sit and who can reach it” is a design question answered before the first line of retrieval code.
These are not compliance theater. They are the reason a general-purpose chatbot is a non-starter for this work and a purpose-built system can win.
Measure it like you mean it#
You cannot ship legal AI on vibes. Build an evaluation set the way you would for any safety-relevant system:
- A golden set of real questions with known-correct answers and the exact source spans that justify them, curated by lawyers.
- Retrieval metrics — did the right clause land in the top-k? Recall on the supporting span matters more than ranking elegance, because a missed cross-reference produces a wrong answer.
- Citation faithfulness — does every claim trace to a span that genuinely supports it? This catches the subtle failure where the answer is right but the cited source doesn’t actually say it.
- Abstention accuracy — when the answer isn’t in the corpus, does the system correctly decline instead of inventing one?
Re-run the suite on every model, prompt, and chunking change. Regressions in legal reliability are silent until someone relies on a bad answer in front of a counterparty.
The honest bottom line#
The funding flowing into in-house legal AI in 2026 is chasing a genuine inefficiency, and the technology is good enough to remove real drudgery — if it is built as a grounded retrieval system with citations, evals, and abstention, rather than a chat box with a law-firm logo. The durable products will be the ones honest about the boundary: AI reads, extracts, compares, and summarizes; a lawyer interprets, advises, and decides. Teams that respect that line will quietly hand hours back to their counsel. Teams that blur it are building a liability with a nice UI. The engineering challenge for the rest of the year is not a smarter model — it is retrieval that never loses the cross-reference and a system that would rather say “I don’t know” than guess.