How law firms can truly know if AI works

By Sam Grange
Beneath the glossy promises of ‘99% accuracy,’ real progress demands rigorous evaluation, clear metrics, and continuous testing anchored in legal reality
Law firms are racing to adopt AI tools. Each week brings another vendor promising to transform legal work, another headline boasting record accuracy, another demo showcasing dazzling results. Yet behind the polished marketing lies a harder question: How do we actually know these systems work?
The answer lies not in sales copy or benchmark scores. It lies in rigorous, systematic evaluation anchored in real legal tasks. Without this, firms risk two equal mistakes: over-trusting weak tools or under-using strong ones.
Beyond the Headline Number
Claims like ‘99% accurate’ sound impressive, but accurate at what? A benchmark filled with simple questions flatters any model but says little about performance on real legal tasks.
A lawyer recently said to me: "Even if the system is 99% accurate, I still have to check every answer as I don't know which will be the 1%." This is the accuracy trap. If every output still needs review, the system hasn't sped up the work, it has likely slowed it down.
This comment reframes the task entirely. We need to stop asking: ‘How does this perform on benchmarks?’ And instead start asking: ‘How does this change someone's workflow?’
When evaluation shifts towards usefulness and verifiability, the ease with which a user can trust and confirm an output, both productivity and adoption improve.
Evaluation, then, is not about squeezing the last decimals of accuracy. It's about understanding how well an AI system fits the process and helps get the job done.
Building Your Evaluation Framework
If you don’t know how to measure whether an AI system truly works for your use cases, you can’t govern it, improve it, or defend it, and you certainly can’t use it responsibly.
A solid evaluation framework begins with four questions:
What tasks will this handle? Be specific. Don't stop at ‘contract review.’ Define every discrete task. For example, ’identifying parties in commercial agreements’ or ‘extracting payment terms from supplier contracts’. High-level categories hide dozens of subtasks, each requiring its own evaluation method.
What does success look like? Accuracy might be convenient for your technical teams, but it might not be the most important metric for your business. It might be time to track the number of review iterations or the rate of missed clauses. It’s important to anchor every enhancement to a business outcome, not just a technical benchmark.
How will we test it? Design evaluations that reflect real workflows, not synthetic benchmarks. For a contract review tool, test it against your firm's actual playbooks and specific cognitive tasks. Simple retrieval, such as locating a governing law clause that states: ‘This agreement shall be governed by New York law,’ can be handled effectively by basic systems. But when determining governing law depends on contextual clues such as billing addresses, defined terms, or conditional cross-references, the task becomes much harder. Don’t assume consistency within tasks. Use genuine documents of varying quality. Build small, curated datasets where the ground truth is known and verifiable. Once you’ve defined what constitutes a correct answer or action, much of the evaluation can be automated.
How do we review the outputs? Different tasks require different evaluation methods. Data-point extraction can be verified straightforwardly using decades old techniques, while complex reasoning may require expert review or structured scoring rubrics. Whatever the method, tailor it to the task and account for verification difficulty and subjectivity.
This framework isn't bureaucracy, it's infrastructure. It gives you a shared language for understanding performance, a basis for governance, and a way to transform marketing claims into measurable evidence.
The Accuracy Spectrum
Not every use case demands the same level of accuracy. Understanding where your task sits on this spectrum is essential.
At one end are high-risk, high-precision tasks, where even a small error carries significant consequences. When comparing versions in a redline or citing authorities in court, a single missed change or misquoted precedent creates real exposure. Here, AI delivers value only when accuracy approaches perfection. Until that threshold is met, human review remains indispensable.
At the other end are exploratory or interpretive tasks, where accuracy is important but not absolute. In semantic search, the goal is relevance. To surface potentially useful material, not guarantee a single correct answer. If an occasional irrelevant document appears, but is easily ignored, the overall value is unaffected. In fact, overly strict precision can reduce discovery by narrowing the field too much.
Between these extremes lies a wide middle ground, tasks whose required accuracy shifts with context and purpose. Generating a chronology might demand looser thresholds for internal use, but near-perfect precision if those materials are bound for court.
Thresholds are directional, not absolute. Every firm should calibrate its own standards based on the risk profile of each task, regulatory obligations, and internal quality expectations. What matters is aligning accuracy requirements with business risk and user trust.
What to Evaluate? A Practical Framework
Effective evaluation goes beyond measuring raw accuracy. Test across multiple dimensions, each addressing different aspects of reliability, usability, and trust.
Retrieval
For systems that search or retrieve information, structured relevance assessments are essential. Traditional human grading remains the gold standard for nuanced, context-driven evaluation. For example, distinguishing between documents that are authoritative and those that are merely relevant, a distinction that is critical in law. The most robust approach combines expert judgment with automated systems. Build human-graded datasets and reuse them across configurations to enable consistent testing without the need for relabelling each time.
Generation
When AI generates text, evaluate against expert-authored required elements tailored to the task. Then use them to measure precision (what proportion of the output is accurate) and recall (what proportion of the required information appears). While it is tempting, avoid reference-free LLM judging without these required elements; such methods often suffer from inconsistency, bias, and narcissism.
Robustness
Real users don't write perfect prompts. Evaluate whether systems can handle this. This means synonyms, misspellings, and formatting variations, all without changing intent. These invariance tests reveal whether a system truly understands a query or is simply pattern-matching.
Bias and fairness
Test for differences in response quality when surface-level, legally irrelevant details vary, such as references to race, gender, socioeconomic status, and more, where they should not affect the answer. A fair system must produce equivalent results regardless of these cues.
Knowing when to say, ‘I don’t know.’
A system's ability to recognise its own limits is as important as its ability to generate answers. Legal frameworks vary dramatically across jurisdictions, yet retrieval augmented generation (RAG) systems can generate hybrid answers mixing multiple jurisdictions' rules. Evaluate whether systems recognise this and when they cannot provide reliable answers. Use directional tests to confirm that changing jurisdiction triggers appropriately different responses.
Continuous Evaluation: The Gardener's Approach
Musician Brian Eno has a beautiful saying about working with generative systems: ‘You don’t want to be an architect, you want to be a gardener.’
The architect envisions every detail in advance, from the skyline to the doorknob, and a builder creates exactly what was imagined. The gardener plants seeds, observes what thrives, and adapts accordingly. That distinction captures what modern AI evaluation requires.
We can't design the perfect test suite and walk away. Use cases change, edge cases emerge, and large language models evolve; versions shift, behaviours drift, and yesterday’s assumptions expire. Evaluation should be ongoing, not episodic. It’s a living process—less about certification, more about cultivation. The firms that will thrive stay close to real use, watching how systems behave in practice and adjusting as they grow.
Establish regression suites to catch performance degradation before deployment. Use pre-release gates to verify major updates, and post-update smoke tests to confirm continued reliability. Track metrics on a living dashboard so that change stays visible, not anecdotal.
Continuous evaluation turns governance into growth. Like a well-tended garden, your AI systems remain healthy not because they were perfectly planted at the beginning, but because you kept watching, measuring, and adapting.
Sooner or later, your clients will ask how you know your AI is working properly. They'll want evidence, not anecdotes. They'll ask what you did if it failed, how you caught it, and what you changed. If you have clear documentation and regular evaluations, you'll have good answers. If you don't, you'll end up backfilling them later, under pressure, at ten times the cost.
Start now because the future of legal practice isn't about having AI. That's already a given. It's about having AI you can understand, trust, measure, and continually improve. And that begins with one simple discipline: knowing what questions to ask.