Abstract
Research aim
This study examines how legal knowledge can be measured inside automated reasoning systems without treating legal prediction as a black box. Dr Ciarán O'Donnell frames the problem narrowly: identify objective dimensions of legal knowledge, formalize them into machine-readable structures, and test whether those structures improve procedural outcome prediction.
The initial drafting phase considered a broad multi-jurisdictional corpus. The team rejected that design and used a strictly bounded US-only dataset instead, chiefly to avoid cross-border semantic drift in procedural language. That choice narrowed the evidential base, but it made the measurement task cleaner.
Metrics and aggregate result
The evaluation centered on three families of measures: precision in precedent citation, recall for mandatory procedural steps, and calibration of predicted confidence against observed outcomes. Training logs show a baseline prediction accuracy of roughly 87% after a data collection period ranging from 14 to 19 months.
Bottom Line: The core finding is not that automated legal reasoning has solved procedural law. It is that certain dimensions of legal knowledge can be measured with enough discipline to compare one representation scheme against another.
Introduction
Why objective metrics matter
Legal AI often speaks in broad terms: prediction, retrieval, summarisation, risk. Those labels hide the real technical question. What exactly does the system know about law, and how does that knowledge change the result?
For procedural domains, the answer cannot rest on surface language alone. A case syllabus may describe a dismissal, a motion, and a ruling on the merits in adjacent sentences. An automated system that treats those terms as loosely interchangeable can misclassify a procedural dismissal as a substantive ruling. That is not a small drafting error. It changes the legal meaning of the case.
Observation data supports the need for separable metrics because procedural reasoning contains several distinct tasks: identifying the rule, locating the deadline, determining whether a condition has been triggered, and deciding whether prior authority constrains the outcome. A single accuracy score can make those tasks look more stable than they are.
Scope of the inquiry
The scope is limited to US civil and criminal procedure. Researchers debated whether to include tort law, then discarded it after early readings showed high variance in state-level common law interpretation. In the initial literature review, which ran for about 45 to 62 days, procedural rule application showed close to 12% variance. That was already enough complexity for a controlled study.
The project sits within a Cambridge-origin line of automated reasoning research concerned with proof structure, defeasible logic, and reproducible inference. The point is not to replace legal judgment. It is to test whether a system can represent the legal materials in a way that makes its reasoning inspectable.
Important: A bounded domain is not a weakness here. It is the condition that allows measurement to mean something.
Methodology
Dataset construction
The dataset was constructed from anonymized US court records within the civil and criminal procedure scope. The practical work was slower than the design memo suggested. Each annotation batch required roughly 182 to 217 hours of manual review, because the relevant legal signal often appeared in conditional phrases rather than headline outcomes.
The team first attempted automated extraction using standard natural language processing pipelines. That route was abandoned after the tools failed to capture nested conditional logic in procedural reasoning. A phrase such as filing after a statutory period, unless tolling applies, is easy to read and hard to encode. It carries a rule, an exception, and a timing condition in one compact unit.
Field experience revealed that the annotation protocol needed to separate case facts, procedural posture, rule triggers, exceptions, precedent references, and temporal constraints. The inter-annotator agreement threshold was set at about 94%. Below that point, disagreement signalled that the rule formalisation itself required revision, not merely that annotators needed more training.
Formalisation and benchmarks
Case facts were converted into machine-readable structures through a staged protocol. First, annotators identified the procedural event. Next, they marked the governing rule or precedent reference. Then they encoded timing, exception, and burden-related features where present. This produced a representation that could be tested for both outcome prediction and reasoning trace consistency.
The benchmark sheet used three main measurement categories:
- Precision: false positive rate in precedent citation, with a target threshold above 91%.
- Recall: capture rate of mandatory procedural steps, with a target threshold around 89%.
- Calibration: alignment between predicted confidence and observed procedural outcomes.
The calibration measure mattered because a legally useful system should not merely be right by accident. It should express lower confidence when the rule structure is incomplete, ambiguous, or dependent on facts that the record does not settle. That said, the formalisation protocols degrade significantly when applied to appellate cases featuring split decisions or dissenting opinions that introduce competing legal frameworks.
Field Note: The hardest records were not the longest ones. They were short orders using familiar procedural vocabulary in an unusual posture.
Key Findings
Where representation changed performance
The clearest performance difference appeared when temporal reasoning was represented as its own legal dimension rather than folded into general fact description. Based on our testing, systems that carried timing as a structured component resolved temporal logic around 40% better during continuous benchmarking periods lasting 3 to 5 weeks.
This finding is unsurprising to procedural lawyers, but it is important for system design. Deadlines do not behave like ordinary facts. A missed filing date, a tolling event, and a jurisdiction-specific deadline rule can reverse the expected result even when the factual narrative looks stable.
Context-dependent variation was especially visible where jurisdictions differed between strict statutory deadlines and equitable tolling principles. In strict-deadline settings, the temporal feature often dominated. In tolling settings, the system needed to represent exceptions and judicial discretion with more care.
Precedent weighting and granularity
The evaluators decided against using standard F1 scores as the sole metric. Group feedback indicates that F1 masked false positives in precedent weighting, particularly where the system cited procedurally similar cases that did not actually control the dispute.
Granularity changed reliability. Coarse representations could often predict the broad outcome, but they were more likely to confuse a procedural dismissal with a decision on the merits when overlapping terminology appeared in the case syllabus. More detailed structures handled that failure case better because they preserved the distinction between posture, issue, and legal effect.
This is where the study becomes less glamorous and more useful. The improvement did not come from a larger vocabulary. It came from forcing the system to distinguish legal roles that human readers often infer silently.
- Temporal reasoning improved when deadlines, tolling events, and sequence were encoded separately.
- Precedent weighting improved when citation relevance was evaluated against procedural posture, not surface similarity alone.
- Outcome reliability improved when the system retained fine-grained distinctions between fact, rule, exception, and holding.
Limitations
Jurisdiction and discretion
The dataset is jurisdiction-specific. Its US civil and criminal procedure focus supports cleaner evaluation, but it also limits transfer. A model that handles federal procedural timing well may still struggle when a state doctrine shifts the practical weight of an exception.
Discretionary judicial factors remain difficult to formalise. The authors considered proposing a universal patch for those factors, then discarded the idea after recognising that latent variables in sentencing guidelines could not be mathematically captured with sufficient fidelity. That restraint matters. Some legal judgments depend on institutional context, not just rule syntax.
When applied to novel legal questions, reliability dropped by about 22%. The projected future validation work spans 6 to 8 months, with attention to whether the existing dimensions can be extended without erasing legally meaningful uncertainty.
Boundary of current formalisation
The current approach works best where the procedural issue has a stable rule structure and the record supplies enough facts to test conditions. It is weaker where the legal question itself is new, where appellate opinions split on rationale, or where dissenting opinions introduce rival frameworks that later courts may adopt.
That does not make the work merely academic. It identifies where automated legal reasoning can be audited: temporal logic, precedent weighting, procedural posture, and calibrated confidence. It also identifies where human legal judgment remains central.
The practical lesson is modest but firm. Objective dimensions of legal knowledge are possible, provided the domain is bounded, the annotation protocol is strict, and the evaluation does not let a single aggregate score hide the structure of the legal error.


