Research Note · Evaluation · Assurance
Why Paired Evaluation Beats Before/After Benchmarks
A model-edit benchmark number is only as strong as the comparison behind it. Pairing makes the comparison inspectable.
Research Note: comparison quality is part of the result
Highlights
- A before/after benchmark delta is weak if the two numbers were not measured on the same windows.
- InvarLock treats pairing as a contract, not a reporting preference.
- The report matters because it exposes whether the comparison was exact, non-overlapping, and reproducible.
A model-edit benchmark can look precise while still being structurally weak. The usual failure mode is simple: a reader sees one baseline score, one edited-model score, and a difference between them, but the underlying comparison was not held fixed tightly enough to justify that difference.
That is why pairing matters.
In InvarLock, the point is not only to compare a baseline and an edited subject. It is to compare them on the same deterministic windows, with overlap and count checks surfaced in the report and enforced by the verifier. That turns "before/after" from a loose benchmark habit into a constrained measurement procedure.
The Problem With Naive Before/After Numbers
The phrase "before and after" sounds stronger than it is.
If the baseline and edited runs do not reuse the same windows, then some of the observed difference may come from the schedule itself rather than from the edit. Different slices of text, different coverage, silent overlap, or mismatched counts can all contaminate the comparison. The resulting number may still be interesting, but it is no longer the clean statement many readers assume it is.
This is exactly why InvarLock's public docs treat pairing as part of the method surface. The README does not describe evaluation as "run two benchmarks and compare them later." It describes evaluation as a paired comparison against a fixed baseline with deterministic windows and explicit guard contracts.
What Pairing Changes
Pairing does three important things.
First, it holds the measurement surface fixed. The edited subject is evaluated on the same preview and final windows as the baseline. That removes one large class of ambiguity from the comparison.
Second, it makes the primary metric math interpretable. The assurance case ties paired primary metrics to log-space comparison and bootstrap confidence intervals. That is a stronger setup than treating two aggregate benchmark numbers as if they were automatically comparable.
Third, it makes the comparison auditable. InvarLock does not merely say that pairing happened. It records pairing statistics, counts, and overlap behavior in the report surface so another person can inspect them.
Why InvarLock Treats Pairing As A Runtime Contract
The most important thing in the pairing docs is not the definition. It is the fact that pairing is enforced.
The public coverage-and-pairing note is explicit: valid schedules use fixed seeds, non-overlapping windows, and exact reuse of baseline window IDs for edited runs. CI and Release profiles fail closed if pairing is insufficient, overlap is present, or window counts do not meet the tier floor.
The datasets reference makes the same posture concrete through invariants. window_match_fraction must be 1.0. window_overlap_fraction must be 0.0. paired_windows must be greater than zero. Missing or invalid baseline evidence is not supposed to produce a hand-wavy comparison. It is supposed to stop the stronger claim from being made.
That distinction matters. A serious evaluation system should not rely on author intent to keep comparisons clean. It should make the bad comparison harder to ship.
What The Report Makes Visible
The reports reference is useful here because it shows how pairing leaves a trail.
evaluation.report.json carries dataset and window statistics, primary-metric fields, and the validation surface the verifier uses. Pairing is not buried in a side note. It appears in the evidence flow, in the dataset windows stats, and in the verify path that checks pairing, count logic, and ratio math.
That is the real difference between "paired evaluation" as a slogan and pairing as an evidence system. In the first case, readers are asked to trust that the comparison was careful. In the second case, they can inspect whether the schedule actually matched.
What Pairing Still Does Not Solve
Pairing is strong, but it is not magic.
It does not tell you whether the dataset was a good choice. It does not rescue a tokenizer mismatch. It does not solve task validity, deployment realism, or broader questions about whether the benchmark reflects the behavior you actually care about. Pairing improves the fairness of the comparison. It does not automatically make the comparison globally important.
This is exactly why the argument should stay narrow. The right claim is not "paired evaluation makes benchmarking objective." The right claim is smaller: if you want to reason about regression from a weight edit, exact window reuse is a much stronger comparison surface than a loose before/after number.
Claim Surface
The practical structure is:
- baseline and subject reuse the same deterministic windows
- the report records counts, pairing, and overlap statistics
- the verifier checks those contracts in stronger profiles
- the resulting primary-metric comparison is more interpretable than a naive before/after benchmark
That does not make every result definitive. It does make the comparison more defensible.
Limitations
- This post explains why pairing is methodologically stronger; it does not add a new experiment.
- Pairing does not solve dataset choice or external validity.
- The companion figure carries part of the contrast between exact pairing and naive before/after comparison, but it is still only a method sketch.
Sources
More from the blog
Continue through recent releases and implementation notes.
Release
Standalone contract bundles with tighter release gates
InvarLock 0.7.1 makes wheel-only verify/report workflows first-class, ships a public contract bundle, and tightens supply-chain and release-validation gates.
Release
GPT-OSS pilots with CUDA-ready attested lanes
InvarLock 0.7.0 adds first-class GPT-OSS support, pilot Ministral 3 8B/14B presets, and a CUDA-capable attested runtime path for GPU hosts.
Research Note
What InvarLock Actually Claims
A narrow claim can be stronger than a broad one. InvarLock is about auditable regression risk from weight edits, not general model safety.