Why Paired Evaluation Beats Before/After Benchmarks

A model-edit benchmark number is only as strong as the comparison behind it. Pairing makes the comparison inspectable.

April 13, 2026

5 min read

InvarLock Team

Research Note: comparison quality is part of the result

Highlights

Pairing is enforced through three report invariants: window_match_fraction == 1.0, window_overlap_fraction == 0.0, paired_windows > 0.
CI and Release profiles hard-abort when pairing falls short of the tier floor; failures surface as E001 and related codes.
evaluation.report.json exposes the pairing trace — dataset window stats, counts, overlap behavior — so the comparison is inspectable, not just asserted.

A model-edit benchmark can look precise while still being structurally weak. The usual failure mode is simple: a reader sees one baseline score, one edited-model score, and a difference between them, but the underlying comparison was not held fixed tightly enough to justify that difference.

That is why pairing matters.

In InvarLock, the point is not only to compare a baseline and an edited subject. It is to compare them on the same deterministic windows, with overlap and count checks surfaced in the report and enforced by the verifier. That turns "before/after" from a loose benchmark habit into a constrained measurement procedure.

The Problem With Naive Before/After Numbers

The phrase "before and after" sounds stronger than it is.

If the baseline and edited runs do not reuse the same windows, then some of the observed difference may come from the schedule itself rather than from the edit. Different slices of text, different coverage, silent overlap, or mismatched counts can all contaminate the comparison. The resulting number may still be interesting, but it is no longer the clean statement many readers assume it is.

This is why InvarLock's public docs treat pairing as part of the method surface. They do not describe evaluation as "run two benchmarks and compare them later." They describe evaluation as a paired comparison against a fixed baseline with deterministic windows and explicit guard contracts.

What Pairing Changes

Pairing does three important things.

First, it holds the measurement surface fixed. The edited subject is evaluated on the same preview and final windows as the baseline. That removes one large class of ambiguity from the comparison.

Second, it makes the primary metric math interpretable. The assurance case ties paired primary metrics to log-space comparison and bootstrap confidence intervals. That is a stronger setup than treating two aggregate benchmark numbers as if they were automatically comparable.

Third, it makes the comparison auditable. InvarLock does not merely say that pairing happened. It records pairing statistics, counts, and overlap behavior in the report surface so another person can inspect them.

Why InvarLock Treats Pairing As A Runtime Contract

The most important thing in the pairing docs is not the definition. It is the fact that pairing is enforced.

The public coverage-and-pairing note is explicit: valid schedules use fixed seeds, non-overlapping windows, and exact reuse of baseline window IDs for edited runs. CI and Release profiles fail closed if pairing is insufficient, overlap is present, or window counts do not meet the tier floor.

The datasets reference makes the same posture concrete through invariants. window_match_fraction must be 1.0. window_overlap_fraction must be 0.0. paired_windows must be greater than zero. Missing or invalid baseline evidence is not supposed to produce a hand-wavy comparison. It is supposed to stop the stronger claim from being made.

That distinction matters. A serious evaluation system should not rely on author intent to keep comparisons clean. It should make the bad comparison harder to ship.

What The Report Makes Visible

The reports reference is useful here because it shows how pairing leaves a trail.

evaluation.report.json carries dataset and window statistics, primary-metric fields, and the validation surface the verifier uses. Pairing is not buried in a side note. It appears in the evidence flow, in the dataset windows stats, and in the verify path that checks pairing, count logic, and ratio math.

That is the practical difference between "paired evaluation" as a slogan and pairing as an evidence system. In the first case, readers are asked to trust that the comparison was careful. In the second case, they can inspect whether the schedule actually matched.

What Pairing Still Does Not Solve

Pairing is strong, but it is not magic.

It does not tell you whether the dataset was a good choice. It does not rescue a tokenizer mismatch. It does not solve task validity, deployment realism, or broader questions about whether the benchmark reflects the behavior you actually care about. Pairing improves the fairness of the comparison. It does not automatically make the comparison globally important.

This is why the argument should stay narrow. The right claim is not "paired evaluation makes benchmarking objective." The right claim is smaller: if you want to reason about regression from a weight edit, exact window reuse is a much stronger comparison surface than a loose before/after number.

Claim Map

The practical structure is:

baseline and subject reuse the same deterministic windows
the report records counts, pairing, and overlap statistics
the verifier checks those contracts in stronger profiles
the resulting primary-metric comparison is more interpretable than a naive before/after benchmark

That does not make every result definitive. It does make the comparison more defensible.

Limitations

Pairing tightens the comparison surface, not the question being asked: a paired evaluation of an unrepresentative dataset is still unrepresentative.
Window identity, coverage, and overlap math live in the pairing docs — this post summarizes the posture, not the spec.
Whether a given benchmark should be the comparison is out of scope.

Why Paired Evaluation Beats Before/After Benchmarks

Highlights

The Problem With Naive Before/After Numbers

What Pairing Changes

Why InvarLock Treats Pairing As A Runtime Contract

What The Report Makes Visible

What Pairing Still Does Not Solve

Claim Map

Limitations

Sources

More in Research Note

Fail-Closed Verification for Weight-Edit Evaluation

What InvarLock Actually Claims

The Minimum Evidence Surface for Trustworthy Weight-Edit Results