Back to blog

Research Note · Evaluation · Assurance

Why Paired Evaluation Beats Before/After Benchmarks

Ink/charcoal doodle: exact paired baseline and subject windows flow into report and verify, while a looser before-and-after lane sits below as weaker evidence.

A model-edit benchmark number is only as strong as the comparison behind it. Pairing makes the comparison inspectable.

5 min read
InvarLock Team

Research Note: comparison quality is part of the result

Highlights

  • A before/after benchmark delta is weak if the two numbers were not measured on the same windows.
  • InvarLock treats pairing as a contract, not a reporting preference.
  • The report matters because it exposes whether the comparison was exact, non-overlapping, and reproducible.

A model-edit benchmark can look precise while still being structurally weak. The usual failure mode is simple: a reader sees one baseline score, one edited-model score, and a difference between them, but the underlying comparison was not held fixed tightly enough to justify that difference.

That is why pairing matters.

In InvarLock, the point is not only to compare a baseline and an edited subject. It is to compare them on the same deterministic windows, with overlap and count checks surfaced in the report and enforced by the verifier. That turns "before/after" from a loose benchmark habit into a constrained measurement procedure.

The Problem With Naive Before/After Numbers

The phrase "before and after" sounds stronger than it is.

If the baseline and edited runs do not reuse the same windows, then some of the observed difference may come from the schedule itself rather than from the edit. Different slices of text, different coverage, silent overlap, or mismatched counts can all contaminate the comparison. The resulting number may still be interesting, but it is no longer the clean statement many readers assume it is.

This is exactly why InvarLock's public docs treat pairing as part of the method surface. The README does not describe evaluation as "run two benchmarks and compare them later." It describes evaluation as a paired comparison against a fixed baseline with deterministic windows and explicit guard contracts.

What Pairing Changes

Pairing does three important things.

First, it holds the measurement surface fixed. The edited subject is evaluated on the same preview and final windows as the baseline. That removes one large class of ambiguity from the comparison.

Second, it makes the primary metric math interpretable. The assurance case ties paired primary metrics to log-space comparison and bootstrap confidence intervals. That is a stronger setup than treating two aggregate benchmark numbers as if they were automatically comparable.

Third, it makes the comparison auditable. InvarLock does not merely say that pairing happened. It records pairing statistics, counts, and overlap behavior in the report surface so another person can inspect them.

Why InvarLock Treats Pairing As A Runtime Contract

The most important thing in the pairing docs is not the definition. It is the fact that pairing is enforced.

The public coverage-and-pairing note is explicit: valid schedules use fixed seeds, non-overlapping windows, and exact reuse of baseline window IDs for edited runs. CI and Release profiles fail closed if pairing is insufficient, overlap is present, or window counts do not meet the tier floor.

The datasets reference makes the same posture concrete through invariants. window_match_fraction must be 1.0. window_overlap_fraction must be 0.0. paired_windows must be greater than zero. Missing or invalid baseline evidence is not supposed to produce a hand-wavy comparison. It is supposed to stop the stronger claim from being made.

That distinction matters. A serious evaluation system should not rely on author intent to keep comparisons clean. It should make the bad comparison harder to ship.

What The Report Makes Visible

The reports reference is useful here because it shows how pairing leaves a trail.

evaluation.report.json carries dataset and window statistics, primary-metric fields, and the validation surface the verifier uses. Pairing is not buried in a side note. It appears in the evidence flow, in the dataset windows stats, and in the verify path that checks pairing, count logic, and ratio math.

That is the real difference between "paired evaluation" as a slogan and pairing as an evidence system. In the first case, readers are asked to trust that the comparison was careful. In the second case, they can inspect whether the schedule actually matched.

What Pairing Still Does Not Solve

Pairing is strong, but it is not magic.

It does not tell you whether the dataset was a good choice. It does not rescue a tokenizer mismatch. It does not solve task validity, deployment realism, or broader questions about whether the benchmark reflects the behavior you actually care about. Pairing improves the fairness of the comparison. It does not automatically make the comparison globally important.

This is exactly why the argument should stay narrow. The right claim is not "paired evaluation makes benchmarking objective." The right claim is smaller: if you want to reason about regression from a weight edit, exact window reuse is a much stronger comparison surface than a loose before/after number.

Claim Surface

The practical structure is:

  • baseline and subject reuse the same deterministic windows
  • the report records counts, pairing, and overlap statistics
  • the verifier checks those contracts in stronger profiles
  • the resulting primary-metric comparison is more interpretable than a naive before/after benchmark

That does not make every result definitive. It does make the comparison more defensible.

Limitations

  • This post explains why pairing is methodologically stronger; it does not add a new experiment.
  • Pairing does not solve dataset choice or external validity.
  • The companion figure carries part of the contrast between exact pairing and naive before/after comparison, but it is still only a method sketch.

Sources

More from the blog

Continue through recent releases and implementation notes.