Coverage & Pairing Plan

Plain language: Use non‑overlapping, paired windows with fixed seeds. Baseline and edited runs reuse the exact same windows. Tier‑based minima are validated at runtime and surfaced in the report.

Overview

AspectDetails
PurposeDefine the pairing, non-overlap, seed, and tier-floor requirements for evaluation windows.
AudienceEvaluation pipeline maintainers, release reviewers, and operators preparing paired evidence.
Contract scopeBaseline/subject window reuse, pairing statistics, coverage floors, and report-verifier checks.
Source of truthsrc/invarlock/core/runner_pairing.py, src/invarlock/eval/window_planning.py, and report pairing tests.

Claim

A valid evaluation schedule uses non‑overlapping, paired windows with fixed seeds and reuses the baseline window IDs for edited runs. The runner enforces tier-based minima. CI/Release runs hard-fail pairing/count drift when a baseline pairing context exists, and report verification rejects invalid pairing.

Window Selection (assumptions)

  • Non‑overlap: set seq_len == stride so windows do not overlap.
  • Deterministic: record and reuse the seed bundle (python, numpy, torch) and bootstrap seed (when applicable).
  • Dedupe: deduplication is allowed for pilots/probes; release evidence uses strict non‑overlap on the full plan.
  • Exact pairing: preview/final counts must match and the edited run must reuse baseline window IDs; mixing schedules invalidates the paired Δlog assumptions.

Pairing Reuse (baseline → edited)

  • The edited run pins windows via the baseline report.
  • report lints pairing and overlap:
    • dataset.windows.stats.window_match_fraction == 1.0
    • dataset.windows.stats.window_overlap_fraction == 0.0
  • CI/Release abort if counts differ, pairing < 1.0, or overlap > 0.0 when a baseline pairing context exists.

Tier Minima (runner defaults)

Sane defaults enforced by the runner per tier (guard-rail floors; profiles may request higher counts):

TierPreview WindowsFinal WindowsBootstrap Replicates
Conservative2202201,500
Balanced1801801,200
Aggressive140140800

These minima are derived from half‑width targets on paired Δlog‑loss (see Tier Policy v1 Calibration). CI/Release profiles treat shortfalls as hard errors; dev flows surface warnings but also record coverage in the generated report bundle.

Runtime Contract (report)

  • Window plan: dataset.windows.stats.{requested_preview,requested_final,actual_preview,actual_final}
  • Pairing/overlap: dataset.windows.stats.{window_match_fraction,window_overlap_fraction,paired_windows}
  • Coverage floors: dataset.windows.stats.coverage.{preview,final} meets/exceeds the window tier floor (profiles may request higher counts)
  • Bootstrap metadata: dataset.windows.stats.bootstrap.{method,alpha,replicates,seed} records the interval method, replicate count, and RNG seed

Observability

  • Pairing and coverage appear in both the Markdown report and the JSON report, enabling auditors to verify schedule integrity.

Assumptions & Scope

  • Applies to evaluation (inference) schedules; training/edit algorithms may alter data flow and are out of scope here.
  • Dataset or tokenizer changes that affect tokenization invalidate recorded pairing schedules.
  • Window pairing must be exact (ID reuse) and non‑overlapping; mixing schedules invalidate paired Δlog assumptions.
  • This plan is calibrated for Linux/macOS environments and the tier profiles documented in Tier Policy v1 Calibration.