Coverage & Pairing Plan
Plain language: Use non‑overlapping, paired windows with fixed seeds. Baseline and edited runs reuse the exact same windows. Tier‑based minima are validated at runtime and surfaced in the report.
Overview
| Aspect | Details |
|---|---|
| Purpose | Define the pairing, non-overlap, seed, and tier-floor requirements for evaluation windows. |
| Audience | Evaluation pipeline maintainers, release reviewers, and operators preparing paired evidence. |
| Contract scope | Baseline/subject window reuse, pairing statistics, coverage floors, and report-verifier checks. |
| Source of truth | src/invarlock/core/runner_pairing.py, src/invarlock/eval/window_planning.py, and report pairing tests. |
Claim
A valid evaluation schedule uses non‑overlapping, paired windows with fixed seeds and reuses the baseline window IDs for edited runs. The runner enforces tier-based minima. CI/Release runs hard-fail pairing/count drift when a baseline pairing context exists, and report verification rejects invalid pairing.
Window Selection (assumptions)
- Non‑overlap: set
seq_len == strideso windows do not overlap. - Deterministic: record and reuse the seed bundle (
python,numpy,torch) and bootstrap seed (when applicable). - Dedupe: deduplication is allowed for pilots/probes; release evidence uses strict non‑overlap on the full plan.
- Exact pairing: preview/final counts must match and the edited run must reuse baseline window IDs; mixing schedules invalidates the paired Δlog assumptions.
Pairing Reuse (baseline → edited)
- The edited run pins windows via the baseline report.
- report lints pairing and overlap:
dataset.windows.stats.window_match_fraction == 1.0dataset.windows.stats.window_overlap_fraction == 0.0
- CI/Release abort if counts differ, pairing < 1.0, or overlap > 0.0 when a baseline pairing context exists.
Tier Minima (runner defaults)
Sane defaults enforced by the runner per tier (guard-rail floors; profiles may request higher counts):
| Tier | Preview Windows | Final Windows | Bootstrap Replicates |
|---|---|---|---|
| Conservative | 220 | 220 | 1,500 |
| Balanced | 180 | 180 | 1,200 |
| Aggressive | 140 | 140 | 800 |
These minima are derived from half‑width targets on paired Δlog‑loss (see Tier Policy v1 Calibration). CI/Release profiles treat shortfalls as hard errors; dev flows surface warnings but also record coverage in the generated report bundle.
Runtime Contract (report)
- Window plan:
dataset.windows.stats.{requested_preview,requested_final,actual_preview,actual_final} - Pairing/overlap:
dataset.windows.stats.{window_match_fraction,window_overlap_fraction,paired_windows} - Coverage floors:
dataset.windows.stats.coverage.{preview,final}meets/exceeds the window tier floor (profiles may request higher counts) - Bootstrap metadata:
dataset.windows.stats.bootstrap.{method,alpha,replicates,seed}records the interval method, replicate count, and RNG seed
Observability
- Pairing and coverage appear in both the Markdown report and the JSON report, enabling auditors to verify schedule integrity.
Assumptions & Scope
- Applies to evaluation (inference) schedules; training/edit algorithms may alter data flow and are out of scope here.
- Dataset or tokenizer changes that affect tokenization invalidate recorded pairing schedules.
- Window pairing must be exact (ID reuse) and non‑overlapping; mixing schedules invalidate paired Δlog assumptions.
- This plan is calibrated for Linux/macOS environments and the tier profiles documented in Tier Policy v1 Calibration.