Compare & evaluate (BYOE)
Overview
| Aspect | Details |
|---|---|
| Purpose | evaluate two checkpoints (baseline vs subject) with deterministic pairing. |
| Audience | Teams and researchers with existing model-edit workflows who want paired evaluation without coupling to a specific edit stack. |
| Workflow | Baseline run → Subject run → report with paired windows. |
| Network | Offline by default; use evaluate --allow-network when a run needs model downloads. |
| Output | evaluation.report.json + evaluation_report.md (+ runtime.manifest.json for container-backed outputs). |
InvarLock's primary, most stable path is Compare & evaluate (BYOE): you provide the baseline and the subject checkpoints, and InvarLock produces a deterministic report. This avoids coupling to any particular edit stack and keeps your existing tooling intact whether you are validating quantization, pruning, fine-tuning, or other checkpoint-edit workflows.
TL;DR
- Produce your baseline and edited checkpoints (any external tool).
- Ensure both use the same tokenizer (InvarLock verify lints tokenizer hash when present).
- Run
invarlock evaluate --baseline <baseline> --subject <subject> --adapter auto.
By default, evaluate runs inside the runtime container. Use --execution-mode host
only for host-side workflows that intentionally run model loading on the
host. If you choose that host-side path, verify the resulting report with
invarlock verify --runtime-provenance host ....
Example (wheel-first, GPT‑2, CPU/MPS friendly; requires invarlock[hf] or equivalent HF extra):
INVARLOCK_DEDUP_TEXTS=1 invarlock evaluate --allow-network \
--baseline sshleifer/tiny-gpt2 \
--subject /path/to/your/edited-model \
--adapter auto \
--profile ci \
--out runs/eval_smoke \
--report-out reports/eval_smoke
Repo maintainers who want a repo-owned preset can replace the flag-only example
above with --preset configs/presets/..., but that preset path is not shipped
in wheel installs.
Outputs:
- JSON report:
reports/eval_smoke/evaluation.report.json - Markdown report:
reports/eval_smoke/evaluation_report.md - Runtime provenance:
reports/eval_smoke/runtime.manifest.json
Reuse a baseline report (skip baseline evaluation)
When evaluating many subjects against the same baseline, you can reuse a single
baseline report.json file and skip the baseline evaluation portion by passing
--baseline-report the exact emitted report path.
Requirements:
- Baseline report must be from a no-op run (
edit.name == "noop"). - Baseline report must include stored evaluation windows (set
INVARLOCK_STORE_EVAL_WINDOWS=1when generating it). - The baseline report must match the intended
--profile,--tier, and adapter family.
Example:
# 1) Produce a reusable baseline report once (writes runs/baseline_once/source/<timestamp>/report.json)
INVARLOCK_STORE_EVAL_WINDOWS=1 INVARLOCK_DEDUP_TEXTS=1 invarlock evaluate --allow-network \
--baseline sshleifer/tiny-gpt2 \
--subject sshleifer/tiny-gpt2 \
--adapter auto \
--profile ci \
--tier balanced \
--out runs/baseline_once \
--report-out reports/eval_baseline_once
# 2) Reuse it for many subjects (skips baseline evaluation)
# Use the exact report path from step 1, e.g. runs/baseline_once/source/<timestamp>/report.json
INVARLOCK_DEDUP_TEXTS=1 invarlock evaluate --allow-network \
--baseline-report runs/baseline_once/source/<timestamp>/report.json \
--baseline sshleifer/tiny-gpt2 \
--subject /path/to/your/edited-model \
--adapter auto \
--profile ci \
--tier balanced \
--out runs/eval_subject_1 \
--report-out reports/eval_subject_1
Pairing invariants
- InvarLock pairs windows from the baseline run and the edited run. For
comparability:
- Sequence length and stride must match.
- Window counts (preview/final) must match.
- Tokenizer hash should match; the verify command fails if both hashes are present and differ.
Use the same dataset/evaluation configuration on both sides, whether that means
repeating the same explicit flags or reusing the same preset (--preset), and
keep seq_len=stride for deterministic non-overlapping windows.
Why Compare & evaluate?
- Stable: your edit stack remains yours; InvarLock focuses on gates and evidence.
- Portable: reports are self-contained artifacts with provenance.
- Low maintenance: you can update your edit tools without waiting for InvarLock updates.
When to use built-in edits
InvarLock ships a single built-in edit tuned for portable smokes:
quant_rtn— 8‑bit (attention‑only mode available), CPU/MPS friendly
Use it for quick local checks. For production edits, prefer Compare & evaluate (BYOE).
Determinism & pairing
InvarLock pairs windows deterministically between baseline and subject runs. This ensures reproducible ratios and CI across machines and re‑runs. Keep these in mind:
- Match
seq_lenandstridebetween runs; preferseq_len = stridefor non‑overlapping windows. - Keep
preview_nandfinal_nequal across baseline and subject. - Use the same tokenizer;
invarlock verifylints tokenizer hash mismatches when present.
Determinism, pairing math, provenance, and runtime provenance are surfaced in
reports and runtime.manifest.json and revalidated by invarlock verify.
Related Documentation
- CLI Reference — Full
evaluatecommand options - Reading a report — Understanding output reports
- Coverage & Pairing (Assurance) — Window pairing invariants
- Determinism Contracts (Assurance) — Seed and reproducibility guarantees