Token-weighted paired statistics and stricter release gates
Token-weighted paired bootstrap lands across the pipeline, strictness toggles expand, and CI/release pairing expectations become explicit and enforceable.
Release: InvarLock 0.3.3 - Paired bootstrap, strictness toggles, and clearer failures
Highlights
- Token-weighted paired Δlog-loss bootstrap support (core + primary metric + variance guard).
- Window pairing enforcement becomes more explicit (overlap/duplicates/mismatch detection).
- Strictness toggles and report metadata improvements for clearer evaluation outcomes.
0.3.3 tightens the statistical backbone of paired evaluation. The paired Δlog-loss bootstrap work isn’t just a “numbers” change—it’s about making drift conclusions more faithful to what was actually evaluated (token-weighted and paired, not loosely aggregated).
It also makes CI/release expectations blunt and explicit: perfect pairing, non-overlapping windows, and coverage floors aren’t “best effort” anymore—they’re enforced. That’s a theme in this release: fewer fuzzy edges, more things you can confidently point to.
And when things do go wrong, reports carry better context (including evaluation soft-fail metadata), which helps turn failures into something you can diagnose instead of something you just re-run blindly.
For the immutable release record, read the tagged CHANGELOG.md for v0.3.3.
More in Release
Continue through nearby posts in the same reading thread.
Release
Fail-closed baseline pairing with lower-memory retries
CI/release baseline pairing is fail-closed (pairing evidence is required), and adapters reduce peak memory during retries via chunked snapshot/restore.
Release
Calibration, determinism, and regression protection
`invarlock calibrate` arrives, determinism utilities mature, and regression harness + golden tracking help prevent silent policy drift.
Release
Deterministic evidence packs and safer perplexity runs
Evidence packs gain a deterministic bash test suite and better runtime helpers, window selection becomes stable/offline, and perplexity runs get safer around bad token IDs.