Research Note · Calibration · Variance
Variance Enablement Should Be Evidence-Gated
Variance equalization is stronger when it must earn enablement through predictive evidence, explicit tier knobs, and report-visible provenance.
Research Note: a correction path should have to earn its own activation
Highlights
- VE is not presented publicly as an always-on correction. It is gated.
- The gate depends on predictive paired evidence, tier-sidedness, and a declared minimum effect.
- The result is visible in reports and can be carried forward as policy rather than intuition.
Many adaptive mechanisms sound better in prose than they look under review. A correction path is described as "smart," "dynamic," or "self-tuning," and the actual enablement rule disappears behind that language.
InvarLock's public variance-equalization surface goes further than that.
The assurance note for VE says plainly that VE should only propose scales when predictive paired evidence clears a tier-specific gate. That means the interesting question is not whether VE exists. It is whether VE has earned the right to turn on.
What The Predictive Gate Is Actually Testing
The VE note is concrete about the contract. Predictive A/B runs compare an edited model without VE against a virtual VE path on the same windows, then inspect paired Delta logNLL behavior under the tier's sidedness rule.
That is a much narrower and better claim than "VE tends to help." The gate is testing whether there is evidence for improvement on the predictive comparison surface chosen for calibration.
The calibration CLI reference reinforces the same story operationally. VE sweeps exist to recommend min_effect_lognll values, not to bless VE as a permanent default.
Why Sidedness And Minimum Effect Matter
The most important detail in the public note is that enablement is not based on any movement in the desired direction. The mean effect has to clear min_effect_lognll, and the predictive interval has to clear it with the tier's sidedness rule.
That matters because it prevents two common failures.
The first is wishful enablement: a tiny directional improvement gets treated as enough. The second is statistical ambiguity: the interval is still compatible with no meaningful gain, but the mechanism turns on anyway.
The public tier split is concrete here. Balanced uses a one-sided improvement test, while Conservative uses a two-sided interval and still enables only on improvement. Those are different evidence thresholds, not different marketing names.
The tier-policy catalog makes the distinction operational. Balanced and Conservative do not merely have different labels. They carry different VE policy knobs, and those knobs show up in resolved_policy.variance.*.
Why Provenance And Report Visibility Matter
The VE note is also explicit that predictive A/B provenance has to travel with the decision. Window identity, seed, and tap alignment are part of the contract.
That is important because an enablement decision without provenance is hard to audit and easy to overstate. If VE stayed on, a serious reader should be able to see why. If it stayed off, the report should make that legible too.
This is where the guards reference matters. VE is not only a calibration-side story. Its gate outcome and resolved policy knobs are observable in the report surface.
Why The Patch Output Matters
Like null sweeps, VE sweeps do not end in commentary. They emit machine-readable outputs and a tiers_patch_variance_ve.yaml recommendation.
That patch is what turns the statistical argument into a reviewable policy change. A reader does not have to trust that someone interpreted the sweep correctly. They can inspect the recommended key, see where it lands in tier policy, and decide whether the proposed enablement boundary deserves to move.
This is the value of evidence-gated VE. It constrains activation through visible policy and visible evidence.
What VE Calibration Still Does Not Prove
The claim should stay narrow.
Public VE calibration does not prove that VE is always beneficial. It does not prove that one pilot window budget transfers everywhere. And it does not eliminate the need to recalibrate when model families, profiles, or evaluation budgets change.
What it does do is force VE to justify enablement through predictive evidence instead of narrative optimism.
Claim Map
The practical path is:
- run predictive A/B sweeps under a declared tier
- measure paired Delta logNLL behavior with sidedness and minimum-effect rules
- emit the tiers patch recommendation
- expose the resulting knobs and gate outcome in report evidence
That is a better story than treating VE as a hidden default.
Limitations
- This post explains the public VE gate and calibration workflow; it does not contribute a fresh sweep result.
- VE enablement is stronger when it is evidence-gated, but that does not make it universally beneficial.
- The companion doodle is a simplified predictive-gate sketch, not a replacement for the VE calibration docs.
Sources
More from the blog
Continue through recent releases and implementation notes.
Research Note
Null Sweeps as Threshold Derivation, Not Tuning Folklore
Thresholds are stronger when they come from measured null behavior and end in a policy patch, not from knob-tuning folklore.
Synthesis
The Minimum Evidence Surface for Trustworthy Weight-Edit Results
A trustworthy weight-edit result needs more than a benchmark delta. It needs a bounded claim, an exactly paired comparison, and verification that rejects incomplete evidence.
Release
Evidence packs and explicit runtime provenance
InvarLock 0.8.0 moves the public bundle surface to evidence packs, pins docs to versioned release paths, and makes container-vs-host runtime provenance explicit across evaluate and verify.