Back to blog

Research Note · Calibration · Variance

Variance Enablement Should Be Evidence-Gated

Ink/charcoal doodle: paired evidence windows flow through a variance enablement gate with on and off outcomes.

Variance equalization is stronger when it must earn enablement through predictive evidence, explicit tier knobs, and report-visible provenance.

4 min read
InvarLock Team

Research Note: a correction path should have to earn its own activation

Highlights

  • VE is not presented publicly as an always-on correction. It is gated.
  • The gate depends on predictive paired evidence, tier-sidedness, and a declared minimum effect.
  • The result is visible in reports and can be carried forward as policy rather than intuition.

Many adaptive mechanisms sound better in prose than they look under review. A correction path is described as "smart," "dynamic," or "self-tuning," and the actual enablement rule disappears behind that language.

InvarLock's public variance-equalization surface goes further than that.

The assurance note for VE says plainly that VE should only propose scales when predictive paired evidence clears a tier-specific gate. That means the interesting question is not whether VE exists. It is whether VE has earned the right to turn on.

What The Predictive Gate Is Actually Testing

The VE note is concrete about the contract. Predictive A/B runs compare an edited model without VE against a virtual VE path on the same windows, then inspect paired Delta logNLL behavior under the tier's sidedness rule.

That is a much narrower and better claim than "VE tends to help." The gate is testing whether there is evidence for improvement on the predictive comparison surface chosen for calibration.

The calibration CLI reference reinforces the same story operationally. VE sweeps exist to recommend min_effect_lognll values, not to bless VE as a permanent default.

Why Sidedness And Minimum Effect Matter

The most important detail in the public note is that enablement is not based on any movement in the desired direction. The mean effect has to clear min_effect_lognll, and the predictive interval has to clear it with the tier's sidedness rule.

That matters because it prevents two common failures.

The first is wishful enablement: a tiny directional improvement gets treated as enough. The second is statistical ambiguity: the interval is still compatible with no meaningful gain, but the mechanism turns on anyway.

The public tier split is concrete here. Balanced uses a one-sided improvement test, while Conservative uses a two-sided interval and still enables only on improvement. Those are different evidence thresholds, not different marketing names.

The tier-policy catalog makes the distinction operational. Balanced and Conservative do not merely have different labels. They carry different VE policy knobs, and those knobs show up in resolved_policy.variance.*.

Why Provenance And Report Visibility Matter

The VE note is also explicit that predictive A/B provenance has to travel with the decision. Window identity, seed, and tap alignment are part of the contract.

That is important because an enablement decision without provenance is hard to audit and easy to overstate. If VE stayed on, a serious reader should be able to see why. If it stayed off, the report should make that legible too.

This is where the guards reference matters. VE is not only a calibration-side story. Its gate outcome and resolved policy knobs are observable in the report surface.

Why The Patch Output Matters

Like null sweeps, VE sweeps do not end in commentary. They emit machine-readable outputs and a tiers_patch_variance_ve.yaml recommendation.

That patch is what turns the statistical argument into a reviewable policy change. A reader does not have to trust that someone interpreted the sweep correctly. They can inspect the recommended key, see where it lands in tier policy, and decide whether the proposed enablement boundary deserves to move.

This is the value of evidence-gated VE. It constrains activation through visible policy and visible evidence.

What VE Calibration Still Does Not Prove

The claim should stay narrow.

Public VE calibration does not prove that VE is always beneficial. It does not prove that one pilot window budget transfers everywhere. And it does not eliminate the need to recalibrate when model families, profiles, or evaluation budgets change.

What it does do is force VE to justify enablement through predictive evidence instead of narrative optimism.

Claim Map

The practical path is:

  • run predictive A/B sweeps under a declared tier
  • measure paired Delta logNLL behavior with sidedness and minimum-effect rules
  • emit the tiers patch recommendation
  • expose the resulting knobs and gate outcome in report evidence

That is a better story than treating VE as a hidden default.

Limitations

  • This post explains the public VE gate and calibration workflow; it does not contribute a fresh sweep result.
  • VE enablement is stronger when it is evidence-gated, but that does not make it universally beneficial.
  • The companion doodle is a simplified predictive-gate sketch, not a replacement for the VE calibration docs.

Sources

More from the blog

Continue through recent releases and implementation notes.