Variance Enablement Should Be Evidence-Gated

Variance equalization is stronger when it must earn enablement through predictive evidence, explicit tier knobs, and report-visible provenance.

May 11, 2026

5 min read

InvarLock Team

Research Note: a correction path should have to earn its own activation

Highlights

VE enablement is decided by paired Delta logNLL predictive A/B runs measured against min_effect_lognll.
Tier sidedness sets the test shape: Balanced uses a one-sided improvement test; Conservative uses a two-sided interval and still enables only on improvement.
The decision and resolved knobs (resolved_policy.variance.min_effect_lognll, predictive_one_sided) are visible in the report, and recommended via tiers_patch_variance_ve.yaml.

Many adaptive mechanisms sound better in prose than they look under review. A correction path is described as "smart," "dynamic," or "self-tuning," and the actual enablement rule disappears behind that language.

InvarLock's public variance-equalization surface goes further than that.

The assurance note for VE says plainly that VE should only propose scales when predictive paired evidence clears a tier-specific gate. That means the interesting question is not whether VE exists. It is whether VE has earned the right to turn on.

What The Predictive Gate Is Actually Testing

The VE note is concrete about the contract. Predictive A/B runs compare an edited model without VE against a virtual VE path on the same windows, then inspect paired Delta logNLL behavior under the tier's sidedness rule.

That is a much narrower and better claim than "VE tends to help." The gate is testing whether there is evidence for improvement on the predictive comparison surface chosen for calibration.

The calibration CLI reference reinforces the same story operationally. VE sweeps exist to recommend min_effect_lognll values, not to bless VE as a permanent default.

Why Sidedness And Minimum Effect Matter

The most important detail in the public note is that enablement is not based on any movement in the desired direction. The mean effect has to clear min_effect_lognll, and the predictive interval has to clear it with the tier's sidedness rule.

That matters because it prevents two common failures.

The first is wishful enablement: a tiny directional improvement gets treated as enough. The second is statistical ambiguity: the interval is still compatible with no meaningful gain, but the mechanism turns on anyway.

The public tier split is concrete here. Balanced uses a one-sided improvement test, while Conservative uses a two-sided interval and still enables only on improvement. Those are different evidence thresholds, not different marketing names.

The tier-policy catalog makes the distinction operational. Balanced and Conservative do not merely have different labels. They carry different VE policy knobs, and those knobs show up in resolved_policy.variance.*.

Why Provenance And Report Visibility Matter

The VE note is also explicit that predictive A/B provenance has to travel with the decision. Window identity, seed, and tap alignment are part of the contract.

That is important because an enablement decision without provenance is hard to audit and easy to overstate. If VE stayed on, a serious reader should be able to see why. If it stayed off, the report should make that legible too.

This is where the guards reference matters. VE is not only a calibration-side story. Its gate outcome and resolved policy knobs are observable in the report surface.

The report-visible decision has a shape readers can inspect. A representative excerpt looks like this:

{
  "variance": {
    "enabled": true,
    "predictive_gate": {
      "passed": true,
      "mean_delta": -0.021
    }
  },
  "resolved_policy": {
    "variance": {
      "min_effect_lognll": 0.016,
      "predictive_one_sided": false
    }
  }
}

Why The Patch Output Matters

Like null sweeps, VE sweeps do not end in commentary. They emit machine-readable outputs and a tiers_patch_variance_ve.yaml recommendation.

That patch is what turns the statistical argument into a reviewable policy change. A reader does not have to trust that someone interpreted the sweep correctly. They can inspect the recommended key, see where it lands in tier policy, and decide whether the proposed enablement boundary deserves to move.

This is the value of evidence-gated VE. It constrains activation through visible policy and visible evidence.

What VE Calibration Still Does Not Prove

Public VE calibration makes a deliberately bounded promise. It does not prove that VE is always beneficial. It does not prove that one pilot window budget transfers everywhere. And it does not eliminate the need to recalibrate when model families, profiles, or evaluation budgets change.

What it does do is force VE to justify enablement through predictive evidence instead of narrative optimism.

Claim Map

The practical path is:

run predictive A/B sweeps under a declared tier
measure paired Delta logNLL behavior with sidedness and minimum-effect rules
emit the tiers patch recommendation
expose the resulting knobs and gate outcome in report evidence

That is a better story than treating VE as a hidden default.

Limitations

VE gating is one calibration story; window-budget and family-transfer questions remain open and need their own evidence.
An evidence-gated enablement decision is auditable, not optimal — operators still own the call when assumptions shift.
Gate criteria, tier knobs, and report-field names live in the VE assurance and calibration references; this post explains why the gate exists, not how to drive it.

Variance Enablement Should Be Evidence-Gated

Highlights

What The Predictive Gate Is Actually Testing

Why Sidedness And Minimum Effect Matter

Why Provenance And Report Visibility Matter

Why The Patch Output Matters

What VE Calibration Still Does Not Prove

Claim Map

Limitations

Sources

More in Research Note

From Sweep Outputs to Tier Policy

Null Sweeps as Threshold Derivation, Not Tuning Folklore

Calibration Is the Product Surface, Not a Side Utility