Back to blog

Research Note · Calibration · Policy

Null Sweeps as Threshold Derivation, Not Tuning Folklore

Ink/charcoal doodle: null no-op runs flow into sweep artifacts and a reviewable tier policy patch.

Thresholds are stronger when they come from measured null behavior and end in a policy patch, not from knob-tuning folklore.

5 min read
InvarLock Team

Research Note: thresholds should be measured before they are defended

Highlights

  • Null sweeps turn spectral threshold setting into an empirical derivation problem instead of a tuning habit.
  • The important output is not only a summary table. It is a machine-consumable tiers_patch_*.yaml recommendation.
  • Calibration is stronger when clean baseline behavior is measured explicitly and carried forward as policy evidence.

A lot of threshold tuning in machine learning is socially legible but methodologically weak. Someone adjusts a knob until the warnings "look about right," perhaps on a small pilot, then the result hardens into a default without a clear derivation story.

InvarLock's null-sweep surface goes further than that.

The public calibration docs describe a specific workflow: run no-op sweeps, collect empirical behavior under the null, summarize the resulting distribution, and emit a policy patch that can be reviewed before it is merged into tier defaults. That is a better story than folklore because it makes threshold setting observable and contestable.

What A Null Sweep Actually Measures

The calibration CLI reference is concrete here. A null sweep is not a vague "baseline test." It is a no-op edit sweep intended to measure baseline spectral behavior and derive false-positive-controlled kappa caps and alpha settings.

That distinction matters. The goal is not to show that nothing ever warns under clean conditions. The goal is to measure what warning behavior looks like under clean conditions, then choose policy values that keep false positives under a declared budget.

The CLI makes that budget visible too. Null sweeps expose a target run-level warning rate instead of forcing readers to infer the desired false-positive behavior from a chart after the fact.

This is why the calibration note focuses on null runs and target warning rates instead of only on downstream edited-model outcomes.

Why Null Behavior Is The Right Source For Spectral Thresholds

The tier-calibration assurance note makes the logic explicit: keep the warning budget fixed, measure clean baseline behavior, and derive per-family kappa thresholds against that budget. In other words, the null is not decorative. It is the reference surface for deciding how much spectral instability should count as unusual.

That is a much more defensible posture than "these values felt reasonable on a few runs." It connects the threshold to an empirical error budget.

This is also where the post should stay narrow. A null sweep gives evidence about guard behavior under clean conditions. It does not prove the entire acceptance policy is globally optimal.

Why The Output Patch Matters

The strongest part of the calibration surface is that it does not end in prose.

The public reference says each sweep emits JSON, CSV, Markdown, and a tiers_patch_*.yaml file. That patch is the key artifact because it connects the measured result to the actual policy surface operators use. The tier-policy catalog and guards reference explain why that matters: resolved guard behavior shows up in resolved_policy.* and downstream report evidence, so calibration is not just analysis. It is policy derivation.

That machine-consumable patch is what keeps the story from collapsing into "trust the calibration author." Readers can inspect the proposed keys, understand where they land, and decide whether the recommendation deserves to be adopted.

Why This Is Better Than Threshold Folklore

Null sweeps do not make calibration perfect. They do make it legible.

A folklore threshold often has no stable artifact, no target error budget, and no clean way to revisit the recommendation later. A null-sweep threshold has a run recipe, explicit outputs, and a patch-shaped recommendation. That means someone else can argue with it in the right place: the evidence and the policy diff, not the author's intuition.

This is the value of calibration work. It is not that thresholds become unquestionable. It is that they become reviewable.

What Null Sweeps Still Do Not Tell You

The limitations matter.

Null sweeps are one part of the policy story. They help derive spectral thresholds from clean behavior, but they do not by themselves settle every guard choice, every family transfer, or every future window budget. The public docs are careful about this: published assurance basis is still narrower than the full runnable surface, and teams are encouraged to recalibrate on their own models, data, and hardware.

So the correct claim is not "null sweeps discover the right thresholds once and for all." It is smaller: null sweeps are a disciplined way to turn clean empirical behavior into reviewable threshold recommendations.

Claim Map

The practical path is:

  • run null sweeps under a declared profile and tier
  • collect machine-readable and human-readable calibration artifacts
  • emit tiers_patch_spectral_null.yaml
  • review the patch before merging it into tier policy

That is a far stronger calibration story than hand-tuning values until the warnings feel acceptable.

Limitations

  • This post explains the public calibration workflow; it does not contribute a fresh sweep result.
  • Null sweeps help derive spectral thresholds without proving the rest of the policy surface is solved.
  • The companion doodle is a simplified flow from null runs to artifacts to policy review, not a replacement for the calibration docs.

Sources

More from the blog

Continue through recent releases and implementation notes.