Calibration Is the Product Surface, Not a Side Utility
Calibration is not just analysis around the product. It changes how thresholds are derived, when correction paths may turn on, and which policy values later govern reports.
Synthesis: what May implies about where calibration really lives
Highlights
- Threshold derivation: null sweeps emit
tiers_patch_spectral_null.yamlwith calibratedfamily_caps.*keys. See the null-sweeps note. - Enablement discipline: variance equalization only turns on when paired
Delta logNLLclearsmin_effect_lognllunder the tier's sidedness rule. See the variance-enablement note. - Policy continuity: patched keys surface as
resolved_policy.*in later reports, giving the sweep -> patch -> policy -> report path a traceable identity. See the tier-policy note.
The May posts all start from calibration, but they do not stay there in the narrow sense. Null sweeps explain how thresholds are derived from null behavior. Variance enablement explains how a correction path has to earn enablement. Tier policy explains how sweep outputs become reviewable policy patches and later show up in reports.
Taken together, those posts imply a stronger and more practical conclusion: calibration is part of the product surface.
That claim should be read carefully. It does not mean calibration is the whole product. It means calibration changes the live boundary that operators actually interact with and later audit.
Here, "product surface" means the operator-visible thresholds, gates, patch paths, and report fields that actually govern evaluation and review.
1. Calibration Sets The Derivation Story
The first May post matters because it changes how threshold selection should be interpreted. The calibration CLI reference describes null sweeps as a workflow that measures clean no-op behavior and emits a patch-shaped recommendation. Threshold setting is no longer just taste plus experience. It becomes part of the formal operational story.
That is already product-surface behavior. A system whose thresholds are empirically derived and patchable exposes a different review interface than one whose thresholds live mostly in inherited defaults.
2. Calibration Sets The Enablement Story
The second May post adds a different kind of boundary. Variance equalization is not simply available because it exists in the implementation. Publicly, it is supposed to turn on only when predictive evidence clears tier-specific sidedness and minimum-effect rules.
That means calibration is not only about where thresholds come from. It is also about when a correction path may become active at all. The variance guard predictive gate is therefore a product-facing gate, not just an analysis note.
Again, that is not peripheral analysis. It is control over a live runtime behavior.
3. Calibration Sets The Policy Story
The third May post completes the path. Sweep outputs do not end in a human summary. They end in reviewable tiers_patch_*.yaml files that can be merged into runtime tier policy, then exposed later as resolved_policy.* in reports.
This is the clearest reason calibration belongs on the product surface. Its outputs survive. They do not disappear after an experiment review meeting. They become part of the policy that future evaluations inherit.
A system that exposes policy this way is telling you something important: calibration is part of how the product defines itself operationally.
Why This Framing Matters
Calling calibration a side utility encourages the wrong habits. It suggests that calibration can be deferred, hidden, or treated as a private research detail while the "real" product lives elsewhere.
The public InvarLock docs point in the opposite direction. The tier-policy catalog distinguishes calibrated values from explicit policy choices, and the guards reference makes resolved policy visible in the report surface. Calibration determines part of the threshold surface, part of the enablement surface, and part of the policy surface. That means it belongs inside serious operator review, release review, and evidence interpretation.
This framing is also useful for readers. It tells them where to look when they want to understand why a threshold exists, why a correction path stayed off, or why a report resolved to a given policy value.
What Calibration Still Does Not Solve
This synthesis should stay conservative.
Calibration does not turn every decision into empiricism. The tier-policy catalog is explicit that some values are calibrated and others are policy choices. Calibration also does not eliminate the need for review, transfer checks, or recalibration when window budgets, hardware, or model families change. The tier v1 calibration note is still local to a specific published evidence surface.
So the right conclusion is not "calibration solves the whole system." It is smaller and more useful: calibration is a real part of the system boundary that operators need to understand and govern.
The Month-Two Checklist
For the current public surface, calibration belongs on the product boundary when it provides:
- an empirical derivation story rather than threshold folklore
- explicit enablement rules rather than hidden adaptive defaults
- a reviewable policy patch rather than summary-only interpretation
- downstream visibility through
resolved_policy.*
If those pieces are missing, calibration still may exist. It is simply weaker as an operational surface.
Limitations
- This is a synthesis of the May calibration series; it adds framing, not data.
- "Product surface" here is the operator-visible threshold, gate, patch, and report set, narrower than "entire product" by design.
- Window-budget, hardware, and family-transfer questions still need their own evidence; this post argues calibration belongs on the product surface, not that any specific calibrated value generalizes.
Sources
More in Research Note
Continue through nearby posts in the same reading thread.
Research Note
What Belongs in evaluation.report.json
An evaluation report is strongest when it is treated as a stable evidence contract: a small required core, meaningful optional blocks, and a clear boundary around what still lives outside the JSON.
Research Note
From Sweep Outputs to Tier Policy
Calibration becomes operational when sweep artifacts end in reviewable YAML patches that later appear as resolved runtime policy in reports.
Research Note
Variance Enablement Should Be Evidence-Gated
Variance equalization is stronger when it must earn enablement through predictive evidence, explicit tier knobs, and report-visible provenance.