BCa Bootstrap for Paired Δlog‑Loss (with Fallbacks)

Plain language: Confidence intervals come from a paired, token‑weighted BCa bootstrap on Δlog‑loss; the ratio CI is just the exponentiated Δlog CI. When Δ is degenerate, or BCa’s acceleration term is undefined, we fall back safely.

Claim

Paired, token‑weighted BCa on Δlog‑loss yields a ratio CI by exponentiation. When Δ is degenerate or acceleration is undefined, the implementation falls back safely (e.g., percentile CI or a collapsed interval).

Method (paired, token‑weighted)

Samples: paired Δlog‑loss per window, token‑weighted (t_i).
Bootstrap: Bias‑Corrected and Accelerated (BCa), replicates N (α = 0.05 by default).
Seeding: dataset.windows.stats.bootstrap.seed recorded for reproducibility.

Given per‑window token counts t_i and log‑losses ℓ_i^A, ℓ_i^B, define

Δℓ_i = ℓ_i^B − ℓ_i^A (paired, token‑weighted)
Compute CI [L, U] via BCa on {Δℓ_i} (resampled with probability ∝ t_i).
Perplexity ratio CI is exp([L, U]).

Fallbacks

Degenerate Δ (all equal, no pairs, or single pair): mark degenerate=true; CI collapses to [μ, μ] with μ = mean(Δ).
Undefined acceleration (jackknife variance is zero): fall back to a percentile bootstrap CI.

Runtime Contract (report)

primary_metric.ci — Δlog‑loss CI (log space, ppl-like kinds)
primary_metric.display_ci — ratio CI = exp(primary_metric.ci)
Identity checks:
- primary_metric.display_ci == exp(primary_metric.ci)
- preview/final drift ratio is computed from primary_metric.{preview,final}
dataset.windows.stats.bootstrap.{replicates,seed,method}
dataset.windows.stats.paired_delta_summary.{mean,std,degenerate}

Defaults & Tuning (tiers)

Balanced: ≈ 180×180 windows, BCa replicates ≈ 1.2k.
Conservative: ≈ 220×220 windows, BCa replicates ≈ 1.5k.

Adjust only with an audit note; always record the seed. CI/Release profiles enforce minima strictly when pairing is established.

Notes

Pairing and non‑overlap are required; see Coverage & Pairing Plan.
BCa is numerically stable under typical window counts; for extreme small‑n, expect more frequent fallbacks.

Assumptions & Scope

Paired windows and token weighting are required for the log‑space identities to hold.
Degenerate Δ cases are rare in practice at tier coverage; when they occur, the report records the fallback explicitly.

References

Efron, B. (1987). “Better Bootstrap Confidence Intervals.” Journal of the American Statistical Association, 82(397), 171–185. https://doi.org/10.1080/01621459.1987.10478410
DiCiccio, T. J., & Efron, B. (1996). “Bootstrap Confidence Intervals.” Statistical Science, 11(3), 189–228. https://projecteuclid.org/journals/statistical-science/volume-11/issue-3/Bootstrap-Confidence-Intervals/10.1214/ss/1032280214.full
Efron, B., & Narasimhan, B. (2021). “bcaboot: Bias Corrected Bootstrap Confidence Intervals.” R package vignette. https://cran.r-project.org/web/packages/bcaboot/bcaboot.pdf
Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC.