BCa Bootstrap for Paired Δlog‑Loss (with Fallbacks)

Plain language: Confidence intervals come from a paired, token‑weighted BCa bootstrap on Δlog‑loss; the ratio CI is just the exponentiated Δlog CI. When Δ is degenerate, or BCa’s acceleration term is undefined, we fall back safely.

Claim

Paired, token‑weighted BCa on Δlog‑loss yields a ratio CI by exponentiation. When Δ is degenerate or acceleration is undefined, the implementation falls back safely (e.g., percentile CI or a collapsed interval).

Method (paired, token‑weighted)

  • Samples: paired Δlog‑loss per window, token‑weighted (t_i).
  • Bootstrap: Bias‑Corrected and Accelerated (BCa), replicates N (α = 0.05 by default).
  • Seeding: dataset.windows.stats.bootstrap.seed recorded for reproducibility.

Given per‑window token counts t_i and log‑losses ℓ_i^A, ℓ_i^B, define

  • Δℓ_i = ℓ_i^B − ℓ_i^A (paired, token‑weighted)
  • Compute CI [L, U] via BCa on {Δℓ_i} (resampled with probability ∝ t_i).
  • Perplexity ratio CI is exp([L, U]).

Fallbacks

  • Degenerate Δ (all equal, no pairs, or single pair): mark degenerate=true; CI collapses to [μ, μ] with μ = mean(Δ).
  • Undefined acceleration (jackknife variance is zero): fall back to a percentile bootstrap CI.

Runtime Contract (report)

  • primary_metric.ci — Δlog‑loss CI (log space, ppl-like kinds)
  • primary_metric.display_ci — ratio CI = exp(primary_metric.ci)
  • Identity checks:
    • primary_metric.display_ci == exp(primary_metric.ci)
    • preview/final drift ratio is computed from primary_metric.{preview,final}
  • dataset.windows.stats.bootstrap.{replicates,seed,method}
  • dataset.windows.stats.paired_delta_summary.{mean,std,degenerate}

Defaults & Tuning (tiers)

  • Balanced: ≈ 180×180 windows, BCa replicates ≈ 1.2k.
  • Conservative: ≈ 220×220 windows, BCa replicates ≈ 1.5k.

Adjust only with an audit note; always record the seed. CI/Release profiles enforce minima strictly when pairing is established.

Notes

  • Pairing and non‑overlap are required; see Coverage & Pairing Plan.
  • BCa is numerically stable under typical window counts; for extreme small‑n, expect more frequent fallbacks.

Assumptions & Scope

  • Paired windows and token weighting are required for the log‑space identities to hold.
  • Degenerate Δ cases are rare in practice at tier coverage; when they occur, the report records the fallback explicitly.

References