Report Outline

This page defines the renderer-neutral structure for current InvarLock evaluation reports. It connects the canonical evaluation.report.json payload to human-readable renderers such as Markdown, HTML, evidence-pack summaries, and benchmark comparison pages through the same section model.

The outline is implemented by invarlock.reporting.report_outline.build_evaluation_report_outline.

Purpose

The outline keeps report renderers aligned around the same information architecture. Reports can include:

  • policy failures, warning-mode guard movement, and strict warning policies
  • causal, MLM, seq2seq, image-text, and MoE evidence lanes
  • primary-metric tail checks and measured accuracy floors
  • public assurance-basis reports with runtime manifests and model revisions
  • guard-value evidence and benchmark-style bare-vs-guarded comparisons

Renderers should use this shared outline for visible section order.

Canonical Section Order

SectionPurposeTypical source blocks
DecisionOverall verdict, evidence mode, model/edit identity, warning count.validation, assurance, meta, primary_metric, guard_warnings
Primary MetricTask metric, final value, baseline-relative comparison, CI, tail gate.primary_metric, primary_metric_tail, validation
Policy GatesHard verify gates and thresholds.validation, policy_digest, resolved_policy
Guard SignalsGuard observations and warnings separate from hard failures.guard_warnings, invariants, spectral, rmt, variance, moe
Benchmark ComparisonOptional bare-vs-guarded scenario deltas.benchmark_comparison, benchmark, guard_effect_benchmark
Evidence And ProvenanceDataset, windows, runtime/policy/provider digests, device, seed.dataset, provenance, policy_digest, meta, artifacts
Technical AppendixVerbose raw measurements, resolved policy, plugins, artifacts.plugins, resolved_policy, policy_provenance, system_overhead, classification, structure, artifacts

The benchmark section is omitted when no benchmark block is present.

Renderer Rules

  • Keep policy failures, guard warnings, and guard-value evidence distinct.
  • Keep primary metric interpretation task-aware: ppl-like metrics use ratios; accuracy uses percentage-point deltas.
  • Put benchmark deltas after guard signals, not in provenance or appendix.
  • Keep verbose policy YAML, plugin provenance, and raw artifacts in the technical appendix unless they are needed to explain the verdict.
  • Treat the outline as the source for visible section order in future Markdown and HTML renderers.