Artifact Layout

Overview

AspectDetails
PurposeExplain where evaluation outputs and reports live.
AudienceOperators archiving evidence and CI outputs.
Scoperuns/ scratch outputs and reports/ long-lived evidence.
Source of truthsrc/invarlock/core/evaluate_contract.py, src/invarlock/core/evaluate_plan.py, src/invarlock/reporting/report_make.py, src/invarlock/reporting/report_bundle.py, src/invarlock/reporting/report_console.py, src/invarlock/reporting/report_files.py, src/invarlock/cli/commands/evaluate.py.

Quick Start

# Compare baseline and subject on the default runtime-container path
invarlock evaluate --allow-network \
  --baseline gpt2 \
  --subject gpt2 \
  --report-out reports/eval

# Render HTML from the emitted evaluation bundle
invarlock report html -i reports/eval/evaluation.report.json -o reports/eval/evaluation.html
invarlock report explain --evaluation-report reports/eval/evaluation.report.json

Model-loading commands use the runtime container by default unless a host-side invarlock evaluate --execution-mode host workflow explicitly bypasses it.

Repo-owned presets under configs/ remain available for maintainers, but the quick-start path above stays wheel-compatible by using direct flags only.

Concepts

  • runs/ is scratch space: evaluate emits baseline/subject working artifacts there.
  • reports/ is evidence: archive evaluation.report.json and runtime.manifest.json for audit, plus any HTML or evidence-pack outputs you distribute.
  • evaluation bundles reference baseline/subject report artifacts; keep them together to preserve pairing and make later review easier.

Command outputs

CommandWritesWhat to archive
invarlock evaluateruns/, reports/<name>/evaluation.report.json, runtime.manifest.jsonEvaluation report bundle plus runtime provenance for container-backed runs.
invarlock report htmlreports/<name>/evaluation.htmlOptional (can be rebuilt).

Reference

Evaluate scratch outputs (runs/)

Run artifact directory layout with per-run report and event outputs.

Evaluation reports (reports/)

Report artifact directory layout with baseline outputs, evaluation report outputs, and runtime manifest evidence.

Archive checklist

  • Keep evaluation.report.json with runtime.manifest.json.
  • Retain HTML exports only when you need reviewer-friendly artifacts.
  • Retain scratch runs/ only if debugging or rebuilding derived artifacts.
  • Prune timestamped runs/ once evidence is archived.
ArtifactWhy archiveRequired for verify
evaluation.report.jsonEvaluation report snapshotYes
runtime.manifest.jsonRuntime provenance for container-backed outputsYes
events.jsonlDebugging timelineNo
evaluation.htmlHuman reviewNo

Seeds, hashes, and policy digests

  • report.meta.seeds includes Python/NumPy/Torch seeds.
  • report.meta.tokenizer_hash and dataset digests support pairing verification.
  • reports record policy_digest and resolved tier policy snapshots.

Cleanup checklist

  1. Copy evaluation.report.json and runtime.manifest.json into reports/ for retention.
  2. Keep any referenced baseline/subject artifacts alongside derived reports for pairing checks and report explain.
  3. Remove stale timestamped runs once evidence is archived.

Troubleshooting

  • Missing pairing artifacts: report explain and some advanced workflows need the baseline/subject artifacts referenced by the evaluation bundle.
  • Large run dirs: prune old timestamped runs after archiving reports.

Observability

  • evaluation.report.json is the canonical distribution artifact.
  • scratch run artifacts provide per-phase logs for debugging when needed.