Artifact Layout

Overview

AspectDetails
PurposeExplain where evaluation outputs and reports live.
AudienceOperators archiving evidence and CI outputs.
Scoperuns/ scratch outputs and reports/ long-lived evidence.
Source of truthsrc/invarlock/core/evaluate_contract.py, src/invarlock/core/evaluate_plan.py, src/invarlock/reporting/report_make.py, src/invarlock/reporting/report_bundle.py, src/invarlock/reporting/report_summary.py, src/invarlock/cli/commands/evaluate.py.

Quick Start

# Compare baseline and subject on the default runtime-container path
invarlock evaluate --allow-network \
  --baseline gpt2 \
  --subject gpt2 \
  --report-out reports/eval

# Render HTML from the emitted evaluation bundle
invarlock report html -i reports/eval/evaluation.report.json -o reports/eval/evaluation.html
invarlock report explain --evaluation-report reports/eval/evaluation.report.json
invarlock report export -i reports/eval/evaluation.report.json --format release-review-md

Model-loading commands use the runtime container by default unless a host-side invarlock evaluate --execution-mode host workflow explicitly bypasses it.

Repo-owned presets under configs/ remain available for maintainers, but the quick-start path above stays wheel-compatible by using direct flags only.

Concepts

  • runs/ is scratch space: evaluate emits baseline/subject working artifacts there.
  • reports/ is evidence: archive evaluation.report.json and runtime.manifest.json for audit, plus any HTML or evidence-pack outputs you distribute.
  • evaluation bundles may reference baseline/subject report artifacts; keep them together when you want regeneration, deeper provenance review, or low-level run telemetry, but evaluation.report.json is the canonical portable artifact for verification, rendering, validation, and explanation.

Command outputs

CommandWritesWhat to archive
invarlock evaluateruns/, reports/<name>/evaluation.report.json, runtime.manifest.jsonEvaluation report bundle plus runtime provenance for container-backed runs.
invarlock report htmlreports/<name>/evaluation.htmlOptional (can be rebuilt).
invarlock report exportOptional output path for mlflow-tags.json, model-card-invarlock.md, or release-review.mdOptional reviewer/registry convenience output (can be rebuilt).

Reference

Evaluate scratch outputs (runs/)

Run artifact directory layout with per-run report and event outputs.Run artifact directory layout with per-run report and event outputs.

Evaluation reports (reports/)

Report artifact directory layout with evaluation report, runtime manifest, HTML, export, and verify sidecar outputs.Report artifact directory layout with evaluation report, runtime manifest, HTML, export, and verify sidecar outputs.

Archive checklist

  • Keep evaluation.report.json with runtime.manifest.json.
  • Retain HTML exports only when you need reviewer-friendly artifacts.
  • Retain scratch runs/ only if debugging or rebuilding derived artifacts.
  • Prune timestamped runs/ once evidence is archived.
ArtifactWhy archiveRequired for verify
evaluation.report.jsonEvaluation report snapshotYes
runtime.manifest.jsonRuntime provenance for container-backed outputsYes
events.jsonlDebugging timelineNo
evaluation.htmlHuman reviewNo
mlflow-tags.jsonRegistry tag handoffNo
model-card-invarlock.mdModel-card evidence blockNo
release-review.mdReviewer packetNo
invarlock-verify.jsonStored CI verify outputNo

Seeds, hashes, and policy digests

  • report.meta.seeds includes Python/NumPy/Torch seeds.
  • report.meta.tokenizer_hash and dataset digests support pairing verification.
  • reports record policy_digest and resolved tier policy snapshots.

Cleanup checklist

  1. Copy evaluation.report.json and runtime.manifest.json into reports/ for retention.
  2. Keep any referenced baseline/subject artifacts alongside derived reports when you need regeneration or low-level run telemetry.
  3. Remove stale timestamped runs once evidence is archived.

Troubleshooting

  • Missing pairing artifacts: report explain --evaluation-report works from evaluation.report.json; use explicit --subject-report/--baseline-report only when you need to rebuild the explanation from raw run artifacts.
  • Large run dirs: prune old timestamped runs after archiving reports.

Observability

  • evaluation.report.json is the canonical distribution artifact.
  • scratch run artifacts provide per-phase logs for debugging when needed.