Public Evidence Walkthrough
Purpose
This walkthrough shows the shipped public evidence floor that reviewers can verify without downloading model weights. It is deliberately BYOE-oriented: InvarLock validates baseline/subject comparison artifacts for externally materialized subjects; deployable quantized checkpoint production is outside this public evidence floor.
public_evidence/README.md defines the evidence taxonomy. In short, fixture
artifacts validate verifier contracts, while real-run artifacts are produced by
invarlock evaluate against materialized baseline and subject checkpoints.
Every public evidence artifact carries evidence.meta.json so reviewers can see
whether they are looking at a fixture or a real run.
Published-basis pass
The repository ships strict-pass public-basis examples for GPT-2-style causal LM and BERT-style masked LM lanes:
invarlock verify --profile release --assurance strict \
public_evidence/published_basis/gpt2/evaluation.report.json
invarlock verify --profile release --assurance strict \
public_evidence/published_basis/bert/evaluation.report.json
Each directory includes:
| File | Role |
|---|---|
evaluation.report.json | Canonical verifier input with primary metric, guard evidence, policy digest, and assurance section. |
runtime.manifest.json | Container runtime provenance manifest bound to the report by SHA-256. |
evidence_pack_recipe.json | Recipe pointer for rebuilding a full validation evidence pack. |
artifact_package/ | Checkpoint references, report/runtime paths, signed-pack path, and exact verifier commands. |
evidence_pack/ | Signed, checksum-bound GPT-2 public evidence pack that verifies under strict release policy. |
The support matrix records these paths under
contracts/support_matrix.json as the published_basis evidence floor.
The GPT-2 artifact_package/ is intentionally a checkpoint-reference package,
not a weight dump. It names the baseline and subject checkpoint references, binds
them to the report, runtime manifest, and signed pack, and keeps the exact
verification commands in artifact_package/artifact_package.json. Large model
weights remain external to the repository; the rebuild recipe is the source of
truth for materializing a fresh BYOE evidence drop.
The GPT-2 lane also ships a small signed pack so reviewers can exercise the full offline evidence-pack verifier without rebuilding the suite:
FPR=$(python - <<'PY'
import json
from pathlib import Path
manifest = json.loads(
Path("public_evidence/published_basis/gpt2/evidence_pack/manifest.json")
.read_text(encoding="utf-8")
)
print(manifest["signing_key_fingerprint"])
PY
)
invarlock advanced evidence-pack verify \
public_evidence/published_basis/gpt2/evidence_pack \
--strict \
--profile release \
--report-assurance strict \
--expected-fingerprint "$FPR"
The expected pack result is ok=true with authenticity=pinned. Without
--expected-fingerprint, the signature still confirms integrity but not signer
authenticity.
Real model runs
The repository includes small real runs generated by the CLI on GPT-2-family checkpoints. They verify under the release/strict profile and ship signed evidence packs.
The first run uses sshleifer/tiny-gpt2 as the baseline and subject, then
applies the built-in quant_rtn RTN dequantized weight-edit simulation:
uv run invarlock verify \
public_evidence/real_runs/tiny_gpt2_quant_rtn/evaluation.report.json \
--profile release \
--assurance strict
uv run invarlock advanced evidence-pack verify \
public_evidence/real_runs/tiny_gpt2_quant_rtn/evidence_pack \
--strict \
--profile release \
--report-assurance strict \
--expected-fingerprint sha256:cc17b2af6579f5de01e74d91e93528b04670ff89f907ec3ba786a69065435605
The exact invarlock evaluate command is in
public_evidence/real_runs/tiny_gpt2_quant_rtn/run_command.txt. The pack remains
small enough for the repo because it references model checkpoints rather than
vendoring weights. The built-in edit remains a demo/smoke edit, not a deployable
quantization backend.
The second run is a real external BYOE path. The subject checkpoint is
materialized outside InvarLock by
public_evidence/real_runs/tiny_gpt2_external_magnitude_prune/external_edit_recipe.py,
then consumed by invarlock evaluate with --edit-label custom:
uv run invarlock verify \
public_evidence/real_runs/tiny_gpt2_external_magnitude_prune/evaluation.report.json \
--profile release \
--assurance strict
uv run invarlock advanced evidence-pack verify \
public_evidence/real_runs/tiny_gpt2_external_magnitude_prune/evidence_pack \
--strict \
--profile release \
--report-assurance strict \
--expected-fingerprint sha256:e01c40a94c89b22306a2670b032f623aa5428351d06e18f9b3e9e6a39b42c41b
That artifact is the concrete real-run evidence for BYOE/custom subjects: the
checkpoint weights are not vendored, checkpoint_refs.json records the external
edit type and file hashes, and the report records edit_name = custom.
BYOE edit examples
The repository also ships small strict-verifiable BYOE examples for multiple
external edit workflows. These fixtures make the verifier boundary explicit:
the subject checkpoint is an external reference, plugins.edits is empty, and
the report is verified as a baseline-vs-subject comparison rather than as output
from a built-in edit plugin.
invarlock verify --profile release --assurance strict \
public_evidence/byoe_examples/magnitude_prune_byoe/evaluation.report.json
invarlock verify --profile release --assurance strict \
public_evidence/byoe_examples/lora_merge_byoe/evaluation.report.json
Each example includes checkpoint_refs.json beside the report. The pruning
fixture is a dense magnitude-pruned subject reference, and the LoRA fixture is a
merged-adapter/fine-tune style subject reference. Both are validation-subject
fixtures only; sparse runtime speedups, packed quantized storage, and deployable
optimized backend behavior are outside their scope.
Caught regressions
The caught-regression fixtures keep the naive primary metric acceptable
(ratio_vs_baseline = 1.0) while one guard fails. They cover the three
non-primary guard families exposed in strict verification:
invarlock verify --profile release --assurance strict \
public_evidence/caught_regressions/spectral_guard_failure/evaluation.report.json
invarlock verify --profile release --assurance strict \
public_evidence/caught_regressions/rmt_guard_failure/evaluation.report.json
invarlock verify --profile release --assurance strict \
public_evidence/caught_regressions/variance_guard_failure/evaluation.report.json
Expected outcome: verification fails. The failure is not a perplexity failure; it is a guard/policy failure. For the spectral case, the verifier reports:
Release verification requires validation.spectral_stable == true
spectral did not pass
That is the intended strict-verification behavior: guard stability is required even when the summary metric is clean.
Policy failures
The policy-failure fixtures show non-guard and provenance predicates that can block a strict release:
invarlock verify --profile release --assurance strict \
public_evidence/policy_failures/invariants_failure/evaluation.report.json
invarlock verify --profile release --assurance strict \
public_evidence/policy_failures/primary_metric_failure/evaluation.report.json
invarlock verify --profile release --assurance strict \
public_evidence/policy_failures/runtime_provenance_failure/evaluation.report.json
Expected outcome: each verification fails for its named policy predicate: invariant evidence, primary-metric acceptance, or container runtime provenance.
Applying this to your checkpoint
Use your own edited checkpoint from a quantization, pruning, distillation, or
fine-tuning pipeline, then run invarlock evaluate or generate an
evaluation.report.json from paired run reports:
invarlock report generate \
--run runs/subject/report.json \
--baseline-run-report runs/baseline/report.json \
--format report \
-o reports/eval
invarlock verify --profile release --assurance strict \
reports/eval/evaluation.report.json
Keep evaluation.report.json and runtime.manifest.json together. Use
invarlock advanced runtime-verify only when you specifically want to inspect
the manifest/report binding; use invarlock verify for the full strict
verification result.