Evidence Pack Internals

This guide explains how the evidence pack suite is wired internally: entrypoints, task graph, scheduling, and artifact generation. It complements Evidence Packs, which focuses on how to run a suite.

Scope note: in this guide, CALIBRATION_RUN -> GENERATE_PRESET is called Preset Derivation. It produces run-scoped calibrated_preset_<model>.yaml/json files and does not directly modify global runtime/tiers.yaml.

Overview

AspectDetails
PurposeHardware-agnostic Phase 0 validation harness for edit detection
Harness versionevidence-packs-v1
Manifest formatevidence-pack-v1
HardwareNVIDIA GPUs where models fit VRAM; multi-GPU recommended for full
Modelssubset (1 model), showcase/workshop3 (3 models), or full (6 models); all ungated public
EditsScenario-driven; default suites use 4 clean + 4 stress edit scenarios per model, and filtered manifests may select any subset
Preset DerivationCALIBRATION_RUN + GENERATE_PRESET create run-scoped calibrated presets
SchedulingDynamic work-stealing, small_first priority strategy
Multi-GPUProfile-based; required_gpus grows only when memory requires it
OutputEvidence pack with manifest.json, checksums.sha256, and report bundles (--layout v2 nests results + metadata)
Source of truthscripts/evidence_packs/run_suite.sh, scripts/evidence_packs/run_pack.sh, src/invarlock/evidence_pack.py, src/invarlock/cli/commands/evidence_pack.py

Quick Start (Context)

The suite wrappers fail closed unless remote model code has been explicitly approved. The examples below assume INVARLOCK_ALLOW_REMOTE_CODE=1 is exported for the shell that launches run_suite.sh or run_pack.sh.

# Run the subset suite (offline by default)
export INVARLOCK_ALLOW_REMOTE_CODE=1
./scripts/evidence_packs/run_suite.sh --suite subset

# Run the full suite and build an evidence pack
./scripts/evidence_packs/run_pack.sh --suite full --net 1 --release-review

# Verify an existing evidence pack
invarlock advanced evidence-pack verify ./evidence_pack_runs/subset_20250101_000000/evidence_pack --strict --report-assurance strict

# Verify and pin the expected package-native signer
invarlock advanced evidence-pack verify ./evidence_pack_runs/subset_20250101_000000/evidence_pack --strict --report-assurance strict --expected-fingerprint sha256:<64-hex-chars>

Hardware Target

  • Hardware-agnostic by design; run on any NVIDIA GPU topology where the models fit in VRAM.
  • Multi-GPU scheduling is enabled automatically when a task’s memory plan exceeds per-device capacity.
  • Set GPU_MEMORY_GB or GPU_MEMORY_PER_DEVICE to match your hardware when running on GPUs with unusual memory sizes.

Entrypoints and modules

Entrypoints

  • scripts/evidence_packs/run_suite.sh runs a suite and sets PACK_* runtime flags before calling the main orchestrator.
  • scripts/evidence_packs/run_pack.sh runs a suite, then packages artifacts into a portable evidence pack (manifest + checksums + reports).
  • scripts/evidence_packs/verify_pack.sh validates an evidence pack in repo workflows.
  • invarlock advanced evidence-pack verify provides the package-native verifier path for installed wheels.
  • scripts/evidence_packs/run_suite.sh defines the model suites and allows MODEL_1MODEL_8 overrides.
  • scripts/evidence_packs/lib/validation/validation_suite.sh orchestrates the run: preflight, queue creation, worker launch, and monitoring.

Library modules

  • lib/task_serialization.sh: task schema, JSON helpers, GPU planning.
  • lib/queue/queue_manager.sh: compatibility facade for queue state, dependency, and task-generation modules.
  • lib/queue/queue_core.sh: queue setup, locking, summaries, and terminal-state helpers.
  • lib/queue/queue_lifecycle.sh: task state transitions and orphan reclamation.
  • lib/queue/queue_dependencies.sh: dependency resolution and dependent promotion.
  • lib/queue/queue_memory_plan.sh: profile-based memory refresh and memory-plan export.
  • lib/queue/queue_generation.sh: progress state, task search, and task graph generation.
  • lib/queue/scheduler.sh: compatibility facade for scheduler modules.
  • lib/queue/scheduler_core.sh: GPU ID/cache helpers, scheduler lock, and GPU-count policy.
  • lib/queue/scheduler_gpu_runtime.sh: OOM checks, memory probes, utilization, and purge helpers.
  • lib/queue/scheduler_reservations.sh: GPU reservation and availability helpers.
  • lib/queue/scheduler_selection.sh: task priority, selection, work stealing, and scheduling metrics.
  • lib/gpu_worker.sh: worker loop, heartbeats, task execution glue.
  • lib/tasks/task_functions.sh: compatibility facade and execute_task dispatcher.
  • lib/tasks/task_common.sh: shared scheduling, model, preset, and reusable baseline-report helpers.
  • lib/tasks/task_baseline.sh: baseline setup, calibration, preset generation, and shared baseline report preparation.
  • lib/tasks/task_edit_lifecycle.sh: edit creation, batch edit creation, evaluation, and cleanup.
  • lib/tasks/task_error_lifecycle.sh: error-model creation, evaluation probes, structural-failure reports, and cleanup.
  • lib/tasks/model_creation.sh: edit and error-model creation helpers (create_model_variant dispatcher).
  • lib/config/config_generator.sh: InvarLock config generation and wrapper helpers.
  • lib/validation/validation_suite.sh: validation orchestration, analysis setup, and verdict compilation.
  • lib/core/fault_tolerance.sh: error classification and retry/backoff logic.
  • scripts/evidence_packs/python/manifest_writer.py: evidence pack manifest.json writer.
  • scripts/evidence_packs/python/preset_generator.py: preset derivation + edit-type variants.

Module dependency graph

Evidence pack module dependency graph from entrypoints into queueing, execution, and packaging helpers.Evidence pack module dependency graph from entrypoints into queueing, execution, and packaging helpers.

Troubleshooting decision tree

Troubleshooting decision tree for missing evidence-pack outputs and common guard failures.Troubleshooting decision tree for missing evidence-pack outputs and common guard failures.

Model Suite

Model suites are defined and applied by run_suite.sh.

SuiteModelsNotes
subsetmistralai/Mistral-7B-v0.1Single-GPU friendly
showcasemistralai/Mistral-7B-v0.1, Qwen/Qwen2.5-14B, Qwen/Qwen2.5-32BMulti-GPU recommended; guard-focused scenarios
workshop3mistralai/Mistral-7B-v0.1, mistralai/Mixtral-8x7B-v0.1, 01-ai/Yi-34BWorkshop-friendly 3-model suite (architecture diversity)
fullmistralai/Mistral-7B-v0.1, Qwen/Qwen2.5-14B, Qwen/Qwen2.5-32B, 01-ai/Yi-34B, mistralai/Mixtral-8x7B-v0.1, Qwen/Qwen3-8BMulti-GPU recommended

Default full-suite model sizes (weights-only, approximate):

ModelVRAMCategoryNotes
mistralai/Mistral-7B-v0.1~14 GBSmallFlash Attention 2 compatible
Qwen/Qwen2.5-14B~28 GBSmallFlash Attention 2 compatible
Qwen/Qwen2.5-32B~64 GBMediumFlash Attention 2 compatible
01-ai/Yi-34B~68 GBMediumFlash Attention 2 compatible
mistralai/Mixtral-8x7B-v0.1~90 GBMoEMoE architecture
Qwen/Qwen3-8B~16 GBSmallFlash Attention 2 compatible

Notes:

  • Override models via MODEL_1MODEL_8; set an empty string to disable a slot.
  • validation_suite.sh includes a fallback list of large causal models if it is run directly without separate suite plumbing.

Edit Types

Each model runs 8 validation-subject edit experiments (4 types × 2 versions) plus optional error-injection tests. scripts/evidence_packs/scenarios.json is the source of truth for each scenario's artifact_class.

Evidence-pack quant_rtn scenarios are generated as external subject artifacts: the helper applies RTN quantize/dequantize simulation and saves floating-point dequantized weights. Those scenarios may use bits and group_size to shape the generated artifact, but they are not the built-in edit.name: quant_rtn plugin plan and they do not produce packed runtime quantization.

The same distinction applies to the other current edits: FP8 round-trips tensors through FP8 or float16 and saves ordinary floating-point weights; magnitude pruning writes zeros into dense tensors; low-rank SVD writes dense approximated tensors instead of factorized low-rank modules.

Clean edits (tuned)

Clean edits use tuned parameters supplied via PACK_TUNED_EDIT_PARAMS_FILE. The suite uses :clean: as a sentinel in the edit spec and resolves concrete parameters at runtime.

Edit TypeArtifact classParametersScope
RTN dequantized external-subject simulationvalidation subject checkpointtuned (bits, group_size) from tuned params fileFFN only
FP8 dequantized external-subject simulationvalidation subject checkpointtuned (format) from tuned params fileFFN only
Dense magnitude-pruned checkpointvalidation subject checkpointtuned (prune_level) from tuned params fileFFN only
Dense low-rank-SVD approximated checkpointvalidation subject checkpointtuned (rank) from tuned params fileFFN only

Stress edits

Stress edits are split into required-fail (catastrophic) and informational scenarios. Required-fail scenarios are gating in the final verdict; informational scenarios are tracked as detection-quality signals and are validated by a minimum signal-fraction criterion.

Important nuance: some guards remediate without flipping a boolean validation gate. For example, Spectral can remain validation.spectral_stable=true while applying caps (spectral.caps_applied > 0). Raw remediation events can be useful diagnostics, but public guard-value evidence must be baseline-relative: a stock cap already present in the noop basis does not count. Published guard-value cases must show PM acceptance plus new guard movement relative to the matching baseline, then reproduce that movement in a clean confirmation rerun.

Edit TypeArtifact classParametersScope
RTN dequantized external-subject simulationvalidation subject checkpointquant_rtn:8:all (8-bit)All layers
FP8 dequantized external-subject simulationvalidation subject checkpointfp8_quant:e5m2:allAll layers
Dense magnitude-pruned checkpointvalidation subject checkpointmagnitude_prune:0.5:all (50% sparsity)All layers
Dense low-rank-SVD approximated checkpointvalidation subject checkpointlowrank_svd:32:all (rank 32)All layers

Deployable edits

Deployable scenarios use artifact_class: deployable_optimized_subject and are not part of the default validation lanes. A deployable scenario must carry a backend contract (packed_quantized_storage, reload smoke, inference smoke, and memory/storage evidence) and package the corresponding sidecars into the evidence pack. The OSS suite does not ship a runnable deployable scenario; add one only when the backend can produce or load an artifact and provide all required sidecars on the target GPU stack.

Error injection tests

Enabled when RUN_ERROR_INJECTION=true (default):

  • Required detection (must_detect): nan_injection, inf_injection, shape_mismatch, missing_tensors, extreme_quant, scale_explosion, rank_collapse, norm_collapse, weight_tying_break
  • Informational detection: rmt_norm_noise, spectral_moderate_scale, ve_mlp_scale_skew

rmt_norm_noise additionally emits an rmt_probe.json sidecar next to the error report. This runs an explicit cross-model RMT probe on shared calibration windows (stored in the baseline report) so the evidence pack can demonstrate RMT’s delta policy even when compare-mode evaluation keeps validation.rmt_stable=true.

ve_mlp_scale_skew additionally emits a ve_probe.json sidecar next to the error report. Variance (DD-VE) is a remediation guard and compare-mode evaluation runs the subject model with a no-op edit, which can mute VE’s in-report evidence. The VE probe runs VE calibration directly on shared windows and records whether VE proposes scales and produces a meaningful primary-metric improvement.

Source of truth: scripts/evidence_packs/scenarios.json strictness + intent + primary_guard metadata.

The reference public package for this stricter interpretation is public_evidence/published_basis/mistral_7b/guard_value_demo/, whose artifact_package/reports/guard_value_all_guard_probe_sweep.json records baseline-relative spectral, RMT, and variance/VE cases.

Scheduling

The suite uses dynamic work-stealing scheduling with a file-backed task queue. validation_suite.sh seeds the queue and launches one worker per GPU; workers claim tasks under a scheduler lock with GPU reservation files.

small_first priority strategy

Base task priorities (queue manager) are combined with dynamic boosts in scheduler.sh (model size, blocked dependents, age, and fairness penalties).

Priority bands mapping evidence-pack task types to base scheduler priority values.Priority bands mapping evidence-pack task types to base scheduler priority values.

Dynamic boosts (scheduler):

  • Model size boosts: <30GB (+30), <70GB (+20), <100GB (+10).
  • Critical tasks: SETUP_BASELINE (+50), CALIBRATION_RUN (+20).
  • Unblock boost: +2 per dependent task (capped).
  • Age boost: +1 per 5 minutes in the queue (capped).
  • Fairness penalty: -3 per running task for the same model (capped).
  • Work-stealing boost: raises priority for lagging models.

Dynamic scheduling diagram

Scheduler flow from run_pack through run_suite, queue initialization, worker launch, and monitor loop.Scheduler flow from run_pack through run_suite, queue initialization, worker launch, and monitor loop.

Work-stealing timeline (illustrative)

GPU work-stealing timeline showing smaller jobs finishing early and helping with larger jobs.GPU work-stealing timeline showing smaller jobs finishing early and helping with larger jobs.

Illustrative only; actual scheduling depends on queue state and memory.

Multi-GPU Model Distribution

After baseline setup, the suite writes model_profile.json and updates per-task memory estimates. task_serialization.sh calculates required_gpus based on GPU_MEMORY_PER_DEVICE and NUM_GPUS:

  • Tasks reserve multiple GPUs only when memory exceeds per-device capacity.
  • Adaptive under-allocation is disabled by default (get_minimum_gpus matches required_gpus) to avoid OOM.
  • Set GPU_MEMORY_PER_DEVICE explicitly for non-80/180GB hardware.

Memory-aware selection example

Memory-fit decision example showing ready-queue scanning against free GPU memory.Memory-fit decision example showing ready-queue scanning against free GPU memory.

GPU reservation protection

Reservations are stored under OUTPUT_DIR/workers/gpu_reservations/ and guarded by a queue/scheduler.lock (mkdir-based). The scheduler also expires stale reservations by TTL (GPU_RESERVATION_TTL).

Reservation state example

GPU reservation state showing free and reserved devices for multi-GPU task claims.GPU reservation state showing free and reserved devices for multi-GPU task claims.
Reservation file layout for scheduler locks and GPU reservation metadata.Reservation file layout for scheduler locks and GPU reservation metadata.

Task lifecycle

Task lifecycle state machine from pending through ready, running, completed, and failed.Task lifecycle state machine from pending through ready, running, completed, and failed.

GPU worker loop

GPU worker loop from shutdown checks through claim, execute, and release.GPU worker loop from shutdown checks through claim, execute, and release.

Batch optimizations

Small/medium models default to batch edit creation:

  • Batch edit creation: CREATE_EDITS_BATCH parses and schedules all edit specs together, but reloads the baseline per edit by default to avoid deep-copying large loaded models. Set PACK_BATCH_EDIT_STRATEGY=deepcopy only for small models where single-load throughput is more important than peak memory.
  • Deferred optional report rendering: PACK_DEFER_REPORT_RENDERING=1 keeps evaluation.report.json, runtime.manifest.json, and JSON evidence sidecars in the hot path while skipping markdown/reviewer bundle rendering. Pack verification does not require those optional rendered files. Release review mode enables this default so evaluation workers spend less time on report-heavy filesystem writes; pack-level HTML export still runs unless PACK_SKIP_HTML=1 is set.
  • Evaluation-loop telemetry: each evaluate_timing.json records top-level evaluate timings plus nested baseline/subject run timings when reports expose them. The pack-level evaluation_optimization_summary.json aggregates those timings so reviewers can separate process startup savings from model load, dataset preparation, guard/eval, and report-generation costs.

Large or MoE models can still disable batch edit tasks automatically (or via PACK_USE_BATCH_EDITS=false) and fall back to per-edit tasks (CREATE_EDIT -> evaluate_EDIT).

Task dependency graphs

Batch (default):

Batch dependency graph from baseline setup into calibration, preset generation, edit, and error evaluations.Batch dependency graph from baseline setup into calibration, preset generation, edit, and error evaluations.

Notes:

  • Error injection tasks (CREATE_ERRORevaluate_ERROR) branch off SETUP_BASELINE and require the preset for evaluation.
  • Grouped edit evaluation is opt-in because it trades some multi-GPU task parallelism for lower per-scenario process startup overhead. It is most useful for many short edit evaluations on one model.

Per-edit path (large/MoE or PACK_USE_BATCH_EDITS=false):

Per-edit dependency graph from baseline setup into edit and error evaluation tasks.Per-edit dependency graph from baseline setup into edit and error evaluation tasks.

Task breakdown per model (defaults)

Defaults: DRIFT_CALIBRATION_RUNS=5, CLEAN_EDIT_RUNS=3, STRESS_EDIT_RUNS=2, RUN_ERROR_INJECTION=true, and PACK_USE_BATCH_EDITS=auto.

The task graph is manifest-driven and may add eager baseline, reusable-baseline, and cleanup tasks. Use the generated queue summary as the authoritative count. At a high level, each model includes:

  • Setup baseline: 1 task
  • Preset-derivation runs + preset generation: 6 tasks
  • Edit creation: one batch task when grouped edit creation is explicitly enabled, otherwise one creation task per edit
  • Edit evaluation: edit count × clean/stress run counts
  • Error injection: suite manifest scenarios (12 for subset, 15 for showcase, workshop3, and full by default)
  • Cleanup tasks when cleanup mode is enabled

PACK_USE_BATCH_EDITS=auto favors the per-edit path for cleanup-safe default runs. Set PACK_USE_BATCH_EDITS=true only when the reduced startup overhead is worth the lower task-level parallelism for that campaign.

Execution phases

Execution phases from environment setup through queue initialization, worker launch, and final reports.Execution phases from environment setup through queue initialization, worker launch, and final reports.

Run directory layout

Output directory layout for evidence-pack analysis, reports, and final verdict artifacts.Output directory layout for evidence-pack analysis, reports, and final verdict artifacts.

Some scenarios emit additional sidecar artifacts alongside evaluation.report.json (for example reports/errors/rmt_norm_noise/rmt_probe.json or reports/errors/ve_mlp_scale_skew/ve_probe.json). When present, run_pack.sh copies these sidecars into the packaged evidence pack under reports/**/.

Run modes

  • --calibrate-only / PACK_SUITE_MODE=calibrate-only
    • Preset derivation only mode.
    • Only promotes SETUP_BASELINE, CALIBRATION_RUN, and GENERATE_PRESET tasks.
    • The monitor exits after all GENERATE_PRESET tasks complete.
  • --run-only
    • Continue a prior run after preset derivation. This is effectively --resume with PACK_SUITE_MODE=full.
  • --resume
    • Reuses an existing queue and continues from where the run stopped.

Determinism vs throughput

PACK_DETERMINISM controls harness-level determinism:

# Throughput (default)
export INVARLOCK_ALLOW_REMOTE_CODE=1
PACK_DETERMINISM=throughput ./scripts/evidence_packs/run_suite.sh --suite subset

# Strict
PACK_DETERMINISM=strict ./scripts/evidence_packs/run_suite.sh --suite subset
  • Throughput: NVIDIA_TF32_OVERRIDE=1, CUDNN_BENCHMARK=1.
  • Strict: NVIDIA_TF32_OVERRIDE=0, CUDNN_BENCHMARK=0, CUBLAS_WORKSPACE_CONFIG=:4096:8.

Network mode and model revisions

Evidence packs are offline by default:

  • PACK_NET=0 sets INVARLOCK_ALLOW_NETWORK=0 and enables HF offline modes.
  • PACK_NET=1 enables downloads and writes state/model_revisions.json (ungated models only).
  • Offline runs require model_revisions.json; missing revisions trigger a hard error during SETUP_BASELINE.

Use PACK_MODEL_REVISIONS_FILE to override the revisions path.

Disk and cache behavior

Large runs can be storage-heavy (baseline + edits + error models):

  • Disk preflight estimates required storage and aborts early when insufficient.
    • Override with PACK_SKIP_DISK_PREFLIGHT=1 (not recommended).
    • The minimum free space guard is MIN_FREE_DISK_GB (default 200).
  • PACK_BASELINE_STORAGE_MODE=snapshot_symlink builds a local symlink tree that points into the Hugging Face cache snapshot. This avoids a second baseline copy under OUTPUT_DIR, but it requires one full model copy in HF_HUB_CACHE when that cache shares the output filesystem.
  • PACK_BASELINE_STORAGE_MODE=snapshot_copy materializes a full baseline copy under OUTPUT_DIR/models/<model>/baseline.
  • Baseline downloads prefer one weight format only. When both .safetensors and .bin weights are published, evidence packs download the safetensors set and ignore the .bin copy.
  • HF caches default to OUTPUT_DIR/.hf (override with HF_HOME, HF_HUB_CACHE, HF_DATASETS_CACHE).

For the default subset suite (mistralai/Mistral-7B-v0.1), the model-weight budget is roughly:

  • ~42 GB on the output filesystem with snapshot_symlink when HF_HUB_CACHE lives on the same filesystem as OUTPUT_DIR (one cached baseline + one clean edit peak + one error-model peak under cleanup mode).
  • ~28 GB on the output filesystem with snapshot_symlink when HF_HUB_CACHE is on a separate volume.
  • ~56 GB on the output filesystem with snapshot_copy on the same filesystem.

Those figures are for model weights only; the default preflight also requires MIN_FREE_DISK_GB=200 headroom.

Evidence pack packaging and verification

run_pack.sh builds a portable pack:

  • Copies reports/final_verdict.{txt,json} plus verdict sidecars (category_summary, guard_signal_summary, scenario_signal_summary) and key analysis/* artifacts.
  • Collects all reports into evidence_pack/reports/....
  • Generates manifest.json, checksums.sha256, optional manifest.signature.json.
  • Writes pack-contained provenance metadata such as metadata/source_repo.json and metadata/environment.json before sealing the pack.
  • Stages the pack in a hidden sibling temporary directory and renames it into place only after sealing succeeds, so failed builds do not leave partial evidence_pack/ output behind.
  • Optional HTML export can be disabled with PACK_SKIP_HTML=1.

Packaging flow

Packaging flow from run_suite outputs into collected reports and sidecars, manifests, checksums, html, and package-native signatures.Packaging flow from run_suite outputs into collected reports and sidecars, manifests, checksums, html, and package-native signatures.

invarlock advanced evidence-pack verify checks the pack:

  • Verifies manifest.json binds checksums.sha256 via checksums_sha256_digest.
  • Verifies digest-backed manifest references (subject, invocation.config_source, environment, and materials) against on-pack files.
  • Verifies checksums.sha256 (and thus all hashed artifacts).
  • Verifies the package-native Ed25519 signature bundle; a missing manifest.signature.json is a signature failure by default.
  • Enforces signer authenticity when --expected-fingerprint or --trust-store is supplied.
  • Enforces “no extra files” semantics in --strict mode.
  • Runs invarlock verify across all bundled reports (JSON output optional) with runtime-manifest enforcement on; each packaged evaluation.report.json carries an adjacent runtime.manifest.json.
  • Returns structured exit codes so callers can distinguish usage, missing-file, manifest-format, signature, integrity, and report-verification failures.

The installed-wheel package-native CLI is self-contained:

  • invarlock advanced evidence-pack keygen generates Ed25519 signing keys.
  • invarlock advanced evidence-pack build --signing-key ... emits manifest.signature.json.
  • invarlock advanced evidence-pack verify validates the signature bundle in-process and does not depend on external signature binaries.

The repo shell harness remains a separate maintainer path, but it uses the same package-native Ed25519 manifest-signature format as the installed CLI.

Maintainer evidence-pack packaging also treats source provenance as fail-closed:

  • run_pack.sh writes metadata/source_repo.json from the active Git checkout.
  • If git is unavailable or the repository metadata cannot be collected, pack creation stops instead of silently emitting partial provenance unless an explicit snapshot marker is present.
  • Detached artifact-tree runs may provide GPU_RUN_SOURCE.txt in the repository root, or point INVARLOCK_SOURCE_REPO_MARKER at an equivalent key-value file. The marker must include source_commit=<40-hex-commit> and may include source_branch, source_describe, source_uri, and source_dirty.
  • If you need to package from a detached artifact tree, write a complete metadata/source_repo.json first or provide that explicit snapshot marker rather than relying on fallback inference.

Remote setup helper

scripts/evidence_packs/lib/core/setup_remote.sh is an optional bootstrap script for fresh GPU hosts. It clones the repo, creates a venv, installs PyTorch and InvarLock, and leaves the host ready to run run_pack.sh.

Operational guidance for remote evidence-pack work:

  • Prefer a fresh clone or work tree per campaign instead of reusing an older editable-install checkout.
  • If you intentionally run from a work tree that is not the editable install behind .venv, either reinstall that work tree or export PYTHONPATH=src so invarlock resolves to the intended source tree.
  • run_suite.sh and run_pack.sh default to SKIP_FLASH_ATTN=true and PACK_BASELINE_STORAGE_MODE=snapshot_copy for bulk default runtime-container runs.
  • Bulk evidence-pack runs fail fast unless INVARLOCK_ALLOW_REMOTE_CODE=1 is set.
  • Export non-default runtime roots before launching the suite when you expect them inside delegated container jobs: INVARLOCK_CONFIG_ROOT, HF_HOME, HF_HUB_CACHE, HF_DATASETS_CACHE, TRANSFORMERS_CACHE, TMPDIR, TMP.
  • If a staged preset or profile uses !include outside its config directory, set INVARLOCK_ALLOW_CONFIG_INCLUDE_OUTSIDE=1 on the remote host before the evidence-pack entrypoint; the default runtime-container launcher rejects that config graph before container start when the override is missing.
  • After Qwen2.5-14B campaigns run with PACK_CLEANUP_MODELS=0, run scripts/evidence_packs/run_qwen14_sentinels.sh from the same fresh work tree to validate saved-model direct evaluate and the public quant smoke. The sentinel helper reloads retained edit subject directories, so default cleanup mode is not sufficient for this follow-up check. The helper defaults to --profile dev --assurance off and requires evaluate/verify subprocesses to exit zero.

Recommended remote validation checklist after security-default changes:

  1. Run an evidence-pack subset lane with explicit external HF_HOME and INVARLOCK_CONFIG_ROOT overrides.
  2. Run one delegated invarlock evaluate with external --edit-config, TMPDIR, and INVARLOCK_EXPORT_DIR roots.
  3. Run one scripts/model_evidence/model_evidence_sweep.py --execution-mode container lane with an external output root and confirm the published report path is populated.

Common knobs for the setup script:

  • REPO_DIR, REPO_URL, BRANCH, PYTHON_BIN, VENV_DIR.
  • TORCH_INDEX_URL, TORCH_PACKAGES, PACK_SKIP_TORCH_CHECK.
  • HF_HOME, HF_HUB_CACHE, HF_DATASETS_CACHE.

Tuning reference

Core configuration

VariableDefaultDescription
PACK_SUITEsubsetSuite name (subset, showcase, workshop3, or full)
PACK_NET0Enable network preflight/downloads
PACK_OUTPUT_DIRunsetSets OUTPUT_DIR when provided
OUTPUT_DIRauto./evidence_pack_runs/<suite>_<timestamp> via entrypoint
PACK_OUTPUT_DIR_ABSOLUTEfalseNormalize OUTPUT_DIR to absolute path
PACK_SUITE_MODEfullfull, calibrate-only, or run-only
PACK_DETERMINISMthroughputHarness determinism mode
PACK_REPEATS0Determinism repeat metadata
PACK_MODEL_REVISIONS_FILEOUTPUT_DIR/state/model_revisions.jsonRevisions path
PACK_USE_BATCH_EDITSautoForce/disable batch edit creation
PACK_DEFER_REPORT_RENDERING0 (1 under --release-review)Skip optional markdown/reviewer bundle rendering during evaluation
PACK_RUNTIME_IMAGE_FLAVORdefaultRemote setup runtime image flavor; use quant on CUDA hosts for optional quant adapter container evidence, including hf_bnb, hf_awq, hf_gptq, hf_torchao, hf_hqq, hf_quanto, and hf_ct
RESUME_MODEtrueSkip completed steps when outputs exist

Hardware selection

VariableDefaultDescription
CUDA_VISIBLE_DEVICESunsetExplicit GPU pool (comma-separated)
GPU_ID_LISTunsetAlternate GPU pool list
NUM_GPUSautoNumber of GPUs to use (clamped to pool)
GPU_MEMORY_GBautoPer-GPU memory hint for planning
GPU_MEMORY_PER_DEVICEGPU_MEMORY_GBPer-device memory for required_gpus
GPU_MIN_FREE_GB10Minimum free VRAM for eligibility
GPU_REQUIRE_IDLEtrueRequire GPUs with no compute processes
GPU_CACHE_TTL5GPU cache TTL (seconds)
GPU_RESERVATION_TTL60Reservation TTL (seconds)
GPU_RESERVATION_LOCK_TIMEOUT5Reservation lock timeout (seconds)

Model overrides

VariableDefaultDescription
MODEL_1MODEL_8suite-definedOverride model slots; empty disables

InvarLock settings

VariableDefaultDescription
INVARLOCK_DATASETwikitext2Dataset provider
INVARLOCK_DATASET_PROVIDER_YAMLunsetRaw YAML mapping for dataset.provider (advanced; overrides provider kind + args)
INVARLOCK_DATASET_PROVIDER_JSONunsetRaw JSON object for dataset.provider (advanced; overrides provider kind + args)
INVARLOCK_HF_DATASET_NAMEallenai/c4HF dataset name when INVARLOCK_DATASET=hf_text
INVARLOCK_HF_CONFIG_NAMEen (for allenai/c4)HF dataset config when INVARLOCK_DATASET=hf_text
INVARLOCK_HF_TEXT_FIELDtextText field when INVARLOCK_DATASET=hf_text
INVARLOCK_HF_MAX_SAMPLES2000Max rows consumed when INVARLOCK_DATASET=hf_text
INVARLOCK_HF_TRUST_REMOTE_CODEunsetPass trust_remote_code to HF load_dataset (not needed for allenai/c4 Parquet)
INVARLOCK_HF_CACHE_DIRunsetdatasets cache_dir override when INVARLOCK_DATASET=hf_text
INVARLOCK_LOCAL_JSONL_FILEunsetJSONL file path when INVARLOCK_DATASET=local_jsonl
INVARLOCK_LOCAL_JSONL_PATHunsetJSONL file/dir path when INVARLOCK_DATASET=local_jsonl
INVARLOCK_LOCAL_JSONL_DATA_FILESunsetJSONL glob/list when INVARLOCK_DATASET=local_jsonl
INVARLOCK_LOCAL_JSONL_TEXT_FIELDtextText field when INVARLOCK_DATASET=local_jsonl
INVARLOCK_LOCAL_JSONL_MAX_SAMPLES2000Max rows consumed when INVARLOCK_DATASET=local_jsonl
INVARLOCK_TIERbalancedGuard tier preset
INVARLOCK_PREVIEW_WINDOWS32Preview windows
INVARLOCK_FINAL_WINDOWS32Final windows
INVARLOCK_SEQ_LEN512Sequence length
INVARLOCK_STRIDE256Stride
INVARLOCK_EVAL_BATCH32InvarLock batch size
PACK_GUARDS_ORDERinvariants,spectral,rmt,variance,invariantsGuards included in preset derivation and generated presets

Primary metric acceptance/drift gates should be configured via profile/config (primary_metric.acceptance_range, primary_metric.drift_band), not env vars.

Tuned edit presets

VariableDefaultDescription
PACK_TUNED_EDIT_PARAMS_FILEunsetJSON file with tuned clean edit params (required when CLEAN_EDIT_RUNS>0).

Preset derivation reuse

VariableDefaultDescription
PACK_CALIBRATION_PRESET_DIRunsetDirectory containing calibrated_preset_<model>.yaml/json to reuse; skips preset-derivation runs.
PACK_CALIBRATION_PRESET_FILEunsetSingle preset file applied to all models (advanced).

Experiment controls

VariableDefaultDescription
DRIFT_CALIBRATION_RUNS5Preset-derivation run count
CLEAN_EDIT_RUNS3Clean edit evaluate runs
STRESS_EDIT_RUNS2Stress edit evaluate runs
RUN_ERROR_INJECTIONtrueEnable error injection

Storage and memory planning

VariableDefaultDescription
PACK_BASELINE_STORAGE_MODEsnapshot_copyBaseline storage mode for run_suite.sh and run_pack.sh (snapshot_symlink, snapshot_copy, or save_pretrained)
MIN_FREE_DISK_GB200Disk pressure threshold
PACK_SKIP_DISK_PREFLIGHT0Skip storage preflight
CUDA_MEMORY_FRACTION0.92Target GPU memory fraction
MODEL_LOAD_OVERHEAD_GB4Load overhead for planning
EDIT_OVERHEAD_GB8Per-edit overhead for planning
BATCH_EDIT_OVERHEAD_GB8Batch edit overhead
INVARLOCK_OVERHEAD_GB6InvarLock overhead

Worker + reliability controls

VariableDefaultDescription
WORKER_HEARTBEAT_INTERVAL30Heartbeat interval (seconds)
WORKER_IDLE_SLEEP5Sleep when idle (seconds)
WORKER_MAX_FAILURES10Stop worker after N failures
WORKER_TIMEOUT2700Worker heartbeat timeout (seconds)
CANCEL_BLOCKED_TASKS_GRACE_SECONDS90Fail blocked tasks after grace
TASK_TIMEOUT_DEFAULT21600Default task timeout (seconds)
TASK_TIMEOUT_<TASKTYPE>unsetPer-task timeout override

Packaging and verification

VariableDefaultDescription
PACK_DIROUTPUT_DIR/evidence_packEvidence pack output dir
PACK_SIGN_MANIFEST1Sign manifest.json with a package-native Ed25519 key (auto-generated if PACK_SIGNING_KEY is unset)
PACK_SIGNING_KEYunsetOptional Ed25519 private key PEM for deterministic signer identity
PACK_SKIP_HTML0Skip HTML rendering
PACK_VERIFY_PROFILEdevProfile for invarlock verify
PACK_REPORT_ASSURANCEreportNested report assurance mode passed to report verification (report, strict, or off)
PACK_REQUIRE_PASS0Fail pack generation unless final_verdict.json is PASS
PACK_REQUIRE_RUNTIME_MANIFESTS0Require report-adjacent runtime manifests during hardened pack generation
PACK_RELEASE_REVIEW0Set by run_pack.sh --release-review; requires PASS verdicts, signed manifests, runtime manifests, CI profile, and strict report assurance

Troubleshooting

Missing model revisions (offline)

If offline runs fail with “requires model revisions”, run a preflight:

export INVARLOCK_ALLOW_REMOTE_CODE=1
./scripts/evidence_packs/run_suite.sh --suite subset --net 1

Or point to an existing revisions file with PACK_MODEL_REVISIONS_FILE.

OOM on large models

  • Lower GPU_MEMORY_PER_DEVICE so the planner requests more GPUs.
  • Disable batch edits: PACK_USE_BATCH_EDITS=false.
  • Reduce InvarLock batch/seq_len (e.g., INVARLOCK_EVAL_BATCH=16 INVARLOCK_SEQ_LEN=256).
  • Increase memory overhead knobs (MODEL_LOAD_OVERHEAD_GB, EDIT_OVERHEAD_GB).

Disk pressure / preflight failures

Check state/disk_pressure.json and ensure the output filesystem has headroom. Use MIN_FREE_DISK_GB=0 or PACK_SKIP_DISK_PREFLIGHT=1 only if you accept risk of partial artifacts.

Task timeouts

Increase the default or per-task timeout:

export INVARLOCK_ALLOW_REMOTE_CODE=1
TASK_TIMEOUT_DEFAULT=28800 ./scripts/evidence_packs/run_suite.sh --suite subset
TASK_TIMEOUT_CREATE_EDIT=28800 ./scripts/evidence_packs/run_suite.sh --suite subset

Stuck queues or dead workers

  • Inspect state/progress.json and workers/gpu_<id>.status.
  • Check worker logs: logs/gpu_<id>.log and logs/tasks/<task_id>.log.
  • Re-run with --resume to recover from a crash.