Evidence Pack Internals

This guide explains how the evidence pack suite is wired internally: entrypoints, task graph, scheduling, and artifact generation. It complements Evidence Packs, which focuses on how to run a suite.

Scope note: in this guide, CALIBRATION_RUN -> GENERATE_PRESET is called Preset Derivation. It produces run-scoped calibrated_preset_<model>.yaml/json files and does not directly modify global runtime/tiers.yaml.

Overview

AspectDetails
PurposeHardware-agnostic Phase 0 validation harness for edit detection
Versionevidence-packs-v1
HardwareNVIDIA GPUs where models fit VRAM; multi-GPU recommended for full
Modelssubset (1 model), showcase/workshop3 (3 models), or full (6 models); all ungated public
EditsScenario-driven; default suites use 4 clean + 4 stress edit scenarios per model, and filtered manifests may select any subset
Preset DerivationCALIBRATION_RUN + GENERATE_PRESET create run-scoped calibrated presets
SchedulingDynamic work-stealing, small_first priority strategy
Multi-GPUProfile-based; required_gpus grows only when memory requires it
OutputEvidence pack with manifest.json, checksums.sha256, and report bundles (--layout v2 nests results + metadata)
Source of truthscripts/evidence_packs/run_suite.sh, scripts/evidence_packs/run_pack.sh, src/invarlock/evidence_pack.py, src/invarlock/cli/commands/evidence_pack.py

Quick Start (Context)

# Run the subset suite (offline by default)
./scripts/evidence_packs/run_suite.sh --suite subset

# Run the full suite and build an evidence pack
./scripts/evidence_packs/run_pack.sh --suite full --net 1

# Verify an existing evidence pack
invarlock advanced evidence-pack verify ./evidence_pack_runs/subset_20250101_000000/evidence_pack --strict

Hardware Target

  • Hardware-agnostic by design; run on any NVIDIA GPU topology where the models fit in VRAM.
  • Multi-GPU scheduling is enabled automatically when a task’s memory plan exceeds per-device capacity.
  • Set GPU_MEMORY_GB or GPU_MEMORY_PER_DEVICE to match your hardware when running on GPUs with unusual memory sizes.

Entrypoints and modules

Entrypoints

  • scripts/evidence_packs/run_suite.sh runs a suite and sets PACK_* runtime flags before calling the main orchestrator.
  • scripts/evidence_packs/run_pack.sh runs a suite, then packages artifacts into a portable evidence pack (manifest + checksums + reports).
  • scripts/evidence_packs/verify_pack.sh validates an evidence pack in repo workflows.
  • invarlock advanced evidence-pack verify provides the package-native verifier path for installed wheels.
  • scripts/evidence_packs/suites.sh defines the model suites and allows MODEL_1MODEL_8 overrides.
  • scripts/evidence_packs/lib/validation_suite.sh orchestrates the run: preflight, queue creation, worker launch, and monitoring.

Library modules

  • lib/task_serialization.sh: task schema, JSON helpers, GPU planning.
  • lib/queue_manager.sh: queue states, dependency resolution, task generation.
  • lib/scheduler.sh: dynamic priority, memory gating, reservations.
  • lib/gpu_worker.sh: worker loop, heartbeats, task execution glue.
  • lib/task_functions.sh: implementations for each task type.
  • lib/model_creation.sh: edit and error-model creation helpers (create_model_variant dispatcher).
  • lib/config_generator.sh: InvarLock config generation and wrapper helpers.
  • lib/result_compiler.sh: analysis and verdict compilation.
  • lib/fault_tolerance.sh: error classification and retry/backoff logic.
  • scripts/evidence_packs/python/manifest_writer.py: evidence pack manifest.json writer.
  • scripts/evidence_packs/python/preset_generator.py: preset derivation + edit-type variants.

Module dependency graph

Evidence pack module dependency graph from entrypoints into queueing, execution, and packaging helpers.

Troubleshooting decision tree

Troubleshooting decision tree for missing evidence-pack outputs and common guard failures.

Model Suite

Model suites are defined in scripts/evidence_packs/suites.sh and applied by run_suite.sh.

SuiteModelsNotes
subsetmistralai/Mistral-7B-v0.1Single-GPU friendly
showcasemistralai/Mistral-7B-v0.1, Qwen/Qwen2.5-14B, Qwen/Qwen2.5-32BMulti-GPU recommended; guard-focused scenarios
workshop3mistralai/Mistral-7B-v0.1, mistralai/Mixtral-8x7B-v0.1, 01-ai/Yi-34BWorkshop-friendly 3-model suite (architecture diversity)
fullmistralai/Mistral-7B-v0.1, Qwen/Qwen2.5-14B, Qwen/Qwen2.5-32B, 01-ai/Yi-34B, mistralai/Mixtral-8x7B-v0.1, Qwen/Qwen1.5-72BMulti-GPU recommended

Default full-suite model sizes (weights-only, approximate):

ModelVRAMCategoryNotes
mistralai/Mistral-7B-v0.1~14 GBSmallFlash Attention 2 compatible
Qwen/Qwen2.5-14B~28 GBSmallFlash Attention 2 compatible
Qwen/Qwen2.5-32B~64 GBMediumFlash Attention 2 compatible
01-ai/Yi-34B~68 GBMediumFlash Attention 2 compatible
mistralai/Mixtral-8x7B-v0.1~90 GBMoEMoE architecture
Qwen/Qwen1.5-72B~144 GBLargeFlash Attention 2 compatible

Notes:

  • Override models via MODEL_1MODEL_8; set an empty string to disable a slot.
  • validation_suite.sh includes a fallback list of large causal models if it is run directly without suites.sh.

Edit Types

Each model runs 8 edit experiments (4 types × 2 versions) plus optional error injection tests.

Clean edits (tuned)

Clean edits use tuned parameters supplied via PACK_TUNED_EDIT_PARAMS_FILE. The suite uses :clean: as a sentinel in the edit spec and resolves concrete parameters at runtime.

Edit TypeParametersScope
Quantization RTNtuned (bitwidth, group_size) from tuned params fileFFN only
FP8 Quantizationtuned (format) from tuned params fileFFN only
Magnitude Pruningtuned (prune_level) from tuned params fileFFN only
Low-Rank SVDtuned (rank) from tuned params fileFFN only

Stress edits

Stress edits are split into required-fail (catastrophic) and informational scenarios. Required-fail scenarios are gating in the final verdict; informational scenarios are tracked as detection-quality signals and are validated by a minimum signal-fraction criterion.

Important nuance: some guards remediate without flipping a boolean validation gate. For example, Spectral can remain validation.spectral_stable=true while applying caps (spectral.caps_applied > 0). Informational stress scenarios treat both hard gate flips and remediation events (caps applied) as a “signal” so the suite measures guard activity without manufacturing clean false positives.

Edit TypeParametersScope
Quantization RTNquant_rtn:8:all (8-bit)All layers
FP8 Quantizationfp8_quant:e5m2:allAll layers
Magnitude Pruningmagnitude_prune:0.5:all (50% sparsity)All layers
Low-Rank SVDlowrank_svd:32:all (rank 32)All layers

Error injection tests

Enabled when RUN_ERROR_INJECTION=true (default):

  • Required detection (must_detect): nan_injection, inf_injection, shape_mismatch, missing_tensors, extreme_quant, scale_explosion, rank_collapse, norm_collapse, weight_tying_break
  • Informational detection: rmt_norm_noise, spectral_moderate_scale, ve_mlp_scale_skew

rmt_norm_noise additionally emits an rmt_probe.json sidecar next to the error report. This runs an explicit cross-model RMT probe on shared calibration windows (stored in the baseline report) so the evidence pack can demonstrate RMT’s delta policy even when compare-mode evaluation keeps validation.rmt_stable=true.

ve_mlp_scale_skew additionally emits a ve_probe.json sidecar next to the error report. Variance (DD-VE) is a remediation guard and compare-mode evaluation runs the subject model with a no-op edit, which can mute VE’s in-report evidence. The VE probe runs VE calibration directly on shared windows and records whether VE proposes scales and produces a meaningful primary-metric improvement.

Source of truth: scripts/evidence_packs/scenarios.json strictness + intent + primary_guard metadata.

Scheduling

The suite uses dynamic work-stealing scheduling with a file-backed task queue. validation_suite.sh seeds the queue and launches one worker per GPU; workers claim tasks under a scheduler lock with GPU reservation files.

small_first priority strategy

Base task priorities (queue manager) are combined with dynamic boosts in scheduler.sh (model size, blocked dependents, age, and fairness penalties).

Priority bands mapping evidence-pack task types to base scheduler priority values.

Dynamic boosts (scheduler):

  • Model size boosts: <30GB (+30), <70GB (+20), <100GB (+10).
  • Critical tasks: SETUP_BASELINE (+50), CALIBRATION_RUN (+20).
  • Unblock boost: +2 per dependent task (capped).
  • Age boost: +1 per 5 minutes in the queue (capped).
  • Fairness penalty: -3 per running task for the same model (capped).
  • Work-stealing boost: raises priority for lagging models.

Dynamic scheduling diagram

Scheduler flow from run_pack through run_suite, queue initialization, worker launch, and monitor loop.

Work-stealing timeline (illustrative)

GPU work-stealing timeline showing smaller jobs finishing early and helping with larger jobs.

Illustrative only; actual scheduling depends on queue state and memory.

Multi-GPU Model Distribution

After baseline setup, the suite writes model_profile.json and updates per-task memory estimates. task_serialization.sh calculates required_gpus based on GPU_MEMORY_PER_DEVICE and NUM_GPUS:

  • Tasks reserve multiple GPUs only when memory exceeds per-device capacity.
  • Adaptive under-allocation is disabled by default (get_minimum_gpus matches required_gpus) to avoid OOM.
  • Set GPU_MEMORY_PER_DEVICE explicitly for non-80/180GB hardware.

Memory-aware selection example

Memory-fit decision example showing ready-queue scanning against free GPU memory.

GPU reservation protection

Reservations are stored under OUTPUT_DIR/workers/gpu_reservations/ and guarded by a queue/scheduler.lock (mkdir-based). The scheduler also expires stale reservations by TTL (GPU_RESERVATION_TTL).

Reservation state example

GPU reservation state showing free and reserved devices for multi-GPU task claims.
Reservation file layout for scheduler locks and GPU reservation metadata.

Task lifecycle

Task lifecycle state machine from pending through ready, running, completed, and failed.

GPU worker loop

GPU worker loop from shutdown checks through claim, execute, and release.

Batch optimizations

Small/medium models default to batch edit creation:

  • Batch edit creation: CREATE_EDITS_BATCH loads a model once and creates all 8 edits (cuts repeated model loads).

Large or MoE models disable batch edits automatically (or via PACK_USE_BATCH_EDITS=false) and fall back to per-edit tasks (CREATE_EDIT → evaluate_EDIT).

Task dependency graphs

Batch (default):

Batch dependency graph from baseline setup into calibration, preset generation, edit, and error evaluations.

Notes:

  • Error injection tasks (CREATE_ERRORevaluate_ERROR) branch off SETUP_BASELINE and require the preset for evaluation.

Per-edit path (large/MoE or PACK_USE_BATCH_EDITS=false):

Per-edit dependency graph from baseline setup into edit and error evaluation tasks.

Task breakdown per model (defaults)

Defaults: DRIFT_CALIBRATION_RUNS=5, CLEAN_EDIT_RUNS=3, STRESS_EDIT_RUNS=2, RUN_ERROR_INJECTION=true.

Batch path (default for small/medium):

  • Setup baseline: 1 task
  • Preset-derivation runs + preset generation: 6 tasks
  • Batch edits: 1 task
  • evaluate edits: 20 tasks
  • Error injection: 10 tasks

Total: ~38 tasks/model (varies with overrides).

Per-edit path (large/MoE or PACK_USE_BATCH_EDITS=false):

  • Setup baseline: 1 task
  • Preset-derivation runs + preset generation: 6 tasks
  • Create edits: 8 tasks
  • evaluate edits: 20 tasks
  • Error injection: 10 tasks

Total: ~45 tasks/model (varies with overrides).

Execution phases

Execution phases from environment setup through queue initialization, worker launch, and final reports.

Run directory layout

Output directory layout for evidence-pack analysis, reports, and final verdict artifacts.

Some scenarios emit additional sidecar artifacts alongside evaluation.report.json (for example reports/errors/rmt_norm_noise/rmt_probe.json or reports/errors/ve_mlp_scale_skew/ve_probe.json). When present, run_pack.sh copies these sidecars into the packaged evidence pack under reports/**/.

Run modes

  • --calibrate-only / PACK_SUITE_MODE=calibrate-only
    • Preset derivation only mode.
    • Only promotes SETUP_BASELINE, CALIBRATION_RUN, and GENERATE_PRESET tasks.
    • The monitor exits after all GENERATE_PRESET tasks complete.
  • --run-only
    • Continue a prior run after preset derivation. This is effectively --resume with PACK_SUITE_MODE=full.
  • --resume
    • Reuses an existing queue and continues from where the run stopped.

Determinism vs throughput

PACK_DETERMINISM controls harness-level determinism:

# Throughput (default)
PACK_DETERMINISM=throughput ./scripts/evidence_packs/run_suite.sh --suite subset

# Strict
PACK_DETERMINISM=strict ./scripts/evidence_packs/run_suite.sh --suite subset
  • Throughput: NVIDIA_TF32_OVERRIDE=1, CUDNN_BENCHMARK=1.
  • Strict: NVIDIA_TF32_OVERRIDE=0, CUDNN_BENCHMARK=0, CUBLAS_WORKSPACE_CONFIG=:4096:8.

Network mode and model revisions

Evidence packs are offline by default:

  • PACK_NET=0 sets INVARLOCK_ALLOW_NETWORK=0 and enables HF offline modes.
  • PACK_NET=1 enables downloads and writes state/model_revisions.json (ungated models only).
  • Offline runs require model_revisions.json; missing revisions trigger a hard error during SETUP_BASELINE.

Use PACK_MODEL_REVISIONS_FILE to override the revisions path.

Disk and cache behavior

Large runs can be storage-heavy (baseline + edits + error models):

  • Disk preflight estimates required storage and aborts early when insufficient.
    • Override with PACK_SKIP_DISK_PREFLIGHT=1 (not recommended).
    • The minimum free space guard is MIN_FREE_DISK_GB (default 200).
  • PACK_BASELINE_STORAGE_MODE=snapshot_symlink builds a local symlink tree that points into the Hugging Face cache snapshot. This avoids a second baseline copy under OUTPUT_DIR, but it requires one full model copy in HF_HUB_CACHE when that cache shares the output filesystem.
  • PACK_BASELINE_STORAGE_MODE=snapshot_copy materializes a full baseline copy under OUTPUT_DIR/models/<model>/baseline.
  • Baseline downloads prefer one weight format only. When both .safetensors and .bin weights are published, evidence packs download the safetensors set and ignore the .bin copy.
  • HF caches default to OUTPUT_DIR/.hf (override with HF_HOME, HF_HUB_CACHE, HF_DATASETS_CACHE).

For the default subset suite (mistralai/Mistral-7B-v0.1), the model-weight budget is roughly:

  • ~42 GB on the output filesystem with snapshot_symlink when HF_HUB_CACHE lives on the same filesystem as OUTPUT_DIR (one cached baseline + one clean edit peak + one error-model peak under cleanup mode).
  • ~28 GB on the output filesystem with snapshot_symlink when HF_HUB_CACHE is on a separate volume.
  • ~56 GB on the output filesystem with snapshot_copy on the same filesystem.

Those figures are for model weights only; the default preflight also requires MIN_FREE_DISK_GB=200 headroom.

Evidence pack packaging and verification

run_pack.sh builds a portable pack:

  • Copies reports/final_verdict.{txt,json} plus verdict sidecars (category_summary, guard_signal_summary, scenario_signal_summary) and key analysis/* artifacts.
  • Collects all reports into evidence_pack/reports/....
  • Generates manifest.json, checksums.sha256, optional manifest.signature.json.
  • Writes pack-contained provenance metadata such as metadata/source_repo.json and metadata/environment.json before sealing the pack.
  • Stages the pack in a hidden sibling temporary directory and renames it into place only after sealing succeeds, so failed builds do not leave partial evidence_pack/ output behind.
  • Optional HTML export can be disabled with PACK_SKIP_HTML=1.

Packaging flow

Packaging flow from run_suite outputs into collected reports and sidecars, manifests, checksums, html, and package-native signatures.

invarlock advanced evidence-pack verify checks the pack:

  • Verifies manifest.json binds checksums.sha256 via checksums_sha256_digest.
  • Verifies digest-backed manifest references (subject, invocation.config_source, environment, and materials) against on-pack files.
  • Verifies checksums.sha256 (and thus all hashed artifacts).
  • Verifies the package-native Ed25519 signature bundle when present; --strict requires it.
  • Enforces “no extra files” semantics in --strict mode.
  • Runs invarlock verify across all bundled reports (JSON output optional) with runtime-manifest enforcement on; each packaged evaluation.report.json carries an adjacent runtime.manifest.json.
  • Returns structured exit codes so callers can distinguish usage, missing-file, manifest-format, signature, integrity, and report-verification failures.

The installed-wheel package-native CLI is self-contained:

  • invarlock advanced evidence-pack keygen generates Ed25519 signing keys.
  • invarlock advanced evidence-pack build --signing-key ... emits manifest.signature.json.
  • invarlock advanced evidence-pack verify validates the signature bundle in-process and does not depend on external signature binaries.

The repo shell harness remains a separate maintainer path, but it uses the same package-native Ed25519 manifest-signature format as the installed CLI.

Maintainer evidence-pack packaging also treats source provenance as fail-closed:

  • run_pack.sh writes metadata/source_repo.json from the active Git checkout.
  • If git is unavailable or the repository metadata cannot be collected, pack creation stops instead of silently emitting partial provenance.
  • If you need to package from a detached artifact tree, write a complete metadata/source_repo.json first rather than relying on fallback inference.

Remote setup helper

scripts/evidence_packs/lib/setup_remote.sh is an optional bootstrap script for fresh GPU hosts. It clones the repo, creates a venv, installs PyTorch and InvarLock, and leaves the host ready to run run_pack.sh.

Operational guidance for remote evidence-pack work:

  • Prefer a fresh clone or work tree per campaign instead of reusing an older editable-install checkout.
  • If you intentionally run from a work tree that is not the editable install behind .venv, either reinstall that work tree or export PYTHONPATH=src so invarlock resolves to the intended source tree.
  • run_suite.sh and run_pack.sh default to SKIP_FLASH_ATTN=true and PACK_BASELINE_STORAGE_MODE=snapshot_copy for bulk default runtime-container runs.
  • Bulk evidence-pack runs fail fast unless INVARLOCK_ALLOW_REMOTE_CODE=1 is set.
  • Export non-default runtime roots before launching the suite when you expect them inside delegated container jobs: INVARLOCK_CONFIG_ROOT, HF_HOME, HF_HUB_CACHE, HF_DATASETS_CACHE, TRANSFORMERS_CACHE, TMPDIR, TMP.
  • If a staged preset or profile uses !include outside its config directory, set INVARLOCK_ALLOW_CONFIG_INCLUDE_OUTSIDE=1 on the remote host before the evidence-pack entrypoint; the default runtime-container launcher rejects that config graph before container start when the override is missing.
  • After Qwen2.5-14B campaigns, run scripts/evidence_packs/run_qwen14_sentinels.sh from the same fresh work tree to validate saved-model direct evaluate and the public quant smoke.

Recommended remote validation checklist after security-default changes:

  1. Run an evidence-pack subset lane with explicit external HF_HOME and INVARLOCK_CONFIG_ROOT overrides.
  2. Run one delegated invarlock evaluate with external --edit-config, TMPDIR, and INVARLOCK_EXPORT_DIR roots.
  3. Run one scripts/model_evidence_sweep.py --execution-mode container lane with an external output root and confirm the published report path is populated.

Common knobs for the setup script:

  • REPO_DIR, REPO_URL, BRANCH, PYTHON_BIN, VENV_DIR.
  • TORCH_INDEX_URL, TORCH_PACKAGES, PACK_SKIP_TORCH_CHECK.
  • HF_HOME, HF_HUB_CACHE, HF_DATASETS_CACHE.

Tuning reference

Core configuration

VariableDefaultDescription
PACK_SUITEsubsetSuite name (subset or full)
PACK_NET0Enable network preflight/downloads
PACK_OUTPUT_DIRunsetSets OUTPUT_DIR when provided
OUTPUT_DIRauto./evidence_pack_runs/<suite>_<timestamp> via entrypoint
PACK_OUTPUT_DIR_ABSOLUTEfalseNormalize OUTPUT_DIR to absolute path
PACK_SUITE_MODEfullfull, calibrate-only, or run-only
PACK_DETERMINISMthroughputHarness determinism mode
PACK_REPEATS0Determinism repeat metadata
PACK_MODEL_REVISIONS_FILEOUTPUT_DIR/state/model_revisions.jsonRevisions path
PACK_USE_BATCH_EDITSautoForce/disable batch edit creation
RESUME_MODEtrueSkip completed steps when outputs exist

Hardware selection

VariableDefaultDescription
CUDA_VISIBLE_DEVICESunsetExplicit GPU pool (comma-separated)
GPU_ID_LISTunsetAlternate GPU pool list
NUM_GPUSautoNumber of GPUs to use (clamped to pool)
GPU_MEMORY_GBautoPer-GPU memory hint for planning
GPU_MEMORY_PER_DEVICEGPU_MEMORY_GBPer-device memory for required_gpus
GPU_MIN_FREE_GB10Minimum free VRAM for eligibility
GPU_REQUIRE_IDLEtrueRequire GPUs with no compute processes
GPU_CACHE_TTL5GPU cache TTL (seconds)
GPU_RESERVATION_TTL60Reservation TTL (seconds)
GPU_RESERVATION_LOCK_TIMEOUT5Reservation lock timeout (seconds)

Model overrides

VariableDefaultDescription
MODEL_1MODEL_8suite-definedOverride model slots; empty disables

InvarLock settings

VariableDefaultDescription
INVARLOCK_DATASETwikitext2Dataset provider
INVARLOCK_DATASET_PROVIDER_YAMLunsetRaw YAML mapping for dataset.provider (advanced; overrides provider kind + args)
INVARLOCK_DATASET_PROVIDER_JSONunsetRaw JSON object for dataset.provider (advanced; overrides provider kind + args)
INVARLOCK_HF_DATASET_NAMEallenai/c4HF dataset name when INVARLOCK_DATASET=hf_text
INVARLOCK_HF_CONFIG_NAMEen (for allenai/c4)HF dataset config when INVARLOCK_DATASET=hf_text
INVARLOCK_HF_TEXT_FIELDtextText field when INVARLOCK_DATASET=hf_text
INVARLOCK_HF_MAX_SAMPLES2000Max rows consumed when INVARLOCK_DATASET=hf_text
INVARLOCK_HF_TRUST_REMOTE_CODEunsetPass trust_remote_code to HF load_dataset (not needed for allenai/c4 Parquet)
INVARLOCK_HF_CACHE_DIRunsetdatasets cache_dir override when INVARLOCK_DATASET=hf_text
INVARLOCK_LOCAL_JSONL_FILEunsetJSONL file path when INVARLOCK_DATASET=local_jsonl
INVARLOCK_LOCAL_JSONL_PATHunsetJSONL file/dir path when INVARLOCK_DATASET=local_jsonl
INVARLOCK_LOCAL_JSONL_DATA_FILESunsetJSONL glob/list when INVARLOCK_DATASET=local_jsonl
INVARLOCK_LOCAL_JSONL_TEXT_FIELDtextText field when INVARLOCK_DATASET=local_jsonl
INVARLOCK_LOCAL_JSONL_MAX_SAMPLES2000Max rows consumed when INVARLOCK_DATASET=local_jsonl
INVARLOCK_TIERbalancedGuard tier preset
INVARLOCK_PREVIEW_WINDOWS32Preview windows
INVARLOCK_FINAL_WINDOWS32Final windows
INVARLOCK_SEQ_LEN512Sequence length
INVARLOCK_STRIDE256Stride
INVARLOCK_EVAL_BATCH32InvarLock batch size
PACK_GUARDS_ORDERinvariants,spectral,rmt,variance,invariantsGuards included in preset derivation and generated presets

Primary metric acceptance/drift gates should be configured via profile/config (primary_metric.acceptance_range, primary_metric.drift_band), not env vars.

Tuned edit presets

VariableDefaultDescription
PACK_TUNED_EDIT_PARAMS_FILEunsetJSON file with tuned clean edit params (required when CLEAN_EDIT_RUNS>0).

Preset derivation reuse

VariableDefaultDescription
PACK_CALIBRATION_PRESET_DIRunsetDirectory containing calibrated_preset_<model>.yaml/json to reuse; skips preset-derivation runs.
PACK_CALIBRATION_PRESET_FILEunsetSingle preset file applied to all models (advanced).

Experiment controls

VariableDefaultDescription
DRIFT_CALIBRATION_RUNS5Preset-derivation run count
CLEAN_EDIT_RUNS3Clean edit evaluate runs
STRESS_EDIT_RUNS2Stress edit evaluate runs
RUN_ERROR_INJECTIONtrueEnable error injection

Storage and memory planning

VariableDefaultDescription
PACK_BASELINE_STORAGE_MODEsnapshot_symlinkBaseline storage mode (snapshot_symlink, snapshot_copy, or save_pretrained)
MIN_FREE_DISK_GB200Disk pressure threshold
PACK_SKIP_DISK_PREFLIGHT0Skip storage preflight
CUDA_MEMORY_FRACTION0.92Target GPU memory fraction
MODEL_LOAD_OVERHEAD_GB4Load overhead for planning
EDIT_OVERHEAD_GB8Per-edit overhead for planning
BATCH_EDIT_OVERHEAD_GB8Batch edit overhead
INVARLOCK_OVERHEAD_GB6InvarLock overhead

Worker + reliability controls

VariableDefaultDescription
WORKER_HEARTBEAT_INTERVAL30Heartbeat interval (seconds)
WORKER_IDLE_SLEEP5Sleep when idle (seconds)
WORKER_MAX_FAILURES10Stop worker after N failures
WORKER_TIMEOUT2700Worker heartbeat timeout (seconds)
CANCEL_BLOCKED_TASKS_GRACE_SECONDS90Fail blocked tasks after grace
TASK_TIMEOUT_DEFAULT21600Default task timeout (seconds)
TASK_TIMEOUT_<TASKTYPE>unsetPer-task timeout override

Packaging and verification

VariableDefaultDescription
PACK_DIROUTPUT_DIR/evidence_packEvidence pack output dir
PACK_SIGN_MANIFEST1Sign manifest.json with a package-native Ed25519 key (auto-generated if PACK_SIGNING_KEY is unset)
PACK_SIGNING_KEYunsetOptional Ed25519 private key PEM for deterministic signer identity
PACK_SKIP_HTML0Skip HTML rendering
PACK_VERIFY_PROFILEdevProfile for invarlock verify

Troubleshooting

Missing model revisions (offline)

If offline runs fail with “requires model revisions”, run a preflight:

./scripts/evidence_packs/run_suite.sh --suite subset --net 1

Or point to an existing revisions file with PACK_MODEL_REVISIONS_FILE.

OOM on large models

  • Lower GPU_MEMORY_PER_DEVICE so the planner requests more GPUs.
  • Disable batch edits: PACK_USE_BATCH_EDITS=false.
  • Reduce InvarLock batch/seq_len (e.g., INVARLOCK_EVAL_BATCH=16 INVARLOCK_SEQ_LEN=256).
  • Increase memory overhead knobs (MODEL_LOAD_OVERHEAD_GB, EDIT_OVERHEAD_GB).

Disk pressure / preflight failures

Check state/disk_pressure.json and ensure the output filesystem has headroom. Use MIN_FREE_DISK_GB=0 or PACK_SKIP_DISK_PREFLIGHT=1 only if you accept risk of partial artifacts.

Task timeouts

Increase the default or per-task timeout:

TASK_TIMEOUT_DEFAULT=28800 ./scripts/evidence_packs/run_suite.sh --suite subset
TASK_TIMEOUT_CREATE_EDIT=28800 ./scripts/evidence_packs/run_suite.sh --suite subset

Stuck queues or dead workers

  • Inspect state/progress.json and workers/gpu_<id>.status.
  • Check worker logs: logs/gpu_<id>.log and logs/tasks/<task_id>.log.
  • Re-run with --resume to recover from a crash.