Everything you need for model change evidence

InvarLock provides baseline-vs-subject evaluation with paired windows, guard checks, evaluation reports, verification, and proof-pack-ready artifacts.

Why this is different from ordinary eval tooling

The difference is not just another score. InvarLock makes the evaluation output verifiable, portable, and usable in release review.

Verification step
Ordinary evals: Scores and charts without a standard verification step.
InvarLock: Evaluation report plus a verify command for schema, paired-window math, and release checks.
Provenance
Ordinary evals: Evidence often gets spread across CI logs, notebooks, and screenshots.
InvarLock: Pairing data, policy digests, and report metadata travel with the artifact.
Review handoff
Ordinary evals: Harder for reviewers to re-check pairing and release gates.
InvarLock: Proof-pack-ready evidence can move between teams, environments, and release workflows.

Evaluate edited checkpoints

Establish a deterministic baseline-vs-subject comparison before you look at a release verdict.

Guard Pipeline

Multi-layered validation beyond perplexity

Every edited model runs through the guard pipeline before report output is finalized.

What this gives you

  • Structural invariant checks ensure model architecture integrity
  • Spectral analysis guards detect stability issues in weight matrices
  • Random Matrix Theory (RMT) identifies statistical outliers
  • Variance enforcement catches harmful distribution shifts

Guard output

# Guard pipeline layers
1. Invariants    -> Architecture checks
2. Spectral      -> Stability analysis
3. RMT           -> Outlier detection
4. Variance      -> Distribution guards

Paired Evaluation Windows

Deterministic, reproducible comparisons

InvarLock uses paired evaluation windows with deterministic pairing metadata to ensure statistically valid, reproducible results.

What this gives you

  • Token-weighted comparisons between baseline and subject
  • Deterministic pairing ensures reproducibility
  • Calibrated evaluation windows for accuracy
  • Paired delta-log-loss metrics with confidence intervals

Pairing record

Baseline: gpt2 -> Subject: gpt2-q4
Windows: <paired windows>
Metric: delta-log-loss (paired)
CI: <confidence interval>

Understand the decision

Move from raw measurements to bounded claims, portable artifacts, and verifiable handoff evidence.

Statistical Boundaries

Quantify degradation without overstating the claim

Get confidence intervals on primary metrics plus explicit decision boundaries so reviewers know what the evidence supports and where it stops.

What this gives you

  • BCa bootstrap confidence intervals
  • Regression budget enforcement
  • Plain-language guarantees and non-guarantees
  • Primary metric ratio tracking
  • Policy digest included in evaluation report metadata

Decision summary

INVARLOCK v<version> · EVALUATE
Status: PASS · Gates: <passed>/<total> passed
Primary metric ratio: <ratio>
Output: reports/eval/evaluation.report.json

Evaluation Reports + Verification

Validate artifacts before release

Use evaluation.report.json as a standard handoff artifact, run invarlock verify, and carry the same evidence model into proof-pack workflows.

What this gives you

  • Machine-readable report schema
  • Pairing and policy metadata included in output
  • invarlock verify checks in CI and release profiles
  • Proof-pack workflows bundle portable evidence for handoff
  • HTML rendering available with invarlock report html

Artifact handoff

# Validate an evaluation report
invarlock verify reports/eval/evaluation.report.json

# Render report for sharing
invarlock report html \
  -i reports/eval/evaluation.report.json \
  -o reports/eval/evaluation.html

Ship inside your environment

Carry the same evidence model into CI and private infrastructure without changing how teams review release risk.

CI/CD Integration

Gate deployments on robustness checks

Seamlessly integrate InvarLock into your deployment pipelines with stable exit codes and structured JSON outputs.

What this gives you

  • Exit code 0 for passing evaluations (safe to deploy)
  • Machine-readable evaluation.report.json outputs
  • CI/CD pipeline examples for GitHub Actions and Jenkins
  • Automated quality gates for release review

Pipeline gate

# In your CI pipeline
invarlock evaluate \
  --baseline base.pt \
  --subject quantized.pt

if [ $? -eq 0 ]; then
  echo "Safe to deploy"
  ./deploy.sh
fi

Offline-First Design

Your models stay private

Network access is disabled by default. Enable downloads explicitly per command to ensure your model weights never leave your infrastructure.

What this gives you

  • Network disabled by default
  • Explicit opt-in for downloads with INVARLOCK_ALLOW_NETWORK=1
  • On-premise deployment ready
  • Air-gapped environment support

Network policy

# Network disabled (default)
invarlock evaluate --baseline ./local.pt

# Explicit network enable
INVARLOCK_ALLOW_NETWORK=1 \
  invarlock evaluate --baseline gpt2

Use the CLI now, or join the pilot waitlist

The open-source CLI, docs, and GitHub repo are available now. Use the waitlist below if you want pilot announcements or product updates beyond the current open-source path.

Prefer the open-source path right away?