Dataset Providers

Overview

AspectDetails
PurposeDeterministic dataset providers for preview/final evaluation windows.
AudienceCLI users configuring dataset blocks and Python callers building evaluation windows.
Supported providerswikitext2, synthetic, hf_text, local_jsonl, vision_text, hf_seq2seq, local_jsonl_pairs, seq2seq.
Requiresinvarlock[eval] or invarlock[hf] for Hugging Face datasets providers.
NetworkOffline by default; CLI runs use evaluate --allow-network for first download, while programmatic callers can set INVARLOCK_ALLOW_NETWORK=1.
InputsDataset provider name plus provider-specific fields.
Outputs / ArtifactsEvaluation windows stored in report.evaluation_windows and dataset metadata in report.data.*. vision_text persists example records instead of token windows.
Source of truthsrc/invarlock/eval/data.py, src/invarlock/eval/data_support.py, src/invarlock/eval/data_tokenization.py, and src/invarlock/eval/data_providers.py.

Quick Start

dataset:
  provider: wikitext2
  split: validation
  seq_len: 512
  stride: 512
  preview_n: 64
  final_n: 64
  seed: 42

For Compare & evaluate, reuse the same dataset block in baseline and subject runs.

Concepts

  • Preview vs final windows: the runner computes the primary metric on two deterministic splits; counts are recorded in run reports and evaluation reports.
  • Pairing: invarlock evaluate requires baseline window evidence to pair windows. Missing/invalid evidence fails closed in CI/Release profiles.
  • Offline-first: downloads are opt-in. CLI runs use evaluate --allow-network; programmatic callers can set INVARLOCK_ALLOW_NETWORK=1. Cached datasets can be enforced via HF_DATASETS_OFFLINE=1.
  • Vision-text manifests: vision_text is local-files-only and expects JSONL records with id, image_path, prompt, and either answer or answers. Records are single-image examples; provider batching can still group multiple records when callers request batch_size > 1.
  • Public image-text datasets: public Hugging Face datasets can be used by materializing them first into a local vision_text manifest. The model evidence workflow uses scripts/model_evidence/materialize_vision_text_dataset.py for this pattern, so evaluation remains offline/hashable after the download step.
  • Image-text primary metric: vision_text uses answer accuracy as the primary metric because the evidence claim is whether the generated answer matches the image question. Token log loss is still recorded as supporting telemetry, but perplexity is not the public VQA gate.
  • Tokenizer contract: dataset providers expect either a callable tokenizer that returns input_ids plus optional attention_mask, or an encode(...) method that accepts truncation=True, max_length=..., and padding="max_length".
  • Default runtime-container execution: dataset-backed model-loading commands run in the runtime container by default; public host-side execution uses invarlock evaluate --execution-mode host.
  • Dedupe & capacity: INVARLOCK_DEDUP_TEXTS=1 removes exact duplicates; INVARLOCK_CAPACITY_FAST=1 speeds up capacity checks for quick runs.
  • HF cache fallback: if a local rerun hits a Hugging Face datasets shared-cache lock/permission error, InvarLock retries with its own writable datasets cache. Set INVARLOCK_HF_DATASETS_CACHE to choose that fallback location explicitly.

Pairing invariants (E001)

InvariantFailure condition
window_pairing_reasonMust be empty / None.
paired_windowsMust be > 0.
window_match_fractionMust be 1.0.
window_overlap_fractionMust be 0.0.

Counts mismatches are enforced via coverage.preview.used, coverage.final.used, and paired_windows in dataset.windows.stats.

Reference

Provider matrix

ProviderKindNetworkRequired keysNotes
wikitext2textCache/Netprovider, seq_len, stride, preview_n, final_nDeterministic n‑gram stratification; requires datasets.
synthetictextOfflineprovider, seq_len, preview_n, final_nGenerated text; good for smoke tests.
hf_texttextCache/Netdataset_name, text_fieldGeneric HF dataset loader; uses first N rows.
local_jsonltextOfflinefile/path/data_files, text_fieldReads JSONL from disk; default text_field: text.
vision_textimage-textOfflinefile/path/data_filesLocal JSONL manifest of single-image VQA-style examples; stride is ignored.
hf_seq2seqseq2seqCache/Netdataset_name, src_field, tgt_fieldProvides encoder ids + decoder labels; supports pinned dataset revision and source/target prefixes.
local_jsonl_pairsseq2seqOfflinefile/path/data_files, src_field, tgt_fieldPaired JSONL for seq2seq.
seq2seqseq2seqOfflineoptional n, src_len, tgt_lenSynthetic seq2seq generator.

Provider field map

ProviderRequired keysEvidence fields (run report / evaluation report)
wikitext2provider, seq_len, stride, preview_n, final_nreport.data.* + report.dataset.windows.stats
syntheticprovider, seq_len, preview_n, final_nreport.data.* + report.dataset.windows.stats
hf_textdataset_name, text_fieldreport.data.* + report.dataset.windows.stats
local_jsonlfile/path/data_files, text_fieldreport.data.* + report.dataset.windows.stats
vision_textfile/path/data_filesreport.data.* + report.evaluation_windows.{preview,final}.records
hf_seq2seqdataset_name, src_field, tgt_fieldreport.data.* + report.dataset.windows.stats
local_jsonl_pairsfile/path/data_files, src_field, tgt_fieldreport.data.* + report.dataset.windows.stats
seq2seqoptional n, src_len, tgt_lenreport.data.* + report.dataset.windows.stats

Provider-specific config fields (dataset name, paths, fields) are recorded under report.data when available.

Pairing evidence matrix

Config keysReport fieldsreport fieldsVerify gate
dataset.provider, seq_len, stride, splitreport.data.{dataset,seq_len,stride,split}report.dataset.{provider,seq_len,windows}Schema + pairing context.
dataset.preview_n/final_nreport.data.{preview_n,final_n}, report.evaluation_windowsreport.dataset.windows.{preview,final}Pairing + count checks.
Pairing stats (derived)report.dataset.windows.statsreport.dataset.windows.stats_validate_pairing + _validate_counts.
Provider digestreport.provenance.provider_digestreport.provenance.provider_digestRequired in CI/Release.

HF text provider example

dataset:
  provider: hf_text
  dataset_name: Salesforce/wikitext
  config_name: wikitext-2-raw-v1
  text_field: text
  split: validation
  preview_n: 64
  final_n: 64

Local JSONL provider example

dataset:
  provider: local_jsonl
  path: /data/my_corpus
  text_field: text
  preview_n: 64
  final_n: 64

Vision-text provider example

dataset:
  provider:
    kind: vision_text
    path: tests/fixtures/vision_text/demo_manifest.jsonl
  split: validation
  seq_len: 256
  preview_n: 1
  final_n: 1

Public VQA materialization example

python scripts/model_evidence/materialize_vision_text_dataset.py \
  --dataset Multimodal-Fatima/VQAv2_sample_validation \
  --split validation \
  --revision 99487d2651df3799002b2fb3e455741744514a02 \
  --output-dir artifacts/model-evidence/public_datasets/vqav2_sample_validation_800 \
  --max-samples 800 \
  --image-field image \
  --prompt-field question \
  --answer-field multiple_choice_answer \
  --answers-field answers \
  --id-field question_id \
  --prompt-template '{question}
Return exactly one JSON object like {{"answer":"short phrase"}}. Use a short phrase only. Do not explain.' \
  --overwrite

The generated manifest.jsonl, images/, and materialization_summary.json are then consumed by vision_text. For evidence promotion, pin the dataset revision and keep the materialization summary with the run artifacts. Public VQA evidence prompts should prefer a structured answer field such as {"answer":"..."}; the evaluator extracts that field before exact-answer scoring and falls back to the raw generation when no JSON answer is present.

Seq2seq provider example (HF)

dataset:
  provider:
    kind: hf_seq2seq
    dataset_name: abisee/cnn_dailymail
    config_name: 3.0.0
    revision: 96df5e686bee6baa90b8bee7c28b81fa3fa6223d
    src_field: article
    tgt_field: highlights
    src_prefix: "summarize: "
    max_samples: 1024
  split: validation
  seq_len: 256
  preview_n: 32
  final_n: 32

The FLAN-T5 public seq2seq basis uses this provider shape with google/flan-t5-base pinned to model revision 7bcac572ce56db69c1ea7c8af255c5d7c9672fc2.

Environment variables

  • INVARLOCK_ALLOW_NETWORK=1 — allow dataset downloads.
  • HF_DATASETS_OFFLINE=1 — force cached-only datasets.
  • INVARLOCK_DEDUP_TEXTS=1 — exact-text dedupe before tokenization.
  • INVARLOCK_CAPACITY_FAST=1 — approximate capacity estimation for quick runs.
  • INVARLOCK_HF_DATASETS_CACHE=/path/to/cache — override the writable fallback cache used after shared-cache lock/permission failures.

Troubleshooting

  • DEPENDENCY-MISSING: datasets: install invarlock[eval] or invarlock[hf].
  • NO-SAMPLES / NO-PAIRS errors: verify dataset fields and split names.
  • HF cache .lock / permission errors on local reruns: rerun as-is to use the automatic writable-cache fallback, or set INVARLOCK_HF_DATASETS_CACHE to a writable directory you control.
  • vision_text image file is missing: ensure manifest image_path values resolve relative to the JSONL file and point to readable local files.
  • Pairing failures (E001): ensure baseline report.json contains evaluation_windows and was produced with matching dataset settings.

Observability

  • report.data.* stores provider name, split, and window counts.
  • report.evaluation_windows stores preview/final token windows.
  • reports preserve dataset metadata and window pairing stats under dataset.*.