Dataset Providers

Overview

AspectDetails
PurposeDeterministic dataset providers for preview/final evaluation windows.
AudienceCLI users configuring dataset blocks and Python callers building evaluation windows.
Supported providerswikitext2, synthetic, hf_text, local_jsonl, hf_seq2seq, local_jsonl_pairs, seq2seq.
Requiresinvarlock[eval] or invarlock[hf] for Hugging Face datasets providers.
NetworkOffline by default; HF-backed providers need INVARLOCK_ALLOW_NETWORK=1 for first download.
InputsDataset provider name plus provider-specific fields.
Outputs / ArtifactsEvaluation windows stored in report.evaluation_windows and dataset metadata in report.data.*.
Source of truthsrc/invarlock/eval/data.py and src/invarlock/eval/providers/*.

Quick Start

dataset:
  provider: wikitext2
  split: validation
  seq_len: 512
  stride: 512
  preview_n: 64
  final_n: 64
  seed: 42

For Compare & evaluate, reuse the same dataset block in baseline and subject runs.

Concepts

  • Preview vs final windows: the runner computes the primary metric on two deterministic splits; counts are recorded in run reports and evaluation reports.
  • Pairing: invarlock evaluate requires baseline window evidence to pair windows. Missing/invalid evidence fails closed in CI/Release profiles.
  • Offline-first: downloads are opt-in via INVARLOCK_ALLOW_NETWORK=1. Cached datasets can be enforced via HF_DATASETS_OFFLINE=1.
  • Dedupe & capacity: INVARLOCK_DEDUP_TEXTS=1 removes exact duplicates; INVARLOCK_CAPACITY_FAST=1 speeds up capacity checks for quick runs.

Pairing invariants (E001)

InvariantFailure condition
window_pairing_reasonMust be empty / None.
paired_windowsMust be > 0.
window_match_fractionMust be 1.0.
window_overlap_fractionMust be 0.0.

Counts mismatches are enforced via coverage.preview.used, coverage.final.used, and paired_windows in dataset.windows.stats.

Reference

Provider matrix

ProviderKindNetworkRequired keysNotes
wikitext2textCache/Netprovider, seq_len, stride, preview_n, final_nDeterministic n‑gram stratification; requires datasets.
synthetictextOfflineprovider, seq_len, preview_n, final_nGenerated text; good for smoke tests.
hf_texttextCache/Netdataset_name, text_fieldGeneric HF dataset loader; uses first N rows.
local_jsonltextOfflinefile/path/data_files, text_fieldReads JSONL from disk; default text_field: text.
hf_seq2seqseq2seqCache/Netdataset_name, src_field, tgt_fieldProvides encoder ids + decoder labels.
local_jsonl_pairsseq2seqOfflinefile/path/data_files, src_field, tgt_fieldPaired JSONL for seq2seq.
seq2seqseq2seqOfflineoptional n, src_len, tgt_lenSynthetic seq2seq generator.

Provider field map

ProviderRequired keysEvidence fields (run report / evaluation report)
wikitext2provider, seq_len, stride, preview_n, final_nreport.data.* + report.dataset.windows.stats
syntheticprovider, seq_len, preview_n, final_nreport.data.* + report.dataset.windows.stats
hf_textdataset_name, text_fieldreport.data.* + report.dataset.windows.stats
local_jsonlfile/path/data_files, text_fieldreport.data.* + report.dataset.windows.stats
hf_seq2seqdataset_name, src_field, tgt_fieldreport.data.* + report.dataset.windows.stats
local_jsonl_pairsfile/path/data_files, src_field, tgt_fieldreport.data.* + report.dataset.windows.stats
seq2seqoptional n, src_len, tgt_lenreport.data.* + report.dataset.windows.stats

Provider-specific config fields (dataset name, paths, fields) are recorded under report.data when available.

Pairing evidence matrix

Config keysReport fieldsreport fieldsVerify gate
dataset.provider, seq_len, stride, splitreport.data.{dataset,seq_len,stride,split}report.dataset.{provider,seq_len,windows}Schema + pairing context.
dataset.preview_n/final_nreport.data.{preview_n,final_n}, report.evaluation_windowsreport.dataset.windows.{preview,final}Pairing + count checks.
Pairing stats (derived)report.dataset.windows.statsreport.dataset.windows.stats_validate_pairing + _validate_counts.
Provider digestreport.provenance.provider_digestreport.provenance.provider_digestRequired in CI/Release.

HF text provider example

dataset:
  provider: hf_text
  dataset_name: wikitext
  config_name: wikitext-2-raw-v1
  text_field: text
  split: validation
  preview_n: 64
  final_n: 64

Local JSONL provider example

dataset:
  provider: local_jsonl
  path: /data/my_corpus
  text_field: text
  preview_n: 64
  final_n: 64

Seq2seq provider example (HF)

dataset:
  provider: hf_seq2seq
  dataset_name: wmt14
  src_field: translation.en
  tgt_field: translation.de
  preview_n: 32
  final_n: 32

Environment variables

  • INVARLOCK_ALLOW_NETWORK=1 — allow dataset downloads.
  • HF_DATASETS_OFFLINE=1 — force cached-only datasets.
  • INVARLOCK_DEDUP_TEXTS=1 — exact-text dedupe before tokenization.
  • INVARLOCK_CAPACITY_FAST=1 — approximate capacity estimation for quick runs.

Troubleshooting

  • DEPENDENCY-MISSING: datasets: install invarlock[eval] or invarlock[hf].
  • NO-SAMPLES / NO-PAIRS errors: verify dataset fields and split names.
  • Pairing failures (E001): ensure baseline report.json contains evaluation_windows and was produced with matching dataset settings.

Observability

  • report.data.* stores provider name, split, and window counts.
  • report.evaluation_windows stores preview/final token windows.
  • reports preserve dataset metadata and window pairing stats under dataset.*.