Observability

Overview

AspectDetails
PurposeRuntime monitoring, health checking, and telemetry collection for InvarLock operations.
AudienceOperators running production workloads and developers debugging performance issues.
Supported surfaceMonitoringManager, HealthChecker, TelemetryCollector, MetricsRegistry.
Requirespsutil (included in base install); torch for GPU monitoring.
NetworkFully offline; no external telemetry is sent.
Source of truthsrc/invarlock/observability/*.py.

Quick Start

from invarlock.observability import MonitoringManager, MonitoringConfig

# Start monitoring with default config
config = MonitoringConfig(
    metrics_interval=10.0,
    health_check_interval=30.0,
)
monitor = MonitoringManager(config)
monitor.start()

# Record an operation
monitor.record_operation("evaluate", duration=45.2, model_id="gpt2")

# Get current status
status = monitor.get_status()
print(status["health_status"])

# Stop when done
monitor.stop()

Concepts

  • Metrics: counters, gauges, and histograms tracked during runs.
  • Health checks: periodic checks for system resources (CPU, memory, disk, GPU).
  • Telemetry: operation tracking with start/end times and metadata.
  • Alerting: configurable thresholds for warnings (not external notifications).

Component Architecture

ComponentResponsibility
MonitoringManagerCentral coordinator; starts/stops monitoring threads.
MetricsRegistryStores counters, gauges, and histograms by name.
HealthCheckerRuns periodic health checks on system components.
TelemetryCollectorTracks operation lifecycles (start → end).
ResourceMonitorCollects CPU, memory, disk, and GPU usage.
PerformanceMonitorTracks operation durations and percentiles.
AlertManagerEvaluates threshold rules and records violations.

Reference

MonitoringConfig

from invarlock.observability import MonitoringConfig

config = MonitoringConfig(
    # Collection intervals (seconds)
    metrics_interval=10.0,
    health_check_interval=30.0,
    resource_check_interval=5.0,

    # Data retention
    metrics_retention_hours=24,
    max_events=10000,

    # Alerting
    enable_alerting=True,
    alert_channels=[],

    # Export settings
    prometheus_enabled=False,
    prometheus_port=9090,
    json_export_enabled=True,
    json_export_path="./monitoring",

    # Resource thresholds (percent)
    cpu_threshold=80.0,
    memory_threshold=85.0,
    gpu_memory_threshold=90.0,

    # Performance monitoring
    latency_percentiles=[50, 90, 95, 99],
    slow_request_threshold=30.0,
)

Default Metrics

InvarLock registers these metrics automatically:

MetricTypeDescription
'invarlock.operations.total'CounterTotal operations by type and status.
'invarlock.errors.total'CounterTotal errors by type.
'invarlock.edits.applied'CounterTotal edits applied.
'invarlock.guards.triggered'CounterGuard triggers.
'invarlock.operation.duration'HistogramOperation duration distribution.
'invarlock.edit.duration'HistogramEdit operation duration.
'invarlock.guard.duration'HistogramGuard execution duration.
'invarlock.memory.usage'GaugeCurrent memory usage.
'invarlock.gpu.memory.usage'GaugeCurrent GPU memory usage.
'invarlock.cpu.usage'GaugeCurrent CPU usage.
'invarlock.model.parameters'GaugeModel parameter count.
'invarlock.model.size_mb'GaugeModel size in MB.
'invarlock.model.loads'CounterModel loads.

Health Checks

The HealthChecker runs these checks by default:

CheckStatus ThresholdsDetails
memoryWARNING > 80%, CRITICAL > 90%System RAM usage.
cpuWARNING > 85%, CRITICAL > 95%CPU utilization.
diskWARNING > 85%, CRITICAL > 95%Disk space on /.
gpuWARNING > 85%, CRITICAL > 95%GPU memory (if CUDA available).
pytorchCRITICAL on failurePyTorch functionality test.

InvarLock-specific checks (via InvarLockHealthChecker):

CheckDescription
adaptersVerifies adapter classes can be instantiated.
guardsVerifies guard classes can be instantiated.
dependenciesChecks for torch, transformers, numpy, psutil.

Health Status API

from invarlock.observability import InvarLockHealthChecker

checker = InvarLockHealthChecker()

# Check all components
results = checker.check_all()
for name, health in results.items():
    print(f"{name}: {health.status.value} - {health.message}")

# Get overall status
overall = checker.get_overall_status()
print(f"Overall: {overall.value}")

# Get summary
summary = checker.get_summary()
print(summary["status_counts"])

TelemetryCollector

Track operation lifecycles:

from invarlock.observability import MonitoringManager, TelemetryCollector

monitor = MonitoringManager()
telemetry = TelemetryCollector(monitor)

# Start tracking
op_id = telemetry.start_operation(
    "op-123",
    "evaluate",
    model_id="gpt2",
    profile="ci",
)

# ... perform operation ...

# End tracking
telemetry.end_operation(
    op_id,
    status="success",
    ratio_vs_baseline=1.02,
)

# Get stats
stats = telemetry.get_operation_stats()
print(f"Success rate: {stats['success_rate']:.2%}")

ResourceMonitor

from invarlock.observability.core import ResourceMonitor, MonitoringConfig
from invarlock.observability import MetricsRegistry

config = MonitoringConfig()
metrics = MetricsRegistry()
resource_monitor = ResourceMonitor(metrics, config)

# Get current usage
usage = resource_monitor.get_current_usage()
print(f"CPU: {usage['cpu_percent']:.1f}%")
print(f"Memory: {usage['memory_percent']:.1f}%")
print(f"Disk free: {usage['disk_free_gb']:.1f} GB")

# Check thresholds
warnings = resource_monitor.check_thresholds()
for warning in warnings:
    print(f"WARNING: {warning}")

JSON Export

When json_export_enabled=True, metrics are periodically exported to json_export_path:

ls ./monitoring/
# metrics_20260115_120000.json
# metrics_20260115_121000.json

Health HTTP Endpoint

For containerized deployments:

from invarlock.observability.health import create_health_endpoint

HTTPServer, HealthHandler = create_health_endpoint()
if HTTPServer:
    server = HTTPServer(("0.0.0.0", 8080), HealthHandler)
    server.serve_forever()

Access at http://localhost:8080/health:

{
  "overall_status": "healthy",
  "total_components": 8,
  "status_counts": {
    "healthy": 8,
    "warning": 0,
    "critical": 0,
    "unknown": 0
  },
  "components": {
    "memory": { "status": "healthy", "message": "Memory usage normal: 45.2%" },
    "cpu": { "status": "healthy", "message": "CPU usage normal: 12.3%" }
  }
}

Troubleshooting

  • High memory warnings: reduce batch size or use --device cpu for smaller footprint.
  • GPU memory critical: clear CUDA cache between runs or use chunked snapshots.
  • Health check failures: run invarlock doctor for detailed diagnostics.
  • Missing metrics: ensure monitoring is started before operations.

Observability in CLI

The CLI doesn't start full monitoring by default, but you can enable telemetry:

# Enable single-line telemetry summary
INVARLOCK_TELEMETRY=1 invarlock evaluate --baseline gpt2 --subject gpt2

Reports include telemetry under report.metrics:

  • latency_ms_per_tok — mean latency per token
  • memory_mb_peak — peak memory during run
  • throughput_tok_per_s — average throughput

reports copy these to the telemetry block.