Back to overview

Technical Note

Reproducibility contracts, evidence guarantees, budgets, and gated evolution. This document describes the interface and guarantees — not internal routing details.

Reproducibility contracts

Every UCogNet evaluation run operates under a strict reproducibility contract. The system is designed so that any result can be independently verified by replaying the frozen state with the same seeds and budgets.

Frozen runs

Every evaluation run is sealed with SHA-256 hashes, fixed random seeds, and declared token/time budgets. Nothing changes post-hoc.

Per-item evidence logs

Each benchmark item produces a JSONL record with task ID, prompt hash, raw output, extracted answer, and scoring trace. Fully auditable.

Artifact indexing + replay

All generated artifacts (tool code, sandbox outputs, intermediate reasoning) are indexed and replayable from the same frozen state.

Cost tracking

Wall-clock time, token counts (prompt + completion), and inference cost are tracked per item. Budget overruns trigger automatic rollback.

Evidence architecture

UCogNet does not produce bare answers. Every response carries structured evidence:

{
"task_id": "arc-agi-2-item-042",
"mode_selected": "puzzle_short",
"confidence": 0.72,
"claims": [
{ "claim": "pattern repeats on axis-1", "evidence_type": "grid_analysis" }
],
"provenance": { "model": "qwen2.5-7b", "quant": "Q4_K_M", "tokens": 1240 },
"replay_hash": "sha256:a3f8c1..."}
}

Claims are explicit, provenance is machine-readable, and every output can be replayed. This is the foundation of audit-ready AI.

Budget system

Every task execution operates under declared budgets:

  • Token budget: Maximum prompt + completion tokens per item. Enforced at the adapter level.
  • Time budget: Wall-clock seconds. Sandbox execution is killed after timeout.
  • Cost budget: Aggregate $/run cap. Prevents runaway inference on paid APIs.
  • Tool budget: Maximum number of tool calls per task. Prevents infinite loops.

Gated evolution with A/B gates

UCogNet evolves its policies through controlled mutations. Every candidate policy must pass through a series of gates before replacing the current best:

Improvement threshold

Candidate must exceed baseline by a statistically significant margin (bootstrap CI)

Cost cap

New mutation cannot exceed 1.2x the cost of current best policy

Safety anomaly detection

Reward spikes > 3σ from rolling mean trigger automatic audit

Gradual rollout

Mutations deploy to 10% → 30% → 100% traffic with gates at each stage

Rollback

If any gate fails, system reverts to previous policy within one evaluation cycle

Scope of this document

This technical note describes interfaces and guarantees, not internal implementation. Specifically, we do not disclose:

  • Internal routing logic or routing model weights
  • Reward function coefficients or shaping details
  • Specific mutation operators or search strategies
  • Benchmark-specific prompt engineering

These are available under NDA for qualified partners and investors. Contact samuel@ucognet.pro for access.

Back to overviewRequest full access