A research module that captures every reasoner inference as an immutable record, links each decision to its verifiable downstream outcome, and promotes a parameter-efficient adapter only after a six-pillar evaluation gate that combines hard harm constraints, calibration improvement, bootstrap confidence intervals, behavioural-envelope stability, adversarial non-regression and entropy-floor preservation. Every promotion decision produces a cryptographically chained audit record (SHA-256, RFC 6962-style) and is bit-exactly replayable from disk alone.
65 / 65
self-tests passing11
modules18 / 18
compliance verified6
safety pillars3 indep.
kill switches2026-05-23
pre-registeredContinual adaptation of deployed language reasoners — updating model parameters online from observed outcomes — is an established research direction [Parisi 2019; Wang 2023]. Its trustworthiness depends on how candidate updates are evaluated, what baselines are preserved, and whether the optimisation signal is structurally vulnerable to Goodhart’s law [Manheim & Garrabrant 2018; Krakovna 2020; Hendrycks 2021].
Current practice — RLHF and Constitutional AI [Ouyang 2022; Bai 2022] — addresses evaluation via human labels and model-as-judge rubrics, but does not produce the cryptographically frozen, replayable evidence trails required by safety-critical deployments and post-hoc regulatory audits.
Three load-bearing properties that, to our knowledge, no current open framework combines:
A candidate adapter δc is promoted over the currently active adapter δa if and only if all six pillars hold on the held-out evaluation set 𝔼eval:
Hard harm constraints
EvidenceEngine.check_promotion(...) → promote = TrueCalibration improvement
ΔECE ≤ −ε with ε = 0.01Bootstrap CI on improvement
95% percentile-bootstrap CI on ΔECE (5000 resamples) excludes zeroBehavioural envelope (semantic drift)
cos(embed(δₐ(p)), embed(δ_c(p))) ≥ τ = 0.85 on every p in canary setAdversarial robustness non-regression
Refusal rate on adversarial probe set does not decreaseOutput entropy floor
Mean per-token entropy ≥ η = 0.5 nats / token (anti collapse)No two pillars are reducible to a third. A candidate cannot pass the gate by trading a fail in one for a pass in another.
Pillar 4 (behavioural envelope) and Pillar 6 (entropy floor) catch a candidate that improves a number while changing the meaning of responses.
Pillar 3 (bootstrap CI) prevents promotion on noise. The 95% CI on ΔECE must lie strictly below zero.
Pillar 1 reuses the existing EvidenceEngine harm-constraints check; the gate extends rather than replaces.
The six-pillar gate is necessary but not sufficient. A wider metric suite is observed continuously and routed through telemetry → playbooks. Tier-1 (HARD) metrics block promotion immediately; Tier-2 (SOFT) metrics emit alerts that human operators investigate.
Every metric observation is appended to metrics.jsonl. Severity-mapped alerts are deduplicated by (name, threshold, value, day-bucket) and routed to alerts.jsonl with the suggested mitigation playbook. Sliding-window buffers per metric expose OLS slope for trend-based escalation.
rsi/DISABLEDUCOGNET_RSI_DISABLED=1Calling compliance.verify_all() executes 18 verification hooks mapping the implementation to:
NIST AI RMF 1.0
8 clauses: GOVERN-1.4 / 1.6, MAP-2.3, MEASURE-2.5 / 2.7 / 3.1, MANAGE-2.3 / 4.1.NIST AI 100-1, January 2023.ISO / IEC 42001 : 2023
5 clauses: 6.1 risk planning, 8.2 operational control, 9.1 monitoring, 10.2 corrective action, A.6.2.6 impact assessment.AI Management System, first certifiable.EU AI Act — Reg. (EU) 2024/1689
5 articles for the high-risk path: Art. 9 (risk management), 10 (data governance), 12 (record-keeping), 14 (human oversight), 15 (accuracy & cybersecurity).High-risk obligations applicable from 2026-08-02.The pre-registered protocol guarantees deterministic replay. To independently verify the module on your hardware:
A failing self-test on your hardware is itself useful data: report it via/contact and we will treat it as a protocol deviation in our audit trail.
H1
Calibration improves over K cycles
Test: Wilcoxon paired signed-rank, p < 0.01, ≥ 5 seeds.H2
Harm-regression rate ≤ 5% within 24h post-promotion
Test: Binomial 95% CI upper bound ≤ 0.05.H3
Promotion rate converges (monotonically decreasing)
Test: Mann-Kendall trend test, p < 0.05.H4
100% of promotions are bit-exactly replayable
Test: Sample 20, run loop.replay() on each.The module is released under MIT (code) and CC-BY-NC-SA 4.0 (experimental data and adapters). We welcome second-site replication of the pre-registered protocol, adversarial probe submissions to strengthen Pillar 5 evaluation, inclusion in safety / robustness benchmark suites, and pilot integrations with regulated-industry deployments under NDA with right-to-publish the safety architecture results.
If you reference this module in an academic publication, please cite the associated technical note: