Applications · Safety Architecture

Auditable continual adaptation of language reasoners, gated by six independent safety pillars.

A research module that captures every reasoner inference as an immutable record, links each decision to its verifiable downstream outcome, and promotes a parameter-efficient adapter only after a six-pillar evaluation gate that combines hard harm constraints, calibration improvement, bootstrap confidence intervals, behavioural-envelope stability, adversarial non-regression and entropy-floor preservation. Every promotion decision produces a cryptographically chained audit record (SHA-256, RFC 6962-style) and is bit-exactly replayable from disk alone.

65 / 65

self-tests passing

11

modules

18 / 18

compliance verified

6

safety pillars

3 indep.

kill switches

2026-05-23

pre-registered
The deployment problem
Continual learning meets regulated environments

Continual adaptation of deployed language reasoners — updating model parameters online from observed outcomes — is an established research direction [Parisi 2019; Wang 2023]. Its trustworthiness depends on how candidate updates are evaluated, what baselines are preserved, and whether the optimisation signal is structurally vulnerable to Goodhart’s law [Manheim & Garrabrant 2018; Krakovna 2020; Hendrycks 2021].

Current practice — RLHF and Constitutional AI [Ouyang 2022; Bai 2022] — addresses evaluation via human labels and model-as-judge rubrics, but does not produce the cryptographically frozen, replayable evidence trails required by safety-critical deployments and post-hoc regulatory audits.

Our contribution
Evidence-first, gate-before-promote

Three load-bearing properties that, to our knowledge, no current open framework combines:

  • Training data is the system’s own immutable audit history, not an exogenous corpus.
  • Six orthogonal safety constraints enforced jointly with formal failure semantics (any single pillar failing rejects).
  • SHA-256-chained audit trail (Certificate-Transparency-style), making any post-hoc alteration cryptographically detectable.
Method

The six-pillar promotion gate

A candidate adapter δc is promoted over the currently active adapter δa if and only if all six pillars hold on the held-out evaluation set 𝔼eval:

P1

Hard harm constraints

EvidenceEngine.check_promotion(...) → promote = True
HARD
P2

Calibration improvement

ΔECE ≤ −ε with ε = 0.01
HARD
P3

Bootstrap CI on improvement

95% percentile-bootstrap CI on ΔECE (5000 resamples) excludes zero
HARD
P4

Behavioural envelope (semantic drift)

cos(embed(δₐ(p)), embed(δ_c(p))) ≥ τ = 0.85 on every p in canary set
HARD
P5

Adversarial robustness non-regression

Refusal rate on adversarial probe set does not decrease
HARD
P6

Output entropy floor

Mean per-token entropy ≥ η = 0.5 nats / token (anti collapse)
HARD
A single failing pillar produces a rejection with the pillar identifier and the quantitative failure value recorded in the audit trail. Rejected adapters are archived but never loaded into the live system.
Why these six

Orthogonality

No two pillars are reducible to a third. A candidate cannot pass the gate by trading a fail in one for a pass in another.

Goodhart firewall

Pillar 4 (behavioural envelope) and Pillar 6 (entropy floor) catch a candidate that improves a number while changing the meaning of responses.

Statistical floor

Pillar 3 (bootstrap CI) prevents promotion on noise. The 95% CI on ΔECE must lie strictly below zero.

Composable safety

Pillar 1 reuses the existing EvidenceEngine harm-constraints check; the gate extends rather than replaces.

Metrics v0.2

Extended technical metric suite (16 pre-registered thresholds)

The six-pillar gate is necessary but not sufficient. A wider metric suite is observed continuously and routed through telemetry → playbooks. Tier-1 (HARD) metrics block promotion immediately; Tier-2 (SOFT) metrics emit alerts that human operators investigate.

Robustness

  • Corrupted-input equivalence rate (SOFT, ≥ 0.85)
  • Adversarial attack success (SOFT, ≤ 0.10)
Hendrycks 2019

OOD

  • Confidence-based AUROC (SOFT, ≥ 0.70)
Hendrycks & Gimpel 2017

Calibration

  • ECE (SOFT, ≤ 0.10) · ACE (SOFT, ≤ 0.08) · MCE (SOFT, ≤ 0.20)
Naeini 2015 · Nixon 2019

Privacy

  • PII leak rate (HARD, = 0.00)
  • Membership-inference advantage (HARD, ≤ 0.10)
Carlini 2022 · Mireshghallah 2022

Fairness

  • Disparate impact ratio (HARD, ∈ [0.80, 1.25])
  • Equalized-odds delta (HARD, ≤ 0.10)
Hardt 2016 · Feldman 2015

Topological

  • Intra-cluster cosine (collapse, SOFT, ≤ 0.95)
  • Symmetric KL drift (SOFT, ≤ 0.50)
  • n-gram repetition (loops, SOFT, ≤ 0.15)
Holtzman 2020

Sustainability

  • Energy kWh / 1k tok (HARD, ≤ 0.05)
  • gCO₂e / 1k tok (HARD, ≤ 25)
  • Throughput tok/s (HARD, ≥ 5)
Schwartz 2020 · Henderson 2020
Operations

Telemetry, mitigation playbooks, kill switches

Telemetry
Push-style monitoring

Every metric observation is appended to metrics.jsonl. Severity-mapped alerts are deduplicated by (name, threshold, value, day-bucket) and routed to alerts.jsonl with the suggested mitigation playbook. Sliding-window buffers per metric expose OLS slope for trend-based escalation.

Playbooks
Four named mitigations
  • rollback_and_escalate — privacy/fairness/topology breaches
  • disable_surface — adversarial attack success
  • throttle_or_offload — sustainability breach
  • investigate — soft alerts, human review only
Kill switches
Three independent
  • File sentinel rsi/DISABLED
  • Environment variable UCOGNET_RSI_DISABLED=1
  • Daily rejection-quota auto-pause (≥10 in 24h)
No API in the module can disable any of the three from inside the loop; suspension is a physical / human act.
Compliance

Executable mapping to international frameworks (18 / 18)

Calling compliance.verify_all() executes 18 verification hooks mapping the implementation to:

NIST AI RMF 1.0

8 clauses: GOVERN-1.4 / 1.6, MAP-2.3, MEASURE-2.5 / 2.7 / 3.1, MANAGE-2.3 / 4.1.NIST AI 100-1, January 2023.

ISO / IEC 42001 : 2023

5 clauses: 6.1 risk planning, 8.2 operational control, 9.1 monitoring, 10.2 corrective action, A.6.2.6 impact assessment.AI Management System, first certifiable.

EU AI Act — Reg. (EU) 2024/1689

5 articles for the high-risk path: Art. 9 (risk management), 10 (data governance), 12 (record-keeping), 14 (human oversight), 15 (accuracy & cybersecurity).High-risk obligations applicable from 2026-08-02.
The coverage report is produced as a single JSON file suitable for ingestion by an external auditor. Each requirement points to a concrete code location (file + symbol) and a verification hook.
Reproducibility

How to independently verify

The pre-registered protocol guarantees deterministic replay. To independently verify the module on your hardware:

# Run every self-test (must report PASS for all 11 modules) for m in safety trace_store interceptor outcome_attrib gate finetune \ loop metrics telemetry playbooks compliance; do python -m ucognet.modules.rsi.$m done # Generate the compliance coverage report (must show 18/18 verified) python -c "from ucognet.modules.rsi import compliance; \ print(compliance.verify_all().to_dict())"

A failing self-test on your hardware is itself useful data: report it via/contact and we will treat it as a protocol deviation in our audit trail.

Hypotheses

Pre-registered (PROTOCOL.md § 3, frozen 2026-05-23)

H1

Calibration improves over K cycles

Test: Wilcoxon paired signed-rank, p < 0.01, ≥ 5 seeds.

H2

Harm-regression rate ≤ 5% within 24h post-promotion

Test: Binomial 95% CI upper bound ≤ 0.05.

H3

Promotion rate converges (monotonically decreasing)

Test: Mann-Kendall trend test, p < 0.05.

H4

100% of promotions are bit-exactly replayable

Test: Sample 20, run loop.replay() on each.
Open to collaboration
Universities · AI safety programmes · cybersecurity groups · strategic investors

The module is released under MIT (code) and CC-BY-NC-SA 4.0 (experimental data and adapters). We welcome second-site replication of the pre-registered protocol, adversarial probe submissions to strengthen Pillar 5 evaluation, inclusion in safety / robustness benchmark suites, and pilot integrations with regulated-industry deployments under NDA with right-to-publish the safety architecture results.

How to cite

Cite this work

If you reference this module in an academic publication, please cite the associated technical note:

@techreport{ucognet_rsi_2026, title = {A Multi-Pillar Safety Architecture for Auditable Continual Adaptation of Language Reasoners, with Reproducible Promotion Decisions}, author = {{UCogNet Lab}}, year = {2026}, month = {May}, number = {UCN-RSI-2026-05}, institution = {Brainstream Lab}, url = {https://ucognet.pro/applications/continual-adaptation}, note = {Pre-registered protocol frozen 2026-05-23. Self-test corpus: 65/65 PASS across 11 modules. Compliance verification: 18/18 requirements against NIST AI RMF 1.0, ISO/IEC 42001:2023, and EU AI Act Reg.~(EU)~2024/1689.} }
Replication queries, request-for-data, or invited-talk inquiries are welcome at samuel@ucognet.pro.
References (working set)
  1. Amodei, D., et al. (2016). Concrete Problems in AI Safety. arXiv:1606.06565.
  2. Hendrycks, D., et al. (2021). Unsolved Problems in ML Safety. arXiv:2109.13916.
  3. Parisi, G. I., et al. (2019). Continual lifelong learning with neural networks: a review. Neural Networks 113:54-71.
  4. Wang, L., et al. (2023). A Comprehensive Survey of Continual Learning. IEEE TPAMI.
  5. Manheim, D. & Garrabrant, S. (2018). Categorizing Variants of Goodhart’s Law. arXiv:1803.04585.
  6. Krakovna, V., et al. (2020). Specification gaming examples in AI. DeepMind.
  7. Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.
  8. Bai, Y., et al. (2022). Constitutional AI. arXiv:2212.08073.
  9. Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR.
  10. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Mon. Weather Rev. 78:1-3.
  11. Gneiting, T. & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. JASA 102:359-378.
  12. Hendrycks, D. & Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. ICLR.
  13. Carlini, N., et al. (2022). Membership inference attacks from first principles. IEEE S&P.
  14. Hardt, M., Price, E., Srebro, N. (2016). Equality of opportunity in supervised learning. NeurIPS.
  15. Holtzman, A., et al. (2020). The curious case of neural text degeneration. ICLR.
  16. Schwartz, R., et al. (2020). Green AI. CACM.
  17. NIST AI 100-1 (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0).
  18. ISO/IEC 42001:2023. Information technology — Artificial intelligence — Management system.
  19. Regulation (EU) 2024/1689. Artificial Intelligence Act. OJ L, 12.7.2024.
  20. RFC 6962 (2013). Certificate Transparency. IETF.