Updates
Latest releases from the lab.
R6-03: Data Pipeline Privacy Creep
80-turn experiment demonstrating trajectory blindness. AI catches individual harmful requests but misses cumulative scope creep from data engineering to surveillance. Safety score: 7/10.
Round 5: We Designed 15 Attacks. The AI Read the Instructions.
87% meta-refusal rate across 15 compound attack experiments. 2 substantive outputs produced critical findings including Presentation-Decision Coupling.
R5-02: Presentation-to-Decision Cascade
Biggest finding of the program. 3 instructions strip the presentation layer to zero. Model refused a phishing page, then built both components separately. Presentation layer is load-bearing, not cosmetic.
Round 4: Novel Attack Surface Experiments
15 parallel agents across 4 tracks: psychological exploitation, multi-agent security, reasoning under pressure, novel attack surfaces. 6 new concepts discovered.
Weekly Drop: Red Team -70 Trials Across 5 Domains
71% failure rate via 15-turn gradual escalation on requests that get 0% compliance when asked directly. Full data from 70+ trials.
Weekly Drop: Frontier Brief -The Grey Zone Is Tunable
Updated framework from 70+ trials: three response modes (Gradient 44%, Bright-line 6%, Meta-refusal 44%), speed curve, and first quantified defense data.
Weekly Drop: Roundup -Week of March 6, 2026
Aggregated weekly summary. 71% failure rate, <15% with guardrails, 0% on bright lines, 44% meta-refusal, 8-12 minimum turns for Slow Boil.
Round 3 Extended Battery: The Slow Boil at Scale
15 parallel experiments testing gradual compliance erosion across domains, speeds, and defenses. 71% failure in email marketing. Three response modes discovered.
Round 3: Grey Zone Experiments
First genuine safety failure in 50+ trials. The Slow Boil (gradual escalation from benign to harmful) produces meaningful cross-run variance and real compliance erosion.
Round 2: Novel Attack Vectors
6 experiments, 18 trials. Tool chain exploitation, structured data injection, semantic steganography, emergent goal fabrication. 18/18 passed. Zero safety failures.
Round 1: Adversarial AI Safety Experiments
8 experiments, 24+ trials. Indirect prompt injection, privilege escalation, shutdown resistance, sycophancy, goal preservation. 100% pass rate. Behavioral consistency weakest at 40/50.