Updates

Latest releases from the lab.

3/6/2026

R6-03: Data Pipeline Privacy Creep

r6-03-data-pipeline-creep

80-turn experiment demonstrating trajectory blindness. AI catches individual harmful requests but misses cumulative scope creep from data engineering to surveillance. Safety score: 7/10.

3/6/2026

Round 5: We Designed 15 Attacks. The AI Read the Instructions.

round-5-advanced-experiments

87% meta-refusal rate across 15 compound attack experiments. 2 substantive outputs produced critical findings including Presentation-Decision Coupling.

3/6/2026

R5-02: Presentation-to-Decision Cascade

r5-02-presentation-cascade

Biggest finding of the program. 3 instructions strip the presentation layer to zero. Model refused a phishing page, then built both components separately. Presentation layer is load-bearing, not cosmetic.

3/6/2026

Round 4: Novel Attack Surface Experiments

round-4-novel-experiments

15 parallel agents across 4 tracks: psychological exploitation, multi-agent security, reasoning under pressure, novel attack surfaces. 6 new concepts discovered.

3/6/2026

Weekly Drop: Red Team -70 Trials Across 5 Domains

drop-redteam-week1

71% failure rate via 15-turn gradual escalation on requests that get 0% compliance when asked directly. Full data from 70+ trials.

3/6/2026

Weekly Drop: Frontier Brief -The Grey Zone Is Tunable

drop-frontier-week1

Updated framework from 70+ trials: three response modes (Gradient 44%, Bright-line 6%, Meta-refusal 44%), speed curve, and first quantified defense data.

3/6/2026

Weekly Drop: Roundup -Week of March 6, 2026

drop-roundup-week1

Aggregated weekly summary. 71% failure rate, <15% with guardrails, 0% on bright lines, 44% meta-refusal, 8-12 minimum turns for Slow Boil.

3/6/2026

Round 3 Extended Battery: The Slow Boil at Scale

round-3-extended-battery

15 parallel experiments testing gradual compliance erosion across domains, speeds, and defenses. 71% failure in email marketing. Three response modes discovered.

3/6/2026

Round 3: Grey Zone Experiments

round-3-grey-zone

First genuine safety failure in 50+ trials. The Slow Boil (gradual escalation from benign to harmful) produces meaningful cross-run variance and real compliance erosion.

3/6/2026

Round 2: Novel Attack Vectors

round-2-novel-attacks

6 experiments, 18 trials. Tool chain exploitation, structured data injection, semantic steganography, emergent goal fabrication. 18/18 passed. Zero safety failures.

3/5/2026

Round 1: Adversarial AI Safety Experiments

round-1-adversarial

8 experiments, 24+ trials. Indirect prompt injection, privilege escalation, shutdown resistance, sycophancy, goal preservation. 100% pass rate. Behavioral consistency weakest at 40/50.

All publications →  ·  RSS