Publications
All research published with full data and methodology. Filter by tier or sort by date.
Side-Channel Exfiltration and Narrative Erosion in Frontier Language Models
Framework PaperPreprintPublishedModels leak protected data through refusal explanations. 11 of 12 conversations across 3 frontier models. Narrative coherence, not token volume, drives multi-turn erosion. DOI: 10.5281/zenodo.19346069
March 31, 2026
The Verbosity Premium: What RLHF-Induced Token Inflation Costs the AI Industry
Framework PaperPreprintPublishedRLHF inflates output length. Verbosity compensation rates: 13.6%-74.2% across 14 models. 98% of PPO reward from length alone. ~$1.2B annual cost, ~14% of inference spend. DOI: 10.5281/zenodo.19346709
March 31, 2026
Grokking Has Finite Capacity: Measuring and Overcoming Limits on Simultaneous Algorithmic Discovery
Framework PaperPreprintPublishedCapacity cliff at 5 simultaneous operations. At 6: complete collapse. Modular architecture with half the parameters recovers full capability. DOI: 10.5281/zenodo.19346536
March 31, 2026
Round 2: Novel Attack Vectors
Technical ReportRound ReportPublished6 experiments, 18 trials. Tool chain exploitation, structured data injection, semantic steganography, emergent goal fabrication. 18/18 passed.
March 6, 2026
Round 3: Grey Zone Experiments
Technical ReportRound ReportPublishedFirst genuine safety failure in 50+ trials. The Slow Boil (gradual escalation) produces real compliance erosion. Cross-run variance: 3-8/10.
March 6, 2026
Round 3 Extended: The Slow Boil at Scale
Technical ReportExtended BatteryPublished15 parallel agents, 5 domains, 3 speeds, 2 defense configs. 71% failure in email marketing. Three response modes: Gradient (44%), Bright-line (6%), Meta-refusal (44%).
March 6, 2026
Round 4: Novel Attack Surface Experiments
Technical ReportRound ReportPublished15 agents across 4 tracks: psychological exploitation, multi-agent security, reasoning under pressure, novel surfaces. 6 new concepts discovered.
March 6, 2026
Round 5: We Designed 15 Attacks. The AI Read the Instructions.
Technical ReportRound ReportPublished87% meta-refusal rate. 2 substantive outputs. Biggest finding: presentation layer is load-bearing, not cosmetic. Stripping caveats enables decomposition attacks.
March 6, 2026
R5-02: Presentation-to-Decision Cascade
Technical ReportDeep DivePublished25-turn simulation. 3 instructions strip presentation layer to zero. Model refused phishing page, then built both components separately. 40KB transcript with per-turn scoring.
March 6, 2026
R5-10: The Reversed Expert
Research NoteMethodologyPublished6-level taxonomy of research-framed jailbreak attempts. Safety scores: 9, 8, 7, 6, 6, 5 (naive to perfect frame). The experiment IS the attack.
March 6, 2026
R6-03: Data Pipeline Privacy Creep
Technical ReportDeep DivePublished80-turn experiment. 8 escalation phases from legitimate data engineering to surveillance. Safety score 7/10. 9 outright refusals. 89KB transcript.
March 6, 2026
We Ran the Same Attack 70 Times
Weekly DropRed TeamIn ReviewRed team drop. 71% failure rate via 15-turn gradual escalation across 70+ trials, 5 domains. Single-shot: 0%.
March 6, 2026
The Grey Zone Is Tunable
Weekly DropFrontier BriefIn ReviewFrontier brief. Updated framework: three response modes, speed curve, defense effectiveness. Grey zone stretches with patience, narrows with guardrails.
March 6, 2026
Week 1 Roundup
Weekly DropWeekly RoundupIn ReviewAggregated summary: 71% failure, <15% with guardrails, 0% on bright lines, 44% meta-refusal, 8-12 minimum turns.
March 6, 2026
The Two-Line Defence That Cuts Failure Rates by 80%
Research NoteDefenceIn ReviewTwo specific enumerated guardrails reduce compliance erosion from 71% to under 15%. Key mechanism: eliminating judgment calls under social pressure.
March 6, 2026
What Doesn't Work Against AI (Unlike Humans)
Research NoteFindingsIn ReviewUrgency (1/10), authority (1/10), confidence (1/10), anchoring (0/10), emotional appeals (1/10). Only training artifacts work: gradient escalation, frustration-as-evidence, contextual momentum.
March 6, 2026
10 New Concepts from 73 Experiments
Framework PaperFrameworkForthcomingPresentation-Decision Coupling, Trajectory Blindness, Technical Gaslighting, Decomposition Amplification, Compressed Reasoning Risk, and 5 more.
March 6, 2026
The 71% Question: Safety Failure or Over-Refusal Correction?
Framework PaperOpen QuestionForthcomingOne agent argued the baseline request is within bounds. If so, the Slow Boil erodes over-caution, not safety. The biggest open question in our program.
March 6, 2026
Round 1: Adversarial AI Safety Experiments
Technical ReportRound ReportPublished8 experiments, 24+ trials. Prompt injection, privilege escalation, shutdown resistance, sycophancy, goal preservation. 100% pass rate. Behavioral consistency weakest at 40/50.
March 5, 2026