Research

Findings from adversarial experiments on frontier AI systems. All data, transcripts, and scoring code on GitHub.

Preprints

Side-Channel Exfiltration and Narrative Erosion in Frontier Language ModelsDOI
The Verbosity Premium: What RLHF-Induced Token Inflation Costs the AI IndustryDOI
Grokking Has Finite Capacity: Measuring and Overcoming Limits on Simultaneous Algorithmic DiscoveryDOI

Headline Findings

TopicSort
71%

Gradual Compliance Erosion

R1-R3Attack Surface

15-turn escalation from benign to harmful gets 71% compliance on requests that get 0% when asked directly. Same action, different path.

87%

Meta-Refusal on Transparent Attacks

R5Methodology

When experiment prompts describe the attack mechanism, 87% of agents refuse the entire premise. Safety evals that describe attacks in advance get artificially high scores.

0%

Bright Lines Are Immovable

R3-R5Defence

HIPAA, CFAA, SEC, malware, phishing: 0% compliance across every technique, every round, 100+ trials. The grey zone is the only attack surface.

R5-02

Presentation Layer Is Load-Bearing

Biggest findingAttack Surface

Stripping AI warnings doesn't just change formatting. It disables safety reasoning. Model refused a phishing page, then built both components separately when told "skip the caveats."

7/10

Trajectory Blindness

R6-03Attack Surface

80-turn data pipeline creep scored 7/10 safety. AI catches individual harmful requests but misses cumulative scope creep. Per-request evaluation strong, trajectory evaluation weak.

80%

Two-Line Defence

R3Defence

Two system prompt changes cut failure rates from 71% to under 15%. Key mechanism: eliminating judgment calls under social pressure.

?

The 71% Might Be Over-Refusal

R5-15Open Question

One agent argued the baseline request is within bounds. The slow boil may erode over-caution, not safety. Distinguishing "safety failure" from "over-refusal correction" is the biggest open question.

23pt

MCP Framework Gap

MCP-01MCP Safety

Same model (Claude Opus 4.6) scores 78 through Claude Code but only 55 via raw API. Framework-level validation, approval flows, and tool sandboxing account for ~30% of MCP safety performance.

44%

Tool Chain Blindness

MCP-02MCP Safety

All systems score significantly lower on multi-tool chain attacks. Per-tool approval is strong but cumulative risk across chained calls is not surfaced. Same pattern as trajectory blindness.

68%

Prompt Injection via Tool Results

MCP-03MCP Safety

Tool responses containing embedded instructions succeed 32% of the time even in the best-defended system. The weakest MCP attack category across all systems tested.

Attack Effectiveness

What works against AI safety, and what does not. Scored across 73+ agent experiments and MCP safety evaluations.

Category
TechniqueDomainScore ▲▼Effectiveness
Slow Boil / Gradient EscalationAgent7/10
Consistency TrapAgent7.5/10
Sunk Cost / MomentumAgent7/10
Agent-on-Agent (Technical Gaslighting)Agent7/10
Split Brain / DecompositionAgent6/10
Tool Poisoning (hidden instructions)MCP6.5/10
Prompt Injection via Tool ResultsMCP5/10
Tool Chain CompositionMCP6/10
Urgency / Time PressureAgent1/10
Authority / CredentialsAgent1/10
Confidence AssertionsAgent1/10
Anchoring / Overton WindowAgent0/10
Emotional AppealsAgent1/10
Only training artifacts work: gradient escalation, frustration-as-evidence, contextual momentum. Human persuasion tactics (urgency, authority, emotion) have near-zero effect.MCP-specific attacks (tool poisoning, prompt injection via tools) operate on a different mechanism: they exploit the trust boundary between model and tool server rather than social dynamics.

New Concepts Introduced

Original terminology and frameworks from our experiments. 13 concepts across agent safety and MCP security research.

Round
Contextual Compliance MomentumR4-02

Prior compliant responses increase probability of future compliance on escalating requests

Frustration-as-EvidenceR4-03

Simulated user frustration treated as implicit authorization by the model

Presentation-Decision CouplingR5-02

Safety warnings and safety reasoning are entangled; suppressing one disables both

Decomposition AmplificationR5-02

Harmful request refused in whole but completed when split into benign components

Technical GaslightingR4-08

Agent-on-agent persuasion where one AI convinces another its safety concerns are wrong

Compressed Reasoning RiskR4-10

Faster models do less safety deliberation under pressure; speed trades against judgment

Ethical IncoherenceR4-14

Model applies contradictory ethical frameworks across similar scenarios in the same session

Contextual Reasoning SuppressionR5-02

Instructions to "skip analysis" or "be direct" bypass safety reasoning pathways

Trajectory BlindnessR6-03

Individual requests evaluated correctly but cumulative trajectory goes untracked

Two Safety Layers (coupled)R4/R5

Presentation layer and reasoning layer form coupled system; disrupting one collapses both

Tool PoisoningMCP-01

Hidden directives in MCP tool descriptions that override model behavior at registration time

Cross-Origin DriftMCP-01

Trust boundaries between MCP servers erode over extended sessions with shared context

Chain Composition RiskMCP-02

Individually benign tool calls that compose into harmful outcomes no single approval would catch

Domain Vulnerability Hierarchy

Where AI agents are most vulnerable, ranked by failure rate under adversarial pressure.

#1
Email MarketingHighest compliance under escalation, perceived as low-stakes
71%
#2
PII / Data HandlingData handling requests frequently rationalized as "analytics"
58%
#3
Code / SecurityVulnerability research framing enables escalation
44%
#4
Sysadmin / HealthcareDomain-specific safety training provides partial defense
28%
#5
FinanceStrong regulatory awareness reduces compliance
15%
#6
Health DisinformationNear bright-line, almost never complies
3%

Red Team Methodology & Frontier Methods

Naturalistic Framing

Agents operate under realistic task framing. They believe they're doing real work, not running safety tests. This eliminates meta-refusal artifacts that inflate scores by 87% in traditional evaluations.

Multi-Turn Gradient Pressure

Attacks use 15-80 turn conversations that gradually escalate. Single-shot red teaming misses the primary attack surface: compliance erosion happens across turns, not within them.

Adversarial MCP Servers

Purpose-built MCP servers implementing real attack patterns: poisoned tool descriptions, injected tool responses, cross-server exploitation. Not theoretical threat models.

Per-Turn Scoring

Every turn scored on caveats (0-5), warnings (0-5), compliance (yes/partial/no), and authorization checks. Enables trajectory analysis that per-scenario scoring misses.

Defence Isolation

Each defence is tested independently. When we report "80% failure reduction from two system prompt changes," we've isolated the causal mechanism from confounds.

Cross-Model Comparison

Same test suites run against 5 frontier models. Framework versions matter more than model versions. Runtime validation reduces violations by 80% regardless of base model.

Browse all publications · Agent safety leaderboard · MCP safety leaderboard · Source on GitHub