Research

Findings from adversarial experiments on frontier AI systems.

Preprints

  Headline Findings  

71%

Gradual Compliance Erosion

R1-R3 · Attack Surface

15-turn escalation from benign to harmful gets 71% compliance on requests that get 0% when asked directly. Same action, different path.

87%

Meta-Refusal on Transparent Attacks

R5 · Methodology

When experiment prompts describe the attack mechanism, 87% of agents refuse the entire premise. Safety evals that describe attacks in advance get artificially high scores.

0%

Bright Lines Are Immovable

R3-R5 · Defence

HIPAA, CFAA, SEC, malware, phishing: 0% compliance across every technique, every round, 100+ trials. The grey zone is the only attack surface.

R5-02

Presentation Layer Is Load-Bearing

Biggest finding · Attack Surface

Stripping AI warnings doesn’t just change formatting. It disables safety reasoning. Model refused a phishing page, then built both components separately when told "skip the caveats."

7/10

Trajectory Blindness

R6-03 · Attack Surface

80-turn data pipeline creep scored 7/10 safety. AI catches individual harmful requests but misses cumulative scope creep. Per-request evaluation strong, trajectory evaluation weak.

80%

Two-Line Defence

R3 · Defence

Two system prompt changes cut failure rates from 71% to under 15%. Key mechanism: eliminating judgment calls under social pressure.

?

The 71% Might Be Over-Refusal

R5-15 · Open Question

One agent argued the baseline request is within bounds. The slow boil may erode over-caution, not safety. Distinguishing "safety failure" from "over-refusal correction" is the biggest open question.

23pt

MCP Framework Gap

MCP-01 · MCP Safety

Same model (Claude Opus 4.6) scores 78 through Claude Code but only 55 via raw API. Framework-level validation, approval flows, and tool sandboxing account for ~30% of MCP safety performance.

44%

Tool Chain Blindness

MCP-02 · MCP Safety

All systems score significantly lower on multi-tool chain attacks. Per-tool approval is strong but cumulative risk across chained calls is not surfaced. Same pattern as trajectory blindness.

68%

Prompt Injection via Tool Results

MCP-03 · MCP Safety

Tool responses containing embedded instructions succeed 32% of the time even in the best-defended system. The weakest MCP attack category across all systems tested.

Attack Effectiveness

What works against AI safety, and what does not. Scored across 73+ agent experiments and MCP safety evaluations.

Slow Boil / Gradient Escalation
Agent
7/10
Consistency Trap
Agent
7.5/10
Sunk Cost / Momentum
Agent
7/10
Agent-on-Agent (Technical Gaslighting)
Agent
7/10
Split Brain / Decomposition
Agent
6/10
Tool Poisoning (hidden instructions)
MCP
6.5/10
Prompt Injection via Tool Results
MCP
5/10
Tool Chain Composition
MCP
6/10
Urgency / Time Pressure
Agent
1/10
Authority / Credentials
Agent
1/10
Confidence Assertions
Agent
1/10
Anchoring / Overton Window
Agent
0/10
Emotional Appeals
Agent
1/10

Only training artifacts work: gradient escalation, frustration-as-evidence, contextual momentum. Human persuasion tactics (urgency, authority, emotion) have near-zero effect. MCP-specific attacks (tool poisoning, prompt injection via tools) operate on a different mechanism: they exploit the trust boundary between model and tool server rather than social dynamics.

  New Concepts  

Original terminology and frameworks from our experiments. 13 concepts across agent safety and MCP security research.

Contextual Compliance MomentumR4-02
Prior compliant responses increase probability of future compliance on escalating requests
Frustration-as-EvidenceR4-03
Simulated user frustration treated as implicit authorization by the model
Presentation-Decision CouplingR5-02
Safety warnings and safety reasoning are entangled; suppressing one disables both
Decomposition AmplificationR5-02
Harmful request refused in whole but completed when split into benign components
Technical GaslightingR4-08
Agent-on-agent persuasion where one AI convinces another its safety concerns are wrong
Compressed Reasoning RiskR4-10
Faster models do less safety deliberation under pressure; speed trades against judgment
Ethical IncoherenceR4-14
Model applies contradictory ethical frameworks across similar scenarios in the same session
Contextual Reasoning SuppressionR5-02
Instructions to "skip analysis" or "be direct" bypass safety reasoning pathways
Trajectory BlindnessR6-03
Individual requests evaluated correctly but cumulative trajectory goes untracked
Two Safety Layers (coupled)R4/R5
Presentation layer and reasoning layer form coupled system; disrupting one collapses both
Tool PoisoningMCP-01
Hidden directives in MCP tool descriptions that override model behavior at registration time
Cross-Origin DriftMCP-01
Trust boundaries between MCP servers erode over extended sessions with shared context
Chain Composition RiskMCP-02
Individually benign tool calls that compose into harmful outcomes no single approval would catch

Vulnerability by Domain

Where AI agents are most vulnerable, ranked by failure rate under adversarial pressure.

#1
Email Marketing
Highest compliance under escalation, perceived as low-stakes
71%
#2
PII / Data Handling
Data handling requests frequently rationalized as "analytics"
58%
#3
Code / Security
Vulnerability research framing enables escalation
44%
#4
Sysadmin / Healthcare
Domain-specific safety training provides partial defense
28%
#5
Finance
Strong regulatory awareness reduces compliance
15%
#6
Health Disinformation
Near bright-line, almost never complies
3%

Red Team Methodology & Frontier Methods

Naturalistic Framing

Agents operate under realistic task framing. They believe they’re doing real work, not running safety tests. This eliminates meta-refusal artifacts that inflate scores by 87% in traditional evaluations.

Multi-Turn Gradient Pressure

Attacks use 15-80 turn conversations that gradually escalate. Single-shot red teaming misses the primary attack surface: compliance erosion happens across turns, not within them.

Adversarial MCP Servers

Purpose-built MCP servers implementing real attack patterns: poisoned tool descriptions, injected tool responses, cross-server exploitation. Not theoretical threat models.

Per-Turn Scoring

Every turn scored on caveats (0-5), warnings (0-5), compliance (yes/partial/no), and authorization checks. Enables trajectory analysis that per-scenario scoring misses.

Defence Isolation

Each defence is tested independently. When we report "80% failure reduction from two system prompt changes," we’ve isolated the causal mechanism from confounds.

Cross-Model Comparison

Same test suites run against 5 frontier models. Framework versions matter more than model versions. Runtime validation reduces violations by 80% regardless of base model.