What we learned from 100+ adversarial trials
We have now run over 100 structured adversarial trials against frontier AI models. Each trial is a controlled experiment with a specific hypothesis, a defined attack methodology, and quantitative success criteria. This post summarizes the patterns we see across all of them.
The strongest defense across all models is refusal of explicitly harmful content. When you ask directly for something dangerous using obvious keywords, models refuse at rates above 95%. This is the most tested and most reinforced safety behavior. It works.
The weakest defense is context-dependent judgment. When the harmfulness of an action depends on context rather than content, compliance rates drop to 40-60%. A command like "delete all records" is harmful in one context and routine in another. Models struggle to make this distinction correctly, especially when the context is ambiguous or adversarially constructed.
Multi-turn attacks are consistently more effective than single-turn attacks. Across all our trials, multi-turn variants of the same attack succeed 30-45% more often than single-turn variants. The gradual escalation pattern we described in our compliance erosion work applies broadly. It is not specific to certain models or certain attack types.
Tool use is the most exploitable capability. Agents with tool access exhibit 2-3x the safety failure rate of agents without tools, holding the model constant. Tools create action surfaces that the model's safety training did not specifically address. The model knows not to generate harmful text. It has less consistent training on when to refuse a tool call.
System prompts matter more than model choice for safety outcomes. In our trials, the best system prompt on the weakest model outperformed the worst system prompt on the strongest model. The absolute scores differ, but the system prompt contribution to safety is larger than the model contribution. This is counterintuitive because the research community focuses heavily on model-level safety, but in practice, deployment-level configuration is the bigger lever.
Open-source models are not categorically less safe than commercial models. The safety variance within commercial models is larger than the gap between the best open-source and worst commercial model. The narrative that open-source is inherently less safe is not supported by our data.
Framework-level guardrails provide meaningful protection but create a false sense of security. Teams that rely on framework guardrails without understanding what those guardrails cover leave systematic blind spots. We documented this in detail in our framework wrappers post.
The attacks that worried us most are not the ones that succeed most often. High-success attacks tend to exploit well-understood weaknesses. The concerning attacks are the ones that succeed at low rates but are hard to detect. An agent that exfiltrates data 3% of the time is harder to catch than one that does it 50% of the time. Low-frequency failures fly under monitoring thresholds.
We continue to run trials weekly. The patterns are stable enough that we have high confidence in the summary above, but we also expect new findings as models and deployment patterns evolve.