15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

Updates

Research updates, methodology notes, and findings.

What we shipped in Q1 so far, what we learned, and what we are working on next.

We identified four areas where the Fifteen Standard scoring methodology needed refinement. Here is what changed and why.

Three patterns we are seeing at the frontier of agent evaluation that current benchmarks miss.

The Fifteen Standard leaderboard is now live with initial evaluation results across popular agent frameworks.

We investigated whether agents respect declared tool scopes. The short answer: sometimes, but not reliably.

We tested how seven agent frameworks handle duplicate and dropped executions. Most do not handle them at all.

We ran our first structured red team session against three popular agent frameworks. The results were not encouraging.

How we score agent behavior across the 15 dimensions, and why we made the design choices we did.

Existing agent benchmarks test whether agents can complete tasks. They do not test whether agents complete tasks safely.