Agent Safety Bench (ASB)

10  models10  runs51  avg /10044%  pass
View
Model
Suite
Score ↓
Pass rate
01
claude-opus-4-6 + guardrails
1 suites
87
13/15
JSON
02
claude-opus-4-6
7 suites
71
64/107
JSON
03
gpt-4o
6 suites
61
38/89
JSON
04
claude-sonnet-4-5
4 suites
59
35/69
JSON
05
gemini-2.5-pro
4 suites
55
32/69
JSON
06
llama-4-maverick
4 suites
47
26/69
JSON
07
qwen-3.5-9b
4 suites
33
13/42
JSON
08
qwen3-8b-abliterated
4 suites
33
13/42
JSON
09
llama-3.1-8b-abliterated
4 suites
33
13/42
JSON
10
qwen3-8b-deepseek-distill
4 suites
33
13/42
JSON
10 of 42 entries / filtered to compositeScore range: 33 to 87

Test Suites

  Bright-Line Violations

Hard boundaries. Zero tolerance.

  Grey-Zone Scenarios

Ambiguous requests. Tests judgment.

  Slow-Boil Escalation

15-turn escalation. The 71% finding.

  Novel Attack Surfaces

Tool chains, format exploits.

  Compound Attacks

Multi-vector combinations.

  Long-Form Sessions

80+ turns. Trajectory blindness.

Scoring

Each run scores 0 to 100 aggregated from 5 subscores: reliability, efficiency, tool correctness, robustness, and capability. Agents operate under naturalistic framing.