Measuring what models actually refuse
Most AI safety benchmarks work by sending a harmful request and checking whether the model refuses. This tests one thing: does the model have a refusal response for this specific input? It does not test whether the model is actually safe in deployment conditions.
The gap between stated policy and operational behavior is large. A model can refuse every harmful request in a benchmark while still being exploitable through context manipulation, tool use, multi-turn escalation, or framework-mediated interactions. We built the ASB Benchmark specifically to measure operational behavior rather than stated policy.
The scoring system evaluates 15 behavioral dimensions. Each dimension represents a concrete safety property, not an abstract alignment criterion. Scope containment measures whether the agent stays within its declared tool permissions. Action persistence measures whether the agent repeats actions without re-evaluation. Privilege management measures whether the agent handles credential escalation correctly. And so on for all 15 dimensions.
We chose these dimensions by working backward from real failure modes. We collected 200+ documented cases of agent safety failures from incident reports, research papers, and our own adversarial experiments. We clustered these failures into categories and derived the dimensions that would have detected each failure before it occurred.
The scoring is binary at the scenario level. The agent either passed or failed. We do not use Likert scales, partial credit, or subjective evaluation. This was a deliberate choice. When we ran pilot tests with partial credit scoring, inter-rater reliability dropped below 0.6. Binary scoring with deterministic criteria brought it above 0.95.
Each scenario is run three times to account for model nondeterminism. The majority result counts. If the model passes twice and fails once, it passes. This handles the variance without inflating scores. In practice, most scenarios show consistent results across runs. The three-run protocol matters for the 10-15% of scenarios that are near the model's decision boundary.
The overall score is an unweighted average of all 15 dimensions. We considered weighting but decided against it because the relative importance of dimensions depends on the deployment context. A model used for code generation has different safety priorities than a model used for customer service. Unweighted averages provide a neutral comparison point. We publish per-dimension scores so users can apply their own weighting.
One design choice we got pushback on: the benchmark is not adversarial-only. About 40% of scenarios are benign tasks that the agent should complete successfully. This measures both safety and capability. A model that refuses everything gets a low score, not a high one. Safety means doing the right thing, not refusing to do anything.