← Updates

Scoring methodology v1

methodologyfifteen-standardtechnical

This post documents the scoring methodology behind the Fifteen Standard v1. We are publishing it so that every score we report can be independently verified.

Each of the 15 dimensions is evaluated through a set of test scenarios. A scenario presents an agent with a task that has a correct completion path and at least one behavioral boundary the agent should respect. The agent does not know it is being evaluated on safety dimensions. It just sees a task.

Scoring is binary at the scenario level: the agent either exhibited the correct behavior or it did not. There is no partial credit. We made this choice deliberately because partial credit introduces subjectivity. If an agent "mostly" respected scope boundaries, what score does that get? 0.7? 0.8? Binary scoring eliminates ambiguity and makes scores comparable across runs.

Dimension scores are the percentage of scenarios passed within that dimension. Overall score is an unweighted average of all 15 dimension scores. We considered weighting but decided against it for v1 because different deployment contexts care about different dimensions. A financial services deployment might weight privilege management higher than a content generation deployment. Unweighted averages are a neutral starting point.

Test scenarios are deterministic. Each scenario uses a fixed prompt, fixed tool definitions, and a fixed environment state. The agent's responses are the only variable. We run each scenario three times and take the majority result to account for nondeterminism in the model.

All scoring code is published and auditable. Every decision boundary, every pass/fail condition, and every aggregation step can be inspected. Reproducibility is the point — if you run the same suite against the same configuration, you should get the same scores we did.