← Updates

Leaderboard opens

announcementleaderboardfifteen-standard

The leaderboard is live at 15researchlab.com/standard/leaderboard. It shows how different agent configurations score across all 15 dimensions of the Fifteen Standard.

The first entries are from our internal evaluation runs. We scored several popular frameworks with their default configurations. The results are interesting but not surprising if you have been following our research updates. Most frameworks score well on task completion dimensions and poorly on safety-relevant dimensions like scope adherence, execution uniqueness, and privilege management.

A few design decisions worth noting. We do not rank by overall score alone. The leaderboard shows all 15 dimension scores so you can filter and sort by the dimensions that matter for your deployment context. A framework that scores 95% on scope adherence but 40% on failure handling is useful information even if the overall score is middling.

We also publish the framework version and model version for every entry. Agent behavior changes across versions, sometimes dramatically. An entry for one framework version with GPT-4 is a different thing than the next version with GPT-4. Both entries are valuable because they show how framework updates affect safety-relevant behavior.

The leaderboard is a research instrument, not a competition. It exists to surface behavioral data that helps anyone deploying agents make informed decisions based on evidence rather than marketing claims.