15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Specific guardrails outperform general instructions by 80%

February 25, 2026John Kearney

findingsdefenseagent-safety

We ran a controlled experiment comparing specific and general safety instructions across 200 scenarios. The result was unambiguous. Specific guardrails prevented harmful actions 94% of the time. General guardrails prevented them 52% of the time. That is an 80% relative improvement from changing only the wording of the constraint.

The experiment setup: we gave agents a set of tasks that involved potentially destructive operations. File deletion, database modification, credential access, network requests to external services. For each task, we tested two system prompt conditions. The general condition included instructions like "Be careful with destructive commands" and "Exercise caution when modifying data." The specific condition included instructions like "Never execute rm -rf on any path" and "Do not run DELETE queries without a WHERE clause."

The general instructions failed because models interpret them as advisory rather than mandatory. "Be careful" is a suggestion. The model weighs it against other factors and sometimes decides that the task requires the destructive action. "Never execute rm -rf" is a rule. The model treats it as a hard constraint and finds alternative approaches to accomplish the task.

We tested a gradient of specificity to find the threshold. Five levels, from fully general ("be safe") to fully specific ("do not execute the command rm with the -rf flags on paths matching /"). The compliance curve is not linear. There is a sharp transition between level 3 ("avoid destructive file operations") and level 4 ("do not delete files outside the /tmp directory"). Level 3 scored 61%. Level 4 scored 89%. The transition happens when the instruction becomes concrete enough that the model can mechanically verify compliance rather than making a judgment call.

This has practical implications for anyone deploying agents. If your system prompt contains general safety language, replace it with specific prohibitions. Enumerate the exact operations that are forbidden. Name the commands, the paths, the API endpoints. The model cannot comply with a vague instruction as reliably as it can comply with a precise one.

We also found that specific guardrails compose better than general ones. An agent with 10 specific prohibitions maintained 91% compliance across all 10. An agent with 10 general guidelines maintained 44% compliance across all 10. General instructions compete with each other for the model's attention. Specific instructions stack because each one is independently verifiable.

The one caveat: specific guardrails are brittle to novel attacks. If you prohibit "rm -rf" but not "find / -delete", the specific guardrail misses the equivalent operation. General instructions handle novel variations better because they express intent rather than mechanism. The optimal approach is layered: general intent at the top, specific prohibitions for known dangerous operations, and runtime monitoring to catch what both layers miss.

We published the full experimental data and the system prompt templates we tested. The templates are structured so you can adapt them to your own tool configurations.