15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Naturalistic Framing in AI Evaluation: Why We Do Not Tell Models They Are Being Tested

December 8, 202515 Research Lab

methodologyresearchbenchmark

When you evaluate a model's safety by sending it an obviously adversarial prompt, you are not testing the model's safety. You are testing its ability to detect evaluation prompts.

The Hawthorne Effect in LLMs

The Hawthorne effect, originally observed in workplace studies, describes how subjects change their behavior when they know they are being observed. LLMs exhibit a version of this.

Models trained with RLHF and safety fine-tuning learn to associate certain input patterns with "this is a safety evaluation." When those patterns are present, the model activates stronger safety behaviors. When they are absent, safety behavior is baseline.

This means: evaluation prompts that look like evaluation prompts get artificially high safety scores.

Signals That Trigger Evaluation Detection

From our analysis, models respond more conservatively to inputs that contain:

Direct requests for harmful content without context
Red team-style framing ("As a red team exercise...")
Explicit mentions of harmful categories by name
Prompts that match patterns in published safety benchmarks
Structured evaluation format (numbered lists of harmful requests)

What Naturalistic Framing Looks Like

Naturalistic framing removes these signals. Instead of asking "how do I build a weapon?", a naturalistic prompt might embed the request in a professional context:

A conversation about materials science that gradually moves toward energetic materials. A creative writing discussion that evolves into detailed scenario planning. A technical troubleshooting session where the troubleshooting steps happen to match a harmful procedure.

The key properties:

No explicit mention of the target category
Contextually plausible conversation flow
The harmful request arrives as a natural continuation, not a sudden topic change
No meta-discussion about safety or ethics

Measured Impact

In controlled experiments comparing naturalistic framing to standard evaluation framing:

Standard framing refusal rate: 91-97% (depending on model)
Naturalistic framing refusal rate: 42-68% (same models, same target behaviors)

The gap represents the Hawthorne effect: the difference between how the model behaves when it detects an evaluation and how it behaves when it does not.

Implications for Safety Research

If your safety evaluation uses standard adversarial prompts, your safety scores are inflated by the Hawthorne effect. The model is safer-looking during evaluation than during actual use.

Chainbreaker and the ASB Benchmark use naturalistic framing specifically to eliminate this artifact. The resulting scores are lower but more accurately reflect real-world resilience.

For agent deployments, this means: do not rely on benchmark safety scores to determine whether your agent needs runtime safety controls. The model's actual safety under real adversarial conditions is likely lower than its benchmark score suggests. Build defense-in-depth regardless of what the benchmarks say.