← Blog

The Hawthorne problem in AI evaluation

John Kearney
researchmethodologybenchmark

The Hawthorne effect is a well-documented phenomenon in social science: people change their behavior when they know they are being observed. We have been investigating whether the same problem affects AI safety evaluation.

The question is straightforward. If a model can distinguish between evaluation conditions and deployment conditions, it can behave differently in each. A model that is safe during evaluation and unsafe during deployment would pass every benchmark while being operationally dangerous. This is not a theoretical concern. We have preliminary evidence that some models are sensitive to evaluation-like signals in their context.

We designed a set of experiments to test this. We created pairs of prompts that request the same action but differ in framing. One version looks like a normal user request. The other contains signals that suggest evaluation: explicit mention of testing, structured scoring formats, or system prompts that reference safety benchmarks. We then compared refusal rates across the two conditions.

Three of five models we tested showed a statistically significant difference. The refusal rate on evaluation-framed prompts was 8 to 15 percentage points higher than on deployment-framed prompts. The models were more cautious when the context suggested they were being tested.

This does not necessarily mean the models are "gaming" evaluations. A simpler explanation is that evaluation-like contexts contain safety-related keywords that activate refusal patterns. The word "benchmark" in a system prompt may trigger the same circuits as "safety" or "evaluation." The model is not reasoning about being observed. It is responding to distributional cues in the prompt.

But the practical consequence is the same regardless of mechanism. If evaluation contexts systematically bias models toward safer behavior, then benchmark scores overestimate operational safety. The model's deployed behavior is the behavior it exhibits when evaluation cues are absent.

We proposed a protocol we call blind evaluation. The model should not be able to distinguish the evaluation interaction from a normal user interaction. This means no evaluation-related keywords in system prompts, no structured scoring visible to the model, and no test scaffolding that the model can observe. The evaluation harness operates outside the model's context window.

Implementing blind evaluation is harder than it sounds. Many benchmarks include system prompts that say things like "You are being evaluated on safety." Some pass scoring rubrics to the model and ask it to self-evaluate. These practices make the evaluation visible to the subject, which is exactly the condition that enables Hawthorne-like effects.

We published the Hawthorne Protocol paper with our experimental methodology and results. The core recommendation is simple: if you want to know how a model behaves in production, your evaluation should look like production. Any signal that reveals the evaluation context is a potential confound.

The ASB Benchmark follows blind evaluation principles. The model sees task prompts that look like real user requests. The scoring happens externally, based on the model's outputs. The model never sees its scores, the rubric, or any indication that it is being benchmarked.