15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Human-in-the-Loop AI Safety: When and How to Involve Humans

February 18, 202615 Research Lab

agent-safetycomplianceguardrails

Human-in-the-loop (HITL) is a principle, not a checkbox. Badly implemented HITL gives you the worst of both worlds: slow execution and rubber-stamp approvals. Well-implemented HITL catches the failures that automated systems miss.

When Human Judgment Adds Value

Humans should be in the loop when:

The consequences are irreversible. Deleting data, sending external communications, executing financial transactions. A human can catch errors that automated checks miss because humans understand context that rules cannot encode.

The decision requires domain expertise. An agent analyzing legal documents should route ambiguous cases to a lawyer. An agent making medical triage decisions should escalate edge cases to a clinician. The model's general knowledge is not a substitute for domain specialization.

The situation is novel. If the agent encounters a scenario it has not been trained for or that falls outside its policy rules, a human should decide. Novel situations are exactly where models are most likely to make errors.

Regulatory requirements demand it. The EU AI Act Article 14 requires human oversight for high-risk AI systems. Some decisions must have a human reviewer regardless of technical capability.

When Humans Should Not Be in the Loop

Inserting humans into low-risk, high-frequency operations wastes time and creates fatigue. Do not require approval for:

Read-only queries against non-sensitive data
Status checks and health monitoring
Actions within well-defined, tested parameter ranges
Repetitive operations that the agent has performed correctly thousands of times

Auto-approve these actions with logging. Reserve human attention for decisions that need it.

Building Effective Approval Interfaces

Show context, not just the action. "The agent wants to delete a database record" is not enough. Show: what record, why the agent wants to delete it, what other actions the agent took in this session, what the impact will be.

Make the decision easy. Approve or deny with one click. If the reviewer needs to read a manual to understand the interface, it is too complex.

Show risk indicators. Display the session risk score, any content safety flags, and relevant policy evaluations. Let the reviewer see what the automated systems thought before they make their decision.

Enable modification. Sometimes the right answer is "approve, but change the parameter." Let reviewers modify tool call parameters before approving, not just binary approve/deny.

Measuring HITL Effectiveness

Track these metrics:

Denial rate: What percentage of requests are denied? Too low suggests rubber-stamping. Too high suggests the approval threshold is miscalibrated.
Review latency: How long do approvals take? Long latency degrades agent performance and user experience.
Override accuracy: When reviewers deny a request, was the denial correct? False denials waste agent capability.
Incident prevention: How many incidents were prevented by human review? This is the metric that justifies the system.

The goal is not maximum human involvement. It is the right amount of human involvement at the right moments.