15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

AI Safety Monitoring in Production: Beyond Testing

March 2, 202615 Research Lab

defenseagent-safetytools

Red teaming and testing happen before deployment. Monitoring happens during deployment. Both are necessary. Testing tells you what is broken before it goes live. Monitoring tells you when something breaks in production.

What to Monitor

Tool-call patterns. Frequency, diversity, sequencing. An agent that normally calls 3-5 tools per session and suddenly calls 30 is behaving anomalously. Track these metrics per agent and per user.

Policy evaluation outcomes. Track the ratio of allows to denies over time. A sudden increase in denials means the agent is attempting more unauthorized actions. A sudden decrease might mean policies were accidentally relaxed.

Content safety flags. Track the rate and severity of content safety detections. Increasing detection rates might indicate an ongoing attack campaign.

Error rates. Tool call failures, timeout rates, malformed parameter rates. Increasing errors can indicate the agent is malfunctioning or being manipulated.

Latency distributions. Unusual latency patterns in tool calls can indicate network issues, but also data exfiltration (large payload transfers) or tool abuse (expensive operations).

Session characteristics. Duration, turn count, context window utilization. Anomalies in these meta-properties can signal multi-turn attacks or resource abuse.

Statistical Methods

EWMA (Exponentially Weighted Moving Average). Maintains a smoothed average that weights recent data more than historical data. Good for detecting gradual trends. Alert when the current observation deviates from the smoothed average by more than N standard deviations.

CUSUM (Cumulative Sum). Accumulates deviations from a target value. Good for detecting persistent shifts. Alert when the cumulative sum exceeds a threshold.

Sliding window statistics. Maintain counts and distributions within a time window. Compare the current window to the previous window. Good for detecting sudden changes.

Sentinel implements EWMA and CUSUM with zero dependencies, operating as a lightweight in-process monitor.

Alerting

Not every anomaly needs immediate human attention. Structure alerts in tiers:

Tier 1 (log): Statistical deviation detected. Record it. Review in daily/weekly analysis.

Tier 2 (notify): Pattern matches a known risk indicator. Send a notification to the operations channel.

Tier 3 (escalate): Multiple risk indicators triggered simultaneously or critical threshold exceeded. Page the on-call engineer.

Tier 4 (automate): Kill switch criteria met. Automatic session termination. Human reviews after the fact.

Baseline Establishment

Monitoring requires a baseline. Before deploying monitoring, run the agent in normal operation and collect baseline metrics:

Average tool-call frequency per session
Typical tool diversity
Normal session duration distribution
Baseline error rates

Set alert thresholds relative to the baseline. Too tight and you get alert fatigue. Too loose and you miss real anomalies. Start with 3-sigma thresholds and adjust based on false positive rates.

The Feedback Loop

Monitoring findings should feed back into your safety controls. If monitoring detects a new attack pattern, add it to your scanner's pattern library. If monitoring reveals that a policy rule is too permissive, tighten it. If monitoring catches an incident, document it and update your red team test corpus.

Monitoring is not the end of the safety pipeline. It is the beginning of the next improvement cycle.