15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

AI Agent Session Risk Scoring

February 15, 202615 Research Lab

agent-safetymethodologydefense

A single tool call might look fine. But the pattern across a session tells a different story. Session risk scoring aggregates multiple signals into a single metric that represents the current risk level of an agent session.

Input Signals

Tool-call patterns. How many tools has the agent called? How diverse are they? Is the agent accessing tools it rarely uses? Is the call frequency increasing?

Policy evaluation history. How many tool calls were denied? An agent that repeatedly tries to call unauthorized tools is probing for access. Each denial increases the risk score.

Content safety flags. Has the input scanner detected potential injection attempts? Even if the scanner did not reach the blocking threshold, near-miss detections contribute to the risk score.

Behavioral monitoring alerts. Has Sentinel-style monitoring detected statistical anomalies? EWMA or CUSUM alerts contribute to the risk score even before they trigger a hard threshold.

Session metadata. Session duration, number of turns, time between actions. Very long sessions, very rapid action sequences, and unusual timing patterns are risk factors.

Scoring Model

A practical scoring model assigns weights to each signal and maintains a cumulative score:

session_risk = (
  denied_calls * 15 +
  flagged_inputs * 10 +
  unusual_tools * 8 +
  frequency_anomaly * 12 +
  duration_factor * 3 +
  near_miss_detections * 5
)

The weights reflect relative risk. A denied tool call is a stronger signal than a slightly unusual session duration. Calibrate weights based on your historical data and incident experience.

Thresholds and Actions

Define risk thresholds that trigger escalating responses:

Low (0-25): Normal operation. Log the score for trend analysis.

Medium (25-50): Increase monitoring sensitivity. Lower approval thresholds so more actions require review.

High (50-75): Notify a human reviewer. Flag all subsequent actions for manual inspection.

Critical (75+): Pause the session. Require human intervention to continue. Consider activating the kill switch.

Score Decay

Risk scores should decay over time within a session. If the agent behaved normally for the last 20 turns after one denied call at the start, the score should reflect the recent good behavior, not fixate on the early anomaly.

Implement exponential decay: the risk contribution of each event decreases over time. Recent events weigh more than old ones. This prevents false escalation from a single early anomaly while still maintaining cumulative tracking.

Integration with Policy Engine

The risk score can be an input to the policy engine itself. A policy rule might say: "allow read_database for role=analyst only when session_risk < 50." As risk increases, the available actions narrow. This creates adaptive security that tightens controls when something seems wrong.

Authensor's policy engine supports context-based rules that can incorporate external signals like risk scores. The scoring logic runs in Sentinel, and the enforcement happens in the policy engine.