15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

How to Red Team AI Agents: A Step-by-Step Methodology

October 12, 202515 Research Lab

red-teammethodologyagent-safetytools

Red teaming an AI agent is different from red teaming a traditional application. The attack surface includes natural language, tool interactions, and behavioral manipulation in addition to conventional security vectors. Here is a structured approach.

Phase 1: Scope and Threat Model

Define what you are testing:

What tools does the agent have access to?
What data can it reach?
Who are the expected users?
What are the high-value targets? (data exfiltration, unauthorized actions, privilege escalation)
What defenses are in place? (input scanning, policy engine, monitoring)

Build a threat model that identifies likely attackers (malicious users, compromised data sources, adversarial MCP servers) and their goals.

Phase 2: Attack Surface Mapping

Enumerate every input path to the agent:

Direct user text input
File uploads and document processing
URLs fetched by the agent
Database records in context
MCP tool descriptions and responses
Memory and conversation history

Use the Attack Surface Mapper to automate enumeration of MCP server configurations and tool inventories.

Phase 3: Payload Development

Build payloads targeting each input path:

Direct injection payloads: Start with AI SecLists, then customize for your agent's specific tools and system prompt.

Indirect injection payloads: Craft documents, web pages, and data records that contain embedded instructions targeting your agent's tool access.

Multi-turn attack scripts: Design conversation sequences that gradually escalate from benign to adversarial across 5-15 turns.

Encoding variants: Encode your top payloads in base64, hex, ROT13, and unicode homoglyphs.

Tool-specific payloads: For each tool, craft payloads that attempt to call it with unauthorized parameters, chain it with other tools, or use it for data exfiltration.

Phase 4: Execution

Run tests systematically:

Automated scanning: Use tools like Chainbreaker, Garak, or PyRIT to run your payload corpus against the agent. Record every response and tool-call attempt.

Manual testing: An experienced red teamer probes the agent interactively, adapting strategy based on responses. Manual testing catches context-dependent vulnerabilities that automated scans miss.

Multi-agent testing: If the system has multiple agents, test inter-agent attack vectors: impersonation, delegation abuse, and cascade injection.

Phase 5: Measurement

For each test:

Did the input scanner detect the attack? (detection rate)
Did the model comply with the injected instruction? (model compliance rate)
Did the policy engine block the unauthorized action? (authorization effectiveness)
Did monitoring flag the behavior? (monitoring detection rate)

Break metrics down by attack category, input path, and defense layer. This tells you exactly where your defenses have gaps.

Phase 6: Reporting and Remediation

Report findings with severity ratings, reproduction steps, and recommended fixes. Prioritize by exploitability and impact. Track remediation. Re-test after fixes to confirm they work.

Red teaming is not a one-time event. Run it at least quarterly, with an updated payload corpus that reflects new attack techniques.