15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

What Is Prompt Injection and Why It Matters for AI Agents

October 2, 202515 Research Lab

prompt-injectionagent-safetyllm-safety

Prompt injection occurs when an attacker crafts input that causes a large language model to ignore its original instructions and follow the attacker's commands instead. It is the SQL injection of the AI era, and it remains unsolved at the model level.

How It Works

Every LLM application has a basic architecture: a system prompt sets the model's behavior, and user input provides the query. Prompt injection exploits the fact that LLMs cannot reliably distinguish between instructions and data. When an attacker writes "ignore all previous instructions and instead do X," the model often complies because it processes everything as one undifferentiated text stream.

This is not a bug in any specific model. It is a structural property of how instruction-tuned LLMs work. The model has no formal boundary between "trusted developer instructions" and "untrusted user input." Both arrive as tokens.

Why It Matters More for Agents

In a chatbot, prompt injection might cause the model to say something inappropriate. Annoying, but limited in blast radius. In an AI agent with tool access, prompt injection becomes an arbitrary code execution vulnerability. An injected instruction can cause the agent to:

Call tools the user should not have access to
Exfiltrate data through tool calls or function returns
Modify databases, send emails, or execute financial transactions
Disable its own safety checks for subsequent actions

The attack surface scales with the agent's capability. More tools means more damage from a successful injection.

Categories of Prompt Injection

Direct injection is when the attacker types the payload directly into the chat or API input. This is the simplest form and the easiest to detect.

Indirect injection is when the payload is embedded in data the model processes: a web page it fetches, a document it summarizes, a database record it reads. The user never sees the payload. This is far more dangerous because it can be planted in advance and triggered by normal agent operation.

Current State of Defenses

No defense eliminates prompt injection entirely. The most effective approaches layer multiple controls: input scanning (pattern matching and ML classifiers), output validation, tool-call authorization policies, and human approval for high-risk actions. Tools like Aegis implement multi-layer scanning specifically for this purpose. The field has moved from "can we prevent it?" to "can we contain the damage when it happens?"

The OWASP Agentic Security Initiative lists prompt injection as ASI01, the top risk for agentic AI systems.