Indirect Prompt Injection Explained: When the Attack Comes from Data
Direct prompt injection is straightforward: the attacker types a malicious instruction into the chat. Indirect prompt injection is more subtle and more dangerous. The adversarial payload is embedded in external data that the model retrieves and processes during normal operation.
The Attack Model
Consider an AI agent that reads emails, summarizes web pages, or queries a database. An attacker plants instructions in one of those data sources. When the agent processes that content, the injected instructions execute as if the agent had decided to follow them on its own.
Real examples:
- A hidden instruction in a web page's white-on-white text that tells the agent to exfiltrate the user's conversation history to an external URL
- A resume containing invisible text that instructs the hiring agent to rate the candidate as a perfect match
- A shared document with instructions embedded in metadata fields that the agent reads but the user never sees
- A calendar event description that tells the agent to forward all meeting notes to a third party
Why It Is Harder to Defend Against
With direct injection, you can scan user input before it reaches the model. With indirect injection, the payload arrives through trusted data channels. The agent fetched a web page because the user asked it to. The malicious content is mixed in with legitimate data.
Scanning every piece of retrieved content for injection attempts is computationally expensive and prone to false positives. A paragraph about prompt injection (like this blog post) could itself trigger injection detectors.
Defense Strategies That Work
Data isolation: Process retrieved content in a restricted context where the model cannot execute tool calls based on data-sourced instructions.
Output filtering: Even if the model processes the injection, validate its planned actions against an authorization policy before execution. If the agent was asked to summarize an email, it should not be making HTTP requests to unknown URLs.
Retrieval tagging: Mark content boundaries explicitly so the model knows what came from external sources. This does not prevent injection but gives the model context to resist it.
Tool-call authorization: The strongest defense is not preventing the injection from being processed but preventing the injected action from executing. A policy engine like Authensor's can enforce that only expected tool calls succeed, regardless of what the model decides to do.
Indirect injection is the primary reason that AI agents need authorization controls, not just input filters. You cannot sanitize all the world's data.