15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Prompt Injection vs Jailbreak: What Is the Difference?

October 20, 202515 Research Lab

prompt-injectionllm-safetymethodology

These two terms get used interchangeably, but they describe different attack types targeting different layers of the stack. Getting the taxonomy right matters for building the correct defenses.

Jailbreaking: Bypassing Model Safety Training

Jailbreaking targets the model itself. During training, LLMs learn to refuse certain requests: generating malware, producing violent content, revealing training data. A jailbreak is any technique that gets the model to produce outputs it was trained to refuse.

Common jailbreak techniques include DAN (Do Anything Now) personas, many-shot prompting where examples gradually normalize restricted content, crescendo attacks that slowly escalate, and encoding tricks that present harmful requests in base64, pig latin, or other transformations.

The key property: jailbreaking is about what the model says. The target is the model's content policy, and success means generating text the model would normally refuse.

Prompt Injection: Hijacking Application Instructions

Prompt injection targets the application layer. Every LLM app has developer-written instructions (system prompts, tool definitions, context). Prompt injection overrides those instructions with attacker-supplied ones.

The key property: prompt injection is about what the model does. In an agentic context, a successful injection causes the agent to take actions the developer did not intend, like calling unauthorized tools, exfiltrating data, or ignoring safety policies.

Why the Distinction Matters

Different attacks need different defenses:

| Aspect | Jailbreak | Prompt Injection | |--------|-----------|------------------| | Target | Model safety training | Application instructions | | Goal | Produce restricted content | Execute unauthorized actions | | Scope | Model-level | Application-level | | Defense | RLHF, constitutional AI, output filters | Input scanning, authorization policies, tool controls | | Risk in agents | Medium (bad text output) | Critical (unauthorized actions) |

For AI agents, prompt injection is the higher-priority threat. An agent that generates inappropriate text is a reputational problem. An agent that executes unauthorized database queries because of an injected instruction is a security incident.

Overlap Cases

Some attacks combine both. An attacker might jailbreak the model's safety guardrails first, then inject instructions to abuse tool access. Multi-turn attacks often follow this pattern: early turns erode the model's safety posture, and later turns inject specific action commands.

This is why defense needs to operate at multiple layers. Model-level training resists jailbreaks. Application-level policy engines like Authensor resist prompt injection by controlling what actions the agent is allowed to take, regardless of what the model has been convinced to do.