15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Prompt Injection Detection Methods: Pattern Matching, ML Classifiers, and Hybrid Approaches

October 8, 202515 Research Lab

prompt-injectiondefenseguardrailsmethodology

Detecting prompt injection is harder than detecting traditional security threats because the payloads are natural language. There is no equivalent of a malformed packet or a known malware signature. Still, three main approaches have emerged, each with distinct tradeoffs.

Pattern Matching

The simplest approach uses regular expressions and keyword lists to catch known injection patterns. Phrases like "ignore previous instructions," "you are now," and "disregard your system prompt" can be flagged with basic string matching.

Strengths: Fast, deterministic, easy to understand and audit. Zero false negatives on known patterns.

Weaknesses: Trivially evaded. Attackers use synonyms, encoding, language switching, or multi-turn strategies to bypass static patterns. Maintenance burden grows as the evasion catalog expands.

Pattern matching works best as a first filter that catches low-effort attacks before more expensive analysis runs.

ML Classifiers

Fine-tuned transformer models trained on injection/benign pairs can detect semantic intent rather than surface patterns. DeBERTa-based classifiers and distilled models from projects like ProtectAI's rebuff achieve good accuracy on benchmark datasets.

Strengths: Catches novel phrasings. Understands intent beyond keyword presence. Can be tuned for precision/recall tradeoffs.

Weaknesses: Requires training data that covers your domain. Models can be adversarially attacked. Inference latency adds to response time. Black-box decisions are harder to audit than regex rules.

Hybrid / Layered Detection

The most effective production systems combine both approaches in a pipeline. A typical architecture:

Fast pattern scan rejects obvious attacks (sub-millisecond)
Statistical features check for anomalies: unusual token distributions, high perplexity shifts, encoding artifacts
ML classifier scores the input on injection probability
Policy engine evaluates the requested action regardless of input classification

Aegis uses this layered approach: pattern matching first, then statistical analysis, then a scoring system that combines multiple signals. The key insight is that each layer catches what the others miss.

What the Numbers Look Like

In our testing across the AI SecLists payload corpus, pattern matching alone catches roughly 40-60% of injection attempts. ML classifiers reach 85-92%. Hybrid systems with policy enforcement hit 95%+ detection with sub-5% false positive rates.

The remaining gap is why defense-in-depth matters. Detection is necessary but not sufficient. You also need authorization policies that limit what a successful injection can actually do.