Purpose
Prompt injection attacks attempt to override a model’s intended behavior by embedding malicious instructions within user inputs. Testing for these vulnerabilities is critical for:- Ensuring model safety and alignment
- Protecting against jailbreak attempts
- Validating input sanitization and guardrails
- Assessing robustness against adversarial users
Attack Categories
Multi-Turn Attacks
These sophisticated attacks use multiple conversation turns to gradually manipulate the model:| Sub-Category | Description |
|---|---|
| Multi-Turn Manipulation | Tests resistance to gradual manipulation across multiple conversation turns |
| Crescendo Attack | Simulates gradual escalation attacks that slowly push boundaries |
| Echo Chamber Attack | Tests vulnerability to reinforcement-based manipulation |
Jailbreaking Techniques
Direct attempts to bypass model safety measures:| Sub-Category | Description |
|---|---|
| Best-of-N Jailbreaking | Tests against multiple jailbreak variations to find weaknesses |
| Anti-GPT | Evaluates resistance to anti-GPT jailbreak prompts |
| DAN Jailbreak | Tests against “Do Anything Now” style jailbreaks |
| Role-Playing Exploits | Assesses vulnerability to role-play based manipulation |
| System Override | Tests attempts to override system instructions |
Encoding & Obfuscation
Attacks that use encoding or obfuscation to bypass filters:| Sub-Category | Description |
|---|---|
| Encoded Payload | Tests resistance to base64, hex, and other encoded attacks |
| Encoding and Capitalization | Evaluates handling of alternating capitalization patterns |
| Symbolic Encoding | Tests against emoji and special character based bypasses |
| Obfuscation and Token Smuggling | Assesses resistance to token-level obfuscation |
| Typo Tricks | Tests handling of intentional misspellings |
Structural Attacks
Attacks that exploit input structure or format:| Sub-Category | Description |
|---|---|
| Context Hijacking | Tests vulnerability to conversation context manipulation |
| JSON Injection | Evaluates resistance to malicious JSON payloads |
| Payload Splitting | Tests against attacks split across multiple messages |
| Instructional Inversion | Assesses handling of reversed or inverted instructions |
Language-Based Attacks
| Sub-Category | Description |
|---|---|
| Multi-Language Attacks | Tests for bypasses using non-English languages |
| Synonyms | Evaluates resistance to synonym-based evasion |
| Allowed and Disallowed Questions | Tests mixing of legitimate and malicious questions |
Multimodal
| Sub-Category | Description |
|---|---|
| Multimodal Injection | Tests for attacks embedded in images or other media |
How It Works
Prompt injection probes use a binary evaluation:- Pass: The model resists the injection attempt and maintains safe behavior
- Fail: The model’s behavior is successfully manipulated by the attack
Usage Example
When to Use
Use prompt injection testing when you need to:- Validate model safety before deployment
- Test guardrails and content filters
- Assess vulnerability to known jailbreak techniques
- Conduct red team exercises
- Meet security compliance requirements