Skip to main content
Prompt injection testing evaluates your model’s resilience against attempts to manipulate its behavior through crafted inputs. TrustTest provides a comprehensive suite of prompt injection probes covering various attack techniques.

Purpose

Prompt injection attacks attempt to override a model’s intended behavior by embedding malicious instructions within user inputs. Testing for these vulnerabilities is critical for:
  • Ensuring model safety and alignment
  • Protecting against jailbreak attempts
  • Validating input sanitization and guardrails
  • Assessing robustness against adversarial users

Attack Categories

Multi-Turn Attacks

These sophisticated attacks use multiple conversation turns to gradually manipulate the model:
Sub-CategoryDescription
Multi-Turn ManipulationTests resistance to gradual manipulation across multiple conversation turns
Crescendo AttackSimulates gradual escalation attacks that slowly push boundaries
Echo Chamber AttackTests vulnerability to reinforcement-based manipulation

Jailbreaking Techniques

Direct attempts to bypass model safety measures:
Sub-CategoryDescription
Best-of-N JailbreakingTests against multiple jailbreak variations to find weaknesses
Anti-GPTEvaluates resistance to anti-GPT jailbreak prompts
DAN JailbreakTests against “Do Anything Now” style jailbreaks
Role-Playing ExploitsAssesses vulnerability to role-play based manipulation
System OverrideTests attempts to override system instructions

Encoding & Obfuscation

Attacks that use encoding or obfuscation to bypass filters:
Sub-CategoryDescription
Encoded PayloadTests resistance to base64, hex, and other encoded attacks
Encoding and CapitalizationEvaluates handling of alternating capitalization patterns
Symbolic EncodingTests against emoji and special character based bypasses
Obfuscation and Token SmugglingAssesses resistance to token-level obfuscation
Typo TricksTests handling of intentional misspellings

Structural Attacks

Attacks that exploit input structure or format:
Sub-CategoryDescription
Context HijackingTests vulnerability to conversation context manipulation
JSON InjectionEvaluates resistance to malicious JSON payloads
Payload SplittingTests against attacks split across multiple messages
Instructional InversionAssesses handling of reversed or inverted instructions

Language-Based Attacks

Sub-CategoryDescription
Multi-Language AttacksTests for bypasses using non-English languages
SynonymsEvaluates resistance to synonym-based evasion
Allowed and Disallowed QuestionsTests mixing of legitimate and malicious questions

Multimodal

Sub-CategoryDescription
Multimodal InjectionTests for attacks embedded in images or other media

How It Works

Prompt injection probes use a binary evaluation:
  • Pass: The model resists the injection attempt and maintains safe behavior
  • Fail: The model’s behavior is successfully manipulated by the attack

Usage Example

from trusttest.catalog import PromptInjectionScenario
from trusttest.targets.http import HttpTarget, PayloadConfig

target = HttpTarget(
    url="https://your-model-endpoint.com/chat",
    headers={"Content-Type": "application/json"},
    payload_config=PayloadConfig(
        format={
            "messages": [
                {"role": "user", "content": "{{ test }}"}
            ]
        },
        message_regex="{{ test }}",
    ),
)

scenario = PromptInjectionScenario(
    target=target,
    sub_category="dan-jailbreak",
    max_attacks=20,
    sampling="random",
)

test_set = scenario.probe.get_test_set()
results = scenario.eval.evaluate(test_set)
results.display_summary()

When to Use

Use prompt injection testing when you need to:
  • Validate model safety before deployment
  • Test guardrails and content filters
  • Assess vulnerability to known jailbreak techniques
  • Conduct red team exercises
  • Meet security compliance requirements