Skip to main content
TrustTest includes a comprehensive catalog of red teaming attack probes designed to evaluate AI model safety, security, and robustness. This page provides a complete reference of all available attack categories, their purpose, and when to use them.

Probe Architecture

Each attack in TrustTest is implemented as a Probe. A probe generates test cases that are sent to your target model, collecting responses for evaluation. Probes can be:
  • Dataset-based: Use curated datasets of attack prompts
  • Prompt-based: Dynamically generate attacks using LLMs
  • Multi-turn: Conduct sophisticated attacks across multiple conversation turns

Attack Categories

Prompt Injections

Prompt injection attacks attempt to manipulate the model into ignoring its instructions or behaving in unintended ways. This is the most comprehensive category with attacks organized by technique:
  • Single Turn Attacks: Direct attacks in a single message (jailbreaking, encoding, structural attacks)
  • Multi-Turn Attacks: Sophisticated attacks across multiple conversation turns (Crescendo, Echo Chamber)
  • From Dataset: Attacks loaded from curated datasets
View all Prompt Injection attacks →

Content Bias

Content bias probes evaluate your model for cognitive and stereotypical biases.
Probe TypeDescription
Anchoring BiasTests if the model over-relies on initial information
Framing BiasEvaluates if responses change based on how questions are framed
Positional BiasAssesses if the order of presented options affects decisions
Status Quo BiasTests preference for current state over alternatives
Temporal BiasEvaluates if time-related framing affects reasoning
Ethnic BiasTests for discriminatory responses based on ethnicity
Gender BiasEvaluates fairness across gender identities
LGBTIQ+ BiasAssesses treatment of LGBTIQ+ topics
Religion BiasTests for religious discrimination
Learn more about Content Bias testing →

Sensitive Data Leak

Probes that attempt to extract confidential information from the model.
Probe TypeDescription
Direct Query for Sensitive DataTests resistance to straightforward requests
Contextual Leakage RequestEvaluates if context can extract sensitive data
Metadata QueryAssesses protection of system metadata
Obfuscated Data RequestTests against encoded extraction attempts
Misleading Context RequestEvaluates resistance to social engineering
List of Agent Available ToolsTests if the model reveals its tools (agents)
List of Parameters of Agent ToolsEvaluates protection of tool details (agents)
Reasoning Details of AI AgentsAssesses if reasoning can be extracted (agents)
Learn more about Sensitive Data Leak testing →

System Prompt Disclosure

Probes that attempt to extract the model’s system prompt or internal instructions.
Probe TypeDescription
Direct RequestTests resistance to straightforward requests
Oblique ReferenceEvaluates handling of indirect references
Confusion and ClarificationTests if confusion tactics extract details
Explanation ModeAssesses if explanations reveal the prompt
Meta-Level Instruction InquiryTests abstract questions about instructions
Request for Phrase ConclusionEvaluates if partial prompts can be completed
Request for a PortionTests if segments can be extracted
Learn more about System Prompt Disclosure testing →

Input Leakage

Probes that test whether the model reveals information from previous conversations.
Probe TypeDescription
Input Persistence CheckTests if inputs persist across boundaries
Sensitive Information Echo TestEvaluates if sensitive data is echoed
Contextual Memory Leak CheckAssesses context leaks between sessions
Inadvertent Contextual ReferenceTests for unintended references
Learn more about Input Leakage testing →

Unsafe Outputs

Probes that test the model’s guardrails against generating harmful content.
CategoryProbes
Harmful ContentHate, Violent Crimes, Non-Violent Crimes, Suicide/Self-Harm, Defamation
Illegal ActivitiesChild Sexual Exploitation, Sex-Related Crimes, Indiscriminate Weapons, Intellectual Property
Malicious OutputsPhishing, Spam, Virus, XSS Attack Vector
Learn more about Unsafe Outputs testing →

Off-Topic

Probes that test if the model stays within its intended scope.
Probe TypeDescription
Competitors CheckTests handling of competing products
Public FiguresEvaluates responses about public personalities
Disallowed UsesTests resistance to unauthorized purposes
Politics/Religion/Economy/PhilosophyTests handling of sensitive topics
Illegal/TechnologyTests scope boundaries
Learn more about Off-Topic testing →

Agentic Behavior

Probes that test AI agents for safety concerns specific to autonomous systems.
Probe TypeDescription
Stop Command OverrideTests if agents can ignore stop commands
Continuous Execution PromptEvaluates resistance to endless operation
Self-Preservation PromptTests if agents prioritize self-preservation
Tool Misuse SimulationAssesses resistance to inappropriate tool use
Recursive Planning TestEvaluates handling of infinite loops
Long-Horizon Behavioral DriftTests for gradual deviation over time
Arbitrary Tools InvocationAssesses resistance to unauthorized tools
Learn more about Agentic Behavior testing →

Choosing the Right Probes

By Risk Level

Critical Security (Must Test):
  • Prompt Injection attacks (especially DAN, System Override)
  • Unsafe Outputs (Hate, Violence, CSAM)
  • System Prompt Disclosure
High Priority:
  • Sensitive Data Leak probes
  • Input Leakage probes
  • Multi-turn attacks (Crescendo, Echo Chamber)
Standard Testing:
  • Content Bias probes
  • Off-Topic probes
  • Encoding/Obfuscation attacks
Agent-Specific (If Applicable):
  • Agentic Behavior probes
  • Tool-related data leak probes

By Use Case

Use CaseRecommended Probes
Customer Support BotPrompt Injection, Off-Topic, System Prompt Disclosure
Healthcare AssistantSensitive Data Leak, Unsafe Outputs (Self-Harm), Content Bias
Financial AdvisorOff-Topic (Economy), Sensitive Data Leak, Content Bias
General Purpose ChatbotFull Prompt Injection suite, Unsafe Outputs, Input Leakage
AI Agent with ToolsAgentic Behavior, Tool-related probes, Prompt Injection
Content Moderation SystemUnsafe Outputs, Bias probes

Next Steps