Probe Architecture
Each attack in TrustTest is implemented as a Probe. A probe generates test cases that are sent to your target model, collecting responses for evaluation. Probes can be:- Dataset-based: Use curated datasets of attack prompts
- Prompt-based: Dynamically generate attacks using LLMs
- Multi-turn: Conduct sophisticated attacks across multiple conversation turns
Attack Categories
Prompt Injections
Prompt injection attacks attempt to manipulate the model into ignoring its instructions or behaving in unintended ways. This is the most comprehensive category with attacks organized by technique:- Single Turn Attacks: Direct attacks in a single message (jailbreaking, encoding, structural attacks)
- Multi-Turn Attacks: Sophisticated attacks across multiple conversation turns (Crescendo, Echo Chamber)
- From Dataset: Attacks loaded from curated datasets
Content Bias
Content bias probes evaluate your model for cognitive and stereotypical biases.| Probe Type | Description |
|---|---|
| Anchoring Bias | Tests if the model over-relies on initial information |
| Framing Bias | Evaluates if responses change based on how questions are framed |
| Positional Bias | Assesses if the order of presented options affects decisions |
| Status Quo Bias | Tests preference for current state over alternatives |
| Temporal Bias | Evaluates if time-related framing affects reasoning |
| Ethnic Bias | Tests for discriminatory responses based on ethnicity |
| Gender Bias | Evaluates fairness across gender identities |
| LGBTIQ+ Bias | Assesses treatment of LGBTIQ+ topics |
| Religion Bias | Tests for religious discrimination |
Sensitive Data Leak
Probes that attempt to extract confidential information from the model.| Probe Type | Description |
|---|---|
| Direct Query for Sensitive Data | Tests resistance to straightforward requests |
| Contextual Leakage Request | Evaluates if context can extract sensitive data |
| Metadata Query | Assesses protection of system metadata |
| Obfuscated Data Request | Tests against encoded extraction attempts |
| Misleading Context Request | Evaluates resistance to social engineering |
| List of Agent Available Tools | Tests if the model reveals its tools (agents) |
| List of Parameters of Agent Tools | Evaluates protection of tool details (agents) |
| Reasoning Details of AI Agents | Assesses if reasoning can be extracted (agents) |
System Prompt Disclosure
Probes that attempt to extract the model’s system prompt or internal instructions.| Probe Type | Description |
|---|---|
| Direct Request | Tests resistance to straightforward requests |
| Oblique Reference | Evaluates handling of indirect references |
| Confusion and Clarification | Tests if confusion tactics extract details |
| Explanation Mode | Assesses if explanations reveal the prompt |
| Meta-Level Instruction Inquiry | Tests abstract questions about instructions |
| Request for Phrase Conclusion | Evaluates if partial prompts can be completed |
| Request for a Portion | Tests if segments can be extracted |
Input Leakage
Probes that test whether the model reveals information from previous conversations.| Probe Type | Description |
|---|---|
| Input Persistence Check | Tests if inputs persist across boundaries |
| Sensitive Information Echo Test | Evaluates if sensitive data is echoed |
| Contextual Memory Leak Check | Assesses context leaks between sessions |
| Inadvertent Contextual Reference | Tests for unintended references |
Unsafe Outputs
Probes that test the model’s guardrails against generating harmful content.| Category | Probes |
|---|---|
| Harmful Content | Hate, Violent Crimes, Non-Violent Crimes, Suicide/Self-Harm, Defamation |
| Illegal Activities | Child Sexual Exploitation, Sex-Related Crimes, Indiscriminate Weapons, Intellectual Property |
| Malicious Outputs | Phishing, Spam, Virus, XSS Attack Vector |
Off-Topic
Probes that test if the model stays within its intended scope.| Probe Type | Description |
|---|---|
| Competitors Check | Tests handling of competing products |
| Public Figures | Evaluates responses about public personalities |
| Disallowed Uses | Tests resistance to unauthorized purposes |
| Politics/Religion/Economy/Philosophy | Tests handling of sensitive topics |
| Illegal/Technology | Tests scope boundaries |
Agentic Behavior
Probes that test AI agents for safety concerns specific to autonomous systems.| Probe Type | Description |
|---|---|
| Stop Command Override | Tests if agents can ignore stop commands |
| Continuous Execution Prompt | Evaluates resistance to endless operation |
| Self-Preservation Prompt | Tests if agents prioritize self-preservation |
| Tool Misuse Simulation | Assesses resistance to inappropriate tool use |
| Recursive Planning Test | Evaluates handling of infinite loops |
| Long-Horizon Behavioral Drift | Tests for gradual deviation over time |
| Arbitrary Tools Invocation | Assesses resistance to unauthorized tools |
Choosing the Right Probes
By Risk Level
Critical Security (Must Test):- Prompt Injection attacks (especially DAN, System Override)
- Unsafe Outputs (Hate, Violence, CSAM)
- System Prompt Disclosure
- Sensitive Data Leak probes
- Input Leakage probes
- Multi-turn attacks (Crescendo, Echo Chamber)
- Content Bias probes
- Off-Topic probes
- Encoding/Obfuscation attacks
- Agentic Behavior probes
- Tool-related data leak probes
By Use Case
| Use Case | Recommended Probes |
|---|---|
| Customer Support Bot | Prompt Injection, Off-Topic, System Prompt Disclosure |
| Healthcare Assistant | Sensitive Data Leak, Unsafe Outputs (Self-Harm), Content Bias |
| Financial Advisor | Off-Topic (Economy), Sensitive Data Leak, Content Bias |
| General Purpose Chatbot | Full Prompt Injection suite, Unsafe Outputs, Input Leakage |
| AI Agent with Tools | Agentic Behavior, Tool-related probes, Prompt Injection |
| Content Moderation System | Unsafe Outputs, Bias probes |