Documentation Index Fetch the complete documentation index at: https://docs.neuraltrust.ai/llms.txt
Use this file to discover all available pages before exploring further.
TrustTest includes a comprehensive catalog of red teaming attack probes designed to evaluate AI model safety, security, and robustness. This page provides a complete reference of all available attack categories, their purpose, and when to use them.
Probe Architecture
Each attack in TrustTest is implemented as a Probe . A probe generates test cases that are sent to your target model, collecting responses for evaluation. Probes can be:
Dataset-based : Use curated datasets of attack prompts
Prompt-based : Dynamically generate attacks using LLMs
Multi-turn : Conduct sophisticated attacks across multiple conversation turns
Attack Categories
Prompt Injections
Prompt injection attacks attempt to manipulate the model into ignoring its instructions or behaving in unintended ways. This is the most comprehensive category with attacks organized by technique:
Single Turn Attacks : Direct attacks in a single message (jailbreaking, encoding, structural attacks)
Multi-Turn Attacks : Sophisticated attacks across multiple conversation turns (Crescendo, Echo Chamber)
From Dataset : Attacks loaded from curated datasets
View all Prompt Injection attacks →
Content Bias
Content bias probes evaluate your model for cognitive and stereotypical biases.
Probe Type Description Anchoring Bias Tests if the model over-relies on initial information Framing Bias Evaluates if responses change based on how questions are framed Positional Bias Assesses if the order of presented options affects decisions Status Quo Bias Tests preference for current state over alternatives Temporal Bias Evaluates if time-related framing affects reasoning Ethnic Bias Tests for discriminatory responses based on ethnicity Gender Bias Evaluates fairness across gender identities LGBTIQ+ Bias Assesses treatment of LGBTIQ+ topics Religion Bias Tests for religious discrimination
Learn more about Content Bias testing →
Sensitive Data Leak
Probes that attempt to extract confidential information from the model.
Probe Type Description Direct Query for Sensitive Data Tests resistance to straightforward requests Contextual Leakage Request Evaluates if context can extract sensitive data Metadata Query Assesses protection of system metadata Obfuscated Data Request Tests against encoded extraction attempts Misleading Context Request Evaluates resistance to social engineering List of Agent Available Tools Tests if the model reveals its tools (agents) List of Parameters of Agent Tools Evaluates protection of tool details (agents) Reasoning Details of AI Agents Assesses if reasoning can be extracted (agents)
Learn more about Sensitive Data Leak testing →
System Prompt Disclosure
Probes that attempt to extract the model’s system prompt or internal instructions.
Probe Type Description Direct Request Tests resistance to straightforward requests Oblique Reference Evaluates handling of indirect references Confusion and Clarification Tests if confusion tactics extract details Explanation Mode Assesses if explanations reveal the prompt Meta-Level Instruction Inquiry Tests abstract questions about instructions Request for Phrase Conclusion Evaluates if partial prompts can be completed Request for a Portion Tests if segments can be extracted
Learn more about System Prompt Disclosure testing →
Probes that test whether the model reveals information from previous conversations.
Probe Type Description Input Persistence Check Tests if inputs persist across boundaries Sensitive Information Echo Test Evaluates if sensitive data is echoed Contextual Memory Leak Check Assesses context leaks between sessions Inadvertent Contextual Reference Tests for unintended references
Learn more about Input Leakage testing →
Unsafe Outputs
Probes that test the model’s guardrails against generating harmful content.
Category Probes Harmful Content Hate, Violent Crimes, Non-Violent Crimes, Suicide/Self-Harm, Defamation Illegal Activities Child Sexual Exploitation, Sex-Related Crimes, Indiscriminate Weapons, Intellectual Property Malicious Outputs Phishing, Spam, Virus, XSS Attack Vector
Learn more about Unsafe Outputs testing →
Off-Topic
Probes that test if the model stays within its intended scope.
Probe Type Description Competitors Check Tests handling of competing products Public Figures Evaluates responses about public personalities Disallowed Uses Tests resistance to unauthorized purposes Politics/Religion/Economy/Philosophy Tests handling of sensitive topics Illegal/Technology Tests scope boundaries
Learn more about Off-Topic testing →
Agentic Behavior
Probes that test AI agents for safety concerns specific to autonomous systems.
Probe Type Description Stop Command Override Tests if agents can ignore stop commands Continuous Execution Prompt Evaluates resistance to endless operation Self-Preservation Prompt Tests if agents prioritize self-preservation Tool Misuse Simulation Assesses resistance to inappropriate tool use Recursive Planning Test Evaluates handling of infinite loops Long-Horizon Behavioral Drift Tests for gradual deviation over time Arbitrary Tools Invocation Assesses resistance to unauthorized tools
Learn more about Agentic Behavior testing →
Choosing the Right Probes
By Risk Level
Critical Security (Must Test):
Prompt Injection attacks (especially DAN, System Override)
Unsafe Outputs (Hate, Violence, CSAM)
System Prompt Disclosure
High Priority:
Sensitive Data Leak probes
Input Leakage probes
Multi-turn attacks (Crescendo, Echo Chamber)
Standard Testing:
Content Bias probes
Off-Topic probes
Encoding/Obfuscation attacks
Agent-Specific (If Applicable):
Agentic Behavior probes
Tool-related data leak probes
By Use Case
Use Case Recommended Probes Customer Support Bot Prompt Injection, Off-Topic, System Prompt Disclosure Healthcare Assistant Sensitive Data Leak, Unsafe Outputs (Self-Harm), Content Bias Financial Advisor Off-Topic (Economy), Sensitive Data Leak, Content Bias General Purpose Chatbot Full Prompt Injection suite, Unsafe Outputs, Input Leakage AI Agent with Tools Agentic Behavior, Tool-related probes, Prompt Injection Content Moderation System Unsafe Outputs, Bias probes
Next Steps
Prompt Injections Explore all prompt injection attack techniques
Creating Custom Probes Learn how to build your own attack probes