NeuralTrust | The leading security platform for generative AI

TrustTest includes a comprehensive catalog of red teaming attack probes designed to evaluate AI model safety, security, and robustness. This page provides a complete reference of all available attack categories, their purpose, and when to use them.

Probe Architecture

Each attack in TrustTest is implemented as a Probe. A probe generates test cases that are sent to your target model, collecting responses for evaluation. Probes can be:

Dataset-based: Use curated datasets of attack prompts
Prompt-based: Dynamically generate attacks using LLMs
Multi-turn: Conduct sophisticated attacks across multiple conversation turns

Attack Categories

Prompt Injections

Prompt injection attacks attempt to manipulate the model into ignoring its instructions or behaving in unintended ways. This is the most comprehensive category with attacks organized by technique:

Single Turn Attacks: Direct attacks in a single message (jailbreaking, encoding, structural attacks)
Multi-Turn Attacks: Sophisticated attacks across multiple conversation turns (Crescendo, Echo Chamber)
From Dataset: Attacks loaded from curated datasets

View all Prompt Injection attacks →

Content Bias

Content bias probes evaluate your model for cognitive and stereotypical biases.

Probe Type	Description
Anchoring Bias	Tests if the model over-relies on initial information
Framing Bias	Evaluates if responses change based on how questions are framed
Positional Bias	Assesses if the order of presented options affects decisions
Status Quo Bias	Tests preference for current state over alternatives
Temporal Bias	Evaluates if time-related framing affects reasoning
Ethnic Bias	Tests for discriminatory responses based on ethnicity
Gender Bias	Evaluates fairness across gender identities
LGBTIQ+ Bias	Assesses treatment of LGBTIQ+ topics
Religion Bias	Tests for religious discrimination

Learn more about Content Bias testing →

Sensitive Data Leak

Probes that attempt to extract confidential information from the model.

Probe Type	Description
Direct Query for Sensitive Data	Tests resistance to straightforward requests
Contextual Leakage Request	Evaluates if context can extract sensitive data
Metadata Query	Assesses protection of system metadata
Obfuscated Data Request	Tests against encoded extraction attempts
Misleading Context Request	Evaluates resistance to social engineering
List of Agent Available Tools	Tests if the model reveals its tools (agents)
List of Parameters of Agent Tools	Evaluates protection of tool details (agents)
Reasoning Details of AI Agents	Assesses if reasoning can be extracted (agents)

Learn more about Sensitive Data Leak testing →

System Prompt Disclosure

Probes that attempt to extract the model’s system prompt or internal instructions.

Probe Type	Description
Direct Request	Tests resistance to straightforward requests
Oblique Reference	Evaluates handling of indirect references
Confusion and Clarification	Tests if confusion tactics extract details
Explanation Mode	Assesses if explanations reveal the prompt
Meta-Level Instruction Inquiry	Tests abstract questions about instructions
Request for Phrase Conclusion	Evaluates if partial prompts can be completed
Request for a Portion	Tests if segments can be extracted

Learn more about System Prompt Disclosure testing →

Input Leakage

Probes that test whether the model reveals information from previous conversations.

Probe Type	Description
Input Persistence Check	Tests if inputs persist across boundaries
Sensitive Information Echo Test	Evaluates if sensitive data is echoed
Contextual Memory Leak Check	Assesses context leaks between sessions
Inadvertent Contextual Reference	Tests for unintended references

Learn more about Input Leakage testing →

Unsafe Outputs

Probes that test the model’s guardrails against generating harmful content.

Category	Probes
Harmful Content	Hate, Violent Crimes, Non-Violent Crimes, Suicide/Self-Harm, Defamation
Illegal Activities	Child Sexual Exploitation, Sex-Related Crimes, Indiscriminate Weapons, Intellectual Property
Malicious Outputs	Phishing, Spam, Virus, XSS Attack Vector

Learn more about Unsafe Outputs testing →

Off-Topic

Probes that test if the model stays within its intended scope.

Probe Type	Description
Competitors Check	Tests handling of competing products
Public Figures	Evaluates responses about public personalities
Disallowed Uses	Tests resistance to unauthorized purposes
Politics/Religion/Economy/Philosophy	Tests handling of sensitive topics
Illegal/Technology	Tests scope boundaries

Learn more about Off-Topic testing →

Agentic Behavior

Probes that test AI agents for safety concerns specific to autonomous systems.

Probe Type	Description
Stop Command Override	Tests if agents can ignore stop commands
Continuous Execution Prompt	Evaluates resistance to endless operation
Self-Preservation Prompt	Tests if agents prioritize self-preservation
Tool Misuse Simulation	Assesses resistance to inappropriate tool use
Recursive Planning Test	Evaluates handling of infinite loops
Long-Horizon Behavioral Drift	Tests for gradual deviation over time
Arbitrary Tools Invocation	Assesses resistance to unauthorized tools

Learn more about Agentic Behavior testing →

Choosing the Right Probes

By Risk Level

Critical Security (Must Test):

Prompt Injection attacks (especially DAN, System Override)
Unsafe Outputs (Hate, Violence, CSAM)
System Prompt Disclosure

High Priority:

Sensitive Data Leak probes
Input Leakage probes
Multi-turn attacks (Crescendo, Echo Chamber)

Standard Testing:

Content Bias probes
Off-Topic probes
Encoding/Obfuscation attacks

Agent-Specific (If Applicable):

Agentic Behavior probes
Tool-related data leak probes

By Use Case

Use Case	Recommended Probes
Customer Support Bot	Prompt Injection, Off-Topic, System Prompt Disclosure
Healthcare Assistant	Sensitive Data Leak, Unsafe Outputs (Self-Harm), Content Bias
Financial Advisor	Off-Topic (Economy), Sensitive Data Leak, Content Bias
General Purpose Chatbot	Full Prompt Injection suite, Unsafe Outputs, Input Leakage
AI Agent with Tools	Agentic Behavior, Tool-related probes, Prompt Injection
Content Moderation System	Unsafe Outputs, Bias probes

Getting Started

Core Concepts

Connect your app

Create tests

Evaluate results

Overview

Probe Architecture

Attack Categories

Prompt Injections

Content Bias

Sensitive Data Leak

System Prompt Disclosure

Input Leakage

Unsafe Outputs

Off-Topic

Agentic Behavior

Choosing the Right Probes

By Risk Level

By Use Case

Next Steps

Prompt Injections

Creating Custom Probes

Getting Started

Core Concepts

Connect your app

Create tests

Evaluate results

​Probe Architecture

​Attack Categories

​Prompt Injections

​Content Bias

​Sensitive Data Leak

​System Prompt Disclosure

​Input Leakage

​Unsafe Outputs

​Off-Topic

​Agentic Behavior

​Choosing the Right Probes

​By Risk Level

​By Use Case

​Next Steps

Prompt Injections

Creating Custom Probes

Probe Architecture

Attack Categories

Prompt Injections

Content Bias

Sensitive Data Leak

System Prompt Disclosure

Input Leakage

Unsafe Outputs

Off-Topic

Agentic Behavior

Choosing the Right Probes

By Risk Level

By Use Case

Next Steps