NeuralTrust | The leading security platform for generative AI

Unsafe output testing evaluates your model’s guardrails against generating harmful, illegal, or dangerous content. These probes test across a comprehensive range of safety categories.

Purpose

Unsafe output testing is essential for:

Ensuring model safety before deployment
Protecting users from harmful content
Meeting content policy requirements
Validating safety guardrails

Safety Categories

Harmful Content

Sub-Category	Description
Hate	Tests resistance to generating hate speech and discriminatory content
Violent Crimes	Evaluates guardrails against violent content
Non-Violent Crimes	Tests handling of requests related to illegal activities
Suicide and Self-Harm	Assesses protection against self-harm content
Defamation	Tests resistance to generating defamatory content

Illegal Activities

Sub-Category	Description
Child Sexual Exploitation	Tests strongest guardrails against CSAM-related content
Sex-Related Crimes	Evaluates handling of sexual crime content
Indiscriminate Weapons	Tests resistance to weapons manufacturing information
Intellectual Property	Assesses handling of IP infringement requests

Malicious Outputs

Sub-Category	Description
Phishing	Tests if the model can be used to generate phishing content
Spam	Evaluates resistance to spam content generation
Virus	Tests guardrails against malware code generation
XSS Attack Vector	Assesses resistance to generating cross-site scripting attacks

How It Works

The evaluator uses a binary scoring system:

Pass: The model refuses to generate unsafe content
Fail: The model produces content that violates safety guidelines

Usage Example

from trusttest.catalog.unsafe_outputs import UnsafeOutputsScenarioBuilder, SubCategory
from trusttest.targets.http import HttpTarget, PayloadConfig

target = HttpTarget(
    url="https://your-model-endpoint.com/chat",
    headers={"Content-Type": "application/json"},
    payload_config=PayloadConfig(
        format={
            "messages": [
                {"role": "user", "content": "{{ test }}"}
            ]
        },
        message_regex="{{ test }}",
    ),
)

builder = UnsafeOutputsScenarioBuilder(target=target, num_test_cases=20)
scenario = builder.get_scenario(SubCategory.HATE)

test_set = scenario.probe.get_test_set()
results = scenario.eval.evaluate(test_set)
results.display_summary()

When to Use

Use unsafe output testing when you need to:

Validate safety guardrails before deployment
Meet content policy compliance
Conduct safety audits
Test across all harm categories
Ensure responsible AI deployment

Getting Started

Core Concepts

Connect your app

Create tests

Evaluate results

Unsafe Outputs

Purpose

Safety Categories

Harmful Content

Illegal Activities

Malicious Outputs

How It Works

Usage Example

When to Use

Getting Started

Core Concepts

Connect your app

Create tests

Evaluate results

​Purpose

​Safety Categories

​Harmful Content

​Illegal Activities

​Malicious Outputs

​How It Works

​Usage Example

​When to Use

Purpose

Safety Categories

Harmful Content

Illegal Activities

Malicious Outputs

How It Works

Usage Example

When to Use