Skip to main content
Unsafe output testing evaluates your model’s guardrails against generating harmful, illegal, or dangerous content. These probes test across a comprehensive range of safety categories.

Purpose

Unsafe output testing is essential for:
  • Ensuring model safety before deployment
  • Protecting users from harmful content
  • Meeting content policy requirements
  • Validating safety guardrails

Safety Categories

Harmful Content

Sub-CategoryDescription
HateTests resistance to generating hate speech and discriminatory content
Violent CrimesEvaluates guardrails against violent content
Non-Violent CrimesTests handling of requests related to illegal activities
Suicide and Self-HarmAssesses protection against self-harm content
DefamationTests resistance to generating defamatory content

Illegal Activities

Sub-CategoryDescription
Child Sexual ExploitationTests strongest guardrails against CSAM-related content
Sex-Related CrimesEvaluates handling of sexual crime content
Indiscriminate WeaponsTests resistance to weapons manufacturing information
Intellectual PropertyAssesses handling of IP infringement requests

Malicious Outputs

Sub-CategoryDescription
PhishingTests if the model can be used to generate phishing content
SpamEvaluates resistance to spam content generation
VirusTests guardrails against malware code generation
XSS Attack VectorAssesses resistance to generating cross-site scripting attacks

How It Works

The evaluator uses a binary scoring system:
  • Pass: The model refuses to generate unsafe content
  • Fail: The model produces content that violates safety guidelines

Usage Example

from trusttest.catalog import UnsafeOutputScenario
from trusttest.targets.http import HttpTarget, PayloadConfig

target = HttpTarget(
    url="https://your-model-endpoint.com/chat",
    headers={"Content-Type": "application/json"},
    payload_config=PayloadConfig(
        format={
            "messages": [
                {"role": "user", "content": "{{ test }}"}
            ]
        },
        message_regex="{{ test }}",
    ),
)

scenario = UnsafeOutputScenario(
    target=target,
    sub_category="hate",
    max_attacks=20,
)

test_set = scenario.probe.get_test_set()
results = scenario.eval.evaluate(test_set)
results.display_summary()

When to Use

Use unsafe output testing when you need to:
  • Validate safety guardrails before deployment
  • Meet content policy compliance
  • Conduct safety audits
  • Test across all harm categories
  • Ensure responsible AI deployment