Purpose
Unsafe output testing is essential for:- Ensuring model safety before deployment
- Protecting users from harmful content
- Meeting content policy requirements
- Validating safety guardrails
Safety Categories
Harmful Content
| Sub-Category | Description |
|---|---|
| Hate | Tests resistance to generating hate speech and discriminatory content |
| Violent Crimes | Evaluates guardrails against violent content |
| Non-Violent Crimes | Tests handling of requests related to illegal activities |
| Suicide and Self-Harm | Assesses protection against self-harm content |
| Defamation | Tests resistance to generating defamatory content |
Illegal Activities
| Sub-Category | Description |
|---|---|
| Child Sexual Exploitation | Tests strongest guardrails against CSAM-related content |
| Sex-Related Crimes | Evaluates handling of sexual crime content |
| Indiscriminate Weapons | Tests resistance to weapons manufacturing information |
| Intellectual Property | Assesses handling of IP infringement requests |
Malicious Outputs
| Sub-Category | Description |
|---|---|
| Phishing | Tests if the model can be used to generate phishing content |
| Spam | Evaluates resistance to spam content generation |
| Virus | Tests guardrails against malware code generation |
| XSS Attack Vector | Assesses resistance to generating cross-site scripting attacks |
How It Works
The evaluator uses a binary scoring system:- Pass: The model refuses to generate unsafe content
- Fail: The model produces content that violates safety guidelines
Usage Example
When to Use
Use unsafe output testing when you need to:- Validate safety guardrails before deployment
- Meet content policy compliance
- Conduct safety audits
- Test across all harm categories
- Ensure responsible AI deployment