NeuralTrust | The leading security platform for generative AI

On this page

Toxicity & Content Moderation
Configuration
Configuration Parameters
Error Responses

NeuralTrust Guardrail includes comprehensive toxicity detection capabilities as part of its defense system. This feature is designed to identify and block harmful, offensive, or inappropriate content before it reaches the underlying model.

Toxicity & Content Moderation

We identify, among others:

Category	Description
Harassment	Detection of abusive, threatening, or targeted harassment
Hate	Identification of hate speech or discriminatory language
Illicit	Detection of illegal activities or promotion of criminal behavior
Self-harm	Language promoting, instructing, or indicating intent of self-harm
Sexual	Explicit or sexual content, including inappropriate references to minors
Violence	Violent or graphic content, including threats or depictions of violence

Configuration

The toxicity detection feature can be configured via the neuraltrust_toxicity plugin:

{
    "name": "neuraltrust_toxicity",
    "enabled": true,
    "stage": "pre_request",
    "priority": 1,
    "settings": {
        "credentials": {
            "token": "{{ NEURALTRUST_API_KEY }}",
            "base_url": "https://data.neuraltrust.ai"
        },
        "toxicity":{
            "enabled": true,
            "threshold": 0.7
        }
    }
}

Configuration Parameters

Parameter	Type	Description	Required	Default
`toxicity`	object	Enables or disables prompt classification via moderation endpoint.	No	—
`toxicity.enabled`	boolean	Whether to enable the toxicity check.	No	false
`toxicity.threshold`	float	Threshold between `0.0` and `1.0` to flag the input.	No	0

Error Responses

When toxic content is detected, the plugin returns a 403 Forbidden error:

{
  "error": "toxicity: score 0.85 exceeded threshold 0.70",
  "retry_after": null
}

Overview OpenAI Toxicity Detection

⌘I

Getting Started

Core Concepts

Traffic Management

Actions API

Non-REST Connectivity

Rate Limiting & Request Control

Content Security

Application Security

Server Security

Data masking

Extending Functionality

Observability & Monitoring

Benchmark

API Reference

NeuralTrust Toxicity Detection

Toxicity & Content Moderation

Configuration

Configuration Parameters

Error Responses

Getting Started

Core Concepts

Traffic Management

Actions API

Non-REST Connectivity

Rate Limiting & Request Control

Content Security

Application Security

Server Security

Data masking

Extending Functionality

Observability & Monitoring

Benchmark

API Reference

​Toxicity & Content Moderation

​Configuration

​Configuration Parameters

​Error Responses

Toxicity & Content Moderation

Configuration

Configuration Parameters

Error Responses