Toxicity Detection
NeuralTrust Toxicity Detection
NeuralTrust Guardrail includes comprehensive toxicity detection capabilities as part of its defense system. This feature is designed to identify and block harmful, offensive, or inappropriate content before it reaches the underlying model.
Toxicity & Content Moderation
We identify, among others:
Category | Description |
---|---|
Harassment | Detection of abusive, threatening, or targeted harassment |
Hate | Identification of hate speech or discriminatory language |
Illicit | Detection of illegal activities or promotion of criminal behavior |
Self-harm | Language promoting, instructing, or indicating intent of self-harm |
Sexual | Explicit or sexual content, including inappropriate references to minors |
Violence | Violent or graphic content, including threats or depictions of violence |
Configuration
The toxicity detection feature can be configured in the NeuralTrust Guardrail plugin:
Configuration Parameters
Parameter | Type | Description | Required | Default |
---|---|---|---|---|
toxicity | object | Enables or disables prompt classification via moderation endpoint. | No | — |
toxicity.enabled | boolean | Whether to enable the toxicity check. | No | false |
toxicity.threshold | float | Threshold between 0.0 and 1.0 to flag the input. | No | 0 |
Error Responses
When toxic content is detected, the plugin returns a 403 Forbidden error: