NeuralTrust Guardrail includes comprehensive toxicity detection capabilities as part of its defense system. This feature is designed to identify and block harmful, offensive, or inappropriate content before it reaches the underlying model.Documentation Index
Fetch the complete documentation index at: https://docs.neuraltrust.ai/llms.txt
Use this file to discover all available pages before exploring further.
Toxicity & Content Moderation
We identify, among others:| Category | Description |
|---|---|
| Harassment | Detection of abusive, threatening, or targeted harassment |
| Hate | Identification of hate speech or discriminatory language |
| Illicit | Detection of illegal activities or promotion of criminal behavior |
| Self-harm | Language promoting, instructing, or indicating intent of self-harm |
| Sexual | Explicit or sexual content, including inappropriate references to minors |
| Violence | Violent or graphic content, including threats or depictions of violence |
Configuration
The toxicity detection feature can be configured via theneuraltrust_toxicity plugin:
Configuration Parameters
| Parameter | Type | Description | Required | Default |
|---|---|---|---|---|
toxicity | object | Enables or disables prompt classification via moderation endpoint. | No | — |
toxicity.enabled | boolean | Whether to enable the toxicity check. | No | false |
toxicity.threshold | float | Threshold between 0.0 and 1.0 to flag the input. | No | 0 |