NeuralTrust Guardrail includes comprehensive toxicity detection capabilities as part of its defense system. This feature is designed to identify and block harmful, offensive, or inappropriate content before it reaches the underlying model.

Toxicity & Content Moderation

We identify, among others:

CategoryDescription
HarassmentDetection of abusive, threatening, or targeted harassment
HateIdentification of hate speech or discriminatory language
IllicitDetection of illegal activities or promotion of criminal behavior
Self-harmLanguage promoting, instructing, or indicating intent of self-harm
SexualExplicit or sexual content, including inappropriate references to minors
ViolenceViolent or graphic content, including threats or depictions of violence

Configuration

The toxicity detection feature can be configured in the NeuralTrust Guardrail plugin:

{
    "name": "neuraltrust_guardrail",
    "enabled": true,
    "stage": "pre_request",
    "priority": 1,
    "settings": {
        "credentials": {
            "token": "{{ NEURALTRUST_API_KEY }}",
            "base_url": "https://data.neuraltrust.ai"
        },
        "toxicity":{
            "enabled": true,
            "threshold": 0.7
        }
    }
}

Configuration Parameters

ParameterTypeDescriptionRequiredDefault
toxicityobjectEnables or disables prompt classification via moderation endpoint.No
toxicity.enabledbooleanWhether to enable the toxicity check.Nofalse
toxicity.thresholdfloatThreshold between 0.0 and 1.0 to flag the input.No0

Error Responses

When toxic content is detected, the plugin returns a 403 Forbidden error:

{
  "error": "toxicity: score 0.85 exceeded threshold 0.70",
  "retry_after": null
}