NeuralTrust | The leading security platform for generative AI

Implementation Details

Message Processing

The plugin processes messages in the following format:

{
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "message content"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg"
                    }
                }
            ]
        }
    ]
}

Content Types

Text Content

Direct text analysis
Multi-message support
UTF-8 encoding
Length validation

Image Content

URL-based processing
Image format validation
Size restrictions
Accessibility checks

Moderation Categories

The plugin supports comprehensive content analysis across multiple categories:

Category	Description	Implementation Details
Sexual	Sexual content detection	- Base category scoring \n- Sub-category detection \n- Context analysis
Violence	Violence and threats	- Direct violence detection \n- Graphic content analysis \n- Threat assessment
Hate	Hate speech and bias	- Bias detection \n- Discriminatory content \n- Hate speech patterns
Self-harm	Self-harm content	- Intent detection \n- Instruction filtering \n- Risk assessment
Harassment	Harassment detection	- Personal attacks \n- Threatening behavior \n- Bullying patterns
Illicit	Illegal activity	- Criminal content \n- Prohibited activities \n- Legal compliance

API Integration

The plugin integrates with OpenAI’s moderation API:

Request Formation

{
    "input": [
        {
            "type": "text",
            "text": "content to moderate"
        }
    ],
    "model": "omni-moderation-latest"
}

Response Processing

{
    "id": "modr-123",
    "model": "omni-moderation-latest",
    "results": [
        {
            "flagged": true,
            "categories": {
                "sexual": false,
                "violence": true
            },
            "category_scores": {
                "sexual": 0.01,
                "violence": 0.92
            }
        }
    ]
}

🛠️ Error Handling

The plugin includes robust and comprehensive error handling mechanisms across multiple stages:

🔧 Configuration Validation

✅ API key verification
✅ Action type validation
✅ Threshold validation
✅ Category validation

⚠️ Runtime Error Handling

🔌 API connection errors
🔍 Response parsing errors
⏱️ Timeout handling
🚦 Rate limit management

📄 Content Processing Errors

❌ Invalid content format
🚫 Missing required fields
📏 Size limit violations
🧬 Encoding issues

🚀 Performance Optimizations

The plugin is designed with performance in mind, optimizing both request and response handling:

📥 Request Processing

📦 Batch message processing
⚡ Efficient JSON parsing
🧠 Minimal memory allocation
🔁 Request pooling

📤 Response Handling

📡 Streaming response processing
📊 Efficient score calculation
⛔ Early termination when thresholds are met
💾 Result caching for repeated evaluations

Configuration Reference

Required Settings

{
    "openai_key": "YOUR_API_KEY",
    "actions": {
        "type": "block",
        "message": "Content violation detected"
    },
    "categories": ["sexual", "violence", "hate"],
    "thresholds": {
        "sexual": 0.3,
        "violence": 0.5,
        "hate": 0.4
    }
}

Advanced Options

Custom error messages
Category-specific actions
Threshold adjustments
Logging configuration

Monitoring and Metrics

The plugin provides detailed monitoring capabilities:

Request/response logging
Category score tracking
Error rate monitoring
Performance metrics

Features

Feature	Capabilities
Multi-Category Detection	• Comprehensive content analysis across multiple categories (sexual, violence, hate, etc.) • Real-time detection with configurable sensitivity levels • Customizable thresholds per category
Flexible Actions	• Configurable response actions • Custom error messages • Block or allow decisions
OpenAI Integration	• Powered by OpenAI’s moderation API • Real-time content analysis • High accuracy detection
Request Stage Processing	• Pre-request content analysis • Configurable priority in plugin chain • Non-blocking architecture

How It Works

Content Analysis

The plugin analyzes incoming requests by examining both text and image content for various types of toxic or inappropriate material. For text content, it processes the content directly through OpenAI’s moderation API. For images, it can analyze image URLs provided in the request. The results are evaluated against configured thresholds:

// Example Request Content - Text
{
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Let's discuss this topic respectfully"
                }
            ]
        }
    ]
}

// OpenAI Moderation API Response (Internal)
{
    "results": [
        {
            "category_scores": {
                "sexual": 0.0001,
                "violence": 0.0002,
                "hate": 0.0001
            }
        }
    ]
}

Threshold Evaluation

Each category has its own configurable threshold. Content is blocked if any category’s score exceeds its threshold:

{
    "thresholds": {
        "sexual": 0.3,    // Block if sexual content score > 0.3
        "violence": 0.5,  // Block if violence score > 0.5
        "hate": 0.4      // Block if hate speech score > 0.4
    }
}

Action Execution

Based on the evaluation results, the plugin can take configured actions:

{
    "actions": {
        "type": "block",
        "message": "Content contains inappropriate content."
    }
}

Configuration Examples

Basic Configuration

A simple configuration that enables toxicity detection with default settings:

{
    "name": "toxicity_detection",
    "enabled": true,
    "stage": "pre_request",
    "priority": 1,
    "settings": {
        "openai_key": "${OPENAI_API_KEY}",
        "actions": {
            "type": "block",
            "message": "Content contains inappropriate content."
        },
        "categories": [
            "sexual",
            "violence",
            "hate"
        ],
        "thresholds": {
            "sexual": 0.3,
            "violence": 0.5,
            "hate": 0.4
        }
    }
}

Key components of the basic configuration:

Plugin Settings

Property	Description	Required	Default
`name`	Plugin identifier	Yes	”toxicity_detection”
`enabled`	Enable/disable plugin	Yes	true
`stage`	Processing stage	Yes	”pre_request”
`priority`	Plugin execution priority	Yes	1

Category Thresholds

Category	Description	Default Threshold	Impact
`sexual`	Sexual content detection	0.3	Lower values = stricter filtering
`violence`	Violence detection	0.5	Higher values = more permissive
`hate`	Hate speech detection	0.4	Balance based on needs

This configuration:

• Focuses solely on violence detection

• Sets a moderate threshold of 0.5 for violent content

• Provides a specific error message for violent content

• Enables warning-level logging for monitoring

Best Practices

Threshold Configuration

Content Policy Alignment:

• Set thresholds according to your content policy

• Consider your audience and use case

• Test thresholds with sample content
Category Selection:

• Enable relevant categories for your use case

• Consider regulatory requirements

• Balance between safety and usability
Performance Considerations:

• Set appropriate plugin priority

• Consider API rate limits

• Monitor response times

Security Considerations

API Key Management:

• Secure storage of OpenAI API key

• Regular key rotation

• Access control for configuration changes
Logging and Monitoring:

• Enable appropriate logging

• Monitor blocked content patterns

• Regular threshold adjustments

Performance Considerations

The Toxicity Detection plugin uses a straightforward HTTP client implementation to interact with OpenAI’s moderation API. The plugin processes requests sequentially, making direct API calls to OpenAI’s moderation endpoint for each incoming request. The implementation includes comprehensive logging at various levels (debug, info, error) to help track and diagnose the plugin’s behavior.

The plugin performs efficient JSON processing by unmarshaling only the required fields from the request and response bodies. It concatenates multiple messages with newlines when needed and processes them in a single API call to OpenAI, which helps reduce the number of API requests when handling multi-message content.

The plugin’s architecture is designed to be lightweight, with minimal memory overhead as it doesn’t maintain any state between requests. However, be aware that each request will incur the latency of an HTTP call to OpenAI’s API. Consider this when planning your rate limits and timeout configurations, as the total processing time will largely depend on OpenAI’s API response time.

Getting Started

Core Concepts

Traffic Management

Non-REST Connectivity

Rate Limiting & Request Control

Content Security

Application Security

Server Security

Data masking

Extending Functionality

Observability & Monitoring

Benchmark

API Reference

​Implementation Details

​Message Processing

​Content Types

​Moderation Categories

​API Integration

​🛠️ Error Handling

​🔧 Configuration Validation

​⚠️ Runtime Error Handling

​📄 Content Processing Errors

​🚀 Performance Optimizations

​📥 Request Processing

​📤 Response Handling

​Configuration Reference

​Required Settings

​Advanced Options

​Monitoring and Metrics

​Features

​How It Works

​Content Analysis

​Threshold Evaluation

​Action Execution

​Configuration Examples

​Basic Configuration

​Plugin Settings

​Category Thresholds

​Best Practices

​Threshold Configuration

​Security Considerations

​Performance Considerations