OpenAI Toxicity Detection
Technical Overviewโ
The OpenAI Toxicity Detection plugin implements real-time content moderation using OpenAI's moderation API. It processes both text and image content through a multi-stage analysis pipeline.
Core Componentsโ
-
Content Extractor
- Processes multiple message types
- Handles text and image URL content
- Supports structured message formats
- Maintains content context
-
Moderation Engine
- Real-time API integration
- Batch processing capability
- Configurable thresholds
- Category-specific scoring
-
Response Analyzer
- Score evaluation
- Threshold comparison
- Category aggregation
- Detailed violation reporting
Implementation Detailsโ
Message Processingโ
The plugin processes messages in the following format:
{
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "message content"
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg"
}
}
]
}
]
}
Content Typesโ
-
Text Content
- Direct text analysis
- Multi-message support
- UTF-8 encoding
- Length validation
-
Image Content
- URL-based processing
- Image format validation
- Size restrictions
- Accessibility checks
Moderation Categoriesโ
The plugin supports comprehensive content analysis across multiple categories:
Category | Description | Implementation Details |
---|---|---|
Sexual | Sexual content detection | - Base category scoring \n- Sub-category detection \n- Context analysis |
Violence | Violence and threats | - Direct violence detection \n- Graphic content analysis \n- Threat assessment |
Hate | Hate speech and bias | - Bias detection \n- Discriminatory content \n- Hate speech patterns |
Self-harm | Self-harm content | - Intent detection \n- Instruction filtering \n- Risk assessment |
Harassment | Harassment detection | - Personal attacks \n- Threatening behavior \n- Bullying patterns |
Illicit | Illegal activity | - Criminal content \n- Prohibited activities \n- Legal compliance |
API Integrationโ
The plugin integrates with OpenAI's moderation API:
- Request Formation
{
"input": [
{
"type": "text",
"text": "content to moderate"
}
],
"model": "omni-moderation-latest"
}
- Response Processing
{
"id": "modr-123",
"model": "omni-moderation-latest",
"results": [
{
"flagged": true,
"categories": {
"sexual": false,
"violence": true
},
"category_scores": {
"sexual": 0.01,
"violence": 0.92
}
}
]
}
Error Handlingโ
The plugin implements comprehensive error handling:
-
Configuration Validation
- API key verification
- Action type validation
- Threshold validation
- Category validation
-
Runtime Error Handling
- API connection errors
- Response parsing errors
- Timeout handling
- Rate limit management
-
Content Processing Errors
- Invalid content format
- Missing required fields
- Size limit violations
- Encoding issues
Performance Optimizationsโ
-
Request Processing
- Batch message processing
- Efficient JSON parsing
- Minimal memory allocation
- Request pooling
-
Response Handling
- Streaming response processing
- Efficient score calculation
- Early termination
- Result caching
Configuration Referenceโ
Required Settingsโ
{
"openai_key": "YOUR_API_KEY",
"actions": {
"type": "block",
"message": "Content violation detected"
},
"categories": ["sexual", "violence", "hate"],
"thresholds": {
"sexual": 0.3,
"violence": 0.5,
"hate": 0.4
}
}
Advanced Optionsโ
- Custom error messages
- Category-specific actions
- Threshold adjustments
- Logging configuration
Monitoring and Metricsโ
The plugin provides detailed monitoring capabilities:
- Request/response logging
- Category score tracking
- Error rate monitoring
- Performance metrics
Featuresโ
Feature | Capabilities |
---|---|
Multi-Category Detection | โข Comprehensive content analysis across multiple categories (sexual, violence, hate, etc.) โข Real-time detection with configurable sensitivity levels โข Customizable thresholds per category |
Flexible Actions | โข Configurable response actions โข Custom error messages โข Block or allow decisions |
OpenAI Integration | โข Powered by OpenAI's moderation API โข Real-time content analysis โข High accuracy detection |
Request Stage Processing | โข Pre-request content analysis โข Configurable priority in plugin chain โข Non-blocking architecture |
How It Worksโ
Content Analysisโ
The plugin analyzes incoming requests by examining both text and image content for various types of toxic or inappropriate material. For text content, it processes the content directly through OpenAI's moderation API. For images, it can analyze image URLs provided in the request. The results are evaluated against configured thresholds:
// Example Request Content - Text
{
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Let's discuss this topic respectfully"
}
]
}
]
}
// OpenAI Moderation API Response (Internal)
{
"results": [
{
"category_scores": {
"sexual": 0.0001,
"violence": 0.0002,
"hate": 0.0001
}
}
]
}
Threshold Evaluationโ
Each category has its own configurable threshold. Content is blocked if any category's score exceeds its threshold:
{
"thresholds": {
"sexual": 0.3, // Block if sexual content score > 0.3
"violence": 0.5, // Block if violence score > 0.5
"hate": 0.4 // Block if hate speech score > 0.4
}
}
Action Executionโ
Based on the evaluation results, the plugin can take configured actions:
{
"actions": {
"type": "block",
"message": "Content contains inappropriate content."
}
}
Configuration Examplesโ
Basic Configurationโ
A simple configuration that enables toxicity detection with default settings:
{
"name": "toxicity_detection",
"enabled": true,
"stage": "pre_request",
"priority": 1,
"settings": {
"openai_key": "${OPENAI_API_KEY}",
"actions": {
"type": "block",
"message": "Content contains inappropriate content."
},
"categories": [
"sexual",
"violence",
"hate"
],
"thresholds": {
"sexual": 0.3,
"violence": 0.5,
"hate": 0.4
}
}
}
Key components of the basic configuration:
Plugin Settings
Property | Description | Required | Default |
---|---|---|---|
name | Plugin identifier | Yes | "toxicity_detection" |
enabled | Enable/disable plugin | Yes | true |
stage | Processing stage | Yes | "pre_request" |
priority | Plugin execution priority | Yes | 1 |
Category Thresholds
Category | Description | Default Threshold | Impact |
---|---|---|---|
sexual | Sexual content detection | 0.3 | Lower values = stricter filtering |
violence | Violence detection | 0.5 | Higher values = more permissive |
hate | Hate speech detection | 0.4 | Balance based on needs |
This configuration:
โข Focuses solely on violence detection
โข Sets a moderate threshold of 0.5 for violent content
โข Provides a specific error message for violent content
โข Enables warning-level logging for monitoring
Best Practicesโ
Threshold Configurationโ
-
Content Policy Alignment:
โข Set thresholds according to your content policy
โข Consider your audience and use case
โข Test thresholds with sample content
-
Category Selection:
โข Enable relevant categories for your use case
โข Consider regulatory requirements
โข Balance between safety and usability
-
Performance Considerations:
โข Set appropriate plugin priority
โข Consider API rate limits
โข Monitor response times
Security Considerationsโ
-
API Key Management:
โข Secure storage of OpenAI API key
โข Regular key rotation
โข Access control for configuration changes
-
Logging and Monitoring:
โข Enable appropriate logging
โข Monitor blocked content patterns
โข Regular threshold adjustments
Performance Considerationsโ
The Toxicity Detection plugin uses a straightforward HTTP client implementation to interact with OpenAI's moderation API. The plugin processes requests sequentially, making direct API calls to OpenAI's moderation endpoint for each incoming request. The implementation includes comprehensive logging at various levels (debug, info, error) to help track and diagnose the plugin's behavior.
The plugin performs efficient JSON processing by unmarshaling only the required fields from the request and response bodies. It concatenates multiple messages with newlines when needed and processes them in a single API call to OpenAI, which helps reduce the number of API requests when handling multi-message content.
The plugin's architecture is designed to be lightweight, with minimal memory overhead as it doesn't maintain any state between requests. However, be aware that each request will incur the latency of an HTTP call to OpenAI's API. Consider this when planning your rate limits and timeout configurations, as the total processing time will largely depend on OpenAI's API response time.