NeuralTrust Content Moderation
NeuralTrust Moderation is a comprehensive content filtering system designed to protect your AI gateway from potentially harmful, inappropriate, or unwanted content. It employs multiple layers of content analysis to ensure robust protection while maintaining high performance.
Overview
The NeuralTrust Moderation plugin offers three powerful moderation approaches that can be used independently or in combination:
- Embedding-Based Moderation: Uses semantic similarity to detect content similar to predefined deny samples
- Keyword & Regex Moderation: Employs fuzzy keyword matching and regular expression pattern matching
- LLM-Based Moderation: Leverages large language models to analyze content for policy violations
This multi-layered approach provides comprehensive protection against various types of harmful content, from simple keyword matching to sophisticated semantic analysis.
Features
Embedding-Based Moderation
- Semantic Similarity Detection: Identifies content similar in meaning to predefined deny samples
- Configurable Threshold: Adjust sensitivity to balance protection and false positives
- Vector Database Integration: Efficiently stores and searches embeddings
- Multiple Embedding Providers: Support for various embedding models
Keyword & Regex Moderation
- Fuzzy Keyword Matching: Detects similar words using Levenshtein distance
- Configurable Similarity Threshold: Adjust sensitivity for word matching
- Case-Insensitive Matching: Catches variations regardless of capitalization
- Regular Expression Support: Complex pattern matching for sophisticated detection
- Pre-compiled Patterns: Optimized for performance
LLM-Based Moderation
- AI-Powered Content Analysis: Uses LLMs to detect policy violations
- Multiple Provider Support: Compatible with OpenAI and Gemini models
- Customizable Instructions: Define specific content policies
- Structured Response Format: Clear categorization of detected issues
Additional Features
- Customizable Actions: Configure how to handle detected violations
- Detailed Error Reporting: Clear explanations of why content was blocked
- Performance Optimization: Efficient processing for minimal latency
- Fingerprint Tracking: Monitor and manage repeated violation attempts
Configuration
The NeuralTrust Moderation plugin can be configured with various options to suit your specific needs:
Common Configuration Parameters
Parameter | Type | Description | Required | Default |
---|---|---|---|---|
mapping_field | string | JSON path to extract content from request body (e.g., “prompt” or “messages.content”) | No | — |
retention_period | integer | Time in seconds to retain fingerprint data for violation tracking | No | 60 |
Embedding-Based Moderation Parameters
Parameter | Type | Description | Required | Default |
---|---|---|---|---|
embedding_param_bag.enabled | boolean | Whether to enable embedding-based moderation | No | false |
embedding_param_bag.threshold | float | Similarity threshold (0.0-1.0) to flag content | Yes | — |
embedding_param_bag.deny_topic_action | string | Action to take when content matches deny samples (only “block” supported) | Yes | — |
embedding_param_bag.deny_samples | array | List of text samples to block (semantic similarity comparison) | No | [] |
embedding_param_bag.embeddings_config.provider | string | Embedding provider (e.g., “openai”) | Yes | — |
embedding_param_bag.embeddings_config.model | string | Model to use for generating embeddings | Yes | — |
embedding_param_bag.embeddings_config.credentials | object | Credentials for the embedding service | Yes | — |
Keyword & Regex Moderation Parameters
Parameter | Type | Description | Required | Default |
---|---|---|---|---|
key_reg_param_bag.enabled | boolean | Whether to enable keyword and regex moderation | No | false |
key_reg_param_bag.similarity_threshold | float | Word similarity threshold (0.0-1.0) | No | 0.8 |
key_reg_param_bag.keywords | array | List of keywords to block | No | [] |
key_reg_param_bag.regex | array | List of regex patterns to block | No | [] |
key_reg_param_bag.actions.type | string | Action to take when content is blocked (only “block” supported) | Yes | — |
key_reg_param_bag.actions.message | string | Custom message for blocked content | No | — |
LLM-Based Moderation Parameters
Parameter | Type | Description | Required | Default |
---|---|---|---|---|
llm_param_bag.enabled | boolean | Whether to enable LLM-based moderation | No | false |
llm_param_bag.provider | string | LLM provider (“openai” or “gemini”) | Yes | — |
llm_param_bag.model | string | Model to use for content analysis | Yes | — |
llm_param_bag.max_tokens | integer | Maximum tokens for LLM response | No | 1000 |
llm_param_bag.instructions | string | Custom instructions for content analysis | No | — |
llm_param_bag.credentials | object | Credentials for the LLM service | Yes | — |
Configuration Examples
Basic Embedding-Based Moderation
Keyword & Regex Moderation for Security
LLM-Based Moderation
Comprehensive Multi-Layer Protection
Error Responses
When moderated content is detected, the plugin returns a 403 Forbidden error with a message indicating the reason for blocking:
Embedding-Based Moderation Error
Keyword & Regex Moderation Error
or
LLM-Based Moderation Error
Best Practices
Embedding-Based Moderation
-
Carefully Select Deny Samples
- Choose clear examples of content you want to block
- Include variations to improve detection coverage
- Keep samples focused on specific categories of harmful content
-
Threshold Tuning
- Start with a threshold around 0.8
- Increase for fewer false positives (stricter matching)
- Decrease for broader protection (may increase false positives)
-
Embedding Model Selection
- Choose models with good semantic understanding
- Consider performance vs. accuracy tradeoffs
- Test with your specific use cases
Keyword & Regex Moderation
-
Keyword Selection
- Start with a focused list of clearly harmful terms
- Avoid overly common words that may cause false positives
- Consider language variations and common misspellings
- Regularly update keywords based on new threats
-
Pattern Crafting
- Use specific regex patterns targeting known attack vectors
- Test patterns thoroughly before deployment
- Consider performance impact of complex patterns
- Document pattern purposes for maintenance
-
Similarity Threshold Tuning
- Start with the default 0.8 threshold
- Increase for stricter matching
- Lower to catch more variations
- Monitor false positive/negative rates
LLM-Based Moderation
-
Provider and Model Selection
- Choose models with strong understanding of policy violations
- Consider latency requirements for your application
- Test different models to find the best balance of accuracy and performance
-
Instruction Crafting
- Be specific about what constitutes a violation
- Include examples if possible
- Clearly define categories of prohibited content
-
Response Handling
- Implement appropriate error messages for users
- Consider logging violations for analysis
- Monitor false positive rates
Performance Considerations
The NeuralTrust Moderation plugin is designed for optimal performance through several key optimizations:
-
Efficient Resource Management
- Memory pooling for reduced allocations
- Pre-compiled regex patterns
- Optimized string comparison algorithms
-
Parallel Processing
- Concurrent execution of different moderation methods
- Early termination when violations are detected
- Efficient context cancellation
-
Caching and Reuse
- Embedding caching for similar content
- Fingerprint tracking to identify repeat offenders
- Pooled resources for reduced memory pressure
-
Scalability
- Linear scaling with increasing workloads
- Minimal CPU utilization
- Low memory footprint
These optimizations ensure the plugin can handle high throughput while maintaining robust content filtering capabilities.