Overview
The NeuralTrust Moderation plugin offers three powerful moderation approaches that can be used independently or in combination:- Embedding-Based Moderation: Uses semantic similarity to detect content similar to predefined deny samples
- Keyword & Regex Moderation: Employs fuzzy keyword matching and regular expression pattern matching
- LLM-Based Moderation: Leverages large language models to analyze content for policy violations
Features
Embedding-Based Moderation
- Semantic Similarity Detection: Identifies content similar in meaning to predefined deny samples
- Configurable Threshold: Adjust sensitivity to balance protection and false positives
- Vector Database Integration: Efficiently stores and searches embeddings
- Multiple Embedding Providers: Support for various embedding models
Keyword & Regex Moderation
- Fuzzy Keyword Matching: Detects similar words using Levenshtein distance
- Configurable Similarity Threshold: Adjust sensitivity for word matching
- Case-Insensitive Matching: Catches variations regardless of capitalization
- Regular Expression Support: Complex pattern matching for sophisticated detection
- Pre-compiled Patterns: Optimized for performance
LLM-Based Moderation
- AI-Powered Content Analysis: Uses LLMs to detect policy violations
- Multiple Provider Support: Compatible with OpenAI and Gemini models
- Customizable Instructions: Define specific content policies
- Structured Response Format: Clear categorization of detected issues
Additional Features
- Customizable Actions: Configure how to handle detected violations
- Detailed Error Reporting: Clear explanations of why content was blocked
- Performance Optimization: Efficient processing for minimal latency
- Fingerprint Tracking: Monitor and manage repeated violation attempts
Configuration
The NeuralTrust Moderation plugin can be configured with various options to suit your specific needs:Common Configuration Parameters
Parameter | Type | Description | Required | Default |
---|---|---|---|---|
mapping_field | string | JSON path to extract content from request body (e.g., “prompt” or “messages.content”) | No | — |
retention_period | integer | Time in seconds to retain fingerprint data for violation tracking | No | 60 |
Embedding-Based Moderation Parameters
Parameter | Type | Description | Required | Default |
---|---|---|---|---|
embedding_param_bag.enabled | boolean | Whether to enable embedding-based moderation | No | false |
embedding_param_bag.threshold | float | Similarity threshold (0.0-1.0) to flag content | Yes | — |
embedding_param_bag.deny_topic_action | string | Action to take when content matches deny samples (only “block” supported) | Yes | — |
embedding_param_bag.deny_samples | array | List of text samples to block (semantic similarity comparison) | No | [] |
embedding_param_bag.embeddings_config.provider | string | Embedding provider (e.g., “openai”) | Yes | — |
embedding_param_bag.embeddings_config.model | string | Model to use for generating embeddings | Yes | — |
embedding_param_bag.embeddings_config.credentials | object | Credentials for the embedding service | Yes | — |
Keyword & Regex Moderation Parameters
Parameter | Type | Description | Required | Default |
---|---|---|---|---|
key_reg_param_bag.enabled | boolean | Whether to enable keyword and regex moderation | No | false |
key_reg_param_bag.similarity_threshold | float | Word similarity threshold (0.0-1.0) | No | 0.8 |
key_reg_param_bag.keywords | array | List of keywords to block | No | [] |
key_reg_param_bag.regex | array | List of regex patterns to block | No | [] |
key_reg_param_bag.actions.type | string | Action to take when content is blocked (only “block” supported) | Yes | — |
key_reg_param_bag.actions.message | string | Custom message for blocked content | No | — |
LLM-Based Moderation Parameters
Parameter | Type | Description | Required | Default |
---|---|---|---|---|
llm_param_bag.enabled | boolean | Whether to enable LLM-based moderation | No | false |
llm_param_bag.provider | string | LLM provider (“openai” or “gemini”) | Yes | — |
llm_param_bag.model | string | Model to use for content analysis | Yes | — |
llm_param_bag.max_tokens | integer | Maximum tokens for LLM response | No | 1000 |
llm_param_bag.instructions | string | Custom instructions for content analysis | No | — |
llm_param_bag.credentials | object | Credentials for the LLM service | Yes | — |
Configuration Examples
Basic Embedding-Based Moderation
Keyword & Regex Moderation for Security
LLM-Based Moderation
Comprehensive Multi-Layer Protection
Error Responses
When moderated content is detected, the plugin returns a 403 Forbidden error with a message indicating the reason for blocking:Embedding-Based Moderation Error
Keyword & Regex Moderation Error
LLM-Based Moderation Error
Best Practices
Embedding-Based Moderation
-
Carefully Select Deny Samples
- Choose clear examples of content you want to block
- Include variations to improve detection coverage
- Keep samples focused on specific categories of harmful content
-
Threshold Tuning
- Start with a threshold around 0.8
- Increase for fewer false positives (stricter matching)
- Decrease for broader protection (may increase false positives)
-
Embedding Model Selection
- Choose models with good semantic understanding
- Consider performance vs. accuracy tradeoffs
- Test with your specific use cases
Keyword & Regex Moderation
-
Keyword Selection
- Start with a focused list of clearly harmful terms
- Avoid overly common words that may cause false positives
- Consider language variations and common misspellings
- Regularly update keywords based on new threats
-
Pattern Crafting
- Use specific regex patterns targeting known attack vectors
- Test patterns thoroughly before deployment
- Consider performance impact of complex patterns
- Document pattern purposes for maintenance
-
Similarity Threshold Tuning
- Start with the default 0.8 threshold
- Increase for stricter matching
- Lower to catch more variations
- Monitor false positive/negative rates
LLM-Based Moderation
-
Provider and Model Selection
- Choose models with strong understanding of policy violations
- Consider latency requirements for your application
- Test different models to find the best balance of accuracy and performance
-
Instruction Crafting
- Be specific about what constitutes a violation
- Include examples if possible
- Clearly define categories of prohibited content
-
Response Handling
- Implement appropriate error messages for users
- Consider logging violations for analysis
- Monitor false positive rates
Performance Considerations
The NeuralTrust Moderation plugin is designed for optimal performance through several key optimizations:-
Efficient Resource Management
- Memory pooling for reduced allocations
- Pre-compiled regex patterns
- Optimized string comparison algorithms
-
Parallel Processing
- Concurrent execution of different moderation methods
- Early termination when violations are detected
- Efficient context cancellation
-
Caching and Reuse
- Embedding caching for similar content
- Fingerprint tracking to identify repeat offenders
- Pooled resources for reduced memory pressure
-
Scalability
- Linear scaling with increasing workloads
- Minimal CPU utilization
- Low memory footprint