NeuralTrust | The leading security platform for generative AI

NeuralTrust Moderation is a comprehensive content filtering system designed to protect your AI gateway from potentially harmful, inappropriate, or unwanted content. It employs multiple layers of content analysis to ensure robust protection while maintaining high performance.

Overview

The NeuralTrust Moderation plugin offers three powerful moderation approaches that can be used independently or in combination:

Embedding-Based Moderation: Uses semantic similarity to detect content similar to predefined deny samples
Keyword & Regex Moderation: Employs fuzzy keyword matching and regular expression pattern matching
LLM-Based Moderation: Leverages large language models to analyze content for policy violations

This multi-layered approach provides comprehensive protection against various types of harmful content, from simple keyword matching to sophisticated semantic analysis.

Features

Embedding-Based Moderation

Semantic Similarity Detection: Identifies content similar in meaning to predefined deny samples
Configurable Threshold: Adjust sensitivity to balance protection and false positives
Vector Database Integration: Efficiently stores and searches embeddings
Multiple Embedding Providers: Support for various embedding models

Keyword & Regex Moderation

Fuzzy Keyword Matching: Detects similar words using Levenshtein distance
Configurable Similarity Threshold: Adjust sensitivity for word matching
Case-Insensitive Matching: Catches variations regardless of capitalization
Regular Expression Support: Complex pattern matching for sophisticated detection
Pre-compiled Patterns: Optimized for performance

LLM-Based Moderation

AI-Powered Content Analysis: Uses LLMs to detect policy violations
Multiple Provider Support: Compatible with OpenAI and Gemini models
Customizable Instructions: Define specific content policies
Structured Response Format: Clear categorization of detected issues

Additional Features

Customizable Actions: Configure how to handle detected violations
Detailed Error Reporting: Clear explanations of why content was blocked
Performance Optimization: Efficient processing for minimal latency
Fingerprint Tracking: Monitor and manage repeated violation attempts

Configuration

The NeuralTrust Moderation plugin can be configured with various options to suit your specific needs:

{
  "name": "neuraltrust_moderation",
  "enabled": true,
  "stage": "pre_request",
  "priority": 1,
  "settings": {
    "mapping_field": "prompt",
    "retention_period": 3600,
    
    "embedding_param_bag": {
      "enabled": true,
      "threshold": 0.8,
      "deny_topic_action": "block",
      "deny_samples": [
        "This is an example of content that should be blocked",
        "Another example of prohibited content"
      ],
      "embeddings_config": {
        "provider": "openai",
        "model": "text-embedding-ada-002",
        "credentials": {
          "header_name": "Authorization",
          "header_value": "Bearer {{ OPENAI_API_KEY }}"
        }
      }
    },
    
    "key_reg_param_bag": {
      "enabled": true,
      "similarity_threshold": 0.8,
      "keywords": [
        "hack",
        "exploit",
        "vulnerability"
      ],
      "regex": [
        "password.*dump",
        "sql.*injection",
        "CVE-\\d{4}-\\d{4,7}"
      ],
      "actions": {
        "type": "block",
        "message": "Content blocked due to prohibited content: %s"
      }
    },
    
    "llm_param_bag": {
      "enabled": true,
      "provider": "openai",
      "model": "gpt-4",
      "max_tokens": 1000,
      "instructions": "Analyze the following content for policy violations",
      "credentials": {
        "header_name": "Authorization",
        "header_value": "Bearer {{ OPENAI_API_KEY }}"
      }
    }
  }
}

Common Configuration Parameters

Parameter	Type	Description	Required	Default
`mapping_field`	string	JSON path to extract content from request body (e.g., “prompt” or “messages.content”)	No	—
`retention_period`	integer	Time in seconds to retain fingerprint data for violation tracking	No	60

Embedding-Based Moderation Parameters

Parameter	Type	Description	Required	Default
`embedding_param_bag.enabled`	boolean	Whether to enable embedding-based moderation	No	false
`embedding_param_bag.threshold`	float	Similarity threshold (0.0-1.0) to flag content	Yes	—
`embedding_param_bag.deny_topic_action`	string	Action to take when content matches deny samples (only “block” supported)	Yes	—
`embedding_param_bag.deny_samples`	array	List of text samples to block (semantic similarity comparison)	No	[]
`embedding_param_bag.embeddings_config.provider`	string	Embedding provider (e.g., “openai”)	Yes	—
`embedding_param_bag.embeddings_config.model`	string	Model to use for generating embeddings	Yes	—
`embedding_param_bag.embeddings_config.credentials`	object	Credentials for the embedding service	Yes	—

Keyword & Regex Moderation Parameters

Parameter	Type	Description	Required	Default
`key_reg_param_bag.enabled`	boolean	Whether to enable keyword and regex moderation	No	false
`key_reg_param_bag.similarity_threshold`	float	Word similarity threshold (0.0-1.0)	No	0.8
`key_reg_param_bag.keywords`	array	List of keywords to block	No	[]
`key_reg_param_bag.regex`	array	List of regex patterns to block	No	[]
`key_reg_param_bag.actions.type`	string	Action to take when content is blocked (only “block” supported)	Yes	—
`key_reg_param_bag.actions.message`	string	Custom message for blocked content	No	—

LLM-Based Moderation Parameters

Parameter	Type	Description	Required	Default
`llm_param_bag.enabled`	boolean	Whether to enable LLM-based moderation	No	false
`llm_param_bag.provider`	string	LLM provider (“openai” or “gemini”)	Yes	—
`llm_param_bag.model`	string	Model to use for content analysis	Yes	—
`llm_param_bag.max_tokens`	integer	Maximum tokens for LLM response	No	1000
`llm_param_bag.instructions`	string	Custom instructions for content analysis	No	—
`llm_param_bag.credentials`	object	Credentials for the LLM service	Yes	—

Configuration Examples

Basic Embedding-Based Moderation

{
  "name": "neuraltrust_moderation",
  "enabled": true,
  "stage": "pre_request",
  "settings": {
    "embedding_param_bag": {
      "enabled": true,
      "threshold": 0.8,
      "deny_topic_action": "block",
      "deny_samples": [
        "How to hack into a system",
        "Instructions for creating malware"
      ],
      "embeddings_config": {
        "provider": "openai",
        "model": "text-embedding-ada-002",
        "credentials": {
          "header_name": "Authorization",
          "header_value": "Bearer {{ OPENAI_API_KEY }}"
        }
      }
    }
  }
}

Keyword & Regex Moderation for Security

{
  "name": "neuraltrust_moderation",
  "enabled": true,
  "stage": "pre_request",
  "settings": {
    "key_reg_param_bag": {
      "enabled": true,
      "similarity_threshold": 0.8,
      "keywords": [
        "hack",
        "exploit",
        "vulnerability",
        "injection",
        "overflow",
        "backdoor"
      ],
      "regex": [
        "CVE-\\d{4}-\\d{4,7}",
        "password.*dump",
        "sql.*injection",
        "(union|select|delete|drop|update|insert).*table",
        "exec.*\\(.*\\)",
        "system\\(.*\\)"
      ],
      "actions": {
        "type": "block",
        "message": "Security violation detected: %s. This incident will be logged."
      }
    }
  }
}

LLM-Based Moderation

{
  "name": "neuraltrust_moderation",
  "enabled": true,
  "stage": "pre_request",
  "settings": {
    "llm_param_bag": {
      "enabled": true,
      "provider": "openai",
      "model": "gpt-4",
      "instructions": "Analyze the following content for policy violations including: harmful instructions, illegal activities, hate speech, or explicit content.",
      "credentials": {
        "header_name": "Authorization",
        "header_value": "Bearer {{ OPENAI_API_KEY }}"
      }
    }
  }
}

Comprehensive Multi-Layer Protection

{
  "name": "neuraltrust_moderation",
  "enabled": true,
  "stage": "pre_request",
  "settings": {
    "mapping_field": "messages.0.content",
    "retention_period": 3600,
    
    "embedding_param_bag": {
      "enabled": true,
      "threshold": 0.8,
      "deny_topic_action": "block",
      "deny_samples": [
        "How to hack into a system",
        "Instructions for creating malware"
      ],
      "embeddings_config": {
        "provider": "openai",
        "model": "text-embedding-ada-002",
        "credentials": {
          "header_name": "Authorization",
          "header_value": "Bearer {{ OPENAI_API_KEY }}"
        }
      }
    },
    
    "key_reg_param_bag": {
      "enabled": true,
      "similarity_threshold": 0.8,
      "keywords": [
        "hack",
        "exploit",
        "bypass"
      ],
      "regex": [
        "security.*bypass",
        "password.*crack"
      ],
      "actions": {
        "type": "block",
        "message": "Content blocked due to prohibited content: %s"
      }
    },
    
    "llm_param_bag": {
      "enabled": true,
      "provider": "openai",
      "model": "gpt-4",
      "instructions": "Analyze the following content for policy violations",
      "credentials": {
        "header_name": "Authorization",
        "header_value": "Bearer {{ OPENAI_API_KEY }}"
      }
    }
  }
}

Error Responses

When moderated content is detected, the plugin returns a 403 Forbidden error with a message indicating the reason for blocking:

Embedding-Based Moderation Error

{
  "error": "content blocked: with similarity score 0.85 exceeds threshold 0.80",
  "retry_after": null
}

Keyword & Regex Moderation Error

{
  "error": "content blocked: word 'h4ck' is similar to blocked keyword 'hack'",
  "retry_after": null
}

{
  "error": "content blocked: regex pattern sql.*injection found in request body",
  "retry_after": null
}

LLM-Based Moderation Error

{
  "error": "content blocked",
  "retry_after": null
}

Best Practices

Embedding-Based Moderation

Carefully Select Deny Samples
- Choose clear examples of content you want to block
- Include variations to improve detection coverage
- Keep samples focused on specific categories of harmful content
Threshold Tuning
- Start with a threshold around 0.8
- Increase for fewer false positives (stricter matching)
- Decrease for broader protection (may increase false positives)
Embedding Model Selection
- Choose models with good semantic understanding
- Consider performance vs. accuracy tradeoffs
- Test with your specific use cases

Keyword & Regex Moderation

Keyword Selection
- Start with a focused list of clearly harmful terms
- Avoid overly common words that may cause false positives
- Consider language variations and common misspellings
- Regularly update keywords based on new threats
Pattern Crafting
- Use specific regex patterns targeting known attack vectors
- Test patterns thoroughly before deployment
- Consider performance impact of complex patterns
- Document pattern purposes for maintenance
Similarity Threshold Tuning
- Start with the default 0.8 threshold
- Increase for stricter matching
- Lower to catch more variations
- Monitor false positive/negative rates

LLM-Based Moderation

Provider and Model Selection
- Choose models with strong understanding of policy violations
- Consider latency requirements for your application
- Test different models to find the best balance of accuracy and performance
Instruction Crafting
- Be specific about what constitutes a violation
- Include examples if possible
- Clearly define categories of prohibited content
Response Handling
- Implement appropriate error messages for users
- Consider logging violations for analysis
- Monitor false positive rates

Performance Considerations

The NeuralTrust Moderation plugin is designed for optimal performance through several key optimizations:

Efficient Resource Management
- Memory pooling for reduced allocations
- Pre-compiled regex patterns
- Optimized string comparison algorithms
Parallel Processing
- Concurrent execution of different moderation methods
- Early termination when violations are detected
- Efficient context cancellation
Caching and Reuse
- Embedding caching for similar content
- Fingerprint tracking to identify repeat offenders
- Pooled resources for reduced memory pressure
Scalability
- Linear scaling with increasing workloads
- Minimal CPU utilization
- Low memory footprint

These optimizations ensure the plugin can handle high throughput while maintaining robust content filtering capabilities.

Getting Started

Core Concepts

Traffic Management

Rate Limiting & Request Control

Content Security

Application Security

Server Security

Data masking

Extending Functionality

Observability & Monitoring

Benchmark

API Reference

NeuralTrust Content Moderation

Overview

Features

Embedding-Based Moderation

Keyword & Regex Moderation

LLM-Based Moderation

Additional Features

Configuration

Common Configuration Parameters

Embedding-Based Moderation Parameters

Keyword & Regex Moderation Parameters

LLM-Based Moderation Parameters

Configuration Examples

Basic Embedding-Based Moderation

Keyword & Regex Moderation for Security

LLM-Based Moderation

Comprehensive Multi-Layer Protection

Error Responses

Embedding-Based Moderation Error

Keyword & Regex Moderation Error

LLM-Based Moderation Error

Best Practices

Embedding-Based Moderation

Keyword & Regex Moderation

LLM-Based Moderation

Performance Considerations

Getting Started

Core Concepts

Traffic Management

Rate Limiting & Request Control

Content Security

Application Security

Server Security

Data masking

Extending Functionality

Observability & Monitoring

Benchmark

API Reference

​Overview

​Features

​Embedding-Based Moderation

​Keyword & Regex Moderation

​LLM-Based Moderation

​Additional Features

​Configuration

​Common Configuration Parameters

​Embedding-Based Moderation Parameters

​Keyword & Regex Moderation Parameters

​LLM-Based Moderation Parameters

​Configuration Examples

​Basic Embedding-Based Moderation

​Keyword & Regex Moderation for Security

​LLM-Based Moderation

​Comprehensive Multi-Layer Protection

​Error Responses

​Embedding-Based Moderation Error

​Keyword & Regex Moderation Error

​LLM-Based Moderation Error

​Best Practices

​Embedding-Based Moderation

​Keyword & Regex Moderation

​LLM-Based Moderation

​Performance Considerations

Overview

Features

Embedding-Based Moderation

Keyword & Regex Moderation

LLM-Based Moderation

Additional Features

Configuration

Common Configuration Parameters

Embedding-Based Moderation Parameters

Keyword & Regex Moderation Parameters

LLM-Based Moderation Parameters

Configuration Examples

Basic Embedding-Based Moderation

Keyword & Regex Moderation for Security

LLM-Based Moderation

Comprehensive Multi-Layer Protection

Error Responses

Embedding-Based Moderation Error

Keyword & Regex Moderation Error

LLM-Based Moderation Error

Best Practices

Embedding-Based Moderation

Keyword & Regex Moderation

LLM-Based Moderation

Performance Considerations