NeuralTrust | The leading security platform for generative AI

Overview

The NeuralTrust Moderation plugin (neuraltrust_moderation) is a sophisticated content filtering system designed to protect your AI gateway from potentially harmful or unwanted content. It employs multiple layers of content analysis to ensure comprehensive protection while maintaining high performance. At its core, the plugin implements content filtering through two primary mechanisms: keyword-based blocking and regular expression pattern matching. The keyword system is enhanced with fuzzy matching capabilities, allowing it to detect not just exact matches but also similar variations of prohibited words. This is particularly effective in catching attempts to circumvent the filter through minor word modifications. The smart detection system utilizes advanced string matching algorithms, specifically the Levenshtein distance calculation, to determine word similarity. This approach is case-insensitive by default and operates with a configurable similarity threshold (ranging from 0 to 1). The default threshold is set to 0.8, providing a good balance between strict matching and flexibility. This means that words that are 80% similar to blocked keywords will trigger the filter, effectively catching common evasion techniques like character substitutions or misspellings.

Features

Keyword-based Blocking:

Fuzzy matching for similar words
Configurable similarity threshold
Case-insensitive matching

Pattern-based Blocking:

Regular expression support
Complex pattern matching
Pre-compiled patterns for performance
Support for common attack patterns

Action Configuration:

Customizable block messages
Configurable response codes
Detailed error reporting
Logging and monitoring

Configuration Examples

Basic Configuration

The basic configuration provides essential content filtering with commonly needed protections:

{
    "settings": {
        "similarity_threshold": 0.8,
        "keywords": [
            "hack",
            "exploit",
            "vulnerability"
        ],
        "regex": [
            "password.*dump",
            "sql.*injection"
        ],
        "actions": {
            "type": "block",
            "message": "Content blocked due to prohibited content: %s"
        }
    }
}

Key components of the basic configuration:

Similarity Threshold

Value Range	Description	Impact
0 to 1 (0.8 default)	Determines how closely strings must match	Controls matching sensitivity
Higher values (>0.8)	Requires closer matches	Reduces false positives
Lower values (<0.8)	Allows more variation	Catches more variations

Security-Focused Configuration

For environments requiring strict security controls:

{
    "settings": {
        "similarity_threshold": 0.8,
        "keywords": [
            "hack",
            "exploit",
            "vulnerability",
            "injection",
            "overflow",
            "backdoor"
        ],
        "regex": [
            "CVE-\\d{4}-\\d{4,7}",
            "password.*dump",
            "sql.*injection",
            "(union|select|delete|drop|update|insert).*table",
            "exec.*\\(.*\\)",
            "system\\(.*\\)"
        ],
        "actions": {
            "type": "block",
            "message": "Security violation detected: %s. This incident will be logged."
        }
    }
}

AI Safety Configuration

Specifically designed for AI model protection:

{
    "settings": {
        "similarity_threshold": 0.8,
        "keywords": [
            "jailbreak",
            "bypass",
            "ignore previous",
            "ignore rules",
            "system prompt"
        ],
        "regex": [
            "you\\s+are\\s+now\\s+.*",
            "ignore\\s+all\\s+.*instructions",
            "disregard\\s+.*safety",
            "\\{\\{.*\\}\\}",
            "<.*>.*</.*>"
        ],
        "actions": {
            "type": "block",
            "message": "Prompt rejected: Potential safety bypass attempt detected."
        }
    }
}

Example Use Cases

1. Content Moderation

The keyword-based blocking feature is particularly effective for content moderation in AI systems where users might try to bypass filters using creative spelling. For instance, if you configure the plugin with the keyword “hack” and a similarity threshold of 0.8, it will catch variations like “h4ck”, “hakk”, or “h@ck”. This is especially useful in educational platforms where you want to prevent discussions about hacking while allowing legitimate security discussions. The fuzzy matching capability means that even if someone tries to evade detection by using “hacc” or “haccc”, the system will still identify and block these attempts, making it significantly more difficult to circumvent the protection. First, create a gateway to protect hacking:

curl -X POST "http://localhost:8080/api/v1/gateways" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Prompt Moderation Gateway",
    "subdomain": "prompt-mod-test",
    "required_plugins": [
        {
            "name": "neuraltrust_moderation",
            "enabled": true,
            "stage": "pre_request",
            "priority": 1,
            "settings": {
                "similarity_threshold": 0.5,
                "keywords": [
                    "hack",
                    "exploit",
                    "vulnerability"
                ],
                "regex": [
                    "password.*dump",
                    "sql.*injection",
                    "CVE-\\d{4}-\\d{4,7}"
                ],
                "actions": {
                    "type": "block",
                    "message": "Content blocked due to prohibited content: %s"
                }
            }
        }
    ]
}'

Then test with clean content (should pass):

curl -X POST "http://localhost:8081/post" \
    -H "Host: prompt-mod-test.example.com" \
    -H "X-TG-API-Key: ${API_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Tell me about machine learning"}'

# Expected Response: 200 OK

Test with blocked content:

curl -X POST "http://localhost:8081/post" \
    -H "Host: prompt-mod-test.example.com" \
    -H "X-TG-API-Key: ${API_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "How to hacking into a system"}'

# Expected Response: 403 Forbidden
# Message: "Content blocked due to prohibited content: hack"

2. CVE Protection

Complex pattern matching combines multiple rules to create sophisticated content filters. For example, in a security context, you might want to prevent any attempts to discuss specific vulnerabilities or exploits. The system can detect and block CVE numbers and related security discussions using regex patterns like “CVE-\d-\d” combined with keywords like “exploit” or “vulnerability”. This comprehensive approach helps prevent the AI from inadvertently providing information about security vulnerabilities. First, create a the gateway to protect CVE numbers:

curl -X POST "http://localhost:8080/api/v1/gateways" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "CVE Protection Gateway",
    "subdomain": "cve-protection-test",
    "required_plugins": [
        {
            "name": "neuraltrust_moderation",
            "enabled": true,
            "stage": "pre_request",
            "priority": 1,
            "settings": {
                "similarity_threshold": 0.8,
                "keywords": [
                    "exploit",
                    "vulnerability",
                    "security bug"
                ],
                "regex": [
                    "CVE-\\d{4}-\\d{4,7}",
                    "vulnerability.*found",
                    "security.*patch"
                ],
                "actions": {
                    "type": "block",
                    "message": "Security information blocked: %s"
                }
            }
        }
    ]
}'

Test with clean content:

curl -X POST "http://localhost:8081/post" \
    -H "Host: cve-protection-test.example.com" \
    -H "X-TG-API-Key: ${API_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Tell me about software development"}'

# Expected Response: 200 OK

Test with CVE pattern:

curl -X POST "http://localhost:8081/post" \
    -H "Host: cve-protection-test.example.com" \
    -H "X-TG-API-Key: ${API_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Tell me about CVE-2024-1234"}'

# Expected Response: 403 Forbidden
# Message: "Security information blocked: CVE-2024-1234"

3. SQL Injection Prevention

Pattern-based blocking using regular expressions provides powerful protection against structured threats and known attack patterns. Consider a scenario where you want to prevent users from attempting to extract sensitive information through SQL injection. By implementing a regex pattern like “(select|union|drop).*from.*where”, the system can detect and block sophisticated SQL injection attempts even when they’re disguised within natural language. First, create a gateway with SQL injection protection:

curl -X POST "http://localhost:8080/api/v1/gateways" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "SQL Protection Gateway",
    "subdomain": "sql-protection-test",
    "required_plugins": [
        {
            "name": "neuraltrust_moderation",
            "enabled": true,
            "stage": "pre_request",
            "priority": 1,
            "settings": {
                "similarity_threshold": 0.8,
                "keywords": [
                    "sql",
                    "injection",
                    "database"
                ],
                "regex": [
                    "sql.*injection",
                    "(union|select|delete|drop|update|insert).*table",
                    "from.*where.*password"
                ],
                "actions": {
                    "type": "block",
                    "message": "SQL injection attempt blocked: %s"
                }
            }
        }
    ]
}'

Test with clean content:

curl -X POST "http://localhost:8081/post" \
    -H "Host: sql-protection-test.example.com" \
    -H "X-TG-API-Key: ${API_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "What is a database?"}'

# Expected Response: 200 OK

Test with SQL injection pattern:

curl -X POST "http://localhost:8081/post" \
    -H "Host: sql-protection-test.example.com" \
    -H "X-TG-API-Key: ${API_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "How to perform sql injection attacks"}'

# Expected Response: 403 Forbidden
# Message: "SQL injection attempt blocked: sql.*injection"

4. Cryptocurrency Scam Prevention

The fuzzy matching capability is particularly powerful when dealing with intentional misspellings and character substitutions. In a real-world scenario, if you’re protecting against cryptocurrency scams, you might block variations of cryptocurrency terms and common scam phrases. First, create a gateway with crypto scam protection:

curl -X POST "http://localhost:8080/api/v1/gateways" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Crypto Protection Gateway",
    "subdomain": "crypto-protection-test",
    "required_plugins": [
        {
            "name": "neuraltrust_moderation",
            "enabled": true,
            "stage": "pre_request",
            "priority": 1,
            "settings": {
                "similarity_threshold": 0.8,
                "keywords": [
                    "ethereum",
                    "bitcoin",
                    "crypto",
                    "moonshot",
                    "pump"
                ],
                "regex": [
                    "\\d+x.*gains",
                    "invest.*crypto",
                    "buy.*(btc|eth|coin)"
                ],
                "actions": {
                    "type": "block",
                    "message": "Cryptocurrency discussion blocked: %s"
                }
            }
        }
    ]
}'

Test with clean content:

curl -X POST "http://localhost:8081/post" \
    -H "Host: crypto-protection-test.example.com" \
    -H "X-TG-API-Key: ${API_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Tell me about traditional banking"}'

# Expected Response: 200 OK

Test with crypto scam pattern:

curl -X POST "http://localhost:8081/post" \
    -H "Host: crypto-protection-test.example.com" \
    -H "X-TG-API-Key: ${API_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "How to invest in eth3reum for 100x gains"}'

# Expected Response: 403 Forbidden
# Message: "Cryptocurrency discussion blocked: ethereum (similarity: 0.85)"

5. Financial Advice Protection

Complex pattern matching combines multiple rules to create sophisticated content filters for preventing unauthorized financial advice. First, create a gateway with financial advice protection:

curl -X POST "http://localhost:8080/api/v1/gateways" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Financial Protection Gateway",
    "subdomain": "finance-protection-test",
    "required_plugins": [
        {
            "name": "neuraltrust_moderation",
            "enabled": true,
            "stage": "pre_request",
            "priority": 1,
            "settings": {
                "similarity_threshold": 0.8,
                "keywords": [
                    "invest",
                    "trading",
                    "portfolio",
                    "stocks",
                    "options"
                ],
                "regex": [
                    "invest.*\\$\\d+",
                    "buy.*(stock|option|share)",
                    "trade.*(strategy|tip)",
                    "market.*timing"
                ],
                "actions": {
                    "type": "block",
                    "message": "Financial advice request blocked: %s"
                }
            }
        }
    ]
}'

Test with clean content:

curl -X POST "http://localhost:8081/post" \
    -H "Host: finance-protection-test.example.com" \
    -H "X-TG-API-Key: ${API_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "What is a bank account?"}'

# Expected Response: 200 OK

Test with financial advice pattern:

curl -X POST "http://localhost:8081/post" \
    -H "Host: finance-protection-test.example.com" \
    -H "X-TG-API-Key: ${API_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "How to invest $1000 for maximum gains"}'

# Expected Response: 403 Forbidden
# Message: "Financial advice request blocked: invest.*\\$\\d+"

6. Custom Error Messages

The action configuration feature allows you to customize how the system responds to detected threats with context-aware messages. First, create a gateway with custom error messages:

curl -X POST "http://localhost:8080/api/v1/gateways" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Custom Error Gateway",
    "subdomain": "custom-error-test",
    "required_plugins": [
        {
            "name": "prompt_moderation",
            "enabled": true,
            "stage": "pre_request",
            "priority": 1,
            "settings": {
                "similarity_threshold": 0.8,
                "keywords": [
                    "hack",
                    "exploit",
                    "bypass"
                ],
                "regex": [
                    "security.*bypass",
                    "password.*crack",
                    "system.*hack"
                ],
                "actions": {
                    "type": "block",
                    "message": "Security policy violation #SEC-%d (Corporate Policy 4.2.1). This attempt has been logged and reported. Detected: %s"
                }
            }
        }
    ]
}'

Test with clean content:

curl -X POST "http://localhost:8081/post" \
    -H "Host: custom-error-test.example.com" \
    -H "X-TG-API-Key: ${API_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "How to improve system security?"}'

# Expected Response: 200 OK

Test with security violation:

curl -X POST "http://localhost:8081/post" \
    -H "Host: custom-error-test.example.com" \
    -H "X-TG-API-Key: ${API_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "How to bypass system security"}'

# Expected Response: 403 Forbidden
# Message: "Security policy violation #SEC-123 (Corporate Policy 4.2.1). This attempt has been logged and reported. Detected: security.*bypass"

Best Practices

1. Keyword Selection

• Start with a focused list of clearly harmful terms • Avoid overly common words that may cause false positives • Consider language variations and common misspellings • Regularly update keywords based on new threats

2. Pattern Crafting

• Use specific regex patterns targeting known attack vectors • Test patterns thoroughly before deployment • Consider performance impact of complex patterns • Document pattern purposes for maintenance

3. Threshold Tuning

• Start with the default 0.8 threshold • Increase threshold for stricter matching • Lower threshold to catch more variations • Monitor false positive/negative rates

4. Performance Optimization

• Keep keyword lists focused and relevant • Use simple regex patterns where possible • Monitor processing times • Cache compiled patterns

Performance Considerations

The Prompt Moderation plugin delivers exceptional performance through a sophisticated architecture that combines efficient pattern compilation and string matching capabilities. At initialization, all regex patterns are compiled once and stored in memory, enabling quick access and efficient pattern matching throughout the plugin’s lifecycle. The string matching system leverages optimized Levenshtein distance calculations and smart word tokenization, while maintaining memory-efficient string comparisons. This comprehensive approach ensures that pattern matching operations are executed with minimal computational overhead, making the plugin ideal for high-throughput environments where performance is critical. The plugin’s resource management strategy is built around maintaining minimal memory footprint and optimal CPU utilization. Through intelligent caching mechanisms and efficient data structures, the system achieves consistent performance while scaling linearly with increasing workloads. The architecture supports concurrent processing with low latency, enabling seamless handling of multiple requests simultaneously. This scalability-focused design, combined with pre-compiled patterns and optimized algorithms, ensures the plugin can handle growing demands while maintaining robust content filtering capabilities across distributed operations.

Getting Started

Core Concepts

Traffic Management

Actions API

Non-REST Connectivity

Rate Limiting & Request Control

Content Security

Agent Security

Application Security

Server Security

Data masking

Extending Functionality

Observability & Monitoring

Benchmark

Deployment

API Reference

Keywords & Regex

Overview

Features

Configuration Examples

Basic Configuration

Similarity Threshold

Security-Focused Configuration

AI Safety Configuration

Example Use Cases

1. Content Moderation

2. CVE Protection

3. SQL Injection Prevention

4. Cryptocurrency Scam Prevention

5. Financial Advice Protection

6. Custom Error Messages

Best Practices

1. Keyword Selection

2. Pattern Crafting

3. Threshold Tuning

4. Performance Optimization

Performance Considerations

Getting Started

Core Concepts

Traffic Management

Actions API

Non-REST Connectivity

Rate Limiting & Request Control

Content Security

Agent Security

Application Security

Server Security

Data masking

Extending Functionality

Observability & Monitoring

Benchmark

Deployment

API Reference

​Overview

​Features

​Configuration Examples

​Basic Configuration

​Similarity Threshold

​Security-Focused Configuration

​AI Safety Configuration

​Example Use Cases

​1. Content Moderation

​2. CVE Protection

​3. SQL Injection Prevention

​4. Cryptocurrency Scam Prevention

​5. Financial Advice Protection

​6. Custom Error Messages

​Best Practices

​1. Keyword Selection

​2. Pattern Crafting

​3. Threshold Tuning

​4. Performance Optimization

​Performance Considerations

Overview

Features

Configuration Examples

Basic Configuration

Similarity Threshold

Security-Focused Configuration

AI Safety Configuration

Example Use Cases

1. Content Moderation

2. CVE Protection

3. SQL Injection Prevention

4. Cryptocurrency Scam Prevention

5. Financial Advice Protection

6. Custom Error Messages

Best Practices

1. Keyword Selection

2. Pattern Crafting

3. Threshold Tuning

4. Performance Optimization

Performance Considerations