Why Detect Toxicity?
Toxic content can take many forms—hate speech, sexual content, self-harm encouragement, threats of violence, and more. Left unchecked, this type of content can:- Damage brand reputation and user trust
- Violate regional and global regulations
- Expose users to psychological harm
- Undermine platform safety and inclusive experiences
What TrustGate Offers
TrustGate supports two high-performance plugins for toxicity moderation:Plugin | Provider | Content Types | Key Features |
---|---|---|---|
Azure Toxicity Detection | Azure Content Safety | Text & Image | Severity level thresholds, multi-modal analysis, category-based filtering |
OpenAI Toxicity Detection | OpenAI Moderation API | Text & Image URL | Category-specific scoring, configurable thresholds, sub-category analysis |
NeuralTrust Toxicity Detection | NeuralTrust API | Text | Category-specific scoring, configurable thresholds, sub-category analysis |
- Analyze both text and images
- Detect specific categories of risk (e.g., violence, hate, sexual content)
- Customize response actions (block, warn, log)
- Configure thresholds to balance sensitivity and false positive rates
Toxicity Categories
Typical categories analyzed include:Category | Description |
---|---|
Hate | Discriminatory or hateful language based on race, gender, religion, etc. |
Violence | Threats, graphic violence, promotion of harm to others |
Self-Harm | Encouragement or glorification of suicide, self-injury, etc. |
Sexual | Explicit, suggestive, or otherwise inappropriate sexual content |
Harassment | Threatening or targeted language aimed at individuals or groups |
Illicit | Content involving illegal activity or regulatory violations |
Example Use Cases
Toxicity detection is vital for:- LLM applications to prevent prompt abuse and jailbreaks
- Messaging platforms for filtering hate speech and harassment
- Education tech to keep learning environments safe
- Public comment systems to block harmful or illegal content
- Gaming communities for anti-toxicity and moderation automation
- Customer support AI to intercept offensive or harmful messages
Response Format
Toxicity plugins typically return rich response data:Best Practices
- Use moderate default thresholds and adjust with data
- Monitor blocked content and false positives for tuning
- Secure API keys and rotate regularly
- Enable category-level logging to understand trends
- Test with realistic sample data to validate coverage
Explore individual plugin capabilities: