NeuralTrust | Platform for Agent Security.

Content‑security detectors are the LLM‑aware core of TrustGuard: they score prompts and model output for jailbreaks, toxicity, and off‑topic/disallowed content, and analyze documents and URLs. Several call the NeuralTrust Firewall; its credentials are configured globally (env) or overridden per‑detector under settings.credentials. All of these are detection‑only — the action (Monitor / Block) and evaluation phase (Input / Output) are set on the policy rule that references the detector.

Detector	Slug	Sides	Protocols	Backend
Prompt Guard	`prompt_guard`	input, output	all	NeuralTrust Firewall
Toxicity	`toxicity`	input, output	all	NeuralTrust Firewall
Moderation	`prompt_moderation`	input, output	all	keyword/regex + NeuralTrust topics
URL Analyzer	`url_analyzer`	input	llm, mcp	fetch + NeuralTrust Firewall
Document Analyzer	`doc_analyzer`	input	llm	extract/OCR + PII + Firewall

Prompt Guard — `prompt_guard`

Scores input/output with the NeuralTrust Firewall jailbreak detector and reports a finding (signal.type: "jailbreak") above a threshold.

Detector Sensitivity in the console uses three levels for all detectors: Permissive, Balanced (recommended), and Strict. The API jailbreak.threshold (and similar threshold fields) maps to the same sensitivity knob for automation — prefer the console Sensitivity control when configuring from the UI.

Field	Type	Required	Notes
`jailbreak.threshold`	number	✅	Score in `[0, 1]` above which content is flagged.
`credentials.{base_url,token,openai_api_key}`	string	—	Override global firewall creds.

{
  "name": "Jailbreak — strict",
  "plugin_slug": "prompt_guard",
  "settings": { "jailbreak": { "threshold": 0.6 } }
}

Toxicity — `toxicity`

Scores content with the NeuralTrust Firewall toxicity detector and reports a finding above threshold. The signal.type is the firewall category that scored highest (e.g. hate, violence, harassment, self_harm, sexual).

Field	Type	Required	Notes
`toxicity.threshold`	number	✅	Score in `[0, 1]`.
`credentials.*`	object	—	Override global firewall creds.

Moderation — `prompt_moderation`

Dual‑mode moderation — enable at least one mode.

Field	Type	Default	Notes
`keyreg_moderation.enabled`	boolean	`false`	Keyword/regex matching (`signal.type: "keyreg"`).
`keyreg_moderation.keywords`	array<string>	—
`keyreg_moderation.regex`	array<string>	—	Each must compile.
`keyreg_moderation.similarity_threshold`	number	`0.8`
`nt_topic_moderation.enabled`	boolean	`false`	NeuralTrust topic probability (`signal.type: "nt_topic"`).
`nt_topic_moderation.topics`	array<string>	—
`nt_topic_moderation.thresholds`	map<string, number>	—	Per‑topic threshold.

URL Analyzer — `url_analyzer`

Extracts URLs from content, fetches each page (SSRF‑guarded, size/timeout‑bounded, up to 10 URLs per request), and screens the fetched text for jailbreaks and PII.

Field	Type	Default	Notes
`threshold`	number	`0.7`	Jailbreak score threshold.
`url.timeout`	integer	`20000`	Milliseconds.
`url.max_content_size`	integer	`1048576`	Bytes (1 MiB).
`url.allowed_domains` / `url.blocked_domains`	array<string>	—	Allow/deny lists.
`pii.entities`	array<enum>	—	PII entities to check in fetched content.

Document Analyzer — `doc_analyzer`

Extracts text from uploaded documents (PDF, Office, images via OCR, plain text) sent as payload.attachments, then screens for PII and (optionally) jailbreaks.

Field	Type	Default	Notes
`max_file_size`	integer	`52428800` (50 MiB)	Bytes.
`entities`	array<enum>	—	PII entities to detect.
`firewall.enabled`	boolean	`false`	Jailbreak screening of extracted text.
`firewall.threshold`	number	`0.7`
`ocr.enabled`	boolean	`false`	OCR for images/scans (requires the OCR build).
`ocr.languages`	array<string>	—	Tesseract language codes.

When to use

prompt_guard is the baseline jailbreak defense for chat traffic.
url_analyzer / doc_analyzer for RAG and agent flows that ingest links/files.
toxicity on input and/or output for abuse and safety.
prompt_moderation for topic/scope control (“only answer about X”).

Introduction

Core concepts

Detector catalog

Integrations

Evaluate API

Content security detectors

Prompt Guard — `prompt_guard`

Toxicity — `toxicity`

Moderation — `prompt_moderation`

URL Analyzer — `url_analyzer`

Document Analyzer — `doc_analyzer`

When to use

​Prompt Guard — prompt_guard

​Toxicity — toxicity

​Moderation — prompt_moderation

​URL Analyzer — url_analyzer

​Document Analyzer — doc_analyzer

​When to use

Prompt Guard — `prompt_guard`

Toxicity — `toxicity`

Moderation — `prompt_moderation`

URL Analyzer — `url_analyzer`

Document Analyzer — `doc_analyzer`

When to use