Skip to main content
Content-security detectors are the LLM-aware core of TrustGuard: they score prompts and model output for jailbreaks, toxicity, off-topic/disallowed content, and analyze documents and URLs. Several call the NeuralTrust Firewall or an external provider (OpenAI, Azure, AWS Bedrock); configure provider credentials globally (env) or per-detector under settings.credentials.
DetectorSlugSidesProtocolsBackend
Prompt Guardprompt_guardinput, outputallNeuralTrust Firewall
Multi-turn Guardmultiturn_guardinputallStateful (per session)
Toxicitytoxicityinput, outputallNeuralTrust / OpenAI
Toxicity (OpenAI)toxicity_openaiinput, outputallOpenAI Moderation
Toxicity (Azure)toxicity_azureinput, outputallAzure Content Safety
Prompt Moderationprompt_moderationinput, outputallkeyword/regex + NeuralTrust topics
URL Analyzerurl_analyzerinputllm, mcpfetch + NeuralTrust Firewall
Document Analyzerdoc_analyzerinputllmextract/OCR + PII + Firewall
Bedrock Guardrailbedrock_guardrailbothallAWS Bedrock

Prompt Guard — prompt_guard

Scores input/output with the NeuralTrust Firewall jailbreak detector and flags above a threshold. Sensitivity 1–4 (default 2).
FieldTypeRequiredNotes
jailbreak.thresholdnumberScore in [0,1] above which content is flagged.
credentials.{base_url,token,openai_api_key}stringOverride global firewall creds.
{ "type": "prompt_guard", "mode": "block", "direction": "input",
  "settings": { "jailbreak": { "threshold": 0.6 } } }

Multi-turn Guard — multiturn_guard

Records conversation turns keyed on session_id and evaluates them across the conversation — catching jailbreaks that build up gradually across turns rather than in a single message. Pass a stable session_id on the guard request.
FieldTypeDefaultNotes
session_ttlinteger3600Seconds to retain session state.
retention_periodintegerSeconds; malicious-counter TTL.
thresholdnumber0.7Score in [0,1].

Toxicity — toxicity

Scores content for toxicity via the configured provider.
FieldTypeDefaultNotes
providerenumneuraltrustneuraltrust or openai.
toxicity.thresholdnumber— (required)Score in [0,1].
mapping_fieldstringJSON path to the text to score.
credentials.*objectOverride global creds.

Toxicity (OpenAI) — toxicity_openai

Uses the OpenAI Moderation API (omni-moderation-latest).
FieldTypeRequiredNotes
openai_keystringOpenAI API key.
categoriesarray<string>Moderation categories to consider.
thresholdsmap<string, number>Per-category score threshold [0,1].

Toxicity (Azure) — toxicity_azure

Uses Azure AI Content Safety.
FieldTypeRequiredDefaultNotes
api_keystringAzure key.
endpoints.textstringText analyze URL.
endpoints.imagestringImage analyze URL.
output_typeenumFourSeverityLevelsor EightSeverityLevels.
categoriesarray<string>Hate, Violence, SelfHarm, Sexual
category_severitymap<string,int>Per-category min severity.

Prompt Moderation — prompt_moderation

Dual-mode moderation; enable at least one mode.
FieldTypeDefaultNotes
keyreg_moderation.enabledbooleanfalseKeyword/regex matching.
keyreg_moderation.keywordsarray<string>
keyreg_moderation.regexarray<string>Each must compile.
keyreg_moderation.similarity_thresholdnumber0.8
nt_topic_moderation.enabledbooleanfalseNeuralTrust topic probability.
nt_topic_moderation.topicsarray<string>
nt_topic_moderation.thresholdsmap<string, number>Per-topic threshold.

URL Analyzer — url_analyzer

Extracts URLs from content, fetches each page (SSRF-guarded, size/timeout-bounded), and screens the fetched text for jailbreaks and PII.
FieldTypeDefaultNotes
thresholdnumber0.7Jailbreak score threshold.
url.timeoutinteger20000Milliseconds.
url.max_content_sizeinteger1048576Bytes.
url.allowed_domains / url.blocked_domainsarray<string>Allow/deny lists.
pii.entitiesarray<enum>PII entities to check in fetched content.

Document Analyzer — doc_analyzer

Extracts text from uploaded documents (PDF, Office, images via OCR, plain text) sent as input.attachments, then screens for PII and (optionally) jailbreaks.
FieldTypeDefaultNotes
max_file_sizeinteger52428800 (50 MiB)Bytes.
entitiesarray<enum>PII entities to detect.
firewall.enabledbooleanfalseJailbreak screening of extracted text.
firewall.thresholdnumber0.7
ocr.enabledbooleanfalseOCR for images/scans (requires the OCR build).
ocr.languagesarray<string>Tesseract language codes.

Bedrock Guardrail — bedrock_guardrail

Applies an AWS Bedrock guardrail and reports topic, content, or sensitive-information violations.
FieldTypeRequiredDefaultNotes
guardrail_idstring
versionstring"1"
credentials.aws_access_key / aws_secret_key / aws_region / aws_session_tokenstringStatic creds.
credentials.use_rolebooleanfalseAssume a role instead.
credentials.role_arnstringRequired when use_role=true.

When to use

  • prompt_guard (input, block) is the baseline jailbreak defense; add multiturn_guard for chat where attacks unfold over turns.
  • url_analyzer / doc_analyzer for RAG and agent flows that ingest links/files.
  • Pick one toxicity provider per side that matches your stack.
  • prompt_moderation for topic/scope control (“only answer about X”).