Documents are a blind spot for most AI security tooling. Content is buried inside PDFs, images, spreadsheets, and scans, so detectors that only look at the prompt text miss what matters. The Document analyzer is one of TrustGate’s two Content Analyzers — it parses uploaded files, runs OCR where needed, and then executes the same PII and jailbreak detections that protect the prompt, this time against the extracted document content.Documentation Index
Fetch the complete documentation index at: https://docs.neuraltrust.ai/llms.txt
Use this file to discover all available pages before exploring further.
Why it matters
| Concern | Impact |
|---|---|
| Indirect prompt injection | A benign-looking PDF can carry hidden instructions (“ignore previous rules…”) that hijack the model once the file is summarized or embedded. |
| PII exposure | Scanned contracts, IDs, and forms carry personal data that text-only detectors never see. |
| Structured leakage | Tables and forms encode account numbers, amounts, and identifiers that need the same protection as prose. |
| Payload smuggling | Malicious actors hide instructions or secrets in metadata, embedded scripts, or steganographic content. |
Where it lives in the picker
The Document analyzer sits under the Content Security category inCreate Policy → When, alongside prompt- and response-level detections (Prompt Guard, Prompt Moderation, Response Moderation, Toxicity Protection, URL Analyzer).
Add the detection to a policy, pick whether you want PII, Jailbreak, or both signals, and set the outcome in the Then step — Log to observe, Block to reject the request with a 403, or Mask on the PII side to redact sensitive spans before the model sees them.
How it works
The analyzer runs at the pre-request stage, before the payload reaches the upstream LLM or tool:- Parse — native text is pulled from PDFs, Office files, markdown, HTML, and archives.
- OCR — images and scanned documents are run through OCR to produce searchable text. Languages are configurable per deployment.
- Normalize — tables, forms, and metadata are converted into a structured representation so detections can reason about fields as well as free text.
- Detect — PII and jailbreak detectors run against both the extracted text and the structured view. Results feed back into the policy engine’s
Whenevaluation.
Supported inputs
| Input | Treatment |
|---|---|
| PDF, DOCX, XLSX, PPTX, Markdown, HTML | Native text + structured extraction (tables, headings, metadata). |
| PNG, JPEG, TIFF, scanned PDFs | OCR to text, with language-specific tokenization. |
| Archives (zip, tar, etc.) | Recursively unpacked; each inner file goes through the same pipeline. |
What is detected
PII
Email, credit card, IBAN, phone number, SSN, passport number, national IDs — detected across text, tables, and OCR’d regions. Same entity catalog as Data protection & masking.
Jailbreak
Hidden prompt-injection payloads embedded in document body, metadata, alt text, or visual steganography. Uses the same scoring engine as Prompt security; sensitivity is picked on a 4-level scale (see below).
Structured signals
Detections run against extracted tables and form fields, not just free text — so numbers, codes, and identifiers are caught even when they never appear as prose.
Embedded content
Links and references inside documents are surfaced; pair with the URL Analyzer to follow and inspect them.
Configuration
The Document Analyzer exposes the following fields in the policy’sWhen step:
| Field | Purpose |
|---|---|
| Detection Threshold | Sensitivity level for the jailbreak classifier applied to extracted content. Uses the shared 4-level scale — see below. |
| Max File Size | Upper bound on the uploaded file, in bytes. Default 52428800 (50 MB). Oversized uploads are rejected before parsing. |
| OCR Languages | List of ISO language codes OCR should try (e.g. en, es, fr). Disable OCR entirely for text-only deployments to cut latency. |
| PII Entities to Detect | Which entity types to scan for (Email, Phone, Credit Card, IBAN, SSN, Passport, etc.). |
Detection Threshold — sensitivity levels
| Level | Label | Behaviour |
|---|---|---|
| L1 | Lenient | Minimal filtering, only the most obvious threats. |
| L2 | Balanced | Recommended for most use cases. Default. |
| L3 | Enhanced | Higher sensitivity, may flag borderline content. |
| L4 | Strict | Maximum protection, strictest filtering. |
Related detections
The Document analyzer is one of two Content Analyzers in TrustGate — both share the same configuration pattern (mode + PII + jailbreak) but look at different input channels:- Document analyzer (this page) — inspects files that arrive as uploads or attachments.
- URL Analyzer — fetches the content behind URLs referenced in the prompt and runs the same detections over what it downloads.