Documentation Index
Fetch the complete documentation index at: https://docs.neuraltrust.ai/llms.txt
Use this file to discover all available pages before exploring further.
The Semantic Cache plugin intercepts incoming requests and checks whether a semantically similar request has already been processed. When a match is found above the configured similarity threshold, the cached response is returned instantly — skipping the upstream AI call entirely.
This dramatically reduces latency, saves API costs, and improves throughput for workloads with repetitive or near-duplicate queries.
How It Works
- Pre-request: The plugin extracts text from the incoming request body, generates a vector embedding, and searches for semantically similar cached entries. If a match exceeds the similarity threshold, the cached response is returned immediately.
- Post-response: After a successful upstream response (2xx), the plugin stores the request embedding alongside the response with a configurable TTL.
The plugin uses cosine similarity for efficient nearest-neighbor lookups.
Features
- Semantic matching: Finds similar queries even when worded differently
- Fail-open design: Errors in cache operations never block requests — they pass through to the upstream
- Per-rule isolation: Cache entries are scoped to individual rules, preventing cross-contamination
- Cache bypass: Supports
Cache-Control: no-cache header to skip the cache
- Configurable TTL: Control how long cached responses are retained
- Observability headers: Cache hits include
X-Cache-Status and X-Cache-Similarity response headers
Configuration
The plugin automatically runs in both pre_request and post_response stages — you only need to configure it once in the pre_request stage:
{
"plugin_chain": [
{
"name": "semantic_cache",
"enabled": true,
"stage": "pre_request",
"settings": {
"similarity_threshold": 0.90,
"ttl": "1h",
"embedding": {
"provider": "your-provider",
"model": "your-embedding-model",
"api_key": "your-api-key"
}
}
}
]
}
Settings
| Parameter | Type | Description | Default |
|---|
similarity_threshold | float | Minimum cosine similarity to consider a cache hit (0.0–1.0) | 0.85 |
ttl | string | Time-to-live for cached entries (Go duration format) | "24h" |
Embedding Configuration
| Parameter | Type | Description | Default |
|---|
embedding.provider | string | Embedding provider to use | — (required) |
embedding.model | string | Embedding model identifier | — (required) |
embedding.api_key | string | API key for the embedding provider | — (required) |
The plugin extracts text from the request body by checking the following fields in order:
| Field | Format | Description |
|---|
prompt | string | Simple prompt field |
input | string | Input text field |
messages | array | Chat-style messages array — all content fields are concatenated |
If none of these fields contain extractable text, the request passes through without caching.
On a cache hit, the following headers are added to the response:
| Header | Example | Description |
|---|
X-Cache-Status | HIT | Indicates the response was served from cache |
X-Cache-Similarity | 0.9523 | The cosine similarity score between the request and the cached entry |
Cache Bypass
Send the Cache-Control: no-cache header to skip the cache for a specific request:
curl -H "Cache-Control: no-cache" \
-H "Content-Type: application/json" \
-d '{"prompt": "What is machine learning?"}' \
https://your-gateway/api/v1/chat
Prerequisites
- A valid API key for the configured embedding provider.
Tuning the Similarity Threshold
The similarity_threshold parameter controls how “similar” two queries must be to produce a cache hit:
| Threshold | Behavior |
|---|
0.95–1.0 | Very strict — only near-identical queries match |
0.85–0.95 | Balanced — catches paraphrases and rewordings |
0.70–0.85 | Loose — broader matching, may return less precise results |
< 0.70 | Not recommended — too many false positives |
Example
Request (cache miss — first call)
curl -X POST https://your-gateway/api/v1/chat \
-H "Content-Type: application/json" \
-d '{"prompt": "Explain what deep learning is"}'
Response is fetched from the upstream and cached automatically.
Request (cache hit — similar query)
curl -X POST https://your-gateway/api/v1/chat \
-H "Content-Type: application/json" \
-d '{"prompt": "What is deep learning?"}'
Response headers:
X-Cache-Status: HIT
X-Cache-Similarity: 0.9412
The cached response is returned instantly without contacting the upstream.