NeuralTrust | The leading security platform for generative AI

The Semantic Cache plugin intercepts incoming requests and checks whether a semantically similar request has already been processed. When a match is found above the configured similarity threshold, the cached response is returned instantly — skipping the upstream AI call entirely. This dramatically reduces latency, saves API costs, and improves throughput for workloads with repetitive or near-duplicate queries.

How It Works

Pre-request: The plugin extracts text from the incoming request body, generates a vector embedding, and searches for semantically similar cached entries. If a match exceeds the similarity threshold, the cached response is returned immediately.
Post-response: After a successful upstream response (2xx), the plugin stores the request embedding alongside the response with a configurable TTL.

The plugin uses cosine similarity for efficient nearest-neighbor lookups.

Features

Semantic matching: Finds similar queries even when worded differently
Fail-open design: Errors in cache operations never block requests — they pass through to the upstream
Per-rule isolation: Cache entries are scoped to individual rules, preventing cross-contamination
Cache bypass: Supports Cache-Control: no-cache header to skip the cache
Configurable TTL: Control how long cached responses are retained
Observability headers: Cache hits include X-Cache-Status and X-Cache-Similarity response headers

Configuration

The plugin automatically runs in both pre_request and post_response stages — you only need to configure it once in the pre_request stage:

{
  "plugin_chain": [
    {
      "name": "semantic_cache",
      "enabled": true,
      "stage": "pre_request",
      "settings": {
        "similarity_threshold": 0.90,
        "ttl": "1h",
        "embedding": {
          "provider": "your-provider",
          "model": "your-embedding-model",
          "api_key": "your-api-key"
        }
      }
    }
  ]
}

Settings

Parameter	Type	Description	Default
`similarity_threshold`	float	Minimum cosine similarity to consider a cache hit (0.0–1.0)	`0.85`
`ttl`	string	Time-to-live for cached entries (Go duration format)	`"24h"`

Embedding Configuration

Parameter	Type	Description	Default
`embedding.provider`	string	Embedding provider to use	— (required)
`embedding.model`	string	Embedding model identifier	— (required)
`embedding.api_key`	string	API key for the embedding provider	— (required)

Supported Request Formats

The plugin extracts text from the request body by checking the following fields in order:

Field	Format	Description
`prompt`	`string`	Simple prompt field
`input`	`string`	Input text field
`messages`	`array`	Chat-style messages array — all `content` fields are concatenated

If none of these fields contain extractable text, the request passes through without caching.

Response Headers

On a cache hit, the following headers are added to the response:

Header	Example	Description
`X-Cache-Status`	`HIT`	Indicates the response was served from cache
`X-Cache-Similarity`	`0.9523`	The cosine similarity score between the request and the cached entry

Cache Bypass

Send the Cache-Control: no-cache header to skip the cache for a specific request:

curl -H "Cache-Control: no-cache" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "What is machine learning?"}' \
     https://your-gateway/api/v1/chat

Prerequisites

A valid API key for the configured embedding provider.

Tuning the Similarity Threshold

The similarity_threshold parameter controls how “similar” two queries must be to produce a cache hit:

Threshold	Behavior
`0.95–1.0`	Very strict — only near-identical queries match
`0.85–0.95`	Balanced — catches paraphrases and rewordings
`0.70–0.85`	Loose — broader matching, may return less precise results
`< 0.70`	Not recommended — too many false positives

Example

Request (cache miss — first call)

curl -X POST https://your-gateway/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain what deep learning is"}'

Response is fetched from the upstream and cached automatically.

Request (cache hit — similar query)

curl -X POST https://your-gateway/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is deep learning?"}'

Response headers:

X-Cache-Status: HIT
X-Cache-Similarity: 0.9412

The cached response is returned instantly without contacting the upstream.

Getting Started

Core Concepts

Traffic Management

Cache

Actions API

Non-REST Connectivity

Rate Limiting & Request Control

Content Security

Agent Security

Application Security

Server Security

Data masking

Extending Functionality

Observability & Monitoring

Benchmark

Deployment

API Reference

Semantic Cache

How It Works

Features

Configuration

Settings

Embedding Configuration

Supported Request Formats

Response Headers

Cache Bypass

Prerequisites

Tuning the Similarity Threshold

Example

Request (cache miss — first call)

Request (cache hit — similar query)

Getting Started

Core Concepts

Traffic Management

Cache

Actions API

Non-REST Connectivity

Rate Limiting & Request Control

Content Security

Agent Security

Application Security

Server Security

Data masking

Extending Functionality

Observability & Monitoring

Benchmark

Deployment

API Reference

​How It Works

​Features

​Configuration

​Settings

​Embedding Configuration

​Supported Request Formats

​Response Headers

​Cache Bypass

​Prerequisites

​Tuning the Similarity Threshold

​Example

​Request (cache miss — first call)

​Request (cache hit — similar query)

How It Works

Features

Configuration

Settings

Embedding Configuration

Supported Request Formats

Response Headers

Cache Bypass

Prerequisites

Tuning the Similarity Threshold

Example

Request (cache miss — first call)

Request (cache hit — similar query)