Skip to main content
The Semantic Cache plugin intercepts incoming requests and checks whether a semantically similar request has already been processed. When a match is found above the configured similarity threshold, the cached response is returned instantly — skipping the upstream AI call entirely. This dramatically reduces latency, saves API costs, and improves throughput for workloads with repetitive or near-duplicate queries.

How It Works

  1. Pre-request: The plugin extracts text from the incoming request body, generates a vector embedding, and searches for semantically similar cached entries. If a match exceeds the similarity threshold, the cached response is returned immediately.
  2. Post-response: After a successful upstream response (2xx), the plugin stores the request embedding alongside the response with a configurable TTL.
The plugin uses cosine similarity for efficient nearest-neighbor lookups.

Features

  • Semantic matching: Finds similar queries even when worded differently
  • Fail-open design: Errors in cache operations never block requests — they pass through to the upstream
  • Per-rule isolation: Cache entries are scoped to individual rules, preventing cross-contamination
  • Cache bypass: Supports Cache-Control: no-cache header to skip the cache
  • Configurable TTL: Control how long cached responses are retained
  • Observability headers: Cache hits include X-Cache-Status and X-Cache-Similarity response headers

Configuration

The plugin automatically runs in both pre_request and post_response stages — you only need to configure it once in the pre_request stage:
{
  "plugin_chain": [
    {
      "name": "semantic_cache",
      "enabled": true,
      "stage": "pre_request",
      "settings": {
        "similarity_threshold": 0.90,
        "ttl": "1h",
        "embedding": {
          "provider": "your-provider",
          "model": "your-embedding-model",
          "api_key": "your-api-key"
        }
      }
    }
  ]
}

Settings

ParameterTypeDescriptionDefault
similarity_thresholdfloatMinimum cosine similarity to consider a cache hit (0.0–1.0)0.85
ttlstringTime-to-live for cached entries (Go duration format)"24h"

Embedding Configuration

ParameterTypeDescriptionDefault
embedding.providerstringEmbedding provider to use— (required)
embedding.modelstringEmbedding model identifier— (required)
embedding.api_keystringAPI key for the embedding provider— (required)

Supported Request Formats

The plugin extracts text from the request body by checking the following fields in order:
FieldFormatDescription
promptstringSimple prompt field
inputstringInput text field
messagesarrayChat-style messages array — all content fields are concatenated
If none of these fields contain extractable text, the request passes through without caching.

Response Headers

On a cache hit, the following headers are added to the response:
HeaderExampleDescription
X-Cache-StatusHITIndicates the response was served from cache
X-Cache-Similarity0.9523The cosine similarity score between the request and the cached entry

Cache Bypass

Send the Cache-Control: no-cache header to skip the cache for a specific request:
curl -H "Cache-Control: no-cache" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "What is machine learning?"}' \
     https://your-gateway/api/v1/chat

Prerequisites

  • A valid API key for the configured embedding provider.

Tuning the Similarity Threshold

The similarity_threshold parameter controls how “similar” two queries must be to produce a cache hit:
ThresholdBehavior
0.95–1.0Very strict — only near-identical queries match
0.85–0.95Balanced — catches paraphrases and rewordings
0.70–0.85Loose — broader matching, may return less precise results
< 0.70Not recommended — too many false positives

Example

Request (cache miss — first call)

curl -X POST https://your-gateway/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain what deep learning is"}'
Response is fetched from the upstream and cached automatically.

Request (cache hit — similar query)

curl -X POST https://your-gateway/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is deep learning?"}'
Response headers:
X-Cache-Status: HIT
X-Cache-Similarity: 0.9412
The cached response is returned instantly without contacting the upstream.