How It Works
- Pre-request: The plugin extracts text from the incoming request body, generates a vector embedding, and searches for semantically similar cached entries. If a match exceeds the similarity threshold, the cached response is returned immediately.
- Post-response: After a successful upstream response (2xx), the plugin stores the request embedding alongside the response with a configurable TTL.
Features
- Semantic matching: Finds similar queries even when worded differently
- Fail-open design: Errors in cache operations never block requests — they pass through to the upstream
- Per-rule isolation: Cache entries are scoped to individual rules, preventing cross-contamination
- Cache bypass: Supports
Cache-Control: no-cacheheader to skip the cache - Configurable TTL: Control how long cached responses are retained
- Observability headers: Cache hits include
X-Cache-StatusandX-Cache-Similarityresponse headers
Configuration
The plugin automatically runs in bothpre_request and post_response stages — you only need to configure it once in the pre_request stage:
Settings
| Parameter | Type | Description | Default |
|---|---|---|---|
similarity_threshold | float | Minimum cosine similarity to consider a cache hit (0.0–1.0) | 0.85 |
ttl | string | Time-to-live for cached entries (Go duration format) | "24h" |
Embedding Configuration
| Parameter | Type | Description | Default |
|---|---|---|---|
embedding.provider | string | Embedding provider to use | — (required) |
embedding.model | string | Embedding model identifier | — (required) |
embedding.api_key | string | API key for the embedding provider | — (required) |
Supported Request Formats
The plugin extracts text from the request body by checking the following fields in order:| Field | Format | Description |
|---|---|---|
prompt | string | Simple prompt field |
input | string | Input text field |
messages | array | Chat-style messages array — all content fields are concatenated |
Response Headers
On a cache hit, the following headers are added to the response:| Header | Example | Description |
|---|---|---|
X-Cache-Status | HIT | Indicates the response was served from cache |
X-Cache-Similarity | 0.9523 | The cosine similarity score between the request and the cached entry |
Cache Bypass
Send theCache-Control: no-cache header to skip the cache for a specific request:
Prerequisites
- A valid API key for the configured embedding provider.
Tuning the Similarity Threshold
Thesimilarity_threshold parameter controls how “similar” two queries must be to produce a cache hit:
| Threshold | Behavior |
|---|---|
0.95–1.0 | Very strict — only near-identical queries match |
0.85–0.95 | Balanced — catches paraphrases and rewordings |
0.70–0.85 | Loose — broader matching, may return less precise results |
< 0.70 | Not recommended — too many false positives |