semanticcache policy caches completions and serves them for prompts that are
semantically similar — not just byte-identical. It embeds the prompt, looks for a close
match in a Redis vector store, and returns the cached response on a hit.
It runs in two stages: pre_request (lookup) and post_response (populate).
| Setting | Type | Default | Notes |
|---|---|---|---|
similarity_threshold | float | 0.85 | Minimum cosine similarity for a hit (0–1). |
ttl | duration | 24h | How long entries live. |
embedding.provider | string | openai | Embedding provider. |
embedding.model | string | text-embedding-ada-002 | Embedding model. |
embedding.api_key | string | — | Credential for the embedding provider. |
On a hit
The cached response is returned and the proxy sets:X-Cache-Status: HITX-Cache-Similarity: <score>
Bypassing
SendCache-Control: no-cache to skip the lookup and force a fresh upstream call (the
result still populates the cache).
Tunesimilarity_thresholdcarefully: too low risks serving a cached answer for a meaningfully different prompt; too high reduces hit rate. Start at0.9and lower it while watching answer quality.