global. A group_by_header sub-partitions the
counter within that scope (e.g. per end-user or tenant).
ratelimit — request rate limiting
Counts requests in a sliding window.
| Setting | Type | Notes |
|---|---|---|
limit | int | Max requests per window. |
window | duration | Go duration string: 30s, 1m, 1h. |
retry_after | string | Retry-After value when limited (default 60). |
group_by_header | string | Optional sub-partition key (e.g. a tenant header). |
X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and
(when limited) Retry-After.
tokenratelimit — token rate limiting
Counts prompt + completion tokens from the provider’s usage block — the right control for
LLM cost, since a few large requests can cost more than many small ones.
| Setting | Type | Notes |
|---|---|---|
window.unit | enum | second · minute · hour · day. |
window.max | int | Max tokens per window. |
group_by_header | string | Optional sub-partition key. |
Choosing a scope
- Global policy → a gateway-wide ceiling protecting your upstream spend.
- Consumer-scoped policy → per-tenant quotas.
group_by_header→ fairness within a tenant (per end-user), without a policy per user.