Rate Limiting
TrustGate provides two types of rate limiting: request-based and token-based rate limiting. These help protect your AI models from overuse and ensure fair resource distribution.
Request-Based Rate Limiting
Configure the rate limiter plugin with multiple limit types:
curl -X POST http://localhost:8080/api/v1/gateways/{gateway-id} \
-H "Content-Type: application/json" \
-d '{
"required_plugins": [
{
"name": "rate_limiter",
"enabled": true,
"stage": "pre_request",
"priority": 1,
"settings": {
"limits": {
"global": {
"limit": 15,
"window": "1m"
},
"per_ip": {
"limit": 5,
"window": "1m"
},
"per_user": {
"limit": 5,
"window": "1m"
}
},
"actions": {
"type": "reject",
"retry_after": "60"
}
}
}
]
}'
Configuration Options
-
Limit Types:
global
: Total requests across all usersper_ip
: Requests from each IP addressper_user
: Requests per API key
-
Window Settings:
- Use time units:
s
(seconds),m
(minutes),h
(hours) - Example:
"window": "1m"
for one-minute window
- Use time units:
-
Actions:
reject
: Block requests when limit exceededretry_after
: Time in seconds before retrying
Token-Based Rate Limiting
For AI model requests, use token-based rate limiting to control token consumption:
curl -X POST http://localhost:8080/api/v1/gateways/{gateway-id} \
-H "Content-Type: application/json" \
-d '{
"required_plugins": [
{
"name": "token_rate_limiter",
"enabled": true,
"settings": {
"tokens_per_request": 20,
"tokens_per_minute": 100,
"bucket_size": 1000,
"requests_per_minute": 60
}
}
]
}'
Token Limiter Settings
- tokens_per_request: Maximum tokens per single request
- tokens_per_minute: Total tokens allowed per minute
- bucket_size: Maximum token burst capacity
- requests_per_minute: Maximum number of requests per minute
Monitoring Rate Limits
The gateway includes rate limit information in response headers:
Request-Based Headers
X-RateLimit-Limit-Minute: 60
X-RateLimit-Remaining-Minute: 59
X-RateLimit-Reset-Minute: 52
Token-Based Headers
X-RateLimit-Limit-Tokens: 1000
X-RateLimit-Remaining-Tokens: 900
X-RateLimit-Reset-Tokens: 45
X-Tokens-Consumed: 100
Rate Limit Response
When a rate limit is exceeded, you'll receive a 429 (Too Many Requests) response:
{
"error": "Rate limit exceeded",
"retry_after": 60
}
Best Practices
-
Layered Rate Limiting
- Use both request and token limits for AI endpoints
- Configure global, IP, and user-based limits
-
Window Sizing
- Use shorter windows (1m) for precise control
- Use longer windows (1h) for overall quotas
-
Monitoring
- Watch rate limit headers to track usage
- Implement client-side rate limiting
Next Steps
After configuring rate limiting:
- Configure Load Balancing for high availability