Overview
The token rate limiter operates on two dimensions:- Token Usage: Controls the total number of tokens consumed
- Request Count: Limits the number of requests per minute
Configuration
Configuration Parameters
- tokens_per_request: Default number of tokens reserved for each request when actual usage cannot be determined
- tokens_per_minute: Rate at which tokens are replenished (per minute)
- bucket_size: Maximum number of tokens that can be accumulated
- requests_per_minute: Maximum number of requests allowed per minute
How It Works
Token Bucket Algorithm
- Pre-Request Phase:
- Checks if enough tokens are available in the bucket
- Reserves tokens for the request
- Verifies request count limits
- For streaming requests, deducts tokens immediately
- Post-Response Phase:
- Calculates actual token usage from the response
- Updates the bucket with actual consumption
- Updates request counts
- Adds rate limit headers to the response
Token Replenishment
- Tokens are replenished automatically based on the
tokens_per_minute
rate - Replenishment occurs when checking the bucket state
- Maximum tokens cannot exceed the
bucket_size
- Request counts reset every minute
Response Headers
The plugin adds the following headers to track usage:Rate Limit Response
When limits are exceeded, the plugin returns a 429 status code with details:Streaming Support
The plugin provides special handling for streaming requests:- Tokens are reserved and deducted immediately at request time
- Token usage is tracked throughout the stream
- Final token consumption is calculated from the accumulated usage
Implementation Details
Storage
- Uses Redis for persistent storage
- Bucket state includes:
- Current token count
- Remaining requests
- Last refill timestamp
- 24-hour TTL on bucket data
- Thread-safe operations with mutex locks
Key Format
Best Practices
- Token Configuration
- Set
tokens_per_request
based on average request size - Configure
bucket_size
to handle burst traffic - Adjust
tokens_per_minute
based on subscription tiers
- Request Limits
- Set
requests_per_minute
to prevent rapid small requests - Consider both token and request limits for complete protection
- Monitoring
- Track token consumption patterns using response headers
- Monitor rate limit responses for capacity planning
- Implement exponential backoff in clients