Open AI Token-Based Rate Limiting
This guide demonstrates how to implement token-based rate limiting in TrustGate to effectively manage AI model usage and costs. Unlike traditional request-based rate limiting, token-based rate limiting considers the actual token consumption of each request, providing more granular control over API usage.
Prerequisites
- TrustGate installed and running
- Access to an AI provider (e.g., OpenAI) API key
- Basic understanding of rate limiting concepts
Step 1: Create a Gateway with Token Rate Limiter
First, create a gateway with the token rate limiter plugin enabled:
Configuration Parameters
- tokens_per_request: Maximum number of tokens allowed in a single request
- tokens_per_minute: Total token budget allocated per minute
- bucket_size: Maximum capacity of the token bucket
- requests_per_minute: Maximum number of requests allowed per minute
Step 2: Configure an Upstream
Set up an upstream that connects to your AI provider:
Step 3: Create a Service
Create a service that uses the upstream:
Step 4: Add a Rule
Configure a rule to route requests to your service:
Step 5: Generate an API Key
Create an API key for authentication:
Using the Rate-Limited API
When making requests to your rate-limited API, you’ll receive headers indicating your current rate limit status:
Response Headers
The API returns several headers to help you track your rate limit status:
- X-RateLimit-Limit-Tokens: Total token budget per minute
- X-RateLimit-Remaining-Tokens: Remaining tokens in the current window
- X-RateLimit-Reset-Tokens: Time until token budget resets (in seconds)
- X-RateLimit-Limit-Requests: Maximum requests per minute
- X-RateLimit-Remaining-Requests: Remaining requests in the current window
- X-RateLimit-Reset-Requests: Time until request count resets (in seconds)
- X-Tokens-Consumed: Number of tokens consumed by the current request
Best Practices
-
Set Appropriate Limits
- Consider your AI model’s pricing
- Account for both input and output tokens
- Leave headroom for burst traffic
-
Monitor Usage
- Track token consumption patterns
- Watch for rate limit errors
- Adjust limits based on actual usage
-
Handle Rate Limit Errors
- Implement exponential backoff
- Queue requests when approaching limits
- Provide clear feedback to users
-
Optimize Token Usage
- Compress prompts where possible
- Use efficient model configurations
- Implement client-side token estimation
Troubleshooting
If you encounter rate limit errors, check:
- Current rate limit status using response headers
- Token consumption patterns in your requests
- Plugin configuration in the gateway
- Time until rate limits reset
Next Steps
- Implement monitoring for rate limit metrics
- Set up alerts for high token usage
- Configure multiple rate limit tiers
- Add fallback providers for high availability