NeuralTrust | The leading security platform for generative AI

This guide demonstrates how to implement token-based rate limiting in TrustGate to effectively manage AI model usage and costs. Unlike traditional request-based rate limiting, token-based rate limiting considers the actual token consumption of each request, providing more granular control over API usage.

Prerequisites

TrustGate installed and running
Access to an AI provider (e.g., OpenAI) API key
Basic understanding of rate limiting concepts

Step 1: Create a Gateway with Token Rate Limiter

First, create a gateway with the token rate limiter plugin enabled:

curl -X POST "http://localhost:8080/api/v1/gateways" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${JWT_TOKEN}" \
  -d '{
    "name": "OpenAI Token Rate Limiter",
    "required_plugins": [
      {
        "name": "token_rate_limiter",
        "enabled": true,
        "settings": {
          "tokens_per_request": 20,     # Maximum tokens per single request
          "tokens_per_minute": 100,     # Token budget per minute
          "bucket_size": 1000,          # Maximum token bucket capacity
          "requests_per_minute": 60     # Maximum requests per minute
        }
      }
    ]
  }'

Configuration Parameters

tokens_per_request: Maximum number of tokens allowed in a single request
tokens_per_minute: Total token budget allocated per minute
bucket_size: Maximum capacity of the token bucket
requests_per_minute: Maximum number of requests allowed per minute

Step 2: Configure an Upstream

Set up an upstream that connects to your AI provider:

curl -X POST "http://localhost:8080/api/v1/gateways/{gateway_id}/upstreams" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${JWT_TOKEN}" \
  -d '{
    "name": "ai-upstream",
    "algorithm": "round-robin",
    "targets": [{
      "path": "/v1/chat/completions",
      "provider": "openai",
      "weight": 100,
      "models": ["gpt-3.5-turbo", "gpt-4"],
      "default_model": "gpt-3.5-turbo",
      "credentials": {
        "header_name": "Authorization",
        "header_value": "Bearer your-api-key"
      }
    }]
  }'

Step 3: Create a Service

Create a service that uses the upstream:

curl -X POST "http://localhost:8080/api/v1/gateways/{gateway_id}/services" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${JWT_TOKEN}" \
  -d '{
    "name": "ai-service",
    "type": "upstream",
    "description": "AI API Service",
    "upstream_id": "{upstream_id}",
    "retries": 3
  }'

Step 4: Add a Rule

Configure a rule to route requests to your service:

curl -X POST "http://localhost:8080/api/v1/gateways/{gateway_id}/rules" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${JWT_TOKEN}" \
  -d '{
    "path": "/v1",
    "service_id": "{service_id}",
    "methods": ["POST"],
    "strip_path": false,
    "active": true
  }'

Step 5: Generate an API Key

Create an API key for authentication:

curl -X POST "http://localhost:8080/api/v1/gateways/{gateway_id}/keys" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${JWT_TOKEN}" \
  -d '{
    "name": "test-key",
    "expires_at": "2024-12-31T23:59:59Z"
  }'

Using the Rate-Limited API

When making requests to your rate-limited API, you’ll receive headers indicating your current rate limit status:

curl -X POST "http://localhost:8081/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "X-TG-API-Key: your-api-key" \
  -H "Authorization: Bearer ${JWT_TOKEN}" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {
        "role": "user",
        "content": "Hello, how are you?"
      }
    ]
  }'

Response Headers

The API returns several headers to help you track your rate limit status:

X-RateLimit-Limit-Tokens: Total token budget per minute
X-RateLimit-Remaining-Tokens: Remaining tokens in the current window
X-RateLimit-Reset-Tokens: Time until token budget resets (in seconds)
X-RateLimit-Limit-Requests: Maximum requests per minute
X-RateLimit-Remaining-Requests: Remaining requests in the current window
X-RateLimit-Reset-Requests: Time until request count resets (in seconds)
X-Tokens-Consumed: Number of tokens consumed by the current request

Best Practices

Set Appropriate Limits
- Consider your AI model’s pricing
- Account for both input and output tokens
- Leave headroom for burst traffic
Monitor Usage
- Track token consumption patterns
- Watch for rate limit errors
- Adjust limits based on actual usage
Handle Rate Limit Errors
- Implement exponential backoff
- Queue requests when approaching limits
- Provide clear feedback to users
Optimize Token Usage
- Compress prompts where possible
- Use efficient model configurations
- Implement client-side token estimation

Troubleshooting

If you encounter rate limit errors, check:

Current rate limit status using response headers
Token consumption patterns in your requests
Plugin configuration in the gateway
Time until rate limits reset

Next Steps

Implement monitoring for rate limit metrics
Set up alerts for high token usage
Configure multiple rate limit tiers
Add fallback providers for high availability

Getting Started

Core Concepts

Traffic Management

Non-REST Connectivity

Rate Limiting & Request Control

Content Security

Application Security

Server Security

Data masking

Extending Functionality

Observability & Monitoring

Benchmark

API Reference

Open AI Token-Based Rate Limiting

Prerequisites

Step 1: Create a Gateway with Token Rate Limiter

Configuration Parameters

Step 2: Configure an Upstream

Step 3: Create a Service

Step 4: Add a Rule

Step 5: Generate an API Key

Using the Rate-Limited API

Response Headers

Best Practices

Troubleshooting

Next Steps

Getting Started

Core Concepts

Traffic Management

Non-REST Connectivity

Rate Limiting & Request Control

Content Security

Application Security

Server Security

Data masking

Extending Functionality

Observability & Monitoring

Benchmark

API Reference

​Prerequisites

​Step 1: Create a Gateway with Token Rate Limiter

​Configuration Parameters

​Step 2: Configure an Upstream

​Step 3: Create a Service

​Step 4: Add a Rule

​Step 5: Generate an API Key

​Using the Rate-Limited API

​Response Headers

​Best Practices

​Troubleshooting

​Next Steps

Prerequisites

Step 1: Create a Gateway with Token Rate Limiter

Configuration Parameters

Step 2: Configure an Upstream

Step 3: Create a Service

Step 4: Add a Rule

Step 5: Generate an API Key

Using the Rate-Limited API

Response Headers

Best Practices

Troubleshooting

Next Steps