Rate Limiting¶

Overview¶

llm-toll provides local rate limiting to prevent HTTP 429 errors from LLM providers. Limits are enforced before the API call is made, saving you both time and potential retry costs.

Configuration¶

from llm_toll import track_costs

@track_costs(
    rate_limit=50,     # max 50 requests per minute (RPM)
    tpm_limit=40000,   # max 40,000 tokens per minute (TPM)
)
def chat(prompt):
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )

Parameter	Type	Description
`rate_limit`	`int \\| None`	Maximum requests per minute (RPM). `None` to disable.
`tpm_limit`	`int \\| None`	Maximum tokens per minute (TPM). `None` to disable.

Sliding Window Algorithm¶

The rate limiter uses a 60-second sliding window:

For RPM: tracks timestamps of recent requests in a deque
For TPM: tracks (timestamp, token_count) tuples in a deque

On each check, entries older than 60 seconds are pruned. If the remaining count meets or exceeds the limit, LocalRateLimitError is raised.

LocalRateLimitError¶

from llm_toll.exceptions import LocalRateLimitError

try:
    result = chat("Hello")
except LocalRateLimitError as e:
    print(f"Limit type: {e.limit_type}")    # "rpm" or "tpm"
    print(f"Limit value: {e.limit_value}")   # e.g., 50
    print(f"Retry after: {e.retry_after:.1f}s")

The retry_after attribute tells you exactly how many seconds to wait before the next request is allowed.

Attribute	Type	Description
`limit_type`	`str \\| None`	`"rpm"` or `"tpm"`
`limit_value`	`int \\| None`	The configured limit value
`retry_after`	`float \\| None`	Seconds until the next request is allowed

Scoping¶

Each @track_costs decorator gets its own RateLimiter instance. Limits are scoped per-decorator, not per-project or globally.

# These have independent rate limiters
@track_costs(rate_limit=10)
def func_a(): ...

@track_costs(rate_limit=20)
def func_b(): ...

Check vs Record¶

The rate limiter separates checking from recording:

check() -- Called before the API call. Raises LocalRateLimitError if limits would be breached.
record(tokens=N) -- Called after a successful API call. Adds the request to the sliding window.

This two-step approach avoids counting failed or skipped calls against your limits.

Concurrency Note

Under concurrent use, multiple threads may pass check() before any of them calls record(), slightly exceeding the configured limit. This is an accepted trade-off: the alternative (holding the lock across the API call) would serialize all LLM requests.

Combining with Budget¶

Rate limits and budget enforcement work together:

@track_costs(
    project="my_app",
    max_budget=5.00,
    rate_limit=50,
    tpm_limit=40000,
)
def chat(prompt):
    ...

The order of checks is:

Budget check (pre-call)
Rate limit check (pre-call)
Execute function
Extract usage and log cost (post-call)