Cost Tracking¶
How It Works¶
The @track_costs decorator wraps your function and performs the following steps on each call:
- Pre-call budget check -- If
max_budgetis set, queries the store for the project's accumulated cost. RaisesBudgetExceededErrorif the budget is already exhausted. - Pre-call rate limit check -- If
rate_limitortpm_limitis set, checks the sliding window. RaisesLocalRateLimitErrorif limits would be breached. - Execute the wrapped function -- Calls your function normally.
- Extract token usage -- Auto-detects the SDK from the response object using duck-typing. Falls back to the
extract_usagecallback if auto-detection fails. - Calculate cost -- Looks up per-token pricing in the
PricingRegistryand computesinput_tokens * input_cost + output_tokens * output_cost. - Log usage -- Writes the usage record to the store (SQLite or PostgreSQL) in an atomic transaction.
- Report cost -- Prints a color-coded summary to stderr via
CostReporter.
Auto-Detection¶
The decorator uses duck-typing to detect response objects from supported SDKs:
- OpenAI -- Checks for
usage.prompt_tokensandusage.completion_tokensattributes - Anthropic -- Checks for
usage.input_tokensandusage.output_tokensattributes - Gemini -- Checks for
usage_metadata.prompt_token_countandusage_metadata.candidates_token_count
Parsers are tried in sequence (OpenAI, Anthropic, Gemini). The first successful match is used.
Manual Extraction¶
When auto-detection is not possible (e.g., raw HTTP responses, custom APIs), use the extract_usage parameter:
@track_costs(
extract_usage=lambda res: (res["model"], res["input_tokens"], res["output_tokens"])
)
def my_custom_llm_call(prompt):
# Returns a dict with model, input_tokens, output_tokens
return call_my_api(prompt)
The callable must return a tuple of (model: str, input_tokens: int, output_tokens: int).
If extract_usage raises an exception, the decorator emits a warning and skips cost tracking for that call (the function's return value is still passed through).
Pricing Registry¶
Costs are computed using the PricingRegistry, which stores per-token pricing for each model. The registry ships with built-in pricing for:
- OpenAI: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo, o1, o1-mini, o3, o3-mini, o4-mini
- Anthropic: claude-sonnet-4-20250514, claude-3.5-sonnet, claude-3-haiku, claude-3-opus, claude-3.5-haiku
- Gemini: gemini-1.5-pro, gemini-1.5-flash, gemini-2.0-flash, gemini-2.5-pro, gemini-2.5-flash
- Local: ollama/, local/, llama.cpp/ (all $0.00)
Model lookup uses exact match first, then longest-prefix match (e.g., gpt-4o-2024-08-06 resolves to gpt-4o).
Custom Model Pricing¶
from llm_toll import default_registry
# Register a custom model
default_registry.register_model(
"my-custom-model",
input_cost_per_token=1e-06,
output_cost_per_token=3e-06,
)
Fallback Pricing¶
# Set fallback for any unrecognized model
default_registry.set_fallback_pricing(
input_cost_per_token=1e-06,
output_cost_per_token=3e-06,
)
Without fallback pricing, unrecognized models emit a PricingMatrixOutdatedWarning and are tracked at $0.00.
Cost Rounding¶
All costs are rounded to 10 decimal places (COST_ROUND_PLACES = 10) to prevent floating-point drift from accumulating across many calls.
Persistence¶
Usage logs and budget state are stored locally in ~/.llm_toll.db (SQLite) by default. Each call creates:
- A row in
usage_logswith project, model, token counts, cost, and timestamp - An upsert to
budgetsincrementing the project's accumulated cost
See PostgreSQL Backend for team-wide tracking.