Skip to content

Local / Ollama

Overview

Local models (Ollama, llama.cpp, and other locally-hosted LLMs) are tracked at $0 cost. Rate limiting still applies, which is useful for managing local GPU resources.

Recognized Prefixes

Prefix Cost
ollama/ $0.00
local/ $0.00
llama.cpp/ $0.00

Any model name starting with these prefixes is matched via the pricing registry's prefix matching and tracked at zero cost.

Usage with extract_usage

from llm_toll import track_costs

@track_costs(
    model="ollama/llama3",
    rate_limit=10,       # limit local GPU to 10 RPM
    extract_usage=lambda r: ("ollama/llama3", r["prompt_tokens"], r["completion_tokens"])
)
def local_inference(prompt):
    # Ollama call here
    pass

Usage with OpenAI-Compatible API

Ollama's API is OpenAI-compatible. If you use the openai client pointed at localhost, auto-parsing works automatically:

from openai import OpenAI
from llm_toll import track_costs

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required but unused
)

@track_costs(model="ollama/llama3", rate_limit=10)
def local_chat(prompt):
    response = client.chat.completions.create(
        model="llama3",
        messages=[{"role": "user", "content": prompt}]
    )
    return response

Tip

When using the OpenAI client with Ollama, set model="ollama/llama3" in the decorator to ensure the ollama/ prefix pricing ($0) is applied. The model name from the response may not include the prefix.

Rate Limiting Local Models

Even at $0 cost, rate limiting is valuable for local models:

@track_costs(
    model="ollama/llama3",
    rate_limit=5,        # max 5 requests per minute
    tpm_limit=10000,     # max 10k tokens per minute
)
def gpu_bound_inference(prompt):
    ...

This prevents overwhelming local GPU resources with concurrent requests.

Monitoring

Usage is still logged to the SQLite store even at $0 cost. You can view call counts and token usage via the CLI:

llm-toll --stats --model ollama/llama3