LLM Cost Watchdog

Data & APIs

Monitors real-time LLM API costs, detects runaway loops, enforces budgets, audits code risk, and reports usage across multiple providers and models.

Install

openclaw skills install llm-cost-watchdog

Cost Watchdog 💰

Real-time cost tracking layer for LLM-based agents. Prices every call live, detects runaway loops in code, enforces budget ceilings mid-execution.

1. Identity

Observes LLM spend without disturbing the agent. Prevents $2,400-overnight-loop disasters by making cost a first-class concern: priced at write time, budgeted at check time, surfaced in reports.

2. Triggers

Activate when:

User mentions cost, budget, tokens, billing.
Code contains LLM API calls (Anthropic, OpenAI, OpenRouter, Google, Groq, ...).
Agent loops or recursive workflows.
Batch / streaming processing with unclear bounds.
/cost-watchdog [command] is invoked.

3. Commands

Run via python3 scripts/cost_watchdog.py <cmd> (or hook into your own CLI).

Command	What it does
`session`	Spend totals from `usage.jsonl` — calls, tokens, cost, top models.
`report`	24h / 7d / 30d windows with top model per window.
`tail [--once]`	Watch OpenClaw session JSONL and log every assistant turn.
`detect [--json]`	Identify which model the agent is currently using (5 probes).
`audit <file.py>`	AST-based code risk scan: unbounded loops, recursion, missing `max_tokens`.
`price <model>`	Live pricing for one model, with source + cache age.
`estimate <model>`	Project cost for `n` iterations of a given call.
`alternatives <model>`	Cheaper same-unit models.
`errors [--limit N]`	Recent swallowed exceptions (silent failures made visible).
`validate-tokens <model>`	Compare our heuristic against provider's authoritative count.
`reset [--all]`	Clear current-day log (`--all` also clears rolled files).

4. Pricing layer

Source chain

openrouter/*            → OpenRouter API (live)           → static fallback
anything else           → LiteLLM JSON (live, cached 24h) → OpenRouter (permissive) → static fallback

2600+ models indexed across chat, completion, embedding, image, audio, video, rerank, OCR, search modes.
30+ providers in the static fallback: Anthropic, OpenAI, Google, Groq, Mistral, Cohere, DeepSeek, Perplexity, xAI, Bedrock, Azure, and more.
Unit-aware: token, image, second, query, page, character, pixel. Alternatives never compare across units.
Circuit breaker opens after 3 consecutive network failures for a host; falls through to cache/static until the cool-down ends (60s).

Tuning

Env var	Default	Effect
`CW_PRICE_TTL_SECONDS`	86400 (24h)	Cache lifetime. `0` = hit network every call.
`CW_OFFLINE`	unset	If `1`, never touch the network.
`CW_STATIC_ONLY`	unset	If `1`, skip live sources entirely. Used by tests.
`CW_LOG_DIR`	`~/.cost-watchdog`	Where usage/errors/cache files live.
`CW_BUDGET_USD`	unset	Ceiling; wrappers raise `BudgetExceeded` when crossed.

Refresh static pricing

python3 scripts/refresh_pricing.py

Regenerates references/pricing.md from the live sources so the offline fallback is fresh. Aborts if fewer than 100 rows came back (protects against clobbering on a network outage).

5. Tracking layer — how we know what was spent

Four independent paths, all write to ~/.cost-watchdog/usage.jsonl:

Path	When to use	Covers streams?
`openclaw_tailer.py --watch`	Running OpenClaw. Zero code changes.	yes (reads completed turns)
`track_openai(client)`	You call OpenAI-compatible SDK (covers OpenRouter, Groq, DeepSeek, Mistral, Together, Fireworks, Cerebras, Anyscale, ...).	yes (tee'd iterator, auto-injects `stream_options={"include_usage": True}`)
`track_anthropic(client)`	Direct Anthropic SDK.	yes (wraps `messages.stream()`)
`track_gemini(model)` / `track_cohere(client)` / `track_bedrock(client)`	Direct provider SDKs.	no (add wrappers if you need streams)
`install_global_capture()` (httpx)	Any modern Python SDK using httpx.	no — streams are flagged into `errors.jsonl` so the gap is visible. Use the SDK wrappers for stream coverage.

Usage log rotates daily: usage.YYYY-MM-DD.jsonl. session_total(since=...) skips files outside the window before scanning.

Aggregation uses canonical_family() so claude-haiku-4-5-20251001, claude-haiku-4-5, and claude-haiku-4.5 are one row in reports.

6. Budget enforcement

Two mechanisms:

Write-time check (race-safe): append_usage(entry, budget_ceiling=X) takes an fcntl.flock on a sidecar, sums the current session, and refuses the write (raises BudgetExceeded) if the call would cross X.
Post-write check: wrappers compare cumulative spend to CW_BUDGET_USD after logging and raise if over. Used when the wrapper doesn't know the ceiling at call time.

Either path stops the agent mid-loop; the LLM call still returns to the caller, but the next one blocks.

7. Code audit (AST)

python3 scripts/cost_watchdog.py audit path/to/agent.py

Walks the AST and reports:

CRITICAL — while True with an LLM call and no max_iterations-style bound.
CRITICAL — function that recurses and calls an LLM API with no depth argument.
HIGH — plain while that calls an API with no retry/iteration counter.
MEDIUM — LLM call missing max_tokens / max_completion_tokens.
MEDIUM — function with ≥5 sequential LLM calls (batching candidate).

Every finding has a file line number. No more count('def ') > 3 and count('self.') > 5 → "recursion detected" false positives.

8. Detection — "what model is the agent using?"

python3 scripts/cost_watchdog.py detect

Five probe layers, ranked by confidence:

Probe	Confidence
OpenClaw session JSONL	high
Claude Code session JSONL	high
Most recent usage-log entry	high
Claude Code `settings.json`	medium
Env vars (`ANTHROPIC_MODEL`, `OPENAI_MODEL`, ...)	medium

Emits a table or --json.

9. Files

Path	Purpose
`scripts/_pricing.py`	Router: picks LiteLLM / OpenRouter / static per query.
`scripts/_sources.py`	Three `PricingSource` classes + disk cache + circuit breaker.
`scripts/tokenizer.py`	Provider-aware token counting (tiktoken for OpenAI; calibrated heuristics for others).
`scripts/model_canon.py`	`canonical_family()` — collapses model variants.
`scripts/code_audit.py`	AST cost-risk walker.
`scripts/usage_log.py`	JSONL writer + rotation + aggregation.
`scripts/tracker.py`	SDK wrappers + streaming + budget enforcement.
`scripts/http_capture.py`	`install_global_capture()` — httpx transport hook.
`scripts/openclaw_tailer.py`	Watches OpenClaw sessions.
`scripts/detect_model.py`	Multi-layer detector.
`scripts/errors.py`	`errors.jsonl` writer + reader.
`scripts/io_utils.py`	`write_json_atomic` / `read_json`.
`scripts/refresh_pricing.py`	Regenerates static `pricing.md` from live sources.
`scripts/cost_watchdog.py`	Unified CLI dispatcher.
`references/pricing.md`	Static fallback (regenerated; ~2600 models).
`tests/test_cost_watchdog.py`	73 tests: router, cache, AST, tokenizer, rotation, cassettes, circuit breaker, canonicalization.

10. Quality checklist

11. Known limits (be honest)

Tokenizer heuristics for Claude/Gemini/etc. are calibrated from docs, not measured. Run cost_watchdog validate-tokens <model> to check drift against the provider's authoritative count when you have an API key.
install_global_capture() can't see streaming responses — httpx exposes an empty body until the user reads the stream. Use track_openai / track_anthropic for stream coverage; http_capture logs skipped streams to errors.jsonl so the gap is visible.
Non-httpx SDKs (older Cohere, boto3 with custom transport) need the per-SDK wrappers — HTTP capture won't see them.
LiteLLM community data can lag 24-48h on brand-new models. OpenRouter's API is truly live for anything it routes.

12. Testing

python3 -m unittest tests.test_cost_watchdog     # 73 tests
python3 scripts/code_audit.py test_risky_code.py # sample risks
python3 scripts/cost_watchdog.py report          # current spend summary