LLM Cost Watchdog

Data & APIs

Monitors real-time LLM API costs, detects runaway loops, enforces budgets, audits code risk, and reports usage across multiple providers and models.

Install

openclaw skills install llm-cost-watchdog

Cost Watchdog 💰

Real-time cost tracking layer for LLM-based agents. Prices every call live, detects runaway loops in code, enforces budget ceilings mid-execution.

1. Identity

Observes LLM spend without disturbing the agent. Prevents $2,400-overnight-loop disasters by making cost a first-class concern: priced at write time, budgeted at check time, surfaced in reports.

2. Triggers

Activate when:

  • User mentions cost, budget, tokens, billing.
  • Code contains LLM API calls (Anthropic, OpenAI, OpenRouter, Google, Groq, ...).
  • Agent loops or recursive workflows.
  • Batch / streaming processing with unclear bounds.
  • /cost-watchdog [command] is invoked.

3. Commands

Run via python3 scripts/cost_watchdog.py <cmd> (or hook into your own CLI).

CommandWhat it does
sessionSpend totals from usage.jsonl — calls, tokens, cost, top models.
report24h / 7d / 30d windows with top model per window.
tail [--once]Watch OpenClaw session JSONL and log every assistant turn.
detect [--json]Identify which model the agent is currently using (5 probes).
audit <file.py>AST-based code risk scan: unbounded loops, recursion, missing max_tokens.
price <model>Live pricing for one model, with source + cache age.
estimate <model>Project cost for n iterations of a given call.
alternatives <model>Cheaper same-unit models.
errors [--limit N]Recent swallowed exceptions (silent failures made visible).
validate-tokens <model>Compare our heuristic against provider's authoritative count.
reset [--all]Clear current-day log (--all also clears rolled files).

4. Pricing layer

Source chain

openrouter/*            → OpenRouter API (live)           → static fallback
anything else           → LiteLLM JSON (live, cached 24h) → OpenRouter (permissive) → static fallback
  • 2600+ models indexed across chat, completion, embedding, image, audio, video, rerank, OCR, search modes.
  • 30+ providers in the static fallback: Anthropic, OpenAI, Google, Groq, Mistral, Cohere, DeepSeek, Perplexity, xAI, Bedrock, Azure, and more.
  • Unit-aware: token, image, second, query, page, character, pixel. Alternatives never compare across units.
  • Circuit breaker opens after 3 consecutive network failures for a host; falls through to cache/static until the cool-down ends (60s).

Tuning

Env varDefaultEffect
CW_PRICE_TTL_SECONDS86400 (24h)Cache lifetime. 0 = hit network every call.
CW_OFFLINEunsetIf 1, never touch the network.
CW_STATIC_ONLYunsetIf 1, skip live sources entirely. Used by tests.
CW_LOG_DIR~/.cost-watchdogWhere usage/errors/cache files live.
CW_BUDGET_USDunsetCeiling; wrappers raise BudgetExceeded when crossed.

Refresh static pricing

python3 scripts/refresh_pricing.py

Regenerates references/pricing.md from the live sources so the offline fallback is fresh. Aborts if fewer than 100 rows came back (protects against clobbering on a network outage).

5. Tracking layer — how we know what was spent

Four independent paths, all write to ~/.cost-watchdog/usage.jsonl:

PathWhen to useCovers streams?
openclaw_tailer.py --watchRunning OpenClaw. Zero code changes.yes (reads completed turns)
track_openai(client)You call OpenAI-compatible SDK (covers OpenRouter, Groq, DeepSeek, Mistral, Together, Fireworks, Cerebras, Anyscale, ...).yes (tee'd iterator, auto-injects stream_options={"include_usage": True})
track_anthropic(client)Direct Anthropic SDK.yes (wraps messages.stream())
track_gemini(model) / track_cohere(client) / track_bedrock(client)Direct provider SDKs.no (add wrappers if you need streams)
install_global_capture() (httpx)Any modern Python SDK using httpx.no — streams are flagged into errors.jsonl so the gap is visible. Use the SDK wrappers for stream coverage.

Usage log rotates daily: usage.YYYY-MM-DD.jsonl. session_total(since=...) skips files outside the window before scanning.

Aggregation uses canonical_family() so claude-haiku-4-5-20251001, claude-haiku-4-5, and claude-haiku-4.5 are one row in reports.

6. Budget enforcement

Two mechanisms:

  1. Write-time check (race-safe): append_usage(entry, budget_ceiling=X) takes an fcntl.flock on a sidecar, sums the current session, and refuses the write (raises BudgetExceeded) if the call would cross X.
  2. Post-write check: wrappers compare cumulative spend to CW_BUDGET_USD after logging and raise if over. Used when the wrapper doesn't know the ceiling at call time.

Either path stops the agent mid-loop; the LLM call still returns to the caller, but the next one blocks.

7. Code audit (AST)

python3 scripts/cost_watchdog.py audit path/to/agent.py

Walks the AST and reports:

  • CRITICALwhile True with an LLM call and no max_iterations-style bound.
  • CRITICAL — function that recurses and calls an LLM API with no depth argument.
  • HIGH — plain while that calls an API with no retry/iteration counter.
  • MEDIUM — LLM call missing max_tokens / max_completion_tokens.
  • MEDIUM — function with ≥5 sequential LLM calls (batching candidate).

Every finding has a file line number. No more count('def ') > 3 and count('self.') > 5 → "recursion detected" false positives.

8. Detection — "what model is the agent using?"

python3 scripts/cost_watchdog.py detect

Five probe layers, ranked by confidence:

ProbeConfidence
OpenClaw session JSONLhigh
Claude Code session JSONLhigh
Most recent usage-log entryhigh
Claude Code settings.jsonmedium
Env vars (ANTHROPIC_MODEL, OPENAI_MODEL, ...)medium

Emits a table or --json.

9. Files

PathPurpose
scripts/_pricing.pyRouter: picks LiteLLM / OpenRouter / static per query.
scripts/_sources.pyThree PricingSource classes + disk cache + circuit breaker.
scripts/tokenizer.pyProvider-aware token counting (tiktoken for OpenAI; calibrated heuristics for others).
scripts/model_canon.pycanonical_family() — collapses model variants.
scripts/code_audit.pyAST cost-risk walker.
scripts/usage_log.pyJSONL writer + rotation + aggregation.
scripts/tracker.pySDK wrappers + streaming + budget enforcement.
scripts/http_capture.pyinstall_global_capture() — httpx transport hook.
scripts/openclaw_tailer.pyWatches OpenClaw sessions.
scripts/detect_model.pyMulti-layer detector.
scripts/errors.pyerrors.jsonl writer + reader.
scripts/io_utils.pywrite_json_atomic / read_json.
scripts/refresh_pricing.pyRegenerates static pricing.md from live sources.
scripts/cost_watchdog.pyUnified CLI dispatcher.
references/pricing.mdStatic fallback (regenerated; ~2600 models).
tests/test_cost_watchdog.py73 tests: router, cache, AST, tokenizer, rotation, cassettes, circuit breaker, canonicalization.

10. Quality checklist

  • Live pricing from LiteLLM + OpenRouter, 24h-cached, with static fallback.
  • Exact-match model lookup (no substring conflation).
  • Multi-modal (token / image / second / query / page / character).
  • Unit-aware alternatives (never compares tokens to images).
  • AST-based code audit with line numbers.
  • Provider-aware tokenization (no more tiktoken-for-Claude).
  • Variance-based confidence (no += 0.05 theater).
  • Atomic writes to all shared state files.
  • fcntl.flock-guarded budget check-and-log (no race).
  • Circuit breaker on flaky networks (no 5s hang per call).
  • Streaming capture via SDK wrappers; streams flagged in errors.jsonl via HTTP capture.
  • Daily log rotation + date-scoped aggregation.
  • Canonical model families (variants collapse in reports).
  • errors.jsonl surfaces silent failures; cost_watchdog errors shows them.
  • Cassette tests for LiteLLM + OpenRouter parse paths (schema-drift safety net).
  • 73 logic tests passing.

11. Known limits (be honest)

  • Tokenizer heuristics for Claude/Gemini/etc. are calibrated from docs, not measured. Run cost_watchdog validate-tokens <model> to check drift against the provider's authoritative count when you have an API key.
  • install_global_capture() can't see streaming responses — httpx exposes an empty body until the user reads the stream. Use track_openai / track_anthropic for stream coverage; http_capture logs skipped streams to errors.jsonl so the gap is visible.
  • Non-httpx SDKs (older Cohere, boto3 with custom transport) need the per-SDK wrappers — HTTP capture won't see them.
  • LiteLLM community data can lag 24-48h on brand-new models. OpenRouter's API is truly live for anything it routes.

12. Testing

python3 -m unittest tests.test_cost_watchdog     # 73 tests
python3 scripts/code_audit.py test_risky_code.py # sample risks
python3 scripts/cost_watchdog.py report          # current spend summary