Cost Watchdog 💰
Real-time cost tracking layer for LLM-based agents. Prices every call live,
detects runaway loops in code, enforces budget ceilings mid-execution.
1. Identity
Observes LLM spend without disturbing the agent. Prevents $2,400-overnight-loop
disasters by making cost a first-class concern: priced at write time,
budgeted at check time, surfaced in reports.
2. Triggers
Activate when:
- User mentions cost, budget, tokens, billing.
- Code contains LLM API calls (Anthropic, OpenAI, OpenRouter, Google, Groq, ...).
- Agent loops or recursive workflows.
- Batch / streaming processing with unclear bounds.
/cost-watchdog [command] is invoked.
3. Commands
Run via python3 scripts/cost_watchdog.py <cmd> (or hook into your own CLI).
| Command | What it does |
|---|
session | Spend totals from usage.jsonl — calls, tokens, cost, top models. |
report | 24h / 7d / 30d windows with top model per window. |
tail [--once] | Watch OpenClaw session JSONL and log every assistant turn. |
detect [--json] | Identify which model the agent is currently using (5 probes). |
audit <file.py> | AST-based code risk scan: unbounded loops, recursion, missing max_tokens. |
price <model> | Live pricing for one model, with source + cache age. |
estimate <model> | Project cost for n iterations of a given call. |
alternatives <model> | Cheaper same-unit models. |
errors [--limit N] | Recent swallowed exceptions (silent failures made visible). |
validate-tokens <model> | Compare our heuristic against provider's authoritative count. |
reset [--all] | Clear current-day log (--all also clears rolled files). |
4. Pricing layer
Source chain
openrouter/* → OpenRouter API (live) → static fallback
anything else → LiteLLM JSON (live, cached 24h) → OpenRouter (permissive) → static fallback
- 2600+ models indexed across chat, completion, embedding, image, audio,
video, rerank, OCR, search modes.
- 30+ providers in the static fallback: Anthropic, OpenAI, Google,
Groq, Mistral, Cohere, DeepSeek, Perplexity, xAI, Bedrock, Azure, and more.
- Unit-aware:
token, image, second, query, page, character,
pixel. Alternatives never compare across units.
- Circuit breaker opens after 3 consecutive network failures for a host;
falls through to cache/static until the cool-down ends (60s).
Tuning
| Env var | Default | Effect |
|---|
CW_PRICE_TTL_SECONDS | 86400 (24h) | Cache lifetime. 0 = hit network every call. |
CW_OFFLINE | unset | If 1, never touch the network. |
CW_STATIC_ONLY | unset | If 1, skip live sources entirely. Used by tests. |
CW_LOG_DIR | ~/.cost-watchdog | Where usage/errors/cache files live. |
CW_BUDGET_USD | unset | Ceiling; wrappers raise BudgetExceeded when crossed. |
Refresh static pricing
python3 scripts/refresh_pricing.py
Regenerates references/pricing.md from the live sources so the offline
fallback is fresh. Aborts if fewer than 100 rows came back (protects
against clobbering on a network outage).
5. Tracking layer — how we know what was spent
Four independent paths, all write to ~/.cost-watchdog/usage.jsonl:
| Path | When to use | Covers streams? |
|---|
openclaw_tailer.py --watch | Running OpenClaw. Zero code changes. | yes (reads completed turns) |
track_openai(client) | You call OpenAI-compatible SDK (covers OpenRouter, Groq, DeepSeek, Mistral, Together, Fireworks, Cerebras, Anyscale, ...). | yes (tee'd iterator, auto-injects stream_options={"include_usage": True}) |
track_anthropic(client) | Direct Anthropic SDK. | yes (wraps messages.stream()) |
track_gemini(model) / track_cohere(client) / track_bedrock(client) | Direct provider SDKs. | no (add wrappers if you need streams) |
install_global_capture() (httpx) | Any modern Python SDK using httpx. | no — streams are flagged into errors.jsonl so the gap is visible. Use the SDK wrappers for stream coverage. |
Usage log rotates daily: usage.YYYY-MM-DD.jsonl. session_total(since=...)
skips files outside the window before scanning.
Aggregation uses canonical_family() so
claude-haiku-4-5-20251001, claude-haiku-4-5, and claude-haiku-4.5
are one row in reports.
6. Budget enforcement
Two mechanisms:
- Write-time check (race-safe):
append_usage(entry, budget_ceiling=X)
takes an fcntl.flock on a sidecar, sums the current session, and
refuses the write (raises BudgetExceeded) if the call would cross X.
- Post-write check: wrappers compare cumulative spend to
CW_BUDGET_USD
after logging and raise if over. Used when the wrapper doesn't know the
ceiling at call time.
Either path stops the agent mid-loop; the LLM call still returns to the
caller, but the next one blocks.
7. Code audit (AST)
python3 scripts/cost_watchdog.py audit path/to/agent.py
Walks the AST and reports:
- CRITICAL —
while True with an LLM call and no max_iterations-style bound.
- CRITICAL — function that recurses and calls an LLM API with no depth argument.
- HIGH — plain
while that calls an API with no retry/iteration counter.
- MEDIUM — LLM call missing
max_tokens / max_completion_tokens.
- MEDIUM — function with ≥5 sequential LLM calls (batching candidate).
Every finding has a file line number. No more count('def ') > 3 and count('self.') > 5 → "recursion detected" false positives.
8. Detection — "what model is the agent using?"
python3 scripts/cost_watchdog.py detect
Five probe layers, ranked by confidence:
| Probe | Confidence |
|---|
| OpenClaw session JSONL | high |
| Claude Code session JSONL | high |
| Most recent usage-log entry | high |
Claude Code settings.json | medium |
Env vars (ANTHROPIC_MODEL, OPENAI_MODEL, ...) | medium |
Emits a table or --json.
9. Files
| Path | Purpose |
|---|
scripts/_pricing.py | Router: picks LiteLLM / OpenRouter / static per query. |
scripts/_sources.py | Three PricingSource classes + disk cache + circuit breaker. |
scripts/tokenizer.py | Provider-aware token counting (tiktoken for OpenAI; calibrated heuristics for others). |
scripts/model_canon.py | canonical_family() — collapses model variants. |
scripts/code_audit.py | AST cost-risk walker. |
scripts/usage_log.py | JSONL writer + rotation + aggregation. |
scripts/tracker.py | SDK wrappers + streaming + budget enforcement. |
scripts/http_capture.py | install_global_capture() — httpx transport hook. |
scripts/openclaw_tailer.py | Watches OpenClaw sessions. |
scripts/detect_model.py | Multi-layer detector. |
scripts/errors.py | errors.jsonl writer + reader. |
scripts/io_utils.py | write_json_atomic / read_json. |
scripts/refresh_pricing.py | Regenerates static pricing.md from live sources. |
scripts/cost_watchdog.py | Unified CLI dispatcher. |
references/pricing.md | Static fallback (regenerated; ~2600 models). |
tests/test_cost_watchdog.py | 73 tests: router, cache, AST, tokenizer, rotation, cassettes, circuit breaker, canonicalization. |
10. Quality checklist
11. Known limits (be honest)
- Tokenizer heuristics for Claude/Gemini/etc. are calibrated from docs,
not measured. Run
cost_watchdog validate-tokens <model> to check drift
against the provider's authoritative count when you have an API key.
install_global_capture() can't see streaming responses — httpx exposes
an empty body until the user reads the stream. Use track_openai /
track_anthropic for stream coverage; http_capture logs skipped streams
to errors.jsonl so the gap is visible.
- Non-httpx SDKs (older Cohere, boto3 with custom transport) need the
per-SDK wrappers — HTTP capture won't see them.
- LiteLLM community data can lag 24-48h on brand-new models. OpenRouter's
API is truly live for anything it routes.
12. Testing
python3 -m unittest tests.test_cost_watchdog # 73 tests
python3 scripts/code_audit.py test_risky_code.py # sample risks
python3 scripts/cost_watchdog.py report # current spend summary