Universal token audit & optimization framework for OpenClaw agents.
Based on real-world practice (2026-05-04).
Core Principles
Tier your model usage — Simple tasks use cheap models; complex reasoning
uses expensive ones. Don't mix the two.
Prompts say what, not why — Background rationale and philosophy are
noise to an agent. Strip them.
Batch > Serial — One call for 10 results costs marginally more than
three calls for 3+3+4 results. Combine.
Context = Cost — Every file loaded at session start, every tool schema
registered, every past message injected — all have a token price.
Idle = Zero burn — Nighttime, weekends, and idle periods should run
nothing. Configure active hours.
Output
After each full execution, write a report (token-audit-report-YYYY-MM-DD.md)
containing: before/after comparison table, estimated weekly savings per change,
items deferred and why, recommended next step.
Quick Start
Not every audit needs the full Phase 1-5 treatment. Use these shortcuts
based on your goal:
agents.list[].tools.profile — full, coding, or custom
agents.list[].model — per-agent model override
1C Measure Context Load
List every file that is injected at session start (typically files in the
workspace root directory). Measure each in chars and estimate token cost
(~3 chars per token for CJK-heavy text, ~4 for English-heavy).
If LCM (Lossless Context Management) is active, note the number and average
size of compacted summary blocks injected per turn.
If tool schemas are accessible, estimate total schema chars:
(count of registered tools × average schema size in chars).
1D Map Models to Tiers
Categorize all available models into three tiers based on capability and cost:
🏆 Premium (strong reasoning, high cost): e.g. deepseek-v4-pro, gpt-5.x
🟡 Standard (balanced): e.g. deepseek-v4-flash, minimax-m2.7
🟢 Economy (lightweight): e.g. minimax-m2.7-highspeed, ollama local
Map each task from 1A to its current model tier.
⚠️ Checkpoint: Before moving to Phase 2, present your Phase 1 findings
(task inventory, file sizes, model tier map) to the user.
Confirm that the inventory is complete and the measurements are correct.
This prevents optimizing the wrong things.
Phase 2: PRIORITIZE — Build Your Decision Matrix
Score each finding from Phase 1 along three independent dimensions:
Dimension
Scale
Assessment
Token Impact 🎯
High / Med / Low
Tokens per occurrence × occurrences per period
Risk ⚠️
Safe / Moderate / High
Can you undo it? Does it affect core function?
Effort 🔧
Easy / Med / Hard
Single config change? Multi-file edit? Needs research?
How to Score
Compute a relative priority for each finding by inverting Risk and Effort:
Where each dimension maps to a simple numeric weight:
Impact: High=3, Med=2, Low=1
Risk: Safe=1, Moderate=2, High=3
Effort: Easy=1, Med=2, Hard=3
Focus on items scoring ≥ 1.5 first. Skip items < 1.0 unless they are
trivially easy (effort=1) and safe (risk=1).
Common High-Impact Patterns
These patterns tend to score high across most deployments:
Pattern
Typical Impact
Typical Risk
Typical Effort
Overly verbose task prompts
High
Safe
Easy
Heavy models on simple tasks
High
Safe
Easy
No active hours on heartbeat
Med-High
Safe
Easy
Duplicated content across bootstrap files
Med-High
Safe
Easy-Med
Full tool profile on task-specific agents
High
Moderate
Easy
Idle-time session not configured
Med
Safe
Easy
Outdated tool/plugin configs still loaded
Low-Med
Safe
Easy
⚠️ Checkpoint: Show your top-3 priority items to the user.
Confirm direction before starting optimization.
If the highest-score items seem wrong, revisit Phase 1 measurements.
Phase 3: OPTIMIZE — Apply Categorical Techniques
⚠️ User confirmation gate: Techniques marked Moderate or High risk
involve config changes, profile switches, or task merging. Before applying them,
present the proposed change using this template and get explicit approval:
Each category below contains a set of techniques. Apply them in priority
order from Phase 2 — start with the highest-score items first, regardless
of which category they fall into.
Failure Recovery
If a technique causes a problem:
Config change: Restore the backed-up config file and reload.
Cron merge broken: Restore the old separate cron job from version control
or re-create it from the original prompt.
Profile switch issue: Revert to "full" profile, report the missing tool.
Prompt compression over-aggressive: Restore from the diff backup (keep
pre-optimization prompt versions in a prompts/backup/ directory).
Category Selection Guide
Match your Phase 2 findings to the best starting category:
No active hours, co-located tasks running separately
F Session Lifecycle
Repeated system prompts without caching structure
G Provider-Side Caching
Agent retries failed approaches instead of switching
H Behavioral Discipline
Simple/complex tasks both use premium model
J Intelligent Model Routing
Category Decision Tree
If you're not sure which category to start with, follow this tree from top
to bottom — the first match tells you your likely best starting category:
text
1. Is the main session slow or expensive?
→ Check B (tiering) and J (routing)
→ Also check D (too many tools loaded?)
2. Are cron jobs consuming more than expected?
→ Check A (prompts too wordy?), then B (wrong model?)
→ If F (same-tier jobs not batched?)
3. Is context getting cut off mid-task?
→ Check C (bootstrap too large?) → I (progressive disclosure?)
→ Then J3 (incremental delivery?)
4. Are agent outputs too verbose?
→ Check E (output discipline) → H (behavioral discipline)
5. Is the same heavy prompt repeated across tasks?
→ Check G (provider-side caching: fixed prefix first?)
6. Are you seeing the same errors repeatedly?
→ Check H2 (fail once, switch) → H4 (fix root cause)
7. Default (no obvious symptom):
Run Phase 1 from scratch → Phase 2 will tell you where to go
Pro tip: Start with G (Provider-Side Caching) if you use DeepSeek.
Cache pricing is 0.83% of uncached — fixing prefix structure alone
can cut token costs by 90%+.
Replace multi-sentence descriptions with keyword checklists.
Safe
A3 Constrain output
Add "Answer concisely in ≤3 lines" or equivalent to reduce generated tokens.
Safe
A4 Remove redundancy
Delete "What NOT to do" sections — proper instructions make negatives implicit.
Safe
A5 Reference > inline
Replace full instructions for sub-tasks with file references ("See X.md") when the referenced file is always loaded.
Safe
B. Model Tiering
Technique
Description
Risk
B1 Right-size each task
Map every automated task to the cheapest model that can do it adequately. Test borderline cases.
Safe
B2 Define tier boundaries
Document which model(s) belong to each tier so new tasks are assigned correctly.
Safe
B3 Batch same-tier runs
Schedule same-tier tasks back-to-back to reuse the same session (single context load).
Moderate
C. Context Slimming
Technique
Description
Risk
C1 Measure every boot file
List all files loaded at session start and identify those > 2K chars for potential trimming.
Safe
C2 Cross-reference dedup
When the same content appears in 2+ files (e.g. "Core Principles" in SOUL.md and IDENTITY.md), keep it in one authoritative file and replace the others with a 详见 <file> reference.
Safe
C3 Archive aged-out content
Move old diary entries, superseded milestones, and historical promoted entries to a dedicated archive directory.
Safe
C4 Trim to one-liner
Convert verbose descriptions to single-line summaries.
Before: "This project's coding conventions were established after three code reviews revealed inconsistent patterns: use 2-space indent for HTML/CSS, 4-space for Python, tabs for Go. Prefix private methods with underscore. No Hungarian notation. Import order: stdlib, third-party, local."
After: "Coding conventions (see CONTRIBUTING.md) — 6 rules, numbered."
Count all registered tools and estimate total schema chars. This is typically the single largest per-turn overhead.
Safe (measure only)
D2 Switch profile per agent
Use "coding" profile for sub-agents/cron jobs (excludes browser, canvas, media generation, feishu tools). Use "full" only where those tools are actually needed.
Moderate (test on sub-agents first)
D3 Disable unused tools
If you have disabled skills or orphaned plugin tools still registering schemas, disable or remove them from the registry. Check skills.entries and plugins.load.paths.
Safe
D4 Create custom profile
If neither "full" nor "coding" fits, define a custom profile with exactly the 15-25 tools your use-case needs. Requires config reload.
High
E. Output Discipline
Technique
Description
Risk
E1 No operation narration
Remove "I'll...", "Let me check..." patterns. Do the action directly.
Safe (behavioral)
E2 Lead with conclusion
Put the answer first. Add explanation only when needed.
Safe (behavioral)
E3 Batch turns
Read → plan → apply all changes in as few turns as possible, instead of read→think→edit→think→verify per-item. Each extra turn adds LCM context overhead.
Safe (behavioral)
E4 Sub-agent conciseness
When spawning sub-agents, specify a concise return format. Their full output is injected into context if returned.
Safe
F. Session Lifecycle
Technique
Description
Risk
F1 Set active hours
Configure heartbeat.activeHours so no work runs during idle time (overnight, weekends).
Safe
F2 Isolated sessions
Set heartbeat.isolatedSession: true so periodic checks don't accumulate in the main session.
Safe
F3 Light context
Set heartbeat.lightContext: true to skip loading all bootstrap files — only HEARTBEAT.md is injected.
Safe
F4 Merge co-located tasks
If two cron jobs run within minutes of each other (e.g. both at 23:xx), merge them into one session with a combined prompt. Copy both prompts into one job's message field separated by a blank line, then remove the later job. Saves one full startup context per day.
Moderate
F5 Merge example
Before: Job A at 23:00 (System health check), Job B at 23:10 (Log cleanup). After: Single job at 23:00 with prompt "Do A then B.~A: ...~B: ..."
Moderate
F6 Configure queue
If the platform supports message queue settings (debounce, collect), tune them to prevent rapid-turn accumulation during tool execution.
Safe
G. Provider-Side Caching
Impact is 10× any other category. DeepSeek V4 Pro cached price is 0.83% of
uncached. Cache hit rates of 91-96% are achievable with proper prompt structure.
Technique
Description
Risk
G1 Fixed prefix first
Design all prompts as [static prefix] + [dynamic suffix]. Static prefix includes system instructions, bootstrap summary, and tool schemas. Dynamic suffix includes runtime instruction. This maximizes KV cache hits on the provider side.
Wrong: "Analyze this code for memory leaks...你是代码审查助手,审查规则如下:..."
Right: "你是代码审查助手,审查规则如下:...现在分析这段代码的内存泄漏:..."
Safe
G2 Session contiguity
Don't insert unrelated messages between consecutive calls to the same model — this breaks the KV cache prefix. Batch related calls into a single turn instead.
Safe
G3 Monitor cache rate
Check provider dashboards for cache hit rate. If <80%, your prefix structure likely has variability. Fix it.
Safe
G4 Route to best caching provider
Different providers have wildly different cached prices. DeepSeek V4 Pro: 0.83% of uncached. MiniMax: ~20%. Route routine tasks to the provider with the best cache economics.
Moderate
H. Behavioral Discipline
These are zero-config, zero-cost techniques. The savings come from how you use
the system, not how it's configured.
Technique
Description
Risk
H1 Default to working path
Use known-working tools before alternatives. Don't retry tools known to be broken in the current deployment — each retry is a wasted tool call + error response.
Bad: web_search (broken) → error → web_search again → error → baidu-search → works
Good: baidu-search → works (first attempt)
Safe
H2 Fail once, switch
If a method fails, switch immediately to a known alternative. Don't retry the same approach with slightly different parameters. Each retry costs full tool-call tokens.
Safe
H3 Batch > Poll
Gather all data before acting instead of incrementally. One exec or read call that returns 10 results costs less than 5 separate calls returning 2 each.
Safe
H4 Fix root cause
If a tool works inconsistently due to a known config issue (API key expired, wrong provider), fix the config. Working around it each time costs more in accumulated failed calls.