Install
openclaw skills install token-optimizer-skillsFull token economy suite for OpenClaw agents. Audits context window usage (skills, history, tool outputs), then applies 5 creative strategies to reduce bloat without losing memory quality. Use when asked to "analyze tokens", "reduce context", "find bloat", "optimize memory", or "distill history".
openclaw skills install token-optimizer-skillsThis skill is the complete token management layer for your OpenClaw agent. It covers two phases:
Trigger a token audit when:
count_tokens.py --estimate flag returns a value above 80% of the model's limit.Run this sequence:
Calculate System Prompt & Skills Cost
Identify which skills are loaded. Extract their metadata and SKILL.md sizes.
Calculate Conversation History Cost
Examine how long the conversation has run and estimate recent tool output lengths (grep results, file reads, API responses).
Count precisely using the helper script:
# Zero-dependency estimate (works everywhere)
python count_tokens.py --input my_prompt.txt --estimate
# Precise count for GPT-4o
python count_tokens.py --input my_prompt.txt --model gpt-4o
# Precise count for Gemma/Ollama
python count_tokens.py --input my_prompt.txt --model gemma
# Precise count for Claude (proxy via cl100k_base, ±5% accuracy)
python count_tokens.py --input my_prompt.txt --model claude
# Compare original vs. compressed file
python count_tokens.py --input original.txt --diff compressed.txt --model gpt-4o
Generate the Token Budget Report (see format below).
Present findings to the user in this format:
## 📊 Token Budget Report
**Total Estimated Tokens:** `~<NUM>k (of 128k / 200k / 1M limit)`
### 🍰 Breakdown by Layer
| Layer | Estimated Tokens | % of Total | Status |
|----------------------------|------------------|------------|-----------------------------------|
| 🛠️ System + KIs + Skills | X,XXX | XX% | `Healthy` / `Bloated` |
| 💬 Conversation History | X,XXX | XX% | `Healthy` / `Too Long` |
| 📄 Open Documents | X,XXX | XX% | `Normal` |
| 🧠 Tool Outputs | X,XXX | XX% | `Warning: Huge outputs detected!` |
### 🚨 Bloat Warnings
- **[Skill Name]:** XXXX tokens (oversized description metadata).
- **[Conversation]:** Multiple large search/RAG dumps in the chat loop.
### 💡 Optimization Recommendations
1. Use Memory Distillation (Strategy 1) to compress conversation history.
2. Trim oversized skill metadata with `analyze_skills.py`.
3. Lazy-load skills — only inject SKILL.md when explicitly invoked.
| Anti-Pattern | Description | Fix |
|---|---|---|
| Verbatim Tool Dumps | Pasting 500-line grep results into context | Return only matching lines |
| Overloaded Skill Metadata | SKILL.md description > 500 chars | Edit to be concise |
| Repetitive Instructions | Same boilerplate instructions repeated every turn | Consolidate to a single reference block |
| Full File Reads | Reading a 2,000-line file when only needing one function | Use grep or targeted read |
| Raw Conversation History | Storing every message un-distilled past 20 turns | Distill to episodic JSON facts |
| Model Family | Tokenizer | Vocab Size | Python Tool |
|---|---|---|---|
| GPT-4o, GPT-4o-mini | o200k_base | ~200k | tiktoken |
| GPT-4, GPT-3.5, Claude (proxy) | cl100k_base | ~100k | tiktoken |
| Gemma 1/2/3, Ollama/Gemma | SentencePiece google/gemma-7b | ~256k | transformers |
| Any model (fallback) | char // 4 estimate | — | Built-in (zero deps) |
General Rules:
When tasked with reducing context bloat, apply one or more of these strategies in order of impact vs. risk:
Impact: 40–70% reduction | Risk: Very Low
Use when the conversation history is long (20+ turns or filling significant context).
What it does: Converts verbose chat history into structured JSON facts, preserving only decisions, preferences, constraints, and actions.
How to run:
# Distill a JSON chat history file
python distill_memory.py --input history.json --output facts.json
# Distill a plain text transcript
python distill_memory.py --input conversation.txt --output facts.json
Output schema (v2.0-openclaw):
{
"metadata": {
"distillation_version": "2.0-openclaw",
"original_chars": 45000,
"lines_processed": 312,
"facts_extracted_count": 18
},
"facts": [
{
"id": "a3f2b1c4",
"type": "decision",
"content": "Use React Router for navigation",
"confidence": "high",
"source_turn": "inferred"
},
{
"id": "d8e1f290",
"type": "preference",
"content": "Prefers dark mode primary color palette",
"confidence": "medium",
"source_turn": "inferred"
},
{
"id": "c7a9b034",
"type": "constraint",
"content": "Must not use third-party authentication libraries",
"confidence": "high",
"source_turn": "inferred"
},
{
"id": "f1e2d5c6",
"type": "next_action",
"content": "Implement the dashboard layout component",
"confidence": "low",
"source_turn": "inferred"
}
]
}
Fact types:
decision — A confirmed architectural or design choice.preference — A user or agent preference about style or tools.constraint — A hard rule that must not be violated.entity — An important referenced class, file, or system name.next_action — A pending or future task.Workflow:
temp_history.json.distill_memory.py.agent-memory-mcp is available, commits facts via memory_write.Reference: See the full 3-Tiered Memory Architecture in Section 2.6.
Impact: 10–30% reduction | Risk: Zero
Use when the system prompt metadata is bloated with dozens of passively-loaded skills.
What it does: Identifies skills whose SKILL.md description fields are too verbose (>500 chars) and flags them for trimming.
How to run:
# Auto-detects your skills directory (checks SKILLS_DIR env var, ./skills, ~/.openclaw/skills, ~/.gemini/antigravity/skills)
python analyze_skills.py
# Or set explicitly
SKILLS_DIR=/path/to/your/skills python analyze_skills.py
Sample output:
Analyzing skills in: /app/skills
--- Skill Context Audit ---
Total Skills Analyzed: 239
Estimated Token Cost of Loaded Skill Database: ~20539 tokens
🚨 Bloated Skills Detected (Metadata > 500 chars):
- planning-with-files: 1049 chars (~262 tokens)
- ui-ux-pro-max: 841 chars (~210 tokens)
💡 Recommendation: Edit the description in these SKILL.md files to be concise.
💡 Use Token Genome routing: only inject heavy SKILL descriptions when requested.
Fix: Edit the description: field in the flagged SKILL.md frontmatter to be one clean, concise sentence.
Impact: Up to 80% on UI/boilerplate code | Risk: Low (never removes logic)
Use when working with large frontend or backend code files where the agent only needs to understand the logic, not re-read identical boilerplate.
What it does: Instead of reading an entire 500-line component into context, collapses standard import blocks and boilerplate into single-line comment stubs. The agent still understands what is there — it just doesn't re-tokenize redundant text.
Examples:
// BEFORE — 6 lines, ~40 tokens
import { Button } from "@/components/ui/button"
import { Card, CardHeader, CardContent } from "@/components/ui/card"
import { useState, useEffect } from "react"
import { useRouter } from "next/navigation"
import type { FC } from "react"
// AFTER — 1 line, ~8 tokens
// Standard imports: Button, Card, useState, useEffect, useRouter (FC)
Rule: Only compress sections the agent does NOT need to modify. Never compress the logic function you are about to edit.
Impact: 10–20% | Risk: Low for Gemma/Ollama, not needed for GPT/Claude
Use when the backend model is Gemma or a local Ollama model.
What it does: Rewrites flowing natural language prompts into structured XML blocks, which Gemma's SentencePiece tokenizer handles far more efficiently.
Example:
# BEFORE — flowing prose (~22 tokens)
Here is the context of the user's issue which happens in the API.
Please read it and write a Python script to fix it.
# AFTER — XML dialect (~9 tokens)
<context>Issue in API.</context><task>Write Python fix.</task>
Why it works: XML tags like <context> are often single tokens in Gemma's 256k vocabulary. You save the tokens that were used writing "The following is the context of..." — which are essentially wasted tokens.
Gemma-Specific Tips:
Impact: 20–50% | Risk: HIGH on live prompts — use only for offline large docs
🚨 Critical Guardrail: NEVER apply this to:
- Live system prompts
- Tool schemas or function definitions
- JSON configs or structured data
- Code (unless completely isolated and reversible)
Only use for: offline experiments on large pasted documentation blocks, research dumps, or very long user-provided text.
What it does: Uses LLMLingua-2 to mathematically calculate the perplexity (entropy) of each word and removes low-information filler tokens without changing core meaning.
Setup (one-time):
pip install llmlingua
How to run:
# Compress a large research document by 50%
python compress_prompt.py --input large_doc.txt --ratio 0.5
# Preview first 500 chars before committing (dry run safety check)
python compress_prompt.py --input large_doc.txt --ratio 0.5 --dry-run
Always use --dry-run first to visually verify the output is coherent before permanently replacing the source.
The 3-tiered memory model is the foundation of all quality-preserving token reduction:
┌─────────────────────────────────────┐
│ TIER 1: Working Memory (Expensive) │
│ Active conversation log │
│ → Keep SHORT. Max 20 turns. │
└──────────────┬──────────────────────┘
│ distill_memory.py
▼
┌─────────────────────────────────────┐
│ TIER 2: Episodic Store (Cheap) │
│ JSON facts extracted from history │
│ → 50-100 tokens per session │
└──────────────┬──────────────────────┘
│ memory_write (agent-memory-mcp)
▼
┌─────────────────────────────────────┐
│ TIER 3: Semantic Store (Free) │
│ RAG / Vector / MCP external store │
│ → Zero context cost until queried │
└─────────────────────────────────────┘
🚨 The agent strictly ONLY distills memory under these conditions:
- At explicit task completion — a defined sub-task is marked done.
- After N turns — a dense session has exceeded the configured turn limit.
- At 80% context threshold —
count_tokens.py --estimateconfirms context is near capacity.
Never distill preemptively mid-task — you risk losing the working context needed to finish the current action.
| Goal | Script | Command |
|---|---|---|
| Count tokens in a file (precise) | count_tokens.py | python count_tokens.py --input file.txt --model gpt-4o |
| Count tokens (no dependencies) | count_tokens.py | python count_tokens.py --input file.txt --estimate |
| Find bloated skill descriptions | analyze_skills.py | python analyze_skills.py |
| Distill chat to JSON facts | distill_memory.py | python distill_memory.py --input chat.json --output facts.json |
| Compress a large document (offline) | compress_prompt.py | python compress_prompt.py --input doc.txt --ratio 0.5 --dry-run |
Phase 2 — OpenClaw Production Hardened. All scripts include graceful dependency fallbacks.