Tokenizer

Full token economy suite for OpenClaw agents. Audits context window usage (skills, history, tool outputs), then applies 5 creative strategies to reduce bloat without losing memory quality. Use when asked to "analyze tokens", "reduce context", "find bloat", "optimize memory", or "distill history".

Audits

Pass

ClawScanPass

Agentic behavior and permission review.

Static analysisPass

Pattern checks against bundled files.

VirusTotalAdvisory

Multi-engine malware detections and file reputation.

Install

openclaw skills install token-optimizer-skills

Token Economy Skill — Unified Reference

This skill is the complete token management layer for your OpenClaw agent. It covers two phases:

Audit: Find out where tokens are being spent.
Optimize: Apply the right strategy to reclaim them without losing intelligence or memory.

PHASE 1 — TOKEN AUDIT

1.1 When to Run an Audit

Trigger a token audit when:

The conversation feels slow or starts hallucinating earlier context.
You are about to load a large file or document into context.
The user asks to "check token usage", "find context bloat", or "how full is the context?".
The count_tokens.py --estimate flag returns a value above 80% of the model's limit.

1.2 Auditing Workflow

Run this sequence:

Calculate System Prompt & Skills Cost
Identify which skills are loaded. Extract their metadata and SKILL.md sizes.
Calculate Conversation History Cost
Examine how long the conversation has run and estimate recent tool output lengths (grep results, file reads, API responses).

Count precisely using the helper script:

# Zero-dependency estimate (works everywhere)
python count_tokens.py --input my_prompt.txt --estimate

# Precise count for GPT-4o
python count_tokens.py --input my_prompt.txt --model gpt-4o

# Precise count for Gemma/Ollama
python count_tokens.py --input my_prompt.txt --model gemma

# Precise count for Claude (proxy via cl100k_base, ±5% accuracy)
python count_tokens.py --input my_prompt.txt --model claude

# Compare original vs. compressed file
python count_tokens.py --input original.txt --diff compressed.txt --model gpt-4o

Generate the Token Budget Report (see format below).

1.3 Token Budget Report Format

Present findings to the user in this format:

## 📊 Token Budget Report

**Total Estimated Tokens:** `~<NUM>k (of 128k / 200k / 1M limit)`

### 🍰 Breakdown by Layer
| Layer                      | Estimated Tokens | % of Total | Status                            |
|----------------------------|------------------|------------|-----------------------------------|
| 🛠️ System + KIs + Skills   | X,XXX            | XX%        | `Healthy` / `Bloated`             |
| 💬 Conversation History    | X,XXX            | XX%        | `Healthy` / `Too Long`            |
| 📄 Open Documents          | X,XXX            | XX%        | `Normal`                          |
| 🧠 Tool Outputs            | X,XXX            | XX%        | `Warning: Huge outputs detected!` |

### 🚨 Bloat Warnings
- **[Skill Name]:** XXXX tokens (oversized description metadata).
- **[Conversation]:** Multiple large search/RAG dumps in the chat loop.

### 💡 Optimization Recommendations
1. Use Memory Distillation (Strategy 1) to compress conversation history.
2. Trim oversized skill metadata with `analyze_skills.py`.
3. Lazy-load skills — only inject SKILL.md when explicitly invoked.

1.4 Context Anti-Patterns to Watch For

Anti-Pattern	Description	Fix
Verbatim Tool Dumps	Pasting 500-line grep results into context	Return only matching lines
Overloaded Skill Metadata	SKILL.md `description` > 500 chars	Edit to be concise
Repetitive Instructions	Same boilerplate instructions repeated every turn	Consolidate to a single reference block
Full File Reads	Reading a 2,000-line file when only needing one function	Use grep or targeted read
Raw Conversation History	Storing every message un-distilled past 20 turns	Distill to episodic JSON facts

1.5 Model Tokenizer Reference

Model Family	Tokenizer	Vocab Size	Python Tool
GPT-4o, GPT-4o-mini	`o200k_base`	~200k	`tiktoken`
GPT-4, GPT-3.5, Claude (proxy)	`cl100k_base`	~100k	`tiktoken`
Gemma 1/2/3, Ollama/Gemma	SentencePiece `google/gemma-7b`	~256k	`transformers`
Any model (fallback)	`char // 4` estimate	—	Built-in (zero deps)

General Rules:

English prose: ~1.3 tokens per word.
Code: ~2.5–3.5 tokens per word.
Tab indentation is more efficient than 4-space indentation (~3 tokens saved per line).
JSON structure (braces, quotes, commas) is expensive. Prefer CSV or YAML for large repetitive data.

PHASE 2 — TOKEN OPTIMIZER

The 5 Strategies

When tasked with reducing context bloat, apply one or more of these strategies in order of impact vs. risk:

Strategy 1 — Memory Distillation 💾

Impact: 40–70% reduction | Risk: Very Low

Use when the conversation history is long (20+ turns or filling significant context).

What it does: Converts verbose chat history into structured JSON facts, preserving only decisions, preferences, constraints, and actions.

How to run:

# Distill a JSON chat history file
python distill_memory.py --input history.json --output facts.json

# Distill a plain text transcript
python distill_memory.py --input conversation.txt --output facts.json

Output schema (v2.0-openclaw):

{
  "metadata": {
    "distillation_version": "2.0-openclaw",
    "original_chars": 45000,
    "lines_processed": 312,
    "facts_extracted_count": 18
  },
  "facts": [
    {
      "id": "a3f2b1c4",
      "type": "decision",
      "content": "Use React Router for navigation",
      "confidence": "high",
      "source_turn": "inferred"
    },
    {
      "id": "d8e1f290",
      "type": "preference",
      "content": "Prefers dark mode primary color palette",
      "confidence": "medium",
      "source_turn": "inferred"
    },
    {
      "id": "c7a9b034",
      "type": "constraint",
      "content": "Must not use third-party authentication libraries",
      "confidence": "high",
      "source_turn": "inferred"
    },
    {
      "id": "f1e2d5c6",
      "type": "next_action",
      "content": "Implement the dashboard layout component",
      "confidence": "low",
      "source_turn": "inferred"
    }
  ]
}

Fact types:

decision — A confirmed architectural or design choice.
preference — A user or agent preference about style or tools.
constraint — A hard rule that must not be violated.
entity — An important referenced class, file, or system name.
next_action — A pending or future task.

Workflow:

Agent saves current conversation to temp_history.json.
Runs distill_memory.py.
Reads the structured output (50–100 tokens instead of thousands).
If agent-memory-mcp is available, commits facts via memory_write.
Flushes working memory — conversation starts fresh.

Reference: See the full 3-Tiered Memory Architecture in Section 2.6.

Strategy 2 — Skill Lazy Loading 📦

Impact: 10–30% reduction | Risk: Zero

Use when the system prompt metadata is bloated with dozens of passively-loaded skills.

What it does: Identifies skills whose SKILL.md description fields are too verbose (>500 chars) and flags them for trimming.

How to run:

# Auto-detects your skills directory (checks SKILLS_DIR env var, ./skills, ~/.openclaw/skills, ~/.gemini/antigravity/skills)
python analyze_skills.py

# Or set explicitly
SKILLS_DIR=/path/to/your/skills python analyze_skills.py

Sample output:

Analyzing skills in: /app/skills

--- Skill Context Audit ---
Total Skills Analyzed: 239
Estimated Token Cost of Loaded Skill Database: ~20539 tokens

🚨 Bloated Skills Detected (Metadata > 500 chars):
  - planning-with-files: 1049 chars (~262 tokens)
  - ui-ux-pro-max: 841 chars (~210 tokens)

💡 Recommendation: Edit the description in these SKILL.md files to be concise.
💡 Use Token Genome routing: only inject heavy SKILL descriptions when requested.

Fix: Edit the description: field in the flagged SKILL.md frontmatter to be one clean, concise sentence.

Strategy 3 — Code & Context DNA Compression 🧬

Impact: Up to 80% on UI/boilerplate code | Risk: Low (never removes logic)

Use when working with large frontend or backend code files where the agent only needs to understand the logic, not re-read identical boilerplate.

What it does: Instead of reading an entire 500-line component into context, collapses standard import blocks and boilerplate into single-line comment stubs. The agent still understands what is there — it just doesn't re-tokenize redundant text.

Examples:

// BEFORE — 6 lines, ~40 tokens
import { Button } from "@/components/ui/button"
import { Card, CardHeader, CardContent } from "@/components/ui/card"
import { useState, useEffect } from "react"
import { useRouter } from "next/navigation"
import type { FC } from "react"

// AFTER — 1 line, ~8 tokens
// Standard imports: Button, Card, useState, useEffect, useRouter (FC)

Rule: Only compress sections the agent does NOT need to modify. Never compress the logic function you are about to edit.

Strategy 4 — Model Dialect Rewriting 🗣️

Impact: 10–20% | Risk: Low for Gemma/Ollama, not needed for GPT/Claude

Use when the backend model is Gemma or a local Ollama model.

What it does: Rewrites flowing natural language prompts into structured XML blocks, which Gemma's SentencePiece tokenizer handles far more efficiently.

Example:

# BEFORE — flowing prose (~22 tokens)
Here is the context of the user's issue which happens in the API.
Please read it and write a Python script to fix it.

# AFTER — XML dialect (~9 tokens)
<context>Issue in API.</context><task>Write Python fix.</task>

Why it works: XML tags like <context> are often single tokens in Gemma's 256k vocabulary. You save the tokens that were used writing "The following is the context of..." — which are essentially wasted tokens.

Gemma-Specific Tips:

Gemma's large vocabulary means technical compound words often tokenize as 1 token instead of 2–3.
Tabs save ~3 tokens per line over 4-space indentation in long code files.
Focus parameter names in tool schemas over lengthy natural-language descriptions — Gemma infers from names.

Strategy 5 — Prompt Compression (Offline Only) 🗜️

Impact: 20–50% | Risk: HIGH on live prompts — use only for offline large docs

🚨 Critical Guardrail: NEVER apply this to:

Live system prompts

Tool schemas or function definitions

JSON configs or structured data

Code (unless completely isolated and reversible)

Only use for: offline experiments on large pasted documentation blocks, research dumps, or very long user-provided text.

What it does: Uses LLMLingua-2 to mathematically calculate the perplexity (entropy) of each word and removes low-information filler tokens without changing core meaning.

Setup (one-time):

pip install llmlingua

How to run:

# Compress a large research document by 50%
python compress_prompt.py --input large_doc.txt --ratio 0.5

# Preview first 500 chars before committing (dry run safety check)
python compress_prompt.py --input large_doc.txt --ratio 0.5 --dry-run

Always use --dry-run first to visually verify the output is coherent before permanently replacing the source.

2.6 Memory Persistence Architecture

The 3-tiered memory model is the foundation of all quality-preserving token reduction:

┌─────────────────────────────────────┐
│  TIER 1: Working Memory (Expensive) │
│  Active conversation log            │
│  → Keep SHORT. Max 20 turns.        │
└──────────────┬──────────────────────┘
               │ distill_memory.py
               ▼
┌─────────────────────────────────────┐
│  TIER 2: Episodic Store (Cheap)     │
│  JSON facts extracted from history  │
│  → 50-100 tokens per session        │
└──────────────┬──────────────────────┘
               │ memory_write (agent-memory-mcp)
               ▼
┌─────────────────────────────────────┐
│  TIER 3: Semantic Store (Free)      │
│  RAG / Vector / MCP external store  │
│  → Zero context cost until queried  │
└─────────────────────────────────────┘

Memory Distillation Trigger Rules

🚨 The agent strictly ONLY distills memory under these conditions:

At explicit task completion — a defined sub-task is marked done.

After N turns — a dense session has exceeded the configured turn limit.

At 80% context threshold — count_tokens.py --estimate confirms context is near capacity.

Never distill preemptively mid-task — you risk losing the working context needed to finish the current action.

Quick Reference

Goal	Script	Command
Count tokens in a file (precise)	`count_tokens.py`	`python count_tokens.py --input file.txt --model gpt-4o`
Count tokens (no dependencies)	`count_tokens.py`	`python count_tokens.py --input file.txt --estimate`
Find bloated skill descriptions	`analyze_skills.py`	`python analyze_skills.py`
Distill chat to JSON facts	`distill_memory.py`	`python distill_memory.py --input chat.json --output facts.json`
Compress a large document (offline)	`compress_prompt.py`	`python compress_prompt.py --input doc.txt --ratio 0.5 --dry-run`

Phase 2 — OpenClaw Production Hardened. All scripts include graceful dependency fallbacks.