Lean Context
v1.0.0Reduce token usage in AI agent systems (Claude Code, OpenClaw, GPT Codex, Cursor, Windsurf, Aider, etc.) by applying context compression, selective loading,...
Token Optimizer
Cut token usage without cutting quality. Every technique below is battle-tested in production Claude Code, OpenClaw, and agentic systems.
The 3 Token Drains (fix these first)
- Tool output accumulation — Every file read, shell command, and MCP response appends full output to context permanently. A 10K-line log file stays in context for every subsequent message. This is the #1 silent drain.
- Context compounding — The model re-reads the entire conversation on every turn. Message 50 costs more than message 5 because it re-reads 49 prior messages. Long sessions become token furnaces.
- System prompt baseline — CLAUDE.md / AGENTS.md / system prompts load before every single request. A 5,000-token config file costs 5,000 tokens per turn, forever.
Strategy 1: Slash System Prompt Size
Target: under 500 tokens for primary config files.
# CLAUDE.md / AGENTS.md template (good — ~150 tokens)
## Rules
- TypeScript strict mode
- Test every new function
- Follow existing patterns
## Key Files
- API routes: src/api/README.md
- DB schema: docs/schema.md
- Style guide: docs/style-guide.md
Principles:
- Give the shape of the project, not the full docs
- Use file pointers instead of inline documentation
- Link to reference files — the model reads them only when needed
- 3-5 rules + 3-5 file pointers is the sweet spot
Strategy 2: Selective Loading (Progressive Disclosure)
Never load everything upfront. Use a three-tier system:
| Tier | When loaded | Token cost |
|---|---|---|
| Metadata (name+description) | Always | ~100 words |
| Core instructions (SKILL.md body) | On trigger | <5K words |
| Reference files | On demand | Unlimited |
Implementation patterns:
File pointers over inline content:
# Bad: inline everything
## API Reference
[2000 lines of API docs]
# Good: pointer + on-demand load
## API Reference
See docs/api.md — load only when working on API endpoints.
Domain-split references:
skill/
├── SKILL.md (core workflow only)
└── references/
├── aws.md # load only for AWS tasks
├── gcp.md # load only for GCP tasks
└── azure.md # load only for Azure tasks
Conditional loading via grep patterns: For large reference files, include search patterns in SKILL.md:
# BigQuery Metrics
See references/metrics.md. Search patterns:
- Revenue queries: grep "revenue|billing" references/metrics.md
- User analytics: grep "cohort|retention|churn" references/metrics.md
Strategy 3: Compaction & Session Hygiene
When to compact vs clear:
/compact— Context is long but thread is still relevant. Summarizes and restarts from summary./clear— Switching tasks entirely. Wipes everything. Clean slate.
Rules:
- New topic = new session. No exceptions.
- Point at specific files, never "read the codebase"
- Batch related tasks in one prompt: "Fix bug, refactor, add tests" > three separate prompts
- Use scripts for data you'll read yourself — keep the agent out of the loop
Strategy 4: Efficient Tool Definitions
Tools are context too. Every tool definition loads on every request.
Principles:
- Minimal viable tool set — if a human can't tell which tool to use, the model can't either
- CLI > MCP when possible:
head -20 file.log(10 tokens) vs MCP JSON response (1000+ tokens) - Tools should return summaries, not raw dumps
- Disconnect unused MCP servers — per-request overhead is real
Tool output compression:
# Bad: full JSON dump
{"status":"success","data":{"items":[...500 lines...],"meta":{"page":1,"total":847}}}
# Good: structured summary
"847 items found. First 5: [names]. See full results in .cache/search.json"
Strategy 5: Prompt Compression Techniques
Extractive Compression (low effort, good ROI)
Select relevant sentences, discard the rest. Best for narrative documents.
# Before: 500 tokens
Customer John reported unstable internet for 3 days with video call disruptions.
Support ticket #4521 opened on 2026-04-15. Multiple attempts to reset router failed.
# After (extractive): 30 tokens
John: unstable internet 3 days, video call disruptions, router reset failed.
Selection-Based Compression (chunk-level filter)
Keep or discard entire chunks. Best for factual/citation-heavy content. Zero rewrite cost.
# Approach: filter chunks by relevance score, only pass top-k to model
relevant_chunks = [c for c in retrieved if c.score > 0.75][:3]
LLMLingua-style Token-Level Compression
Uses a small model to remove low-information tokens. Up to 20x compression with <2% quality loss.
- Use for: ICL examples, retrieved documents, long instructions
- Don't use for: code, structured data, exact citations
- Libraries:
llmlingua(Python),LLMLingua2
Strategy 6: Deduplication Patterns
Across context:
- Never repeat instructions already in system prompt
- Reference previous context by turn number instead of re-stating: "as discussed in turn 3"
- Deduplicate few-shot examples — 3 diverse examples > 10 similar ones
Across sessions:
- Store shared context in files, not in conversation
- Use memory files (
memory/*.md) for cross-session continuity instead of re-explaining - Hash-check before loading: if a file hasn't changed, use cached summary
Across tools:
- Don't read the same file twice — reference previous read
- Use
jqto extract specific fields instead of loading full JSON - Pipe commands:
grep ERROR log.txt | head -5instead ofcat log.txt
Strategy 7: Caching & Architecture
Prompt caching (provider-level):
- Anthropic: automatic prompt caching for repeated prefixes (system prompts, tools)
- OpenAI: cached responses for identical prompts
- Strategy: structure prompts so static content comes first, dynamic content last
Sub-agent architecture:
Main agent (clean context, ~2K tokens)
├── Sub-agent 1: deep research (uses 50K tokens, returns 1K summary)
├── Sub-agent 2: code generation (uses 30K tokens, returns diff)
└── Sub-agent 3: testing (uses 20K tokens, returns pass/fail + details)
Each sub-agent explores extensively but returns only distilled results. Main agent stays lean.
Model-tiering:
- Haiku/fast models: formatting, lookups, mechanical edits
- Mid-tier: implementation, tests, explanations
- Top-tier: architecture decisions, complex reasoning, multi-file refactors
Strategy 8: Agent-Specific Optimizations
OpenClaw:
- HEARTBEAT.md: Keep under 200 tokens. Reference external files for detailed checks.
- MEMORY.md: Curated, not raw. Review monthly. Remove stale entries.
- SOUL.md / USER.md: Load every session. Keep tight.
- Skills: Only the SKILL.md body loads on trigger. Put details in
references/. - Tool output: Use
jq,head,tail,grepto scope output before it enters context. - Write to files instead of returning large outputs in chat.
Claude Code:
.claudeignore— Exclude node_modules, build artifacts, lock files, data filescontext-modeMCP plugin — Automatically compresses MCP tool outputs/compactafter 20-30 messages or when switching sub-tasks/modelto switch tiers mid-session (Haiku for lookups, Sonnet for implementation, Opus for architecture)MAX_THINKING_TOKENS=8000env var to cap extended thinking budget- Skills load on-demand only — install freely, they don't add baseline cost
General (applies to all agents):
.gitignorepattern for AI: exclude binaries, media, large generated files- Structured output (JSON, YAML) > unstructured prose for tool responses
--no-ask-user/--allow-allflags reduce confirmation round-trips
Quick Audit Checklist
Run this against any AI project:
Token Audit:
□ System prompt / config files under 500 tokens?
□ Reference docs in separate files, not inline?
□ Tool outputs scoped (jq/head/grep), not raw dumps?
□ Sessions reset between topics?
□ MCP servers limited to active ones only?
□ Few-shot examples: 3 diverse > 10 similar?
□ Sub-agents used for deep exploration work?
□ Model tier matches task complexity?
□ Caching enabled for static prompt prefixes?
□ Deduplication: no repeated instructions across context layers?
□ .claudeignore / .gitignore excluding non-essential files?
□ Extended thinking budget capped for simple tasks?
Strategy ROI Matrix
| Strategy | Effort | Savings | Best For |
|---|---|---|---|
| Slash system prompt | Low | 10-30% baseline | Every project |
| Selective loading | Medium | 40-70% per-query | Multi-domain agents |
| Compaction | Low | 50-80% long sessions | Coding agents |
| Efficient tools | Medium | 50-90% MCP usage | Tool-heavy workflows |
| Prompt compression | High | Up to 20x on docs | RAG, research agents |
| Deduplication | Low | 10-25% per session | All agents |
| Caching & sub-agents | High | 30-60% overall | Production systems |
| Model tiering | Low | 3-5x per-query cost | All multi-model setups |
References
For deeper implementation details — LLMLingua integration code, LangChain compression pipelines, TikToken measurement, sub-agent architecture patterns, semantic deduplication, and dollar savings formulas — see references/compression-deep-dive.md.
