Install
openclaw skills install lean-contextReduce token usage in AI agent systems (Claude Code, OpenClaw, GPT Codex, Cursor, Windsurf, Aider, etc.) by applying context compression, selective loading, prompt deduplication, caching strategies, and efficient tool definitions. Use when: (1) the user wants to cut AI costs or token burn, (2) optimizing CLAUDE.md, AGENTS.md, system prompts, or skill files for size, (3) designing token-efficient agent architectures, (4) auditing a project or config for context bloat, (5) building skills or prompts that minimize context window usage, (6) asking about context engineering, prompt compression, LLMLingua, compaction, or sub-agent patterns for token savings. Triggers on: "reduce tokens", "optimize tokens", "token usage", "context bloat", "prompt compression", "context engineering", "token audit", "cut costs", "token-efficient", "compact context".
openclaw skills install lean-contextCut token usage without cutting quality. Every technique below is battle-tested in production Claude Code, OpenClaw, and agentic systems.
Target: under 500 tokens for primary config files.
# CLAUDE.md / AGENTS.md template (good — ~150 tokens)
## Rules
- TypeScript strict mode
- Test every new function
- Follow existing patterns
## Key Files
- API routes: src/api/README.md
- DB schema: docs/schema.md
- Style guide: docs/style-guide.md
Principles:
Never load everything upfront. Use a three-tier system:
| Tier | When loaded | Token cost |
|---|---|---|
| Metadata (name+description) | Always | ~100 words |
| Core instructions (SKILL.md body) | On trigger | <5K words |
| Reference files | On demand | Unlimited |
Implementation patterns:
File pointers over inline content:
# Bad: inline everything
## API Reference
[2000 lines of API docs]
# Good: pointer + on-demand load
## API Reference
See docs/api.md — load only when working on API endpoints.
Domain-split references:
skill/
├── SKILL.md (core workflow only)
└── references/
├── aws.md # load only for AWS tasks
├── gcp.md # load only for GCP tasks
└── azure.md # load only for Azure tasks
Conditional loading via grep patterns: For large reference files, include search patterns in SKILL.md:
# BigQuery Metrics
See references/metrics.md. Search patterns:
- Revenue queries: grep "revenue|billing" references/metrics.md
- User analytics: grep "cohort|retention|churn" references/metrics.md
When to compact vs clear:
/compact — Context is long but thread is still relevant. Summarizes and restarts from summary./clear — Switching tasks entirely. Wipes everything. Clean slate.Rules:
Tools are context too. Every tool definition loads on every request.
Principles:
head -20 file.log (10 tokens) vs MCP JSON response (1000+ tokens)Tool output compression:
# Bad: full JSON dump
{"status":"success","data":{"items":[...500 lines...],"meta":{"page":1,"total":847}}}
# Good: structured summary
"847 items found. First 5: [names]. See full results in .cache/search.json"
Select relevant sentences, discard the rest. Best for narrative documents.
# Before: 500 tokens
Customer John reported unstable internet for 3 days with video call disruptions.
Support ticket #4521 opened on 2026-04-15. Multiple attempts to reset router failed.
# After (extractive): 30 tokens
John: unstable internet 3 days, video call disruptions, router reset failed.
Keep or discard entire chunks. Best for factual/citation-heavy content. Zero rewrite cost.
# Approach: filter chunks by relevance score, only pass top-k to model
relevant_chunks = [c for c in retrieved if c.score > 0.75][:3]
Uses a small model to remove low-information tokens. Up to 20x compression with <2% quality loss.
llmlingua (Python), LLMLingua2Across context:
Across sessions:
memory/*.md) for cross-session continuity instead of re-explainingAcross tools:
jq to extract specific fields instead of loading full JSONgrep ERROR log.txt | head -5 instead of cat log.txtPrompt caching (provider-level):
Sub-agent architecture:
Main agent (clean context, ~2K tokens)
├── Sub-agent 1: deep research (uses 50K tokens, returns 1K summary)
├── Sub-agent 2: code generation (uses 30K tokens, returns diff)
└── Sub-agent 3: testing (uses 20K tokens, returns pass/fail + details)
Each sub-agent explores extensively but returns only distilled results. Main agent stays lean.
Model-tiering:
OpenClaw:
references/.jq, head, tail, grep to scope output before it enters context.Claude Code:
.claudeignore — Exclude node_modules, build artifacts, lock files, data filescontext-mode MCP plugin — Automatically compresses MCP tool outputs/compact after 20-30 messages or when switching sub-tasks/model to switch tiers mid-session (Haiku for lookups, Sonnet for implementation, Opus for architecture)MAX_THINKING_TOKENS=8000 env var to cap extended thinking budgetGeneral (applies to all agents):
.gitignore pattern for AI: exclude binaries, media, large generated files--no-ask-user / --allow-all flags reduce confirmation round-tripsRun this against any AI project:
Token Audit:
□ System prompt / config files under 500 tokens?
□ Reference docs in separate files, not inline?
□ Tool outputs scoped (jq/head/grep), not raw dumps?
□ Sessions reset between topics?
□ MCP servers limited to active ones only?
□ Few-shot examples: 3 diverse > 10 similar?
□ Sub-agents used for deep exploration work?
□ Model tier matches task complexity?
□ Caching enabled for static prompt prefixes?
□ Deduplication: no repeated instructions across context layers?
□ .claudeignore / .gitignore excluding non-essential files?
□ Extended thinking budget capped for simple tasks?
| Strategy | Effort | Savings | Best For |
|---|---|---|---|
| Slash system prompt | Low | 10-30% baseline | Every project |
| Selective loading | Medium | 40-70% per-query | Multi-domain agents |
| Compaction | Low | 50-80% long sessions | Coding agents |
| Efficient tools | Medium | 50-90% MCP usage | Tool-heavy workflows |
| Prompt compression | High | Up to 20x on docs | RAG, research agents |
| Deduplication | Low | 10-25% per session | All agents |
| Caching & sub-agents | High | 30-60% overall | Production systems |
| Model tiering | Low | 3-5x per-query cost | All multi-model setups |
For deeper implementation details — LLMLingua integration code, LangChain compression pipelines, TikToken measurement, sub-agent architecture patterns, semantic deduplication, and dollar savings formulas — see references/compression-deep-dive.md.