Install
openclaw skills install model-benchmarksReal-time AI model capability tracking via leaderboards (LMSYS Arena, HuggingFace, etc.) for intelligent compute routing and cost optimization
openclaw skills install model-benchmarks"Know thy models, optimize thy costs" — Real-time AI capability tracking for intelligent compute routing
Transform your OpenClaw deployment from guessing to data-driven model selection:
Problem: OpenClaw users often overpay for AI by using expensive models for simple tasks, or underperform by using cheap models for complex work.
Solution: This skill provides real-time model intelligence to route tasks optimally:
Result: Users report 60-95% cost reduction with maintained or improved quality.
# Fetch latest model intelligence
python3 skills/model-benchmarks/scripts/run.py fetch
# Find best model for your task
python3 skills/model-benchmarks/scripts/run.py recommend --task coding
# Check any model's capabilities
python3 skills/model-benchmarks/scripts/run.py query --model gpt-4o
🏆 Top 3 recommendations for coding:
1. gemini-2.0-flash
Task Score: 81.5/100
Cost Efficiency: 445.33
Avg Price: $0.19/1M tokens
2. claude-3.5-sonnet
Task Score: 92.0/100
Cost Efficiency: 10.28
Avg Price: $9.00/1M tokens
# Get optimal model, then configure OpenClaw
BEST_MODEL=$(python3 skills/model-benchmarks/scripts/run.py recommend --task coding --json | jq -r '.models[0]')
openclaw config set agents.defaults.model.primary "$BEST_MODEL"
# Add to crontab for fresh data
0 8 * * * cd ~/.openclaw/workspace && python3 skills/model-benchmarks/scripts/run.py fetch
# Generate cost efficiency report
python3 skills/model-benchmarks/scripts/run.py analyze --export-csv > model_costs.csv
| Platform | Coverage | Update Frequency | Capabilities Tracked |
|---|---|---|---|
| LMSYS Chatbot Arena | 100+ models | Daily | General, Reasoning, Creative |
| BigCode Leaderboard | 50+ models | Weekly | Coding (HumanEval, MBPP) |
| Open LLM Leaderboard | 200+ models | Daily | Knowledge, Comprehension |
| Alpaca Eval | 80+ models | Weekly | Instruction Following |
The skill intelligently maps your tasks to optimal models:
| Task Type | Primary Capability | Recommended Models |
|---|---|---|
coding | Coding + Reasoning | Gemini 2.0 Flash, Claude 3.5 Sonnet |
writing | Creative + General | Claude 3.5 Sonnet, GPT-4o |
analysis | Reasoning + Comprehension | GPT-4o, Claude 3.5 Sonnet |
translation | General + Knowledge | Gemini 2.0 Flash, GPT-4o Mini |
math | Reasoning + Knowledge | GPT-4o, Claude 3.5 Sonnet |
simple | General | Gemini 2.0 Flash, GPT-4o Mini |
# Discover undervalued models
python3 skills/model-benchmarks/scripts/run.py analyze --sort-by efficiency --limit 10
# Compare model performance over time
python3 skills/model-benchmarks/scripts/run.py trends --model gpt-4o --days 30
Edit BENCHMARK_SOURCES in scripts/run.py to add new evaluation platforms.
Customize TASK_CAPABILITY_MAP to weight capabilities for your specific use cases.
Startups using this skill report:
Found a new benchmark platform? Want to improve the scoring algorithm?
scripts/run.py --helpexamples/ directoryMake every token count — choose your models wisely! 🧠