Install
openclaw skills install benchmark-model-providerBenchmark and rank AI providers/models against a user-specific prompt suite derived from the user's purpose, domain, and usage frequency. Use when users ask which model is smarter, cheaper, deeper, faster, worth using daily, better as local vs service, or when building repeatable benchmark specs, reranking old runs, generating markdown/HTML/PDF benchmark reports.
openclaw skills install benchmark-model-providerUse this skill to help users choose the most suitable model for their own workflow instead of giving generic “best model” advice.
Tiếng Việt
Dùng skill này khi Boss muốn biết model nào thật sự đáng dùng cho workflow hằng ngày: model nào research tốt hơn, viết báo cáo ổn hơn, code ngon hơn, rẻ hơn, nhanh hơn, hay đáng dùng lâu dài hơn. Skill này không trả lời kiểu cảm tính, mà dựng benchmark theo đúng nhu cầu thực tế của người dùng rồi chấm, rerank và xuất report rõ ràng.
中文说明
当用户想知道“哪个模型更聪明、更便宜、更适合日常工作流、更适合研究/写报告/编程”时,使用这个技能。它不会给出泛泛而谈的“最佳模型”建议,而是根据用户自己的实际任务构建基准测试,保留原始结果、重新排序,并生成可审阅、可分享的报告。
Treat the benchmark as a personal decision framework:
People often ask questions like:
This skill exists to answer those questions with a repeatable benchmark process, not with vague preferences.
| Area | Default |
|---|---|
| Benchmark mode | prompt_only |
| Overall scoring | quality + depth + cost |
| Speed handling | measured and reported, excluded from default overall |
| Execution strategy | sequential unless orchestration is needed |
| Web publish target | (no built-in publish) — suggest Vercel / Netlify / Cloudflare Pages / GitHub Pages |
prompt_only unless the user explicitly wants agent_context.prompt_only, send only the raw prompt.prompt_only mode.agent_context, use one fixed shared system/context layer for all compared models and record it in metadata.sequential and subagent_orchestrated execution strategies.--max-parallel 4) when the endpoint can tolerate it.rerank as a first-class operation; do not rerun models when only the scoring formula changes.This skill may perform network I/O depending on how the benchmark spec is configured.
run_benchmark.py sends prompts to the base_url configured in the benchmark spec.For detailed runtime assumptions, read:
references/runtime-safety.mdreferences/environment-vars.mdreferences/pricing-sources.mdRead only what you need:
references/initial-project-spec.md — authoritative design baselinereferences/benchmark-schema.md — benchmark spec structure, run artifacts, file layoutreferences/scoring-rubric.md — scoring model, normalization rules, default weightsreferences/pricing-sources.md — pricing precedence and estimation policyreferences/execution-modes.md — benchmark modes, execution strategies, operational modesreferences/output-modes.md — delivery choices, publish rules, progress feedback rulesreferences/runtime-safety.md — trust boundaries, network behavior, safe usage guidancereferences/environment-vars.md — expected environment variables and dependency notesexamples/*.yaml — benchmark context templates and ready-made examples in multiple languages| Script | Purpose |
|---|---|
scripts/build_benchmark_spec.py | Build a benchmark spec from benchmark context |
scripts/run_benchmark.py | Execute benchmark runs and write raw outputs/metrics |
scripts/estimate_tokens.py | Estimate token counts when provider usage is missing |
scripts/resolve_pricing.py | Resolve pricing sources and compute estimated/official pricing |
scripts/score_models.py | Combine raw metrics and rubric scores into rankings |
scripts/build_report.py | Build markdown, HTML, and PDF report artifacts |
scripts/publish_report.py | No deployment automation. Export/copy PDF and print suggested static hosting options (Vercel/Netlify/Cloudflare Pages/GitHub Pages). |
Try to produce these artifacts whenever possible: