Install
openclaw skills install llm-benchmark-analystsearch and analyze llm benchmark results within a fixed benchmark universe, then produce evidence-based model strength and weakness reports or domain-leader summaries. use when comparing a model across benchmarks, ranking the best models by domain, explaining what a benchmark measures, checking predecessor-vs-current progress, or writing benchmark reports that must prioritize exact model version, evaluation date, benchmark variant, score semantics, sub-scores, and benchmark defect warnings. works with browser, web, and multimodal extraction for text, table, canvas, or image-only leaderboards.
openclaw skills install llm-benchmark-analystUse this skill to research benchmark evidence and write structured reports about:
Default to the user's language. Never invent scores, ranks, dates, benchmark variants, or missing table values.
references/benchmark-source.md. If a benchmark is not in that file, exclude it.references/core-dimensions.md to collapse scattered benchmarks into a small set of report dimensions.references/search-playbook.md for routing, overlap expansion, evidence gathering, and comparison anchors.references/report-template.md for output structure.references/data-defect-warnings.md benchmark by benchmark, inline and again in the limitations section.Normalize the model identity before searching
claude, gemini pro, gpt latest, or qwen max until you have the exact currently relevant model string for the searched leaderboard rows.Route the request through core dimensions before web crawling
references/core-dimensions.md to select the primary dimension(s).Expand beyond section labels
Collect evidence in this order
Use multimodal extraction when the leaderboard is not machine-readable
image-extracted.Apply anchor comparisons
latest. Search that first.Apply predecessor comparison
Attach defect warnings
references/data-defect-warnings.md.best models in a domain, do not use only one benchmark. Use a cluster of relevant benchmarks and explain why each one matters.what is this model good or bad at, synthesize at the core-dimension level first, then support with benchmark evidence.For every benchmark you cite, capture:
Use the matching template in references/report-template.md.
At minimum, every substantive report must include:
references/core-dimensions.md: benchmark routing and de-fragmentation mapreferences/search-playbook.md: token-efficient search order, overlap expansion, and comparison rulesreferences/data-defect-warnings.md: warning catalog and ready-to-use caution languagereferences/report-template.md: output structures for single-model, domain-leader, and benchmark-explainer tasksreferences/benchmark-source.md: full allowed benchmark universe copied from the user's benchmark documentanalyze gpt-5's coding and agentic coding strengths and weaknesses, and compare it with the latest claude opus, claude sonnet, and gpt modelfind the best multimodal models right now using only the approved benchmark list and explain each benchmark brieflywrite a report on qwen's reasoning strengths, benchmark gaps, predecessor comparison, and all data-quality caveatstell me which models lead in deep research and search, with benchmark-specific warnings and freshness notes