Multi-Skill-Eval v1.0.0
Integrated Multi-Method Skill Evaluation System
Combines three evaluation approaches into one unified system:
- Skill Assessment — lightweight static analysis (fast, automated)
- Skill Evaluator — 25-criterion rubric scoring (ISO 25010, OpenSSF, Shneiderman)
- Skill-Eval — autonomous benchmark evaluation with skill card generation
🚀 快速开始 / Quick Start
# 完整评估(三种方法)
multi-skill-eval ~/.openclaw/skills/my-skill
# 快速静态分析
multi-skill-eval ~/.openclaw/skills/my-skill --method quick
# 完整评估 + 详细报告
multi-skill-eval ~/.openclaw/skills/my-skill --method full
# 对比两个技能
multi-skill-eval --compare skill-a skill-b
# 批量评估所有本地技能
multi-skill-eval --all
# 指定模型进行基准测试
multi-skill-eval ~/.openclaw/skills/my-skill --method benchmark --model minimax/MiniMax-M2
Three Evaluation Methods
方法一:静态分析 (快速 — 约30秒)
轻量级自动化检查,覆盖4个维度:
python3 scripts/static-analyze.py ~/.openclaw/skills/my-skill
python3 scripts/static-analyze.py ~/.openclaw/skills/my-skill --json # 机器可读格式
检查项目:
- 文档完整性(SKILL.md、描述质量、示例)
- 代码质量与安全信号(脚本语法、错误处理)
- 配置友好性(环境变量文档化、默认值清晰)
- 维护性信号(版本管理、近期更新)
输出: 0-100分数 + 按严重性分类的问题列表。
方法二:Rubric打分 (详细 — 约10分钟)
25项标准,覆盖8个类别。自动化检查 + 手动评审结合。
运行自动化结构检查:
python3 scripts/eval-skill.py ~/.openclaw/skills/my-skill --json --verbose
然后使用 references/rubric.md 进行手动评分
The 25 Criteria (8 Categories)
| # | Category | Framework | Criteria |
|---|
| 1 | Functional Suitability | ISO 25010 | Completeness, Correctness, Appropriateness |
| 2 | Reliability | ISO 25010 | Fault Tolerance, Error Reporting, Recoverability |
| 3 | Performance / Context | ISO 25010 + Agent | Token Cost, Execution Efficiency |
| 4 | Usability — AI Agent | Shneiderman, Gerhardt-Powals | Learnability, Consistency, Feedback, Error Prevention |
| 5 | Usability — Human | Tognazzini, Norman | Discoverability, Forgiveness |
| 6 | Security | ISO 25010 + OpenSSF | Credentials, Input Validation, Data Safety |
| 7 | Maintainability | ISO 25010 | Modularity, Modifiability, Testability |
| 8 | Agent-Specific | Novel | Trigger Precision, Progressive Disclosure, Composability, Idempotency, Escape Hatches |
Scoring: Each criterion 0–4. Total 100 max.
| Score | Verdict | Action |
|---|
| 90–100 | Excellent | Publish confidently |
| 80–89 | Good | Publishable, note known issues |
| 70–79 | Acceptable | Fix P0s before publishing |
| 60–69 | Needs Work | Fix P0+P1 before publishing |
| <60 | Not Ready | Significant rework needed |
Rubric Score Sheet
Copy assets/EVAL-TEMPLATE.md to the skill directory as EVAL.md.
P0 Issues (blocks publishing):
- Missing SKILL.md or invalid frontmatter
- Hardcoded credentials or secrets
- Phantom tooling (referenced scripts not in package)
- No description or description < 50 chars
P1 Issues (should fix):
- No usage examples
- No error handling in scripts
- Missing dependency documentation
- Unclear trigger conditions
方法三:自主基准测试 (深度 — 约30分钟/技能)
Full multi-phase evaluation with multi-model support. Requires AI agent execution.
# Spawn benchmark via AI agent
multi-skill-eval /path/to/skill --method benchmark --model claude-sonnet-4
⚠️ Note: The benchmark method requires an AI agent to orchestrate subagent execution. The CLI coordinates the workflow but actual execution happens through AI agent sessions.
📋 Planned: Self-evolution improvement engine (Phase 7+) is planned but not yet implemented.
Phase 1: Pre-flight Analysis
- Read
SKILL.md — understand claims, dependencies, target use cases
- Classify skill type:
- Capability uplift — teaches the agent something it can't do well
- Encoded preference — sequences steps according to specific process
- Dependency check:
- Required CLI tools, API keys, env vars
- Mark
dependency-gated if credentials missing (skip eval, not fault of skill)
- Check for phantom tooling (referenced scripts not in package)
- Marketing claims check: flag any metrics ("7.8x faster") without evidence
- Read knowledge base:
knowledge/lessons.md, eval-patterns.md, failures.md
- Check prior evaluations:
knowledge/skill-profiles/<slug>.md
Phase 2: Test Case Design
Design 2-3 test prompts across four categories:
- Outcome — Did the task complete correctly?
- Process — Did the agent follow the skill's intended steps?
- Style — Does output follow skill-claimed conventions?
- Efficiency — Reasonable time/token usage?
Assertion design (two layers):
Layer 1: Deterministic checks (fast, reproducible)
- File existence, word counts, keyword presence
- Format compliance (valid JSON, SQL, markdown)
- Programmatic verification (run tests, check syntax)
Layer 2: Rubric-based quality assessment (LLM-as-judge)
- Judge model (NOT execution model) grades output against specific rubric
- Structured scoring, not pass/fail
Key assertion patterns:
- Banned-word checks for style-constrained skills (highly discriminating)
- Methodology/structure assertions for technical domains (baseline already strong on correctness)
- Output-floor assertions: required sections must appear even in error/fallback paths
- Bilingual keyword variants for Chinese-language skills (索引/index, 前导通配符/leading wildcard)
Phase 3: Execution
For each test case, spawn two subagents:
With-skill:
[Model: <execution_model>]
Read the skill at <skill-path>/SKILL.md and follow its instructions.
Task: <prompt>
Save outputs to: <workspace>/iteration-<N>/<test-name>/with_skill/outputs/
Without-skill (baseline):
[Model: <execution_model>]
Complete this task using only built-in capabilities. Do NOT read SKILL.md.
Task: <prompt>
Save outputs to: <workspace>/iteration-<N>/<test-name>/without_skill/outputs/
Multi-model mode: Run same skill across multiple models to check cross-model consistency.
Phase 4: Grading
Programmatic grading for deterministic checks. LLM-based grading for qualitative:
python3 scripts/grade-assertions.py --workspace /path/to/results
Save to grading.json:
{
"expectations": [
{"text": "assertion text", "passed": true, "evidence": "..."}
],
"summary": {"passed": N, "failed": N, "total": N, "pass_rate": 0.X}
}
Phase 5: Benchmark Aggregation
{
"with_skill": {"pass_rate": 0.X, "avg_time": "Ns", "avg_tokens": N},
"without_skill": {"pass_rate": 0.X, "avg_time": "Ns", "avg_tokens": N},
"delta": {"pass_rate": "+0.XX", "time": "+Xx"},
"model_used": "claude-sonnet-4",
"verdict": "Recommended"
}
Efficiency flags: Flag skills where quality delta ≈ 0 but cost delta >2x ("high-overhead framework inflation").
Phase 6: Skill Card Generation
python3 scripts/generate_skill_card.py \
--workspace /path/to/results \
--skill-name "My Skill" \
--skill-slug my-skill \
--eval-model claude-sonnet-4 \
--output skill-cards/my-skill-v1.md
Skill Card Contents:
- Metadata: name, source, eval date, model, engine version
- Overall score 0-10 (Quality 0-5 + Delta 0-3 + Efficiency 0-2)
- With-skill vs without-skill comparison table
- Per-test-case breakdown with assertions, timing, grading
- Strengths / Weaknesses
- Recommendation: Recommended / Conditional / Marginal / Not Recommended
Phase 7: Leaderboard Update
python3 scripts/generate_leaderboard.py --cards-dir skill-cards --output leaderboard/index.html
Self-Evolution Improvement Engine
⚠️ Planned — Not Yet Implemented
The self-evolution improvement engine is designed but not yet implemented. The knowledge base (knowledge/improve/) contains proven patterns and lessons that inform manual skill improvement, but automatic skill rewriting is not available.
Planned Improvement Process (Phase 7-12)
-
Read knowledge base:
knowledge/improve/lessons.md — proven strategies
knowledge/improve/patterns.md — category-specific playbooks
knowledge/improve/failures.md — what NOT to try
-
Diagnose root cause:
- Skill too vague? (Doesn't specify enough to change model behavior)
- Skill redundant? (Teaches things model already knows)
- Skill too heavy? (Adds overhead without proportional quality gain)
- Missing structure? (No clear output format)
- Phantom tooling? (References tools that don't exist)
- Reference manual anti-pattern? (>200 lines of educational content)
- Library-as-skill anti-pattern? (Contains code instead of instructions)
-
Select improvement strategy from patterns:
- Reference Manual Slim-Down: Delete 70%+ redundant content, add MUST/ALWAYS/NEVER mandates
- Library-to-Instructions: Convert code to behavioral instructions
- Phantom Tooling Replacement: Replace missing tool references with inline instructions
- Overhead Routing: Add quick-mode vs full-framework routing
- Assertion-Aligned Rewrite: Rewrite to pass specific failed assertions
-
Rewrite SKILL.md with selected strategy:
- Default: Remove > Add (delete 60-80% first, then add behavioral mandates)
- Add specific, enforceable conventions
- Remove redundant content model already handles
- Save as
SKILL-improved.md
-
Update assertions to match improved skill
-
Re-evaluate with improved version
Planned Re-Eval (Phase 10-11)
Run same eval against SKILL-improved.md:
- Score improved by >= 1.5 points → Success
- Less than 50% of previously-failed assertions fixed → Document limitation, move on
Planned Improvement Knowledge Update (Phase 12)
After each improvement batch:
- Update
knowledge/improve/lessons.md with what worked
- Update
knowledge/improve/patterns.md with reusable patterns
- Update
knowledge/improve/failures.md with failed attempts
- Fold proven patterns back into this SKILL.md
Scoring Summary
| Method | Speed | Coverage | Best For |
|---|
| Static Analysis | ~30s | 4 dimensions | Quick comparison, batch scan |
| Rubric Scoring | ~10min | 25 criteria | Pre-publish audit, detailed report |
| Benchmark Eval | ~30min | Full + self-evolution | Production evaluation, skill improvement |
| Overall Score | Verdict |
|---|
| 7-10 | Recommended |
| 5-6.9 | Conditional |
| 3-4.9 | Marginal |
| 0-2.9 | Not Recommended |
Anti-Patterns to Detect
- Reference manual anti-pattern: SKILL.md >200 lines of educational content (not behavioral instructions)
- Library-as-skill anti-pattern: SKILL.md contains Python/JS class definitions instead of instructions
- Phantom tooling: SKILL.md references scripts/binaries not in the package
- Phantom tooling framework skills: Evaluate template/output structure separately from real data execution
- Unsubstantiated claims: Skill claims specific metrics without evidence — do not use self-reported numbers
- High-overhead framework inflation: Quality delta ≈ 0 but cost delta >2x — penalize efficiency
Deeper Security Scanning
For thorough security audits, complement with SkillLens:
npx skilllens scan /path/to/skill
Checks: exfiltration, code execution, persistence, privilege bypass, prompt injection.
Dependencies
- Python 3.6+ (for eval-skill.py, static-analyze.py, grade-assertions.py)
- PyYAML (
pip install pyyaml) — frontmatter parsing
- Node.js (for SkillLens security scanning)