Multi-Skill-Eval | 集成化技能评估系统

v1.0.2

集成化多方法技能评估系统。整合静态分析(skill-assessment)、Rubric质量打分(skill-evaluator)和自主基准测试(skill-eval)。用于全面评估、对比、审计或改进OpenClaw技能。覆盖文档完整性、代码质量、25项Rubric打分、多模型基准测试。触发词(中文): 评估技...

⭐ 1· 91·0 current·0 all-time

by@wangzairong

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for wangzairong/multi-skill-eval.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Multi-Skill-Eval | 集成化技能评估系统" (wangzairong/multi-skill-eval) from ClawHub.
Skill page: https://clawhub.ai/wangzairong/multi-skill-eval
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install multi-skill-eval

ClawHub CLI

Package manager switcher

npx clawhub@latest install multi-skill-eval

Security Scan

Capability signals

Requires sensitive credentials

These labels describe what authority the skill may exercise. They are separate from suspicious or malicious moderation verdicts.

VirusTotal

Benign

View report →

OpenClaw

Benign

medium confidence

✓

Purpose & Capability

Name/description (skill evaluation, static analysis, rubric scoring, benchmark) match the included files and CLI instructions: scripts perform static analysis, generate cards/leaderboards, and run benchmarks. There are no unrelated required env vars or declared binaries that would be incoherent with the stated purpose.

ℹ

Instruction Scope

SKILL.md instructs running the local Python CLI scripts against target skill directories and — for the benchmark method — to have the AI agent read SKILL.md, source files, and spawn subagents to execute tests. This is coherent for an evaluator but means the skill (and you, when running it) will read the full contents of whatever skill path you point it at; review inputs before evaluating sensitive code. I saw no instructions that attempt to exfiltrate data to unknown endpoints, but parts of the scripts were truncated so full review would be prudent.

✓

Install Mechanism

No install spec (instruction-only with accompanying scripts) — lowest-risk delivery model. The package contains Python scripts; there is no evidence in the provided fragments of downloads from remote URLs or unusual install behavior. You should ensure you run scripts with a controlled Python environment and review any third-party requirements not listed here.

✓

Credentials

The skill declares no required environment variables, no primary credential, and no config paths. The rubric and scripts reference handling of credentials and dependency-gating conceptually, but there are no hardcoded secrets visible in the provided files. If you plan to run benchmarks that require external APIs or models, those credentials would be supplied by your agent environment — not by this skill.

✓

Persistence & Privilege

Flags show always:false and default autonomous invocation allowed (normal). The skill does not request permanent presence or attempt to modify other skills' configs in the visible files. The benchmark method deliberately relies on agent orchestration (spawning subagents), which increases the operational blast radius if misused — this is expected for a benchmarking/evaluation tool but worth noting.

Assessment

This package appears coherent for its stated purpose: it runs local static checks, rubric grading, and an agent-driven benchmark. Before installing/running: (1) review the scripts (especially the truncated ones) for any network calls, subprocess calls, or file-write operations you don't expect; (2) run in an isolated environment (container or VM) if you plan to evaluate untrusted skills; (3) when using the benchmark mode, be aware it will cause the agent to read the full target skill directory and spawn subagents — do not point it at directories containing credentials or sensitive secrets. If you want a higher-confidence assessment, provide the remaining script contents (static-analyze.py and the truncated parts) so I can check for external network calls, subprocess.exec/ shell injection, or hardcoded secrets.

Like a lobster shell, security has layers — review code before you run it.

latestvk97dgm97pthk5a2sdqztt5r7b985kzh1

91downloads

1stars

3versions

Updated 1d ago

v1.0.2

MIT-0

Multi-Skill-Eval v1.0.0

Integrated Multi-Method Skill Evaluation System

Combines three evaluation approaches into one unified system:

Skill Assessment — lightweight static analysis (fast, automated)
Skill Evaluator — 25-criterion rubric scoring (ISO 25010, OpenSSF, Shneiderman)
Skill-Eval — autonomous benchmark evaluation with skill card generation

🚀 快速开始 / Quick Start

# 完整评估（三种方法）
multi-skill-eval ~/.openclaw/skills/my-skill

# 快速静态分析
multi-skill-eval ~/.openclaw/skills/my-skill --method quick

# 完整评估 + 详细报告
multi-skill-eval ~/.openclaw/skills/my-skill --method full

# 对比两个技能
multi-skill-eval --compare skill-a skill-b

# 批量评估所有本地技能
multi-skill-eval --all

# 指定模型进行基准测试
multi-skill-eval ~/.openclaw/skills/my-skill --method benchmark --model minimax/MiniMax-M2

Three Evaluation Methods

方法一：静态分析 (快速 — 约30秒)

轻量级自动化检查，覆盖4个维度：

python3 scripts/static-analyze.py ~/.openclaw/skills/my-skill
python3 scripts/static-analyze.py ~/.openclaw/skills/my-skill --json    # 机器可读格式

检查项目：

文档完整性（SKILL.md、描述质量、示例）
代码质量与安全信号（脚本语法、错误处理）
配置友好性（环境变量文档化、默认值清晰）
维护性信号（版本管理、近期更新）

输出： 0-100分数 + 按严重性分类的问题列表。

方法二：Rubric打分 (详细 — 约10分钟)

25项标准，覆盖8个类别。自动化检查 + 手动评审结合。

运行自动化结构检查：

python3 scripts/eval-skill.py ~/.openclaw/skills/my-skill --json --verbose

然后使用 references/rubric.md 进行手动评分

The 25 Criteria (8 Categories)

#	Category	Framework	Criteria
1	Functional Suitability	ISO 25010	Completeness, Correctness, Appropriateness
2	Reliability	ISO 25010	Fault Tolerance, Error Reporting, Recoverability
3	Performance / Context	ISO 25010 + Agent	Token Cost, Execution Efficiency
4	Usability — AI Agent	Shneiderman, Gerhardt-Powals	Learnability, Consistency, Feedback, Error Prevention
5	Usability — Human	Tognazzini, Norman	Discoverability, Forgiveness
6	Security	ISO 25010 + OpenSSF	Credentials, Input Validation, Data Safety
7	Maintainability	ISO 25010	Modularity, Modifiability, Testability
8	Agent-Specific	Novel	Trigger Precision, Progressive Disclosure, Composability, Idempotency, Escape Hatches

Scoring: Each criterion 0–4. Total 100 max.

Score	Verdict	Action
90–100	Excellent	Publish confidently
80–89	Good	Publishable, note known issues
70–79	Acceptable	Fix P0s before publishing
60–69	Needs Work	Fix P0+P1 before publishing
<60	Not Ready	Significant rework needed

Rubric Score Sheet

Copy assets/EVAL-TEMPLATE.md to the skill directory as EVAL.md.

P0 Issues (blocks publishing):

Missing SKILL.md or invalid frontmatter
Hardcoded credentials or secrets
Phantom tooling (referenced scripts not in package)
No description or description < 50 chars

P1 Issues (should fix):

No usage examples
No error handling in scripts
Missing dependency documentation
Unclear trigger conditions

方法三：自主基准测试 (深度 — 约30分钟/技能)

Full multi-phase evaluation with multi-model support. Requires AI agent execution.

# Spawn benchmark via AI agent
multi-skill-eval /path/to/skill --method benchmark --model claude-sonnet-4

⚠️ Note: The benchmark method requires an AI agent to orchestrate subagent execution. The CLI coordinates the workflow but actual execution happens through AI agent sessions.

📋 Planned: Self-evolution improvement engine (Phase 7+) is planned but not yet implemented.

Phase 1: Pre-flight Analysis

Read SKILL.md — understand claims, dependencies, target use cases
Classify skill type:
- Capability uplift — teaches the agent something it can't do well
- Encoded preference — sequences steps according to specific process
Dependency check:
- Required CLI tools, API keys, env vars
- Mark dependency-gated if credentials missing (skip eval, not fault of skill)
- Check for phantom tooling (referenced scripts not in package)
Marketing claims check: flag any metrics ("7.8x faster") without evidence
Read knowledge base: knowledge/lessons.md, eval-patterns.md, failures.md
Check prior evaluations: knowledge/skill-profiles/<slug>.md

Phase 2: Test Case Design

Design 2-3 test prompts across four categories:

Outcome — Did the task complete correctly?
Process — Did the agent follow the skill's intended steps?
Style — Does output follow skill-claimed conventions?
Efficiency — Reasonable time/token usage?

Assertion design (two layers):

Layer 1: Deterministic checks (fast, reproducible)

File existence, word counts, keyword presence
Format compliance (valid JSON, SQL, markdown)
Programmatic verification (run tests, check syntax)

Layer 2: Rubric-based quality assessment (LLM-as-judge)

Judge model (NOT execution model) grades output against specific rubric
Structured scoring, not pass/fail

Key assertion patterns:

Banned-word checks for style-constrained skills (highly discriminating)
Methodology/structure assertions for technical domains (baseline already strong on correctness)
Output-floor assertions: required sections must appear even in error/fallback paths
Bilingual keyword variants for Chinese-language skills (索引/index, 前导通配符/leading wildcard)

Phase 3: Execution

For each test case, spawn two subagents:

With-skill:

[Model: <execution_model>]
Read the skill at <skill-path>/SKILL.md and follow its instructions.
Task: <prompt>
Save outputs to: <workspace>/iteration-<N>/<test-name>/with_skill/outputs/

Without-skill (baseline):

[Model: <execution_model>]
Complete this task using only built-in capabilities. Do NOT read SKILL.md.
Task: <prompt>
Save outputs to: <workspace>/iteration-<N>/<test-name>/without_skill/outputs/

Multi-model mode: Run same skill across multiple models to check cross-model consistency.

Phase 4: Grading

Programmatic grading for deterministic checks. LLM-based grading for qualitative:

python3 scripts/grade-assertions.py --workspace /path/to/results

Save to grading.json:

{
  "expectations": [
    {"text": "assertion text", "passed": true, "evidence": "..."}
  ],
  "summary": {"passed": N, "failed": N, "total": N, "pass_rate": 0.X}
}

Phase 5: Benchmark Aggregation

{
  "with_skill": {"pass_rate": 0.X, "avg_time": "Ns", "avg_tokens": N},
  "without_skill": {"pass_rate": 0.X, "avg_time": "Ns", "avg_tokens": N},
  "delta": {"pass_rate": "+0.XX", "time": "+Xx"},
  "model_used": "claude-sonnet-4",
  "verdict": "Recommended"
}

Efficiency flags: Flag skills where quality delta ≈ 0 but cost delta >2x ("high-overhead framework inflation").

Phase 6: Skill Card Generation

python3 scripts/generate_skill_card.py \
  --workspace /path/to/results \
  --skill-name "My Skill" \
  --skill-slug my-skill \
  --eval-model claude-sonnet-4 \
  --output skill-cards/my-skill-v1.md

Skill Card Contents:

Metadata: name, source, eval date, model, engine version
Overall score 0-10 (Quality 0-5 + Delta 0-3 + Efficiency 0-2)
With-skill vs without-skill comparison table
Per-test-case breakdown with assertions, timing, grading
Strengths / Weaknesses
Recommendation: Recommended / Conditional / Marginal / Not Recommended

Phase 7: Leaderboard Update

python3 scripts/generate_leaderboard.py --cards-dir skill-cards --output leaderboard/index.html

Self-Evolution Improvement Engine

⚠️ Planned — Not Yet Implemented

The self-evolution improvement engine is designed but not yet implemented. The knowledge base (knowledge/improve/) contains proven patterns and lessons that inform manual skill improvement, but automatic skill rewriting is not available.

Planned Improvement Process (Phase 7-12)

Read knowledge base:
- knowledge/improve/lessons.md — proven strategies
- knowledge/improve/patterns.md — category-specific playbooks
- knowledge/improve/failures.md — what NOT to try
Diagnose root cause:
- Skill too vague? (Doesn't specify enough to change model behavior)
- Skill redundant? (Teaches things model already knows)
- Skill too heavy? (Adds overhead without proportional quality gain)
- Missing structure? (No clear output format)
- Phantom tooling? (References tools that don't exist)
- Reference manual anti-pattern? (>200 lines of educational content)
- Library-as-skill anti-pattern? (Contains code instead of instructions)
Select improvement strategy from patterns:
- Reference Manual Slim-Down: Delete 70%+ redundant content, add MUST/ALWAYS/NEVER mandates
- Library-to-Instructions: Convert code to behavioral instructions
- Phantom Tooling Replacement: Replace missing tool references with inline instructions
- Overhead Routing: Add quick-mode vs full-framework routing
- Assertion-Aligned Rewrite: Rewrite to pass specific failed assertions
Rewrite SKILL.md with selected strategy:
- Default: Remove > Add (delete 60-80% first, then add behavioral mandates)
- Add specific, enforceable conventions
- Remove redundant content model already handles
- Save as SKILL-improved.md
Update assertions to match improved skill
Re-evaluate with improved version

Planned Re-Eval (Phase 10-11)

Run same eval against SKILL-improved.md:

Score improved by >= 1.5 points → Success
Less than 50% of previously-failed assertions fixed → Document limitation, move on

Planned Improvement Knowledge Update (Phase 12)

After each improvement batch:

Update knowledge/improve/lessons.md with what worked
Update knowledge/improve/patterns.md with reusable patterns
Update knowledge/improve/failures.md with failed attempts
Fold proven patterns back into this SKILL.md

Scoring Summary

Method	Speed	Coverage	Best For
Static Analysis	~30s	4 dimensions	Quick comparison, batch scan
Rubric Scoring	~10min	25 criteria	Pre-publish audit, detailed report
Benchmark Eval	~30min	Full + self-evolution	Production evaluation, skill improvement

Overall Score	Verdict
7-10	Recommended
5-6.9	Conditional
3-4.9	Marginal
0-2.9	Not Recommended

Anti-Patterns to Detect

Reference manual anti-pattern: SKILL.md >200 lines of educational content (not behavioral instructions)
Library-as-skill anti-pattern: SKILL.md contains Python/JS class definitions instead of instructions
Phantom tooling: SKILL.md references scripts/binaries not in the package
Phantom tooling framework skills: Evaluate template/output structure separately from real data execution
Unsubstantiated claims: Skill claims specific metrics without evidence — do not use self-reported numbers
High-overhead framework inflation: Quality delta ≈ 0 but cost delta >2x — penalize efficiency

Deeper Security Scanning

For thorough security audits, complement with SkillLens:

npx skilllens scan /path/to/skill

Checks: exfiltration, code execution, persistence, privilege bypass, prompt injection.

Dependencies

Python 3.6+ (for eval-skill.py, static-analyze.py, grade-assertions.py)
PyYAML (pip install pyyaml) — frontmatter parsing
Node.js (for SkillLens security scanning)

Comments

Loading comments...