Multi-Skill-Eval | 集成化技能评估系统

v1.0.2

集成化多方法技能评估系统。整合静态分析(skill-assessment)、Rubric质量打分(skill-evaluator)和自主基准测试(skill-eval)。用于全面评估、对比、审计或改进OpenClaw技能。覆盖文档完整性、代码质量、25项Rubric打分、多模型基准测试。 触发词(中文): 评估技...

1· 91·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for wangzairong/multi-skill-eval.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "Multi-Skill-Eval | 集成化技能评估系统" (wangzairong/multi-skill-eval) from ClawHub.
Skill page: https://clawhub.ai/wangzairong/multi-skill-eval
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install multi-skill-eval

ClawHub CLI

Package manager switcher

npx clawhub@latest install multi-skill-eval
Security Scan
Capability signals
Requires sensitive credentials
These labels describe what authority the skill may exercise. They are separate from suspicious or malicious moderation verdicts.
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
Name/description (skill evaluation, static analysis, rubric scoring, benchmark) match the included files and CLI instructions: scripts perform static analysis, generate cards/leaderboards, and run benchmarks. There are no unrelated required env vars or declared binaries that would be incoherent with the stated purpose.
Instruction Scope
SKILL.md instructs running the local Python CLI scripts against target skill directories and — for the benchmark method — to have the AI agent read SKILL.md, source files, and spawn subagents to execute tests. This is coherent for an evaluator but means the skill (and you, when running it) will read the full contents of whatever skill path you point it at; review inputs before evaluating sensitive code. I saw no instructions that attempt to exfiltrate data to unknown endpoints, but parts of the scripts were truncated so full review would be prudent.
Install Mechanism
No install spec (instruction-only with accompanying scripts) — lowest-risk delivery model. The package contains Python scripts; there is no evidence in the provided fragments of downloads from remote URLs or unusual install behavior. You should ensure you run scripts with a controlled Python environment and review any third-party requirements not listed here.
Credentials
The skill declares no required environment variables, no primary credential, and no config paths. The rubric and scripts reference handling of credentials and dependency-gating conceptually, but there are no hardcoded secrets visible in the provided files. If you plan to run benchmarks that require external APIs or models, those credentials would be supplied by your agent environment — not by this skill.
Persistence & Privilege
Flags show always:false and default autonomous invocation allowed (normal). The skill does not request permanent presence or attempt to modify other skills' configs in the visible files. The benchmark method deliberately relies on agent orchestration (spawning subagents), which increases the operational blast radius if misused — this is expected for a benchmarking/evaluation tool but worth noting.
Assessment
This package appears coherent for its stated purpose: it runs local static checks, rubric grading, and an agent-driven benchmark. Before installing/running: (1) review the scripts (especially the truncated ones) for any network calls, subprocess calls, or file-write operations you don't expect; (2) run in an isolated environment (container or VM) if you plan to evaluate untrusted skills; (3) when using the benchmark mode, be aware it will cause the agent to read the full target skill directory and spawn subagents — do not point it at directories containing credentials or sensitive secrets. If you want a higher-confidence assessment, provide the remaining script contents (static-analyze.py and the truncated parts) so I can check for external network calls, subprocess.exec/ shell injection, or hardcoded secrets.

Like a lobster shell, security has layers — review code before you run it.

latestvk97dgm97pthk5a2sdqztt5r7b985kzh1
91downloads
1stars
3versions
Updated 1d ago
v1.0.2
MIT-0

Multi-Skill-Eval v1.0.0

Integrated Multi-Method Skill Evaluation System

Combines three evaluation approaches into one unified system:

  1. Skill Assessment — lightweight static analysis (fast, automated)
  2. Skill Evaluator — 25-criterion rubric scoring (ISO 25010, OpenSSF, Shneiderman)
  3. Skill-Eval — autonomous benchmark evaluation with skill card generation

🚀 快速开始 / Quick Start

# 完整评估(三种方法)
multi-skill-eval ~/.openclaw/skills/my-skill

# 快速静态分析
multi-skill-eval ~/.openclaw/skills/my-skill --method quick

# 完整评估 + 详细报告
multi-skill-eval ~/.openclaw/skills/my-skill --method full

# 对比两个技能
multi-skill-eval --compare skill-a skill-b

# 批量评估所有本地技能
multi-skill-eval --all

# 指定模型进行基准测试
multi-skill-eval ~/.openclaw/skills/my-skill --method benchmark --model minimax/MiniMax-M2

Three Evaluation Methods

方法一:静态分析 (快速 — 约30秒)

轻量级自动化检查,覆盖4个维度:

python3 scripts/static-analyze.py ~/.openclaw/skills/my-skill
python3 scripts/static-analyze.py ~/.openclaw/skills/my-skill --json    # 机器可读格式

检查项目:

  • 文档完整性(SKILL.md、描述质量、示例)
  • 代码质量与安全信号(脚本语法、错误处理)
  • 配置友好性(环境变量文档化、默认值清晰)
  • 维护性信号(版本管理、近期更新)

输出: 0-100分数 + 按严重性分类的问题列表。


方法二:Rubric打分 (详细 — 约10分钟)

25项标准,覆盖8个类别。自动化检查 + 手动评审结合。

运行自动化结构检查:

python3 scripts/eval-skill.py ~/.openclaw/skills/my-skill --json --verbose

然后使用 references/rubric.md 进行手动评分

The 25 Criteria (8 Categories)

#CategoryFrameworkCriteria
1Functional SuitabilityISO 25010Completeness, Correctness, Appropriateness
2ReliabilityISO 25010Fault Tolerance, Error Reporting, Recoverability
3Performance / ContextISO 25010 + AgentToken Cost, Execution Efficiency
4Usability — AI AgentShneiderman, Gerhardt-PowalsLearnability, Consistency, Feedback, Error Prevention
5Usability — HumanTognazzini, NormanDiscoverability, Forgiveness
6SecurityISO 25010 + OpenSSFCredentials, Input Validation, Data Safety
7MaintainabilityISO 25010Modularity, Modifiability, Testability
8Agent-SpecificNovelTrigger Precision, Progressive Disclosure, Composability, Idempotency, Escape Hatches

Scoring: Each criterion 0–4. Total 100 max.

ScoreVerdictAction
90–100ExcellentPublish confidently
80–89GoodPublishable, note known issues
70–79AcceptableFix P0s before publishing
60–69Needs WorkFix P0+P1 before publishing
<60Not ReadySignificant rework needed

Rubric Score Sheet

Copy assets/EVAL-TEMPLATE.md to the skill directory as EVAL.md.

P0 Issues (blocks publishing):

  • Missing SKILL.md or invalid frontmatter
  • Hardcoded credentials or secrets
  • Phantom tooling (referenced scripts not in package)
  • No description or description < 50 chars

P1 Issues (should fix):

  • No usage examples
  • No error handling in scripts
  • Missing dependency documentation
  • Unclear trigger conditions

方法三:自主基准测试 (深度 — 约30分钟/技能)

Full multi-phase evaluation with multi-model support. Requires AI agent execution.

# Spawn benchmark via AI agent
multi-skill-eval /path/to/skill --method benchmark --model claude-sonnet-4

⚠️ Note: The benchmark method requires an AI agent to orchestrate subagent execution. The CLI coordinates the workflow but actual execution happens through AI agent sessions.

📋 Planned: Self-evolution improvement engine (Phase 7+) is planned but not yet implemented.

Phase 1: Pre-flight Analysis

  1. Read SKILL.md — understand claims, dependencies, target use cases
  2. Classify skill type:
    • Capability uplift — teaches the agent something it can't do well
    • Encoded preference — sequences steps according to specific process
  3. Dependency check:
    • Required CLI tools, API keys, env vars
    • Mark dependency-gated if credentials missing (skip eval, not fault of skill)
    • Check for phantom tooling (referenced scripts not in package)
  4. Marketing claims check: flag any metrics ("7.8x faster") without evidence
  5. Read knowledge base: knowledge/lessons.md, eval-patterns.md, failures.md
  6. Check prior evaluations: knowledge/skill-profiles/<slug>.md

Phase 2: Test Case Design

Design 2-3 test prompts across four categories:

  • Outcome — Did the task complete correctly?
  • Process — Did the agent follow the skill's intended steps?
  • Style — Does output follow skill-claimed conventions?
  • Efficiency — Reasonable time/token usage?

Assertion design (two layers):

Layer 1: Deterministic checks (fast, reproducible)

  • File existence, word counts, keyword presence
  • Format compliance (valid JSON, SQL, markdown)
  • Programmatic verification (run tests, check syntax)

Layer 2: Rubric-based quality assessment (LLM-as-judge)

  • Judge model (NOT execution model) grades output against specific rubric
  • Structured scoring, not pass/fail

Key assertion patterns:

  • Banned-word checks for style-constrained skills (highly discriminating)
  • Methodology/structure assertions for technical domains (baseline already strong on correctness)
  • Output-floor assertions: required sections must appear even in error/fallback paths
  • Bilingual keyword variants for Chinese-language skills (索引/index, 前导通配符/leading wildcard)

Phase 3: Execution

For each test case, spawn two subagents:

With-skill:

[Model: <execution_model>]
Read the skill at <skill-path>/SKILL.md and follow its instructions.
Task: <prompt>
Save outputs to: <workspace>/iteration-<N>/<test-name>/with_skill/outputs/

Without-skill (baseline):

[Model: <execution_model>]
Complete this task using only built-in capabilities. Do NOT read SKILL.md.
Task: <prompt>
Save outputs to: <workspace>/iteration-<N>/<test-name>/without_skill/outputs/

Multi-model mode: Run same skill across multiple models to check cross-model consistency.

Phase 4: Grading

Programmatic grading for deterministic checks. LLM-based grading for qualitative:

python3 scripts/grade-assertions.py --workspace /path/to/results

Save to grading.json:

{
  "expectations": [
    {"text": "assertion text", "passed": true, "evidence": "..."}
  ],
  "summary": {"passed": N, "failed": N, "total": N, "pass_rate": 0.X}
}

Phase 5: Benchmark Aggregation

{
  "with_skill": {"pass_rate": 0.X, "avg_time": "Ns", "avg_tokens": N},
  "without_skill": {"pass_rate": 0.X, "avg_time": "Ns", "avg_tokens": N},
  "delta": {"pass_rate": "+0.XX", "time": "+Xx"},
  "model_used": "claude-sonnet-4",
  "verdict": "Recommended"
}

Efficiency flags: Flag skills where quality delta ≈ 0 but cost delta >2x ("high-overhead framework inflation").

Phase 6: Skill Card Generation

python3 scripts/generate_skill_card.py \
  --workspace /path/to/results \
  --skill-name "My Skill" \
  --skill-slug my-skill \
  --eval-model claude-sonnet-4 \
  --output skill-cards/my-skill-v1.md

Skill Card Contents:

  • Metadata: name, source, eval date, model, engine version
  • Overall score 0-10 (Quality 0-5 + Delta 0-3 + Efficiency 0-2)
  • With-skill vs without-skill comparison table
  • Per-test-case breakdown with assertions, timing, grading
  • Strengths / Weaknesses
  • Recommendation: Recommended / Conditional / Marginal / Not Recommended

Phase 7: Leaderboard Update

python3 scripts/generate_leaderboard.py --cards-dir skill-cards --output leaderboard/index.html

Self-Evolution Improvement Engine

⚠️ Planned — Not Yet Implemented

The self-evolution improvement engine is designed but not yet implemented. The knowledge base (knowledge/improve/) contains proven patterns and lessons that inform manual skill improvement, but automatic skill rewriting is not available.

Planned Improvement Process (Phase 7-12)

  1. Read knowledge base:

    • knowledge/improve/lessons.md — proven strategies
    • knowledge/improve/patterns.md — category-specific playbooks
    • knowledge/improve/failures.md — what NOT to try
  2. Diagnose root cause:

    • Skill too vague? (Doesn't specify enough to change model behavior)
    • Skill redundant? (Teaches things model already knows)
    • Skill too heavy? (Adds overhead without proportional quality gain)
    • Missing structure? (No clear output format)
    • Phantom tooling? (References tools that don't exist)
    • Reference manual anti-pattern? (>200 lines of educational content)
    • Library-as-skill anti-pattern? (Contains code instead of instructions)
  3. Select improvement strategy from patterns:

    • Reference Manual Slim-Down: Delete 70%+ redundant content, add MUST/ALWAYS/NEVER mandates
    • Library-to-Instructions: Convert code to behavioral instructions
    • Phantom Tooling Replacement: Replace missing tool references with inline instructions
    • Overhead Routing: Add quick-mode vs full-framework routing
    • Assertion-Aligned Rewrite: Rewrite to pass specific failed assertions
  4. Rewrite SKILL.md with selected strategy:

    • Default: Remove > Add (delete 60-80% first, then add behavioral mandates)
    • Add specific, enforceable conventions
    • Remove redundant content model already handles
    • Save as SKILL-improved.md
  5. Update assertions to match improved skill

  6. Re-evaluate with improved version

Planned Re-Eval (Phase 10-11)

Run same eval against SKILL-improved.md:

  • Score improved by >= 1.5 points → Success
  • Less than 50% of previously-failed assertions fixed → Document limitation, move on

Planned Improvement Knowledge Update (Phase 12)

After each improvement batch:

  • Update knowledge/improve/lessons.md with what worked
  • Update knowledge/improve/patterns.md with reusable patterns
  • Update knowledge/improve/failures.md with failed attempts
  • Fold proven patterns back into this SKILL.md

Scoring Summary

MethodSpeedCoverageBest For
Static Analysis~30s4 dimensionsQuick comparison, batch scan
Rubric Scoring~10min25 criteriaPre-publish audit, detailed report
Benchmark Eval~30minFull + self-evolutionProduction evaluation, skill improvement
Overall ScoreVerdict
7-10Recommended
5-6.9Conditional
3-4.9Marginal
0-2.9Not Recommended

Anti-Patterns to Detect

  • Reference manual anti-pattern: SKILL.md >200 lines of educational content (not behavioral instructions)
  • Library-as-skill anti-pattern: SKILL.md contains Python/JS class definitions instead of instructions
  • Phantom tooling: SKILL.md references scripts/binaries not in the package
  • Phantom tooling framework skills: Evaluate template/output structure separately from real data execution
  • Unsubstantiated claims: Skill claims specific metrics without evidence — do not use self-reported numbers
  • High-overhead framework inflation: Quality delta ≈ 0 but cost delta >2x — penalize efficiency

Deeper Security Scanning

For thorough security audits, complement with SkillLens:

npx skilllens scan /path/to/skill

Checks: exfiltration, code execution, persistence, privilege bypass, prompt injection.


Dependencies

  • Python 3.6+ (for eval-skill.py, static-analyze.py, grade-assertions.py)
  • PyYAML (pip install pyyaml) — frontmatter parsing
  • Node.js (for SkillLens security scanning)

Comments

Loading comments...