Skill Quality Evaluator

Other

Skill Quality Evaluator - Score any skill on 6 dimensions. Catch 30% of skills that look good but fail silently. Based on Tessl Research 2026 findings.

Install

openclaw skills install xiaobai-skill-quality-eval

Skill Quality Evaluator 📊

Score any skill on 6 dimensions. Catch the 30% of skills that look good but fail silently.

Why This Matters

Tessl Research (April 2026) found:

  • 20% accuracy gain when using a good skill vs no skill
  • 3X cost savings when small model + right skill matches large model
  • 40% activation rate — agents often fail to use available skills
  • 30% of evaluation tasks have leakage — skills that seem great but aren't

This skill helps you evaluate and improve your skills systematically.

6-Dimension Evaluation

1. Activation Reliability (0-100)

Can the agent find and activate this skill when needed?

Checklist:

  • Trigger words are specific and unambiguous
  • Description matches actual functionality
  • No conflicting skills with similar triggers
  • Skill is discovered when user asks relevant questions

Common Issues:

  • Vague description → agent doesn't know when to use it
  • Missing trigger words → skill never activates
  • Too broad → activates when it shouldn't

Score Guide:

  • 90+: Agent activates correctly 95%+ of the time
  • 70-89: Activates in most relevant contexts
  • 50-69: Sometimes activates, sometimes misses
  • <50: Agent rarely finds/uses this skill

2. Task Coverage (0-100)

Does the skill handle the tasks it claims to cover?

Checklist:

  • Each claimed capability has a usage example
  • Edge cases are documented
  • Known limitations are stated
  • Failure modes are explained

Common Issues:

  • Claims broad coverage but only handles happy path
  • No examples for secondary features
  • Undocumented prerequisites

Score Guide:

  • 90+: All claimed tasks have working examples
  • 70-89: Main tasks covered, some gaps in secondary features
  • 50-69: Core functionality works but incomplete
  • <50: Major claims unsupported

3. Instruction Clarity (0-100)

Can the agent follow the instructions without confusion?

Checklist:

  • Instructions are step-by-step, not vague guidelines
  • Decision points have clear criteria
  • Output format is specified
  • Anti-patterns are listed

Common Issues:

  • "Do X when appropriate" → when is appropriate?
  • Missing priority/precedence rules
  • Contradictory instructions

Score Guide:

  • 90+: Agent follows instructions correctly 95%+ of the time
  • 70-89: Mostly clear, occasional confusion
  • 50-69: Agent frequently asks for clarification
  • <50: Instructions are ambiguous or contradictory

4. Leakage Resistance (0-100)

Does the evaluation actually test the skill, or does it leak answers?

Checklist:

  • Examples don't contain verbatim solutions
  • Test tasks require genuine skill application
  • No shortcut paths that bypass skill content
  • Evaluation criteria measure real capability

Common Issues (from Tessl Research):

  • Example tasks are too similar to skill content
  • Skill contains answers verbatim
  • Test can be solved by pattern matching without understanding

Score Guide:

  • 90+: No leakage, genuine skill testing
  • 70-89: Minor leakage that doesn't significantly inflate scores
  • 50-69: Moderate leakage, scores may be 10-20% inflated
  • <50: Major leakage, evaluation results unreliable

5. Model Compatibility (0-100)

Does the skill work across different model sizes?

Checklist:

  • Tested with at least 2 model sizes
  • Works with smaller/cheaper models
  • Performance difference between models documented
  • Minimum model requirements stated

Tessl Finding: Small model + right skill ≈ Large model at 3X lower cost.

Score Guide:

  • 90+: Works well with small models (haiku-level)
  • 70-89: Works with medium models (sonnet-level)
  • 50-69: Requires large models (opus-level)
  • <50: Even large models struggle

6. Real-World Value (0-100)

Does using this skill actually improve outcomes vs no skill?

Checklist:

  • Measurable improvement over baseline
  • Users would notice the difference
  • Saves time or reduces errors
  • No negative side effects

Score Guide:

  • 90+: Clear, significant improvement (20%+ accuracy gain)
  • 70-89: Noticeable improvement
  • 50-69: Marginal improvement
  • <50: No improvement or negative impact

Evaluation Report Template

# Skill Evaluation Report

**Skill**: [name]
**Version**: [version]
**Date**: YYYY-MM-DD
**Evaluator**: [agent/session]

## Overall Score: XX/100

| Dimension | Score | Status |
|-----------|-------|--------|
| Activation Reliability | XX | 🟢/🟡/🔴 |
| Task Coverage | XX | 🟢/🟡/🔴 |
| Instruction Clarity | XX | 🟢/🟡/🔴 |
| Leakage Resistance | XX | 🟢/🟡/🔴 |
| Model Compatibility | XX | 🟢/🟡/🔴 |
| Real-World Value | XX | 🟢/🟡/🔴 |

🟢 80+ | 🟡 50-79 | 🔴 <50

## Critical Issues
1. [Issue] → [Fix]

## Improvement Recommendations
1. [Recommendation] → [Expected impact]

## Quick Wins (easy fixes, big impact)
1. [Fix] → +X points on [dimension]

Usage

Evaluate a skill

Read the skill's SKILL.md and evaluate on all 6 dimensions.
Generate the evaluation report.
Save to memory/evaluations/<skill-name>-eval.md

Improve a skill based on evaluation

1. Read evaluation report
2. Focus on lowest-scoring dimension
3. Apply quick wins first
4. Re-evaluate
5. Repeat until all dimensions ≥ 70

Batch evaluate all skills

For each skill in skills/ directory:
  1. Read SKILL.md
  2. Evaluate on 6 dimensions
  3. Generate report
  4. Identify top 3 improvements
Save summary to memory/evaluations/batch-report.md

Anti-Patterns to Detect

PatternIssueFix
"Do X when appropriate"Vague triggerDefine specific conditions
No examplesAgent can't learnAdd 3+ concrete examples
Only happy pathFragile in productionAdd error handling examples
Verbatim solutionsLeakage riskUse different examples for eval
No model requirementsUnknown compatibilityTest with 2+ model sizes

License

MIT