Skill Quality Evaluator

v1.0.0

Skill Quality Evaluator - Score any skill on 6 dimensions. Catch 30% of skills that look good but fail silently. Based on Tessl Research 2026 findings.

0· 89·0 current·0 all-time
byErwin@aptratcn

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for aptratcn/xiaobai-skill-quality-eval.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "Skill Quality Evaluator" (aptratcn/xiaobai-skill-quality-eval) from ClawHub.
Skill page: https://clawhub.ai/aptratcn/xiaobai-skill-quality-eval
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install xiaobai-skill-quality-eval

ClawHub CLI

Package manager switcher

npx clawhub@latest install xiaobai-skill-quality-eval
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
The skill's name/description (skill-quality-eval) matches the runtime instructions: read a skill's SKILL.md, evaluate on six dimensions, and generate/save a report. There are no unrelated requirements (no binaries, env vars, or installs) that would be disproportionate to the stated purpose.
Instruction Scope
SKILL.md instructs the agent to read SKILL.md files, evaluate them, and save reports (e.g., memory/evaluations/*.md). This is appropriate for an evaluator, but it assumes the agent has read access to the skills/ directory and write access to its memory or filesystem. If those paths include sensitive files or skills, the operator should verify permissions and review produced reports before persistent storage.
Install Mechanism
No install spec and no code files — instruction-only skill. This minimizes risk (nothing is written to disk or executed beyond the agent following textual instructions).
Credentials
The skill requests no environment variables, credentials, or config paths. All declared capabilities are achievable without additional secrets; therefore requested environment access is proportionate.
Persistence & Privilege
always is false and the skill does not request elevated/system-wide persistence. It does instruct saving evaluations to memory/evaluations/, which is a reasonable local persistence need for an evaluator. Operators should confirm the agent's memory/storage policy (what is stored, who can read it) before allowing persistent saves.
Assessment
This is an instruction-only evaluator and appears coherent with its purpose. Before installing: confirm the agent has only intended read access to the skills/ directory and that writing evaluation files to memory/evaluations/ is acceptable for your privacy/policy needs. Run it first on a small, non-sensitive set of skills to verify outputs and ensure reports don't accidentally record secrets from evaluated SKILL.md files. If you rely on automated batch runs, consider restricting which directories it can read and where it can write reports.

Like a lobster shell, security has layers — review code before you run it.

evaluationvk9716mrdamdcm3d7ke6hz8b5sx858274latestvk9716mrdamdcm3d7ke6hz8b5sx858274qualityvk9716mrdamdcm3d7ke6hz8b5sx858274skillvk9716mrdamdcm3d7ke6hz8b5sx858274
89downloads
0stars
1versions
Updated 6d ago
v1.0.0
MIT-0

Skill Quality Evaluator 📊

Score any skill on 6 dimensions. Catch the 30% of skills that look good but fail silently.

Why This Matters

Tessl Research (April 2026) found:

  • 20% accuracy gain when using a good skill vs no skill
  • 3X cost savings when small model + right skill matches large model
  • 40% activation rate — agents often fail to use available skills
  • 30% of evaluation tasks have leakage — skills that seem great but aren't

This skill helps you evaluate and improve your skills systematically.

6-Dimension Evaluation

1. Activation Reliability (0-100)

Can the agent find and activate this skill when needed?

Checklist:

  • Trigger words are specific and unambiguous
  • Description matches actual functionality
  • No conflicting skills with similar triggers
  • Skill is discovered when user asks relevant questions

Common Issues:

  • Vague description → agent doesn't know when to use it
  • Missing trigger words → skill never activates
  • Too broad → activates when it shouldn't

Score Guide:

  • 90+: Agent activates correctly 95%+ of the time
  • 70-89: Activates in most relevant contexts
  • 50-69: Sometimes activates, sometimes misses
  • <50: Agent rarely finds/uses this skill

2. Task Coverage (0-100)

Does the skill handle the tasks it claims to cover?

Checklist:

  • Each claimed capability has a usage example
  • Edge cases are documented
  • Known limitations are stated
  • Failure modes are explained

Common Issues:

  • Claims broad coverage but only handles happy path
  • No examples for secondary features
  • Undocumented prerequisites

Score Guide:

  • 90+: All claimed tasks have working examples
  • 70-89: Main tasks covered, some gaps in secondary features
  • 50-69: Core functionality works but incomplete
  • <50: Major claims unsupported

3. Instruction Clarity (0-100)

Can the agent follow the instructions without confusion?

Checklist:

  • Instructions are step-by-step, not vague guidelines
  • Decision points have clear criteria
  • Output format is specified
  • Anti-patterns are listed

Common Issues:

  • "Do X when appropriate" → when is appropriate?
  • Missing priority/precedence rules
  • Contradictory instructions

Score Guide:

  • 90+: Agent follows instructions correctly 95%+ of the time
  • 70-89: Mostly clear, occasional confusion
  • 50-69: Agent frequently asks for clarification
  • <50: Instructions are ambiguous or contradictory

4. Leakage Resistance (0-100)

Does the evaluation actually test the skill, or does it leak answers?

Checklist:

  • Examples don't contain verbatim solutions
  • Test tasks require genuine skill application
  • No shortcut paths that bypass skill content
  • Evaluation criteria measure real capability

Common Issues (from Tessl Research):

  • Example tasks are too similar to skill content
  • Skill contains answers verbatim
  • Test can be solved by pattern matching without understanding

Score Guide:

  • 90+: No leakage, genuine skill testing
  • 70-89: Minor leakage that doesn't significantly inflate scores
  • 50-69: Moderate leakage, scores may be 10-20% inflated
  • <50: Major leakage, evaluation results unreliable

5. Model Compatibility (0-100)

Does the skill work across different model sizes?

Checklist:

  • Tested with at least 2 model sizes
  • Works with smaller/cheaper models
  • Performance difference between models documented
  • Minimum model requirements stated

Tessl Finding: Small model + right skill ≈ Large model at 3X lower cost.

Score Guide:

  • 90+: Works well with small models (haiku-level)
  • 70-89: Works with medium models (sonnet-level)
  • 50-69: Requires large models (opus-level)
  • <50: Even large models struggle

6. Real-World Value (0-100)

Does using this skill actually improve outcomes vs no skill?

Checklist:

  • Measurable improvement over baseline
  • Users would notice the difference
  • Saves time or reduces errors
  • No negative side effects

Score Guide:

  • 90+: Clear, significant improvement (20%+ accuracy gain)
  • 70-89: Noticeable improvement
  • 50-69: Marginal improvement
  • <50: No improvement or negative impact

Evaluation Report Template

# Skill Evaluation Report

**Skill**: [name]
**Version**: [version]
**Date**: YYYY-MM-DD
**Evaluator**: [agent/session]

## Overall Score: XX/100

| Dimension | Score | Status |
|-----------|-------|--------|
| Activation Reliability | XX | 🟢/🟡/🔴 |
| Task Coverage | XX | 🟢/🟡/🔴 |
| Instruction Clarity | XX | 🟢/🟡/🔴 |
| Leakage Resistance | XX | 🟢/🟡/🔴 |
| Model Compatibility | XX | 🟢/🟡/🔴 |
| Real-World Value | XX | 🟢/🟡/🔴 |

🟢 80+ | 🟡 50-79 | 🔴 <50

## Critical Issues
1. [Issue] → [Fix]

## Improvement Recommendations
1. [Recommendation] → [Expected impact]

## Quick Wins (easy fixes, big impact)
1. [Fix] → +X points on [dimension]

Usage

Evaluate a skill

Read the skill's SKILL.md and evaluate on all 6 dimensions.
Generate the evaluation report.
Save to memory/evaluations/<skill-name>-eval.md

Improve a skill based on evaluation

1. Read evaluation report
2. Focus on lowest-scoring dimension
3. Apply quick wins first
4. Re-evaluate
5. Repeat until all dimensions ≥ 70

Batch evaluate all skills

For each skill in skills/ directory:
  1. Read SKILL.md
  2. Evaluate on 6 dimensions
  3. Generate report
  4. Identify top 3 improvements
Save summary to memory/evaluations/batch-report.md

Anti-Patterns to Detect

PatternIssueFix
"Do X when appropriate"Vague triggerDefine specific conditions
No examplesAgent can't learnAdd 3+ concrete examples
Only happy pathFragile in productionAdd error handling examples
Verbatim solutionsLeakage riskUse different examples for eval
No model requirementsUnknown compatibilityTest with 2+ model sizes

License

MIT

Comments

Loading comments...