Skill Quality Evaluator

v1.0.0

Skill Quality Evaluator - Score any skill on 6 dimensions. Catch 30% of skills that look good but fail silently. Based on Tessl Research 2026 findings.

⭐ 0· 89·0 current·0 all-time

byErwin@aptratcn

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for aptratcn/xiaobai-skill-quality-eval.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Skill Quality Evaluator" (aptratcn/xiaobai-skill-quality-eval) from ClawHub.
Skill page: https://clawhub.ai/aptratcn/xiaobai-skill-quality-eval
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install xiaobai-skill-quality-eval

ClawHub CLI

Package manager switcher

npx clawhub@latest install xiaobai-skill-quality-eval

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

The skill's name/description (skill-quality-eval) matches the runtime instructions: read a skill's SKILL.md, evaluate on six dimensions, and generate/save a report. There are no unrelated requirements (no binaries, env vars, or installs) that would be disproportionate to the stated purpose.

ℹ

Instruction Scope

SKILL.md instructs the agent to read SKILL.md files, evaluate them, and save reports (e.g., memory/evaluations/*.md). This is appropriate for an evaluator, but it assumes the agent has read access to the skills/ directory and write access to its memory or filesystem. If those paths include sensitive files or skills, the operator should verify permissions and review produced reports before persistent storage.

✓

Install Mechanism

No install spec and no code files — instruction-only skill. This minimizes risk (nothing is written to disk or executed beyond the agent following textual instructions).

✓

Credentials

The skill requests no environment variables, credentials, or config paths. All declared capabilities are achievable without additional secrets; therefore requested environment access is proportionate.

ℹ

Persistence & Privilege

always is false and the skill does not request elevated/system-wide persistence. It does instruct saving evaluations to memory/evaluations/, which is a reasonable local persistence need for an evaluator. Operators should confirm the agent's memory/storage policy (what is stored, who can read it) before allowing persistent saves.

Assessment

This is an instruction-only evaluator and appears coherent with its purpose. Before installing: confirm the agent has only intended read access to the skills/ directory and that writing evaluation files to memory/evaluations/ is acceptable for your privacy/policy needs. Run it first on a small, non-sensitive set of skills to verify outputs and ensure reports don't accidentally record secrets from evaluated SKILL.md files. If you rely on automated batch runs, consider restricting which directories it can read and where it can write reports.

Like a lobster shell, security has layers — review code before you run it.

evaluationvk9716mrdamdcm3d7ke6hz8b5sx858274latestvk9716mrdamdcm3d7ke6hz8b5sx858274qualityvk9716mrdamdcm3d7ke6hz8b5sx858274skillvk9716mrdamdcm3d7ke6hz8b5sx858274

89downloads

0stars

1versions

Updated 6d ago

v1.0.0

MIT-0

Skill Quality Evaluator 📊

Score any skill on 6 dimensions. Catch the 30% of skills that look good but fail silently.

Why This Matters

Tessl Research (April 2026) found:

20% accuracy gain when using a good skill vs no skill
3X cost savings when small model + right skill matches large model
40% activation rate — agents often fail to use available skills
30% of evaluation tasks have leakage — skills that seem great but aren't

This skill helps you evaluate and improve your skills systematically.

6-Dimension Evaluation

1. Activation Reliability (0-100)

Can the agent find and activate this skill when needed?

Checklist:

Trigger words are specific and unambiguous
Description matches actual functionality
No conflicting skills with similar triggers
Skill is discovered when user asks relevant questions

Common Issues:

Vague description → agent doesn't know when to use it
Missing trigger words → skill never activates
Too broad → activates when it shouldn't

Score Guide:

90+: Agent activates correctly 95%+ of the time
70-89: Activates in most relevant contexts
50-69: Sometimes activates, sometimes misses
<50: Agent rarely finds/uses this skill

2. Task Coverage (0-100)

Does the skill handle the tasks it claims to cover?

Checklist:

Each claimed capability has a usage example
Edge cases are documented
Known limitations are stated
Failure modes are explained

Common Issues:

Claims broad coverage but only handles happy path
No examples for secondary features
Undocumented prerequisites

Score Guide:

90+: All claimed tasks have working examples
70-89: Main tasks covered, some gaps in secondary features
50-69: Core functionality works but incomplete
<50: Major claims unsupported

3. Instruction Clarity (0-100)

Can the agent follow the instructions without confusion?

Checklist:

Instructions are step-by-step, not vague guidelines
Decision points have clear criteria
Output format is specified
Anti-patterns are listed

Common Issues:

"Do X when appropriate" → when is appropriate?
Missing priority/precedence rules
Contradictory instructions

Score Guide:

90+: Agent follows instructions correctly 95%+ of the time
70-89: Mostly clear, occasional confusion
50-69: Agent frequently asks for clarification
<50: Instructions are ambiguous or contradictory

4. Leakage Resistance (0-100)

Does the evaluation actually test the skill, or does it leak answers?

Checklist:

Examples don't contain verbatim solutions
Test tasks require genuine skill application
No shortcut paths that bypass skill content
Evaluation criteria measure real capability

Common Issues (from Tessl Research):

Example tasks are too similar to skill content
Skill contains answers verbatim
Test can be solved by pattern matching without understanding

Score Guide:

90+: No leakage, genuine skill testing
70-89: Minor leakage that doesn't significantly inflate scores
50-69: Moderate leakage, scores may be 10-20% inflated
<50: Major leakage, evaluation results unreliable

5. Model Compatibility (0-100)

Does the skill work across different model sizes?

Checklist:

Tested with at least 2 model sizes
Works with smaller/cheaper models
Performance difference between models documented
Minimum model requirements stated

Tessl Finding: Small model + right skill ≈ Large model at 3X lower cost.

Score Guide:

90+: Works well with small models (haiku-level)
70-89: Works with medium models (sonnet-level)
50-69: Requires large models (opus-level)
<50: Even large models struggle

6. Real-World Value (0-100)

Does using this skill actually improve outcomes vs no skill?

Checklist:

Measurable improvement over baseline
Users would notice the difference
Saves time or reduces errors
No negative side effects

Score Guide:

90+: Clear, significant improvement (20%+ accuracy gain)
70-89: Noticeable improvement
50-69: Marginal improvement
<50: No improvement or negative impact

Evaluation Report Template

# Skill Evaluation Report

**Skill**: [name]
**Version**: [version]
**Date**: YYYY-MM-DD
**Evaluator**: [agent/session]

## Overall Score: XX/100

| Dimension | Score | Status |
|-----------|-------|--------|
| Activation Reliability | XX | 🟢/🟡/🔴 |
| Task Coverage | XX | 🟢/🟡/🔴 |
| Instruction Clarity | XX | 🟢/🟡/🔴 |
| Leakage Resistance | XX | 🟢/🟡/🔴 |
| Model Compatibility | XX | 🟢/🟡/🔴 |
| Real-World Value | XX | 🟢/🟡/🔴 |

🟢 80+ | 🟡 50-79 | 🔴 <50

## Critical Issues
1. [Issue] → [Fix]

## Improvement Recommendations
1. [Recommendation] → [Expected impact]

## Quick Wins (easy fixes, big impact)
1. [Fix] → +X points on [dimension]

Usage

Evaluate a skill

Read the skill's SKILL.md and evaluate on all 6 dimensions.
Generate the evaluation report.
Save to memory/evaluations/<skill-name>-eval.md

Improve a skill based on evaluation

1. Read evaluation report
2. Focus on lowest-scoring dimension
3. Apply quick wins first
4. Re-evaluate
5. Repeat until all dimensions ≥ 70

Batch evaluate all skills

For each skill in skills/ directory:
  1. Read SKILL.md
  2. Evaluate on 6 dimensions
  3. Generate report
  4. Identify top 3 improvements
Save summary to memory/evaluations/batch-report.md

Anti-Patterns to Detect

Pattern	Issue	Fix
"Do X when appropriate"	Vague trigger	Define specific conditions
No examples	Agent can't learn	Add 3+ concrete examples
Only happy path	Fragile in production	Add error handling examples
Verbatim solutions	Leakage risk	Use different examples for eval
No model requirements	Unknown compatibility	Test with 2+ model sizes

License

MIT

Comments

Loading comments...