# Scoring Rubric — skill-scorer

> This rubric defines the exact criteria for each of the 8 evaluation dimensions.
> Each dimension is scored 0–100. The final score is a weighted average.

---

## Dimension 1: Metadata & Triggering (Weight: 15%)

Evaluates whether Claude can **discover and correctly activate** this skill.

### Scoring Criteria

| Points | Criterion | How to Check |
|--------|-----------|-------------|
| 0-20 | **YAML frontmatter exists** with `name` and `description` | Parse the `---` block at top of SKILL.md |
| 0-20 | **Name follows conventions** — descriptive, uses gerund or verb-noun form, no cryptic abbreviations | Check: would a user or LLM understand what this skill does from the name alone? |
| 0-30 | **Description is effective** — specific enough to trigger correctly, broad enough to catch edge queries, includes anti-trigger guidance | Test: given 5 hypothetical user queries, would this description cause correct triggering? |
| 0-15 | **Description is "pushy" enough** — addresses undertriggering by being explicit about when to activate | Check: does description say "use when..." or "triggers on..." with concrete examples? |
| 0-15 | **Trigger section in body** (if present) — complements description with explicit trigger phrases and anti-triggers | Check for "When to Activate" / "Do NOT activate for" sections |

### Scoring Guide

| Score Range | Meaning |
|-------------|---------|
| 90-100 | Description alone would reliably trigger for all intended use cases. Name is intuitive. Anti-triggers defined. |
| 70-89 | Description covers main use cases but misses edge cases. Name is clear. |
| 50-69 | Description is vague or too narrow. May undertrigger or overtrigger. |
| 30-49 | Description is a generic sentence. Name is cryptic. Claude would rarely choose this skill. |
| 0-29 | Missing frontmatter, no description, or name is meaningless. |

---

## Dimension 2: Structure & Architecture (Weight: 15%)

Evaluates the **file organization and information architecture** of the skill.

### Scoring Criteria

| Points | Criterion | How to Check |
|--------|-----------|-------------|
| 0-20 | **SKILL.md is the entry point** — contains the core logic, not just a pointer to other files | Check that the main workflow is in SKILL.md, not buried in references |
| 0-20 | **Progressive disclosure** — 3-level loading: metadata (always) → SKILL.md body (on trigger) → references (on demand) | Check: are references clearly separated? Does SKILL.md tell Claude when to read each reference? |
| 0-15 | **SKILL.md length is appropriate** — ideally under 500 lines; if longer, must justify with complexity | Count lines. Under 200 = good. 200-500 = acceptable. Over 500 = needs splitting. |
| 0-15 | **Logical section ordering** — follows a natural flow (overview → when → how → output → references) | Check section sequence for coherence |
| 0-15 | **Reference files are well-organized** — clear naming, each file has a single purpose, SKILL.md has a reference table | Check references/ directory structure and the References section in SKILL.md |
| 0-15 | **No deeply nested references** — reference files should not reference other reference files (max 2 levels) | Check for chains of file references |

### Scoring Guide

| Score Range | Meaning |
|-------------|---------|
| 90-100 | Clean 3-level progressive disclosure. SKILL.md under 300 lines. All references clearly mapped. |
| 70-89 | Good structure with minor issues (e.g., slightly long SKILL.md, one unclear reference). |
| 50-69 | Flat structure (everything in SKILL.md) or messy reference organization. |
| 30-49 | No separation of concerns. SKILL.md is a wall of text with no references. |
| 0-29 | Chaotic structure. Missing files. No logical organization. |

---

## Dimension 3: Instruction Clarity (Weight: 15%)

Evaluates whether the instructions are **clear, actionable, and well-written**.

### Scoring Criteria

| Points | Criterion | How to Check |
|--------|-----------|-------------|
| 0-20 | **Uses imperative form** — "Do X" not "You should consider doing X" | Scan for weak modal verbs (should, could, might, consider) |
| 0-20 | **Explains WHY, not just WHAT** — gives reasoning behind rules so Claude can generalize | Check: are constraints accompanied by rationale? |
| 0-20 | **Includes concrete examples** — input/output pairs, before/after, or code samples | Count examples. 0 = bad. 2-3 = good. Excessive = bloat. |
| 0-15 | **Avoids redundancy** — no repeated instructions across sections | Check for duplicate or near-duplicate content |
| 0-15 | **Right degree of freedom** — tight constraints where precision matters, loose guidance where creativity helps | Assess: is every step locked down, or does the skill trust Claude where appropriate? |
| 0-10 | **Consistent terminology** — same concept uses same term throughout | Check for inconsistent naming of the same concept |

### Scoring Guide

| Score Range | Meaning |
|-------------|---------|
| 90-100 | Every instruction is actionable. Examples are crisp. No redundancy. Perfect balance of prescriptive and permissive. |
| 70-89 | Clear overall, but some instructions are vague or redundant. Examples present. |
| 50-69 | Mix of clear and unclear instructions. Missing examples or excessive hand-waving. |
| 30-49 | Mostly vague or passive instructions. No examples. Claude would struggle to follow. |
| 0-29 | Incomprehensible, contradictory, or absent instructions. |

---

## Dimension 4: Workflow & Logic (Weight: 15%)

Evaluates the **operational correctness and completeness** of the skill's workflow.

### Step 0: Skill Type Detection (do this BEFORE scoring)

Different skill types have fundamentally different workflow patterns. Identify the type first, then apply the matching criteria.

| Type | How to Detect | Workflow Pattern |
|------|--------------|-----------------|
| **Instruction-only** | No CLI commands, no API calls, no scripts, no external tools. Pure markdown instructions. Examples: brand-guidelines, commit-formatter, code-review checklist | Rules/templates → Claude applies them using its own judgment |
| **Single-command executor** | One CLI/API command with scenario-specific defaults. SKILL.md contains the command template directly in workflow steps | Collect params → execute one command → format output |
| **Multi-command orchestration** | Multiple CLI/API commands combined. Parameters block defines a command pool; Playbooks describe how to combine them; Usage Examples give executable samples | Detect query type → select commands → orchestrate sequence → merge outputs |
| **Script-bundled** | Has a `scripts/` directory with executable code (Python/bash/Node). SKILL.md defines when and why; scripts handle the how. Claude invokes scripts and interprets their output | SKILL.md dispatches → script executes deterministic logic → Claude interprets results |
| **MCP-integrated** | Depends on an external MCP server connection. Skill is the "knowledge layer" on top of MCP tools — it tells Claude how to use those tools effectively | Verify MCP connection → call MCP tools per skill rules → format output following skill conventions |

**Critical rules:**
- An orchestration skill that puts executable commands in Parameters + Usage Examples and orchestration logic in Playbooks is a valid design pattern — not a workflow gap. Do NOT penalize it for "playbooks contain descriptive logic instead of executable commands."
- A script-bundled skill where SKILL.md is short and delegates to scripts is a valid design pattern — the workflow completeness lives in the scripts, not in SKILL.md. Evaluate the dispatch logic and script documentation, not SKILL.md line count.
- An MCP-integrated skill that says "use MCP tool X to do Y" without reimplementing the tool's logic is correct — the skill provides workflow guidance, not tool reimplementation.

### Scoring Criteria — Instruction-Only Skills

| Points | Criterion | How to Check |
|--------|-----------|-------------|
| 0-25 | **Rules are complete** — cover all relevant cases, no obvious gaps | Could Claude handle edge cases with these instructions alone? |
| 0-25 | **Output format defined** — templates, examples, or structure specified | Check for output templates or format examples |
| 0-25 | **Realistic examples** — input/output pairs demonstrate correct application | Check: do examples cover typical and edge cases? |
| 0-25 | **Handles ambiguous input** — guidance for unclear or borderline cases | Check for decision criteria or "when in doubt" guidance |

### Scoring Criteria — Single-Command Executor Skills

| Points | Criterion | How to Check |
|--------|-----------|-------------|
| 0-20 | **Complete workflow** — all steps from param collection to output delivery | Trace: can Claude handle a request end-to-end? |
| 0-20 | **Parameter handling** — required vs. optional clear; defaults specified; collection SOP defined | Check for parameter tables, defaults, collection strategy |
| 0-15 | **Validation step exists** — self-check before delivering results | Look for output validation / quality gate step |
| 0-15 | **Step dependencies clear** — each step knows what it receives from previous | Check data flow between steps |
| 0-15 | **Realistic usage examples** — commands that would actually work | Check: real values, future dates, valid IDs? |
| 0-15 | **Handles ambiguous input** — what to do when user intent is unclear | Check for clarification prompts or default behaviors |

### Scoring Criteria — Multi-Command Orchestration Skills

| Points | Criterion | How to Check |
|--------|-----------|-------------|
| 0-20 | **Command pool defined** — all available commands and their parameters are listed in a structured format (Parameters block or equivalent). This is where the "what can I call" lives | Check: are all CLI/API commands the skill can use clearly documented with flags and descriptions? A well-structured Parameters block = full score |
| 0-20 | **Orchestration logic defined** — Playbooks describe how to select and combine commands based on query type. Playbooks are EXPECTED to contain descriptive/procedural logic (e.g., "Step 1: search flights → Step 2: use result to search hotels"), NOT raw executable commands. The executable form lives in Usage Examples, not here | Check: do playbooks show command selection logic and sequencing? Different playbooks should have genuinely different command combinations. Score based on clarity of orchestration logic, not whether playbook text is directly copy-pasteable into a terminal |
| 0-15 | **Query type routing** — skill determines which subset of commands to run based on user intent (e.g., single-point query vs full itinerary) | Check: does Step 1 or equivalent classify the query before executing? |
| 0-15 | **Usage Examples are executable** — the Usage Examples section in SKILL.md contains concrete command samples with realistic parameters that can be copy-pasted and run. This is separate from Playbooks — playbooks describe logic, Usage Examples demonstrate concrete invocations | Check: does the Usage Examples section contain real `cli command --flag value` lines? Are values realistic (real cities, future dates)? |
| 0-15 | **Output merging logic** — how results from multiple commands are combined into a unified response. This can be defined in SKILL.md's Output Rules, in references/templates.md, or both. Any of these locations is valid | Check: is there a defined format (table, day-by-day, sections) for presenting multi-command results together? Location doesn't matter — existence and clarity do |
| 0-15 | **Validation step exists** — self-check covers all commands' outputs, not just the last one | Check: does validation verify booking links / data sources across all command results? |

### Scoring Criteria — Script-Bundled Skills

| Points | Criterion | How to Check |
|--------|-----------|-------------|
| 0-20 | **Dispatch logic clear** — SKILL.md tells Claude which script to run and when, with explicit invocation commands (e.g., `Run scripts/analyze.py --input {file}`) | Check: can Claude determine which script to call from SKILL.md instructions alone? |
| 0-20 | **Script documentation** — each script's purpose, inputs, outputs, and expected behavior are documented (in SKILL.md, script comments, or references/) | Check: could Claude understand what a script returns without reading its source code? |
| 0-15 | **Dependency declaration** — required packages, runtimes, and environment prerequisites are listed | Check for Prerequisites section or equivalent. Scripts that silently fail on missing dependencies are a critical flaw |
| 0-15 | **Error handling in dispatch** — SKILL.md defines what to do when a script fails (non-zero exit, unexpected output, timeout) | Check: is there a "if script fails → do X" path? |
| 0-15 | **Output interpretation** — SKILL.md tells Claude how to read and present script output to the user | Check: does the skill define how to parse script results (JSON, HTML, plaintext) and format them? |
| 0-15 | **Self-containment** — scripts are bundled in the skill folder, not referenced as external paths | Check: are all scripts in `scripts/` within the skill directory? No `../../shared/` references? |

### Scoring Criteria — MCP-Integrated Skills

| Points | Criterion | How to Check |
|--------|-----------|-------------|
| 0-20 | **MCP dependency declared** — skill explicitly states which MCP server(s) it requires and how to verify connectivity | Check for: MCP server name, connection verification step, "Requires X MCP server" in Prerequisites or frontmatter |
| 0-20 | **Tool usage guidance** — skill defines which MCP tools to use, in what order, and with what parameters for each workflow | Check: does the skill go beyond "use MCP tools" to specify which tools and how? |
| 0-15 | **Connection failure handling** — skill defines what to do when MCP server is disconnected or unreachable | Check for: "MCP not connected → tell user to reconnect. Do NOT proceed." Not just generic error handling |
| 0-15 | **Workflow adds value over raw MCP** — skill provides orchestration, formatting, or domain logic that raw MCP tool calls don't | Check: would removing the skill and using MCP tools directly produce the same result? If yes, the skill adds no value |
| 0-15 | **Output formatting rules** — skill defines how to present MCP tool results to the user | Check: does the skill specify output structure, not just pass through raw MCP responses? |
| 0-15 | **Realistic usage examples** — examples show complete MCP-mediated workflows with realistic parameters | Check: do examples show the full flow (connect → call tools → format → deliver)? |

### Scoring Guide (applies to all 5 types)

| Score Range | Meaning |
|-------------|---------|
| 90-100 | Workflow is complete for its type. All type-specific criteria met. Examples work. |
| 70-89 | Workflow is complete but some type-specific criteria lack detail. |
| 50-69 | Workflow has gaps relative to its type's requirements. |
| 30-49 | Incomplete workflow. Claude would need to improvise major steps. |
| 0-29 | No coherent workflow. Just a description of what the skill should do, not how. |

---

## Dimension 5: Error Handling (Weight: 10%)

Evaluates how the skill handles **failures, edge cases, and unexpected inputs**.

### Scoring Criteria

| Points | Criterion | How to Check |
|--------|-----------|-------------|
| 0-25 | **Fallback paths defined** — what to do when primary approach fails | Look for fallback/recovery sections or conditional logic |
| 0-25 | **Common failure cases covered** — at minimum: no results, invalid input, tool not available | Count addressed failure cases |
| 0-25 | **Graceful degradation** — skill provides partial value even when full execution fails | Check: does the skill have a degraded mode or helpful error message? |
| 0-25 | **No silent failures** — errors are surfaced to user, not swallowed | Check: does every error path end with user-facing communication? |

### Scoring Guide

| Score Range | Meaning |
|-------------|---------|
| 90-100 | Comprehensive fallback for every failure mode. Graceful degradation. Honest error reporting. |
| 70-89 | Main failure cases covered. Some edge cases missing. |
| 50-69 | Only 1-2 failure cases addressed. Others would leave Claude stuck. |
| 30-49 | Minimal error handling. Claude would likely hallucinate on failure. |
| 0-29 | No error handling at all. |

---

## Dimension 6: Context Efficiency (Weight: 10%)

Evaluates how well the skill **manages the context window budget**.

### Scoring Criteria

| Points | Criterion | How to Check |
|--------|-----------|-------------|
| 0-25 | **No unnecessary information** — every line in SKILL.md earns its place | Ask for each section: "Would Claude make a mistake without this?" |
| 0-25 | **No information Claude already knows** — doesn't re-explain common knowledge | Check for explanations of widely-known concepts |
| 0-25 | **Heavy content is in references, not SKILL.md** — domain knowledge, large examples, lookup tables are deferred | Check: is SKILL.md bloated with data that should be in references? |
| 0-25 | **Metadata is compact** — name + description together stay under ~150 tokens | Estimate token count of frontmatter |

### Scoring Guide

| Score Range | Meaning |
|-------------|---------|
| 90-100 | Every token is necessary. Perfect progressive disclosure. SKILL.md is lean. |
| 70-89 | Mostly efficient with minor bloat (a paragraph or two that could be cut). |
| 50-69 | Notable redundancy or unnecessary explanations. Could lose 30%+ without harm. |
| 30-49 | Significant bloat. Large knowledge sections in SKILL.md that should be deferred. |
| 0-29 | Massive context waste. Walls of text that Claude doesn't need. |

---

## Dimension 7: Portability & Compatibility (Weight: 10%)

Evaluates whether the skill **works across environments and is self-contained**.

### Scoring Criteria

| Points | Criterion | How to Check |
|--------|-----------|-------------|
| 0-25 | **Self-contained** — no external path references (../../shared/), all needed files are in the skill folder | Search for relative paths pointing outside the skill directory |
| 0-25 | **Cross-platform** — works on Claude Code, Claude.ai, and other SKILL.md-compatible agents, or explicitly states compatibility | Check compatibility field and instructions for environment-specific code |
| 0-25 | **Dependencies declared** — required CLIs, packages, APIs are listed with install instructions | Check for Prerequisites section or equivalent |
| 0-25 | **Environment check** — skill verifies its dependencies are available before proceeding | Check for Step 0 / environment verification logic |

### Scoring Guide

| Score Range | Meaning |
|-------------|---------|
| 90-100 | Fully self-contained. Works everywhere. Dependencies declared and checked. |
| 70-89 | Mostly portable. Minor assumptions about environment. |
| 50-69 | Works in one environment but would fail in others. Missing dependency declarations. |
| 30-49 | References external paths. No dependency management. |
| 0-29 | Cannot function outside its original context. |

---

## Dimension 8: Safety & Robustness (Weight: 10%)

Evaluates the skill's **resilience against misuse and hallucination**.

### Scoring Criteria

| Points | Criterion | How to Check |
|--------|-----------|-------------|
| 0-25 | **Identity lock** (if applicable) — skill prevents Claude from substituting training data for real-time data | Check for "CRITICAL EXECUTION RULES" or equivalent guardrails |
| 0-25 | **No hallucination traps** — skill doesn't include large knowledge sections that Claude might use instead of executing commands | Check: does knowledge section have a "does NOT replace execution" disclaimer? |
| 0-25 | **No prompt injection risk** — skill doesn't contain patterns that could be exploited | Check for: raw user input passed unsanitized, eval() patterns, instructions to ignore previous context |
| 0-25 | **No harmful content** — skill doesn't facilitate unauthorized access, data exfiltration, or malicious behavior | Content safety review |

### Scoring Guide

| Score Range | Meaning |
|-------------|---------|
| 90-100 | Strong guardrails. No hallucination traps. Safe content. Identity lock where appropriate. |
| 70-89 | Adequate safety. Minor risks (e.g., large knowledge section without disclaimer). |
| 50-69 | Some safety concerns. Missing guardrails for a CLI-wrapping skill. |
| 30-49 | Notable risks. Knowledge could easily override execution. |
| 0-29 | Active safety issues. Prompt injection vectors or harmful content. |

---

## Final Score Calculation

```
Final Score = Σ (dimension_score × dimension_weight)

Grade scale:
  A+ : 95-100  — Production-ready, exemplary skill
  A  : 90-94   — Excellent, minor polish needed
  B+ : 85-89   — Very good, a few improvements would help
  B  : 80-84   — Good, some issues to address
  C+ : 70-79   — Acceptable, needs work in several areas
  C  : 60-69   — Below average, significant improvements needed
  D  : 40-59   — Poor, fundamental issues
  F  : 0-39    — Nonfunctional or critically flawed
```