Eval Skills
v0.1.1AI Agent Skill unit testing framework. A framework-agnostic toolkit for discovering, scaffolding, selecting, evaluating, and reporting on AI skills. Use this...
Like a lobster shell, security has layers — review code before you run it.
eval-skills
AI Agent Skill unit testing framework — a framework-agnostic toolkit for discovering, scaffolding, selecting, evaluating, and reporting on AI skills.
This skill fills the L1 (Skill Unit Test) gap that LangSmith / DeepEval leave open: while those platforms focus on agent-level and trajectory-level evaluation (L2-L3), eval-skills targets the individual skill level, ensuring each building block meets quality standards before it ever enters an agent pipeline.
When to Use This Skill
- Before deploying a new skill to production — run
evalto verify it meets your quality gate. - When choosing between multiple candidate skills — run
selectto rank them on the same benchmark. - When a skill is upgraded — run
report diffto detect regressions. - In CI/CD — use
--exit-on-failto block merges that degrade skill quality. - When bootstrapping a new skill — run
createto generate a ready-to-fill skeleton.
Capabilities
1. Find Skills
Search for existing skills by keyword, tag, or adapter type.
eval-skills find \
--query "web search" \
--tag retrieval api \
--adapter http \
--min-completion 0.8 \
--skills-dir ./skills \
--limit 10
| Option | Description | Default |
|---|---|---|
-q, --query <string> | Keyword search (matches name, description, tags) | — |
-t, --tag <tags...> | Filter by tags (intersection: skill must have ALL specified tags) | — |
-a, --adapter <type> | Filter by adapter type (http, subprocess, mcp) | — |
--min-completion <rate> | Minimum historical completion rate (0.0 ~ 1.0) | — |
--skills-dir <dir> | Directory to scan for skill.json files | ./skills |
--limit <n> | Maximum number of results | 20 |
Results are ranked by search relevance (when --query is provided) or by historical completion rate (descending).
2. Create Skills
Generate a skill skeleton from a template to bootstrap development.
eval-skills create \
--name my_api_skill \
--from-template http_request \
--output-dir ./skills \
--description "Fetches weather data from OpenWeather API"
| Option | Description | Default |
|---|---|---|
--name <name> | Required. Skill name | — |
--from-template <tpl> | Template type: http_request, python_script, mcp_tool | http_request |
--output-dir <dir> | Output directory | ./skills |
--description <text> | Human-readable description embedded in skill.json | — |
Generated file structure:
skills/my_api_skill/
skill.json # Skill metadata (id, schemas, adapter config)
adapter.config.json # Adapter-specific configuration
tests/
basic.eval.json # A starter benchmark with one sample task
skill.py # (python_script template only) JSON-RPC entrypoint
3. Evaluate Skills
Run benchmark evaluations against one or more skills. This is the core command.
eval-skills eval \
--skills ./skills/calculator/skill.json ./skills/search/ \
--benchmark coding-easy \
--concurrency 4 \
--timeout 30000 \
--retries 2 \
--runs 3 \
--evaluator exact \
--format json markdown html \
--output-dir ./reports \
--exit-on-fail --min-completion 0.8 \
--store ./eval-skills.db
| Option | Description | Default |
|---|---|---|
--skills <paths...> | Required. Skill file(s) or directory(ies) | — |
--benchmark <id|path> | Built-in benchmark ID or path to benchmark.json | coding-easy |
--tasks <file> | Custom tasks JSON file (replaces benchmark) | — |
--concurrency <n> | Number of parallel task executions | 4 |
--timeout <ms> | Per-task timeout in milliseconds | 30000 |
--retries <n> | Retry count on task failure (with incremental backoff) | 0 |
--runs <n> | Repeat evaluation N times for consistency scoring | 1 |
--evaluator <type> | Default scorer type (see Scorer Types below) | exact |
--format <formats...> | Output formats: json, markdown, html | json markdown |
--output-dir <dir> | Report output directory | ./reports |
--exit-on-fail | Exit with code 1 if any skill falls below threshold | disabled |
--min-completion <rate> | Threshold for --exit-on-fail | 0.7 |
--dry-run | Validate configuration only; do not execute tasks | disabled |
--benchmarks-dir <dir> | Directory containing built-in benchmarks | ./benchmarks |
--store <path> | SQLite database path for persistent result storage | ./eval-skills.db |
-c, --config <path> | Path to eval-skills.config.yaml | auto-detected |
Evaluation flow:
- Load skills from
--skillspaths (supports both singleskill.jsonand directories) - Load benchmark tasks from
--benchmarkor--tasks - Build the cartesian product:
skills x tasks x runs - Execute all task items concurrently (controlled by
--concurrency, with timeout and retry) - Score each result using the appropriate scorer
- Aggregate into
SkillCompletionReportper skill - Write reports to
--output-dir
4. Select Skills
Filter and rank skills based on evaluation reports using a multi-dimensional strategy.
eval-skills select \
--from ./skills \
--reports ./reports/eval-result.json \
--strategy ./strategy.yaml \
--min-completion 0.8 \
--top-k 5 \
--output ./selected.json
| Option | Description | Default |
|---|---|---|
--from <path> | Required. Candidate skills directory or JSON file | — |
--reports <file> | Evaluation reports JSON file | — |
--strategy <file> | SelectStrategy YAML/JSON file | built-in default |
--min-completion <rate> | Override minimum completion rate filter | — |
--top-k <n> | Return only the top K results | all |
--output <file> | Write selected skills to file | stdout |
Selection pipeline: Filter (by completion rate, error rate, latency, adapter type, required tags) -> Score -> Rank (by compositeScore, completionRate, latency, or tokenCost) -> TopK
Example strategy.yaml:
filters:
minCompletionRate: 0.8
maxErrorRate: 0.1
maxLatencyP95Ms: 5000
adapterTypes: [http, subprocess]
requiredTags: [production-ready]
sortBy: compositeScore
order: desc
topK: 5
5. Run Pipeline
Execute the full end-to-end pipeline: Find -> Eval -> Select -> Report in a single command.
eval-skills run \
--query "math" \
--benchmark coding-easy \
--skills-dir ./skills \
--top-k 3 \
--min-completion 0.7 \
--format json markdown \
--output-dir ./reports
This command automates the entire process:
- Find — scans
--skills-dirand optionally filters by--query - Eval — evaluates all candidate skills against
--benchmark - Select — filters and ranks results using
--min-completion,--top-k, and optional--strategy - Report — generates output files in all requested
--formats
6. Generate & Compare Reports
Convert report format
eval-skills report convert \
--input ./reports/eval-result.json \
--format html \
--output ./reports/eval-result.html
Supported output formats: markdown, html.
Diff two reports (regression detection)
eval-skills report diff \
./reports/v1.json ./reports/v2.json \
--label-a "v1.0" --label-b "v2.0" \
--output ./reports/diff.md
Generates a side-by-side delta table per skill showing changes in completion rate, error rate, P95 latency, and composite score with directional arrows.
7. Initialize Project
eval-skills init --dir .
Creates the project scaffold:
eval-skills.config.yaml— global configurationskills/— directory for skill definitionsbenchmarks/— directory for benchmark filesreports/— directory for evaluation output
8. Manage Configuration
# List all current configuration values
eval-skills config list
# Get a specific value (supports dot notation)
eval-skills config get llm.model
# Set a value (persisted to ~/.eval-skills/config.yaml)
eval-skills config set concurrency 8
eval-skills config set llm.model gpt-4o
eval-skills config set llm.temperature 0
Configuration is resolved in priority order:
- CLI flags (highest priority)
eval-skills.config.yamlin current directory~/.eval-skills/config.yaml- Built-in defaults (
concurrency: 4,timeoutMs: 30000,outputDir: ./reports)
Scorer Types
Each task in a benchmark specifies an evaluator type. The scorer compares the skill's actual output against the expected output.
| Type | Aliases | Description | Score Range |
|---|---|---|---|
exact_match | exact | Strict equality comparison. Supports caseSensitive option. | 0 or 1 |
contains | — | Checks for the presence of all specified keywords in the output. Partial credit: matched_keywords / total_keywords. | 0.0 ~ 1.0 |
json_schema | schema | Validates output against a JSON Schema (using Ajv). | 0 or 1 |
llm_judge | — | Sends the output + expected rubric to an LLM (configurable model) for quality rating. | 0.0 ~ 1.0 |
custom | — | Loads a custom scorer from expectedOutput.customScorerPath. | 0.0 ~ 1.0 |
Evaluation Metrics
Every evaluation produces a SkillCompletionReport with these metrics:
| Metric | Description | Formula |
|---|---|---|
| Completion Rate | Fraction of tasks that passed | pass_count / total_count |
| Partial Score | Mean score across all tasks | mean(task_scores) |
| Error Rate | Fraction of tasks that errored or timed out | (error_count + timeout_count) / total_count |
| Consistency Score | Stability across multiple runs (requires --runs >= 2) | 1 - stddev(per_run_completion_rates) |
| P50 / P95 / P99 Latency | Response time percentiles | Sorted percentile of latencyMs |
| Composite Score | Weighted overall quality score | 0.5 * CR + 0.2 * (1 - latP95_norm) + 0.3 * (1 - ER) |
Built-in Benchmarks
| ID | Domain | Tasks | Scoring | Description |
|---|---|---|---|---|
coding-easy | coding | 20 | mean / exact_match | Math expressions, string reversal, palindrome detection |
skill-quality | tool-use | 5 | mean / contains | Metadata completeness, description quality, structure checks |
web-search-basic | web | 8 | mean / contains + schema | Factual queries, keyword verification, structured output validation |
gaia-v1 | general | — | mean | Placeholder for GAIA benchmark Level 1 tasks |
toolbench-lite | tool-use | — | mean | Placeholder for ToolBench single-tool scenarios |
Custom Benchmark
Create a benchmark.json file:
{
"id": "my-benchmark",
"name": "My Custom Benchmark",
"version": "1.0.0",
"domain": "general",
"scoringMethod": "mean",
"maxLatencyMs": 30000,
"metadata": { "source": "internal", "lastUpdated": "2026-02-28" },
"tasks": [
{
"id": "task_001",
"description": "Test basic addition",
"inputData": { "expression": "2+3" },
"expectedOutput": { "type": "exact", "value": "5" },
"evaluator": { "type": "exact" },
"timeoutMs": 10000,
"tags": ["math"]
},
{
"id": "task_002",
"description": "Test keyword presence",
"inputData": { "query": "TypeScript" },
"expectedOutput": { "type": "contains", "keywords": ["JavaScript", "Microsoft"] },
"evaluator": { "type": "contains", "caseSensitive": false },
"timeoutMs": 15000,
"tags": ["search"]
}
]
}
eval-skills eval --skills ./my-skill/ --benchmark ./my-benchmark.json
Adapter Types
Skills communicate through adapters. The adapter type is specified in skill.json via adapterType.
| Adapter | Protocol | How it works | Key config |
|---|---|---|---|
http | REST POST | Sends POST { skillId, version, input } to skill.entrypoint. Supports Bearer / API-Key auth via env vars. | baseUrl, authType, authTokenEnvKey |
subprocess | JSON-RPC 2.0 over stdin/stdout | Spawns skill.entrypoint (e.g. python3 skill.py), writes JSON-RPC request to stdin, reads response from stdout. | command, args |
mcp | MCP Protocol | (Phase 2) Native Model Context Protocol integration via @modelcontextprotocol/sdk. | — |
Workflow Examples
Evaluating a Single Skill
# 1. Create a skill skeleton
eval-skills create --name my_calc --from-template python_script
# 2. Implement your logic in skills/my_calc/skill.py
# 3. Run evaluation against the coding-easy benchmark
eval-skills eval \
--skills ./skills/my_calc/skill.json \
--benchmark coding-easy \
--runs 3 \
--format json markdown
# 4. Review the report
cat ./reports/eval-result-*.md
Comparing Multiple Candidate Skills
# 1. Discover candidates
eval-skills find --query "weather" --skills-dir ./skills
# 2. Evaluate all candidates on the same benchmark
eval-skills eval \
--skills ./skills/weather_v1 ./skills/weather_v2 ./skills/weather_v3 \
--benchmark web-search-basic \
--runs 3
# 3. Select the best
eval-skills select \
--from ./skills \
--reports ./reports/eval-result-*.json \
--min-completion 0.8 \
--top-k 2
# 4. Compare two versions
eval-skills report diff \
./reports/v1.json ./reports/v2.json \
--label-a "weather_v1" --label-b "weather_v2"
Full Pipeline (One Command)
eval-skills run \
--skills-dir ./skills \
--benchmark coding-easy \
--top-k 3 \
--min-completion 0.7 \
--format json markdown html \
--output-dir ./reports
CI/CD Quality Gate
# In your CI pipeline — fail the build if completion rate drops below 80%
eval-skills eval \
--skills ./skills/production_skill \
--benchmark coding-easy \
--exit-on-fail \
--min-completion 0.8 \
--format json
Regression Detection
# Compare today's evaluation against the baseline
eval-skills report diff \
./reports/baseline.json ./reports/latest.json \
--label-a "baseline" --label-b "latest" \
--output ./reports/regression-check.md
Best Practices
-
Always use
--runs 3or more when evaluating for production decisions. Single-run results can be noisy; the consistency score captures stability across runs. -
Use
--exit-on-failin CI/CD pipelines to enforce quality gates. Set--min-completionto your acceptable threshold (recommended: 0.8 for production skills). -
Create domain-specific custom benchmarks rather than relying solely on built-in ones. Your custom benchmark should reflect real-world inputs your skill will encounter.
-
Use
report diffafter every skill upgrade to catch regressions early. Compare the new evaluation against a saved baseline report. -
Use
--dry-runbefore long evaluations to validate your configuration (skill paths, benchmark resolution, task count) without actually executing tasks. -
Persist results with
--storeto track skill quality over time. The SQLite store enables historical trend queries. -
Start with
--concurrency 1when debugging a failing skill, then increase for production benchmarking. -
Tag your benchmark tasks to enable per-category analysis (e.g., filter by
math,string,edge-case).
Skill JSON Schema
Every skill must provide a skill.json that conforms to this structure:
{
"id": "my_skill_v1",
"name": "My Skill",
"version": "1.0.0",
"description": "Does something useful",
"tags": ["utility", "math"],
"inputSchema": {
"type": "object",
"properties": { "query": { "type": "string" } },
"required": ["query"]
},
"outputSchema": {
"type": "object",
"properties": { "result": { "type": "string" } }
},
"adapterType": "subprocess",
"entrypoint": "python3 skill.py",
"metadata": {
"author": "Your Name",
"license": "MIT",
"homepage": "https://github.com/you/my-skill"
}
}
Validation rules:
id: lowercase alphanumeric with_or-, non-emptyversion: semver format (X.Y.Z)adapterType: one ofhttp,subprocess,mcp,langchain,customentrypoint: non-empty string (URL for http, command for subprocess)
Global Options
These options are available on all commands:
| Option | Description |
|---|---|
-c, --config <path> | Path to configuration file |
--json | JSON output format (CI-friendly) |
--no-color | Disable colored output |
-v, --verbose | Verbose logging |
--version | Show version |
-h, --help | Show help |
Comments
Loading comments...
