SkillBench
v2.0.0Track skill versions, benchmark performance, compare improvements, and get self-improvement signals. Integrates with tasktime and ClawVault.
Security Scan
OpenClaw
Suspicious
medium confidencePurpose & Capability
Name and description match the declared binary and CLI functionality (skillbench CLI). The install spec (npm @versatly/skillbench → skillbench binary) is consistent with the stated purpose of a benchmarking CLI. However the registry metadata provides no homepage or source repo to review.
Instruction Scope
SKILL.md instructs the agent to call the skillbench CLI and the tasktime ('tt') CLI and to sync with external services (ClawVault, ClawHub). The skill's requires.bins only lists 'skillbench' and does not declare 'tt' or any other external tool it references, and it does not declare where ClawVault/ClawHub credentials come from — so the runtime instructions rely on tools/credentials not described in the skill manifest.
Install Mechanism
Install uses npm (@versatly/skillbench) to create a global 'skillbench' binary — a common pattern for CLIs but one that executes third-party code during install/use. There is no homepage or source URL in the metadata to audit the package, increasing the risk because arbitrary npm package code would run on install and at runtime.
Credentials
The SKILL.md describes automatic syncing to ClawVault and ClawHub and interaction with external dashboards and CI. Yet the skill declares no required environment variables or auth tokens. This is a mismatch: the CLI likely needs credentials or config files to access those services, but the skill does not declare where those credentials come from or what variables/paths it will read.
Persistence & Privilege
The skill is not 'always' and does not request elevated platform privileges in the manifest. It installs a CLI binary (global npm install) but does not declare modifying other skills or agent-wide config; that is within normal bounds for a user-invokable CLI skill.
What to consider before installing
This skill appears to be a legitimate benchmarking CLI, but there are several red flags you should address before installing: 1) npm packages run code on install and at runtime — review the package source (GitHub repo) and the published package content before running npm install globally; 2) SKILL.md references the tasktime 'tt' CLI and external services (ClawVault/ClawHub) but the manifest doesn't declare those dependencies or any auth variables — verify how the CLI obtains credentials and ensure it won't read unexpected config files or exfiltrate data; 3) Prefer to install and test this tool in an isolated environment (container or VM) first, inspect what network endpoints it contacts, and check what files/dirs it writes; 4) If you plan to give it access to service tokens, issue scoped tokens with minimal privileges and rotate them after testing; 5) If you need help auditing the npm package contents or confirming the CLI's network behavior, provide the package URL or the package tarball and I can help review it. Proceed with caution.Like a lobster shell, security has layers — review code before you run it.
Runtime requirements
Binsskillbench
Install
Install SkillBench CLI (npm)
Bins: skillbench
npm i -g @versatly/skillbenchlatest
skillbench Skill
Self-improving skill ecosystem for AI agents.
Track skill versions, benchmark performance, compare improvements, and get signals on what to fix next.
Part of the ClawVault ecosystem | tasktime | ClawHub
Installation
npm install -g @versatly/skillbench
The Loop
1. Use a skill → skillbench use github@1.0.0
2. Do the task → tt start "Create PR" && ... && tt stop
3. Record result → skillbench record "Create PR" --success
4. Check scores → skillbench score github
5. Improve skill → Update skill, bump version
6. Repeat → Compare v1.0.0 vs v1.1.0
Commands
Track Skills
skillbench use github@1.2.0 # Set active skill version
skillbench skills # List tracked skills + signals
Record Benchmarks
# Auto-pulls duration from tasktime
skillbench record "Create PR" --success
# Manual duration
skillbench record "Create PR" --duration 45s --success
# Record failures
skillbench record "Create PR" --fail --error-type "auth-error"
Score & Compare
skillbench score # All skills with grades
skillbench score github # Single skill
skillbench compare github@1.0.0 github@1.1.0
Export & Dashboard
skillbench export --format markdown
skillbench export --format json
skillbench dashboard # Generate HTML dashboard
skillbench dashboard --open # Generate and open in browser
Automated Testing
skillbench test tasktime@1.1.0 # Run smoke test
skillbench test tasktime@1.1.0 --suite full # Run named suite
skillbench test tasktime@1.1.0 --dry-run # Test without recording
Sync
skillbench sync --clawhub # Import installed skills
skillbench sync --vault # Sync to ClawVault
skillbench sync --all # Everything
Health & Monitoring
skillbench health # Overall health report with alerts
skillbench watch --once # Run all test suites once
skillbench watch --interval 300 # Continuous monitoring every 5 min
Analysis & Improvement
skillbench improve # Get suggestions for weakest skill
skillbench improve github # Improvement plan for specific skill
skillbench trend tasktime --days 30 # Performance trend over time
skillbench leaderboard # Compare agents (multi-agent setups)
skillbench schedule --interval 60 # Generate cron config for auto-testing
Baselines & Regression Detection
skillbench baseline tasktime --set # Set baseline from current performance
skillbench baseline --list # List all baselines
skillbench baseline --check # Check all baselines (CI-friendly, exits 1 if failing)
skillbench baseline tasktime --remove # Remove a baseline
CI/CD Integration
skillbench ci # Run all tests + baseline checks
skillbench ci --json # JSON output for automation
skillbench badge # Generate shields.io badges for README
Copy examples/github-action.yml for ready-to-use GitHub Actions workflow.
Grading System
| Grade | Score | Meaning |
|---|---|---|
| 🏆 A+ | 95-100 | Elite performance |
| ✅ A | 85-94 | Excellent |
| 👍 B | 70-84 | Good |
| ⚠️ C | 50-69 | Needs work |
| ❌ D | <50 | Broken |
Based on: Success Rate (40%), Avg Duration (30%), Consistency (20%), Trend (10%)
tasktime Integration
When you omit --duration, skillbench pulls from tasktime:
tt start "Create PR" -c git
# ... do work ...
tt stop
skillbench record --success # Duration auto-pulled
ClawVault Integration
Benchmarks sync to ClawVault automatically.
Improvement Signals
skillbench skills shows:
- ⚠️ needs work — Success rate below 70%
- 🕐 stale — No benchmarks in 7+ days
- ↘️ declining — Getting worse over time
Related
Comments
Loading comments...
