skill-usefulness-audit

Other

Audit installed agent-skill packages for cleanup using usage, overlap, burden, risk, and optional ablation/community evidence. Trigger only on explicit requests to review installed agent skills; not for code review or human skills.

Install

openclaw skills install skill-usefulness-audit

Skill Usefulness Audit

ClawHub / OpenClaw Edition

This ClawHub bundle is packaged for OpenClaw. Install it from an OpenClaw workspace with:

openclaw skills install skill-usefulness-audit

OpenClaw picks up installed workspace skills in the next session. For other agent hosts, use the GitHub repository instead: https://github.com/gongyu0918-debug/skill-usefulness-audit

本 ClawHub 包是 OpenClaw 专用发布包。其他 agent 版本请访问 GitHub 仓库：https://github.com/gongyu0918-debug/skill-usefulness-audit

Overview

Use this skill to judge whether installed skills still deserve to stay installed. It turns vague "this feels useless" opinions into a repeatable audit based on usage evidence, overlap, outcome impact, quality burden, confidence, community prior, and static risk hints.

Manual Trigger Only

Run this skill only after a direct user request. Do not invoke it implicitly during normal task execution. Do not use it for general code review, general security audit, employee skill assessment, or normal task execution.

Safety

Never delete or quarantine skills automatically. Treat all delete, merge-delete, and quarantine-review results as manual-review recommendations. Do not delete skills based only on a structure-only report. This tool does not automatically replay historical conversations; it generates ablation plans and reads ablation result files that the user provides.

Audit Scope

Audit these layers in order:

Usage evidence with recency and source quality.
Installed skill metadata and instructions.
Functional overlap across skills.
Ablation impact from user-provided skill-on versus skill-off results for non-API and non-tool skills.
Quality burden from over-triggering, context-heavy resources, weak progressive disclosure, redundant references/assets, weak scripts, or private-looking bundled files.
Static health and risk signals.
Optional offline community or registry metrics.

Treat API and tool skills as protected capability skills during ablation. Examples: Excel, DOCX, PDF, browser automation, deployment, OCR, external API wrappers, MCP/API gateway helpers.

Workflow

Collect installed skills. Search user-provided roots first. Fallback to OpenClaw-local roots such as ./skills, ./.agents/skills, ~/.openclaw/skills, or ~/.agents/skills.
Collect usage evidence. Prefer native counters, logs, or telemetry. Read calls, recent_30d_calls, recent_90d_calls, last_used_at, and active_days when present. Also read optional burden fields: executions, script_failures, repair_turns, reference_loads, and false_triggers. Fallback to transcript mentions only when native counts are unavailable.
Read every installed SKILL.md. Extract name, description, headings, scripts, references, assets, resource size metrics, and source path.
Classify each skill. Use api, tool, or general. Use the protected path for api and tool.
Detect overlap. Compare descriptions, headings, and resource names. Keep the top overlap peer and similarity score for each skill.
Generate a cost-efficient ablation plan for general skills. Start with local triage signals instead of full replay. Prioritize low final score, high overlap, high quality burden, frequent activation, weak evidence, and missing ablation. Use --ablation-plan-out to write the candidate list, pairwise judge protocol, configurable early-stop rules, model-cost estimates, and accuracy tradeoff. Run actual replay only for candidates selected by that plan.
Score quality burden. Penalize over-triggering with low execution or low ablation impact. Penalize bloated SKILL.md, overlong frontmatter descriptions, excessive reference loading, hidden reference files, vague resource names, long references without a table of contents, reference/assets dumps, executable assets, script count bloat, script maintenance smells, script failure, script syntax errors, and repeated agent repair.
Scan static risk and health signals. Record shell, network, install-hook, packaging, protected-path, persistence, dynamic-exec, or private-content patterns as static hints, not as a safety proof.
Load optional community metrics. Accept local registry exports through --community-file. Treat these metrics as external prior, not local proof.
Score every skill on a 10-point local scale and subtract quality burden for final_score. Read references/scoring-rubric.md.
Produce the final report as tables. Include a full ranking table, a recommended-actions table, a delete-candidate table, and a short evidence note for each skill. Include report_mode, score_breakdown, quality_penalty, quality_evidence, and community_breakdown in JSON output.

Ablation Rules

Read references/ablation-protocol.md before running ablation.

This tool does not automatically replay historical conversations. It creates an ablation plan and reads normalized ablation result files provided by the user.

For each eligible skill:

Generate the ablation plan first.
Sample historical tasks only for candidate skills in that plan.
Keep the prompt and artifacts identical between the skill-on and skill-off runs.
Judge pass/fail, quality delta, tool efficiency, and whether the final answer materially changed.
Mark high consistency between skill-on and skill-off runs as evidence that the skill contributes little.

Do not ablate api or tool skills through fake no-tool simulations. Use the protected-capability branch in the rubric for those skills.

Commands

Run the audit script after collecting evidence:

python scripts/skill_usefulness_audit.py audit \
  --skills-root ./skills \
  --usage-file ./usage.json \
  --history-file ./history.jsonl \
  --ablation-file ./ablation.json \
  --community-file ./community.json \
  --markdown-out ./skill-audit-report.md \
  --json-out ./skill-audit-report.json \
  --ablation-plan-out ./skill-ablation-plan.json

When the host exposes the skill directory, prefer an absolute script path.

Input contracts:

--usage-file: JSON, JSONL, CSV, or TSV with per-skill usage evidence.
--history-file: raw transcript export used only when direct usage counts are weak or missing. Mentions become history_mentions / suspected_invocations, not direct calls.
--ablation-file: normalized JSON or JSONL with skill-on versus skill-off case results.
--community-file: optional offline JSON, JSONL, CSV, or TSV registry metrics.
--ablation-plan-out: optional JSON plan that estimates model cost and narrows ablation to high-value candidates.
--ablation-baseline-cases, --ablation-initial-cases, --ablation-expand-cases, --ablation-max-cases: optional case-count overrides for the ablation plan.

Run without extra files only when you need a structure-only audit. Usage, community, and ablation evidence become lower-confidence in that mode. Do not delete skills based only on a structure-only report. History and usage files may contain sensitive conversations, local paths, project names, and customer data. Missing env means not configured in the current audit process, not proof that the skill is broken in every host.

Output Contract

Always return these tables:

Full score table with: rank, skill, source, kind, calls, recent_30d, usage, uniqueness, impact, community, confidence, risk, local, burden, final, verdict, action, basis
Recommended actions with: skill, local, burden, final, confidence, risk, action, advice
Deletion or merge candidates with: skill, local, burden, final, kind, action, trigger, advice
Missing-evidence table when usage, ablation, or optional community data is incomplete.
Quality-burden table when a skill has context, asset, reference, script, or over-triggering burden.

Always include these JSON fields:

report_mode: strong-evidence, partial-evidence, or structure-only.
score_breakdown: per-skill usage, uniqueness, impact, community, static risk, quality, and confidence details.
quality_penalty: 0.0-2.5 deduction from local_score.
quality_penalty_uncapped: raw quality burden before the 2.5 cap.
quality_evidence: concrete burden flags and evidence.
community_breakdown: registry signal components when community data is present.
ablation_plan: cost-efficient plan with candidate skills, model-cost estimates, stop rules, and expected accuracy impact.
action_advice: plain-language recommendation for the user.
risk_review: concise human review guidance for any static risk flags.

Keep deletion advice conservative for system or host-core skills. Recommend narrowing or merging before deletion when two high-overlap skills still serve distinct host integrations. Treat delete, merge-delete, and quarantine-review as manual-review recommendations only; never remove or isolate a skill automatically from this report.

Resources

scripts/skill_usefulness_audit.py: compatibility wrapper for the modular audit package.
scripts/skill_usefulness_audit_lib/: collect metadata, score skills, scan static risk hints, and render Markdown/JSON tables.
references/scoring-rubric.md: 10-point scoring rules, confidence logic, community prior, and action thresholds.
references/ablation-protocol.md: normalized replay method for historical conversation tests.