skill-usefulness-audit
v0.2.2Audit whether installed skills still create real value. 审计已安装 skill 是否还有真实价值。Use only when the user explicitly asks to review, score, rank, consolidate, or d...
Like a lobster shell, security has layers — review code before you run it.
Skill Usefulness Audit
Overview
Use this skill to judge whether installed skills still deserve to stay installed. It turns vague "this feels useless" opinions into a repeatable audit based on usage evidence, overlap, outcome impact, confidence, community prior, and risk.
用这个 skill 判断哪些已安装 skill 还值得保留。 它把“感觉没用”变成可复现的审计流程,基于使用证据、功能重叠、结果影响、证据置信度、社区先验和风险信号来判断。
Manual Trigger Only
Run this skill only after a direct user request. Do not invoke it implicitly during normal task execution.
只在用户手动要求时运行。 正常任务执行过程中不要隐式触发。
Audit Scope
Audit these layers in order:
- Usage evidence with recency and source quality.
- Installed skill metadata and instructions.
- Functional overlap across skills.
- Ablation impact on historical conversations for non-API and non-tool skills.
- Static health and risk signals.
- Optional offline community or registry metrics.
Treat API and tool skills as protected capability skills during ablation. Examples: Excel, DOCX, PDF, browser automation, deployment, OCR, external API wrappers, MCP/API gateway helpers.
按这个顺序审计:
- 带近期信息和来源质量的使用证据
- 已安装 skill 的元数据与说明
- skill 之间的功能重叠
- 非 API、非工具型 skill 在历史对话上的消融影响
- 静态健康度与风险信号
- 可选的离线社区或注册表指标
在消融阶段,把 API skill 和工具型 skill 当作受保护能力。 例如:Excel、DOCX、PDF、浏览器自动化、部署、OCR、外部 API 包装器、MCP/API 网关类 skill。
Workflow
- Collect installed skills.
Search user-provided roots first.
Fallback to host-local roots such as
./skills,$CODEX_HOME/skills, or~/.codex/skills. - Collect usage evidence.
Prefer native counters, logs, or telemetry.
Read
calls,recent_30d_calls,recent_90d_calls,last_used_at, andactive_dayswhen present. Fallback to transcript mentions only when native counts are unavailable. - Read every installed
SKILL.md. Extractname,description, headings, scripts, references, and source path. - Classify each skill.
Use
api,tool, orgeneral. Use the protected path forapiandtool. - Detect overlap. Compare descriptions, headings, and resource names. Keep the top overlap peer and similarity score for each skill.
- Run ablation for every
generalskill. Use fresh runs or isolated threads when the host supports them. Replay representative historical prompts with the target skill enabled and disabled. - Scan risk and health signals. Record risky shell, network, protected-path, persistence, or dynamic-exec patterns.
- Load optional community metrics.
Accept local registry exports through
--community-file. Treat these metrics as external prior, not local proof. - Score every skill on a 10-point local scale.
Read
references/scoring-rubric.md. - Produce the final report as tables. Include a full ranking table, a recommended-actions table, a delete-candidate table, and a short evidence note for each skill.
Ablation Rules
Read references/ablation-protocol.md before running ablation.
For each eligible skill:
- Sample historical tasks where that skill plausibly matters.
- Keep the prompt and artifacts identical between the skill-on and skill-off runs.
- Judge pass/fail, quality delta, tool efficiency, and whether the final answer materially changed.
- Mark high consistency between skill-on and skill-off runs as evidence that the skill contributes little.
Do not ablate api or tool skills through fake no-tool simulations.
Use the protected-capability branch in the rubric for those skills.
Commands
Run the audit script after collecting evidence:
python scripts/skill_usefulness_audit.py audit \
--skills-root ./skills \
--usage-file ./usage.json \
--history-file ./history.jsonl \
--ablation-file ./ablation.json \
--community-file ./community.json \
--markdown-out ./skill-audit-report.md \
--json-out ./skill-audit-report.json
Input contracts:
--usage-file: JSON, JSONL, CSV, or TSV with per-skill usage evidence.--history-file: raw transcript export used only when direct usage counts are weak or missing.--ablation-file: normalized JSON or JSONL with skill-on versus skill-off case results.--community-file: optional offline JSON, JSONL, CSV, or TSV registry metrics.
Run without extra files only when you need a structure-only audit. Usage, community, and ablation evidence become lower-confidence in that mode.
Output Contract
Always return these tables:
- Full score table with:
rank,skill,source,kind,calls,recent_30d,usage,uniqueness,impact,community,confidence,risk,total,verdict,action,basis - Recommended actions with:
skill,total,confidence,risk,action,reason - Deletion or merge candidates with:
skill,total,kind,action,trigger,reason - Missing-evidence table when usage, ablation, or optional community data is incomplete.
Keep deletion advice conservative for system or host-core skills.
Recommend narrowing or merging before deletion when two high-overlap skills still serve distinct host integrations.
Use quarantine-review for useful but risky skills.
Resources
scripts/skill_usefulness_audit.py: collect metadata, score skills, scan risk, and render Markdown/JSON tables.references/scoring-rubric.md: 10-point scoring rules, confidence logic, community prior, and action thresholds.references/ablation-protocol.md: normalized replay method for historical conversation tests.
Comments
Loading comments...
