skill-deep-audit

Other

Generic skill-quality auditor for any agent skill (Claude, OpenClaw, Cursor, etc.). Runs a 7-dimension static analysis (D1 process closure & idempotency, D2 tool/command conventions, D3 portability & defense, D4 skill usability, D5 security & op risk, D6 code & doc quality, D7 dependency & footprint) with explicit ERR / WARN severity, 115-point scoring (pass line 90 + zero ERR), and an opt-in `--fix` workflow that always backs up first. Two depths: L1 static (~2 min) and L2 dryRun (~5 min, read-only hub + reachability checks). Strict red lines — read-only by default, never executes the audited skill's writes. Use when the user asks to "audit a skill", "check skill quality", "is this skill ready to ship", "lint my skill", or runs this tool by name. Triggers also: "审计这个 Skill"、"检查 Skill 质量"、"Skill 能上线吗"、 "skill-deep-audit"、"审一下 xxx skill"。

Install

openclaw skills install @songhonglei/skill-deep-audit

skill-deep-audit — Generic Skill Auditor

A read-only, multi-dimensional quality auditor for agent skills. Runs static analysis + optional dryRun reachability checks and produces a scorecard.

Design principles

Can it run?            → D3 Portability + D4 Usability conventions
Does it run correctly? → D1 Process closure + D6 Code & doc quality
Is it safe to run?     → D5 Security & op risk
Is it well-conformed?  → D2 Tool & command conventions
Is the whole healthy?  → D7 Dependency & footprint

NEVER DO

  • Do not execute any write operation of the audited skill (read-only, reachability checks, and static analysis only).
  • Do not modify the audited skill's files (read-only and report-only; --fix is the one exception and requires explicit user authorization).
  • Do not fabricate check results — when undecidable, mark "cannot confirm, manual verification needed".

MUST DO

  • At the start of every audit, ask the user to choose the check depth (L1 / L2 dryRun) and wait for the choice.
  • Any ERR → final result is FAIL, regardless of total score.
  • After the audit, write the scorecard MD file.
  • If an L2 dryRun check fails because an external dependency is unavailable, downgrade to WARN with the reason — do not abort.

Dependencies of this skill (all soft — degrade gracefully if missing)

DependencyPurposeBehavior if missing
A skill-hub query tool (e.g. clawhub)Hub publish-status check (D7-W1) and dependency existence check (D7-W2 step 3)Skip Hub checks, related items downgrade to WARN, do not abort the audit

The core of this skill is pure static / read-only analysis. There are no hard external dependencies — L1 static audit works even if no tooling is installed.


Step 0: Ask for check depth

At the start of the audit, present the following options and wait for the user to choose explicitly:

Please choose check depth:

L1 Static analysis (~2 min)
   File read, structural check, keyword scan, syntax check.
   Max 112 (skips items that need to touch external systems). Pass line ≥ 90.
   Good for: quick first-draft check.

L2 dryRun (~5 min, recommended) ⭐
   L1 + Hub existence check + dependency existence check + branch reachability
   simulation (file existence / env config / read-only verification of
   unhit branches).
   Max 115. Pass line ≥ 90.
   Good for: pre-release / pre-ship full acceptance.

Default recommendation: L2 dryRun. Reply 1 / L1 for static, 2 / L2 for dryRun
(empty Enter = L2).

⚠️ Note: L2 dryRun does ONLY read-only queries and reachability checks. It
   performs no writes / updates, and it does NOT actually run the audited
   skill's business workflow.

Step 1: Locate the skill directory

The user may provide:

  • A skill name (e.g. my-skill) → look for a same-named folder under <skills-dir>/. <skills-dir> is the agent's skills directory and may be e.g. ~/.claude/skills/, ~/.openclaw/workspace/skills/, or any path the user specifies. Some agents use a different layout — adjust to what's actually on disk.
  • A relative or absolute path (e.g. skills/my-skill/) → use directly.
ls {skill-path}/
cat {skill-path}/SKILL.md | head -20

If the directory cannot be found → tell the user and stop. Do not guess.


Step 2: Static analysis (runs at every depth)

Execute each check defined in references/check-rules.md in order.

⚖️ Determinism guarantee: each rule's hit/miss decision uses the grep pattern, keyword list, and numeric thresholds defined for it in check-rules.md — not the agent's subjective judgement. Edge cases not explicitly covered by a rule are handled by the "False-Positive General Rules" section (marked "manual verification needed", not hard-judged). This guarantees stable, repeatable results across different agents / re-runs.

2.1 Collect file list

# Script extensions must cover mixed-language skills — .js/.cjs/.mjs/.ts cannot be missed
find {skill-path} -type f \( \
  -name "*.py" -o -name "*.sh" -o -name "*.md" -o -name "*.json" -o -name "*.yaml" \
  -o -name "*.js" -o -name "*.cjs" -o -name "*.mjs" -o -name "*.ts" \) \
  | grep -v __pycache__ | grep -v node_modules | grep -v .git

⚠️ Extension coverage blind-spot: find -name "*.js" does not match .cjs / .mjs / .ts. Python scripts often subprocess-call a sibling xxx.cjs — if the file list misses .cjs, the auditor will wrongly report "called script does not exist" (false positive on D6-E4 / D6-E6). All later extension-scoped scans must include the full set.

2.2 Per-dimension static scan

Execute in order: D1 → D2 → D3 → D4 → D5 → D6.

ℹ️ D7 is not in this step: D7 (dependencies & footprint) needs the code stats (see 2.4) plus Hub / existence checks, so it is consolidated into Step 4. Step 2 only scans D1–D6.

Execution-level convention:

  • Rule title contains L1 → runs at all depths.
  • Rule title contains L2 dryRun → runs only at L2 dryRun; for L1, mark as ➖ skipped (L2 dryRun item).
  • D4-E5 scan must exclude {skill-path}/AUDIT-*.md — audit reports are produced by this tool itself and are not part of the audited skill's package.

For each rule:

  1. Check whether its execution level matches the current depth; if not, record ➖ skipped.
  2. Run the corresponding scan command (grep / regex / file parse / agent-read judgement).
  3. Record: pass ✅ / fail ❌ / skipped ➖.
  4. Accumulate deductions.

2.3 D6-E1 script syntax check

# Python
for f in $(find {skill-path}/scripts -name "*.py" 2>/dev/null); do
  python3 -m py_compile "$f" 2>&1 && echo "OK: $f" || echo "SYNTAX ERR: $f"
done

# Shell
for f in $(find {skill-path}/scripts -name "*.sh" 2>/dev/null); do
  bash -n "$f" 2>&1 && echo "OK: $f" || echo "SYNTAX ERR: $f"
done

2.4 Code-size stats (prerequisite for D7)

# Number of script files (covers mixed skills: .js/.cjs/.mjs/.ts)
find {skill-path}/scripts -type f \( -name "*.py" -o -name "*.sh" -o -name "*.js" -o -name "*.cjs" -o -name "*.mjs" -o -name "*.ts" \) 2>/dev/null | grep -v node_modules | wc -l

# Total line count (-r prevents hang on no-match)
find {skill-path} \( -name "*.py" -o -name "*.sh" -o -name "*.js" -o -name "*.cjs" -o -name "*.mjs" -o -name "*.ts" \) | grep -v node_modules | xargs -r wc -l 2>/dev/null | tail -1

# Skill-on-skill dependency: precise extraction (see D7-W2 "three-step join" algorithm)

# ① List all suspicious import candidates (just module names; ownership is resolved later)
grep -rnE "^\s*(from [a-zA-Z_][a-zA-Z0-9_]* import|import [a-zA-Z_][a-zA-Z0-9_]*)" {skill-path}/scripts/ 2>/dev/null
# ① supplementary: look for sys.path injection / skill_root concatenation
#    (this is the physical evidence of which skill an import belongs to)
grep -rnE "sys\.path\.insert.*skills/|_skill_root|skills/[a-z-]+/scripts" {skill-path}/scripts/ 2>/dev/null

# ② subprocess calls into other skills' scripts (by path)
grep -rnE "skills/[a-z-]+/scripts|_skill_root.*scripts" {skill-path} 2>/dev/null | grep -v __pycache__
# ③ Explicit declaration in SKILL.md
grep -nE "metadata.*requires|depends on .* skill|requires the .* skill|use .* skill" {skill-path}/SKILL.md 2>/dev/null
# → Agent then deduplicates, applies the three-step join to fix ownership, annotates purpose,
#   runs the existence check (D7-W2), and writes the result into report section
#   "VI. Skill Dependencies".
# → Stdlib and well-known PyPI packages (os/sys/json/re/requests/openpyxl …) are excluded
#   from ownership judgement.

Step 3: Hub existence check (runs at L2 dryRun)

Pre-check: this step requires a skill-hub query tool (e.g. clawhub). If unavailable → skip Hub checks; mark D7-W1 as "cannot verify (no hub tooling)", downgrade to WARN, do not abort.

  1. Extract the name field from frontmatter.
  2. Use the available skill-hub query tool to check whether the skill is already published.
  3. Record the result against D7-W1 (not published → WARN, not ERR).

Step 4: Dependency & footprint analysis (D7)

  • D7-W1 Hub publish status (consolidated from Step 3).
  • D7-W2 Precise dependency-skill list + purpose annotation + existence check (local ✅ / hub-has-not-installed ⚠️ / not-found ❌). Output a full list in report section "VI. Skill Dependencies" regardless of count; ≥ 5 deps → WARN; depending on a "not found ❌" skill → ERR; depending on a "hub-has-not-installed ⚠️" skill → WARN.
  • D7-W3 Code ≥ 5000 lines or scripts ≥ 10 → identify high-cohesion modules and suggest a split direction.

Step 5: Aggregate scoring

Total 115 points

DimensionMax
D1 Process closure & idempotency13
D2 Tool & command conventions10
D3 Portability & defense15
D4 Skill usability conventions21
D5 Security & op risk21
D6 Code & doc quality31
D7 Dependency & footprint health4
Total115

📊 Scoring convention: ERR is uniformly 3 points (a hit means FAIL; the point value carries no real meaning). WARN uses three priority tiers (high 3 / mid 2 / low 1) — the difference is meant to guide fix order.

Dual-judgement (both conditions must hold for PASS):

Pass line is uniformly 90 at both depths (skipped items don't count toward the actual max but don't change the pass line):

DepthActual maxPass line
L1 static112≥ 90
L2 dryRun115≥ 90
ConditionResult
Total ≥ pass line AND zero ERRPASS
Any ERR, OR total < pass lineFAIL

Step 6: Produce the scorecard MD file

Generate the full report using references/output-template.md.

Write path: {skill-path}/AUDIT-{YYYY-MM-DD}.md

AUDIT-*.md should not be packaged with the skill (D4-E5 will detect this).


Step 7: Output summary

📋 Audit complete: {skill-name}
─────────────────────────────────────
Total score: {score}/{max}   {PASS ✅ / FAIL ❌}  (L1 max 112 / L2 dryRun max 115)
Pass line:   ≥ 90 (uniform across L1 / L2 dryRun)  AND  zero ERR (dual-judgement)
Depth:       {L1 static / L2 dryRun}

🔴 ERR: {n}   |   🟡 WARN: {n}
Highest-priority fix: {ID and name of the highest-deduction ERR}

Estimated score after fixing all ERR: {estimated}/{max}

📁 Scorecard: {skill-path}/AUDIT-{date}.md

🔧 Fix: {N} items auto-fixable / {M} items need human confirmation
   Reply "fix" to start auto-fix (the skill folder is backed up first).

Step 8: Auto-fix (--fix) behavior spec

⚠️ This is the only step in skill-deep-audit that is allowed to modify the audited skill's files, and only after explicit user authorization. The day-to-day audit (Step 0–7) strictly observes the "audit-only, never fix" red line.

Trigger conditions

  • The user explicitly replies "fix", "apply fix", "--fix", "fix 5.1", etc. after the report is delivered.
  • Without explicit user authorization, never auto-fix. The report only recommends; it does not execute.

Fix scope tiers (corresponds to report section "V. Fix Recommendations")

Sub-sectionTypeAuto-fix?
5.1 Auto-fixablePure text / config / docs (add version, add prerequisites, edit wording, add dependency declaration, normalize reference prefixes — no business logic)✅ User says "fix" → batch apply
5.2 Needs human confirmationBusiness logic / script code (change control flow, change field matching, change HTTP call, change column mapping, remove over-privileged steps)⚠️ Must confirm each item with the user; user approves one → fix one

Execution flow (strict order)

  1. Mandatory pre-fix backup:

    • Copy the entire audited skill directory to a backup path: {skill-path}.bak-{YYYYMMDD-HHMMSS}
    • Immediately tell the user the full backup path.
    • If backup fails → abort the fix, don't touch anything.
    BACKUP="{skill-path}.bak-$(date +%Y%m%d-%H%M%S)"
    cp -r "{skill-path}" "$BACKUP" && echo "✅ Backed up to $BACKUP"
    
  2. Apply fixes item by item:

    • 5.1 items: edit directly per the report's "③ Fix" section. After each change, briefly report ✅ Fixed [ID].
    • 5.2 items: only change after the user explicitly confirms that item; items the user hasn't approved are not touched.
  3. Do not auto-re-audit:

    • After fixes, prompt the user: 🔧 Fixed {n} items. Re-run the audit now to verify? (reply "re-audit" to start)
    • Wait for the user to confirm "re-audit" before re-running Steps 0–7.
  4. Fix record: in the report or reply, list "which files / which items were changed + backup path" so the user can roll back.

Red lines (also apply during fix)

  • Do not execute any write operation.
  • Do not delete any file (even if it looks redundant); if deletion is needed ask the user separately.
  • 5.2 business-logic items are never changed unilaterally, even if they "look safe".
  • If the user wants to roll back after fix: instruct the user to restore by copying the backup directory over.

Part of build-better-skills

This skill is part of the build-better-skills suite — open-source skills that help you build better skills, end-to-end:

StageSkillStatusWhat it does
Creationskill-creator🚧 Not yet releasedScaffold a new skill from intent
Auditglic-checkv1.0.xFast, qualitative multi-dimension review (G/L/I/C + U) — run right after any edit
Auditskill-deep-auditv1.0.0Comprehensive dryRun-level exam — 7 dimensions, 115-pt score, --fix
Testingskill-regression🚧 Not yet releasedEnd-to-end regression testing
Sedimentskill-sediment🚧 Not yet releasedPromote successful workflows into new skills

Two complementary tools share the Audit stage:

  • glic-check — lightweight, qualitative. Run it right after a change for a quick multi-dimension sanity review (no score). Best for tight edit loops.
  • skill-deep-audit (this skill) — heavyweight, quantitative. A full dryRun-level evaluation that grades the skill on a 115-point scale with ERR/WARN findings and a scorecard. Best as a pre-ship "final exam".

Only glic-check and skill-deep-audit ship today. The other entries are roadmap placeholders — they will appear in the suite repo as they are open-sourced.

Rule references