skill-evaluation
ReviewAudited by ClawScan on May 14, 2026.
Overview
The visible package is a coherent skill-testing toolkit, with the main caution that evaluations and trigger probes should be run in a sandbox because they can invoke local AI tools and record outputs.
This skill appears safe to use for evaluating other skills, provided you follow its own sandbox-first guidance. Use disposable workspaces, mock external dependencies, avoid real credentials or private data, and run optional helper scripts only when you intend to test trigger behavior with your local AI platform tools.
Findings (4)
Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.
If a user evaluates an untrusted skill outside a sandbox, that target skill could modify files, call APIs, or perform browser actions.
The skill is designed to execute or observe target skills that may use mutating tools, but it explicitly sets a sandbox and approval boundary.
Enable approval mode — require human confirmation for all mutating tool calls (file writes, API calls, browser actions, shell commands) ... these MUST be sandboxed or mocked.
Run evaluations only in disposable workspaces with test data/accounts, keep approvals enabled, and mock external systems.
Running the trigger evaluator can spend local AI-provider quota and executes the local platform tool in the current project context.
The optional trigger evaluator launches a local AI platform CLI subprocess to test whether a skill triggers. It uses argument lists rather than shell execution, and this is aligned with the trigger-evaluation purpose.
cmd = ["claude", "-p", query, "--output-format", "stream-json", ...] ... process = subprocess.Popen(cmd, ... cwd=str(project_root), env=env)
Only run the trigger evaluator intentionally, preferably in a test project, and verify the local CLI/account being used.
The local platform CLI may run under the user's existing account/session when trigger probes are executed.
The helper subprocess inherits the user's environment, which may include platform credentials or configuration expected by local AI CLIs. The artifacts do not show logging or exfiltration of those values.
env = {k: v for k, v in os.environ.items() if k != "CLAUDECODE"} ... subprocess.Popen(..., env=env)Use a test account or limited environment for evaluations of untrusted prompts or skills, and avoid exposing unnecessary environment secrets.
Generated reports may contain target-skill outputs, tool observations, or sample data that should not be shared if sensitive.
The evaluation schema stores actual outputs and case details in run artifacts, which is expected for reporting but may retain sensitive data if real inputs are used.
## runs/run-{date}-v{N}/results.json ... "actual": "GET /api/users/123 -> {name, email, role, avatar}"Use mock data for tests and review generated JSON/HTML reports before sharing or committing them.
