Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

Skill Eval

Skill evaluation framework. Use when: testing trigger rate, quality compare (with/without skill), or model comparison. Runs via sessions_spawn + sessions_his...

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 74 · 0 current installs · 0 all-time installs
MIT-0
Security Scan
VirusTotalVirusTotal
Suspicious
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description match the included files and runtime instructions: the repository contains resolver and analysis scripts, example evals, and a SKILL.md that instructs the agent to spawn subagents and run local Python analysis. The requested actions (reading skill paths, running trigger/quality/model workflows, writing per-iteration workspaces) are coherent with an evaluation framework.
Instruction Scope
Runtime instructions explicitly tell the agent to read ~/.openclaw/openclaw.json to locate skill directories, call sessions_spawn and sessions_history, run local Python scripts via exec, and write full evaluation data to eval-workspace/. Those actions are expected for an eval tool, but they grant the skill access to user config and full conversation histories (including tool calls/results).
Install Mechanism
No install spec is present (instruction-only skill). Scripts are bundled in the repo and meant to be run locally via exec. There are no remote downloads or installers referenced in SKILL.md; requirements.txt lists requests but analysis scripts are documented as offline. This is low install risk.
Credentials
The skill declares no required env vars or credentials (good). However, it reads ~/.openclaw/openclaw.json and requires subagents to use sandbox="inherit" in spawn calls, which means the spawned sessions may inherit the main agent's registration environment/skill context. While not an explicit credential request, this can expose the same runtime environment to subagents — the behavior is explainable by the tool's purpose but worth noting.
!
Persistence & Privilege
Workflows require cleanup="keep" and saving full_history.json / raw transcripts to eval-workspace/<skill>/iter-N/. Persisting full session histories (tool_use + tool_result) can retain sensitive data (API keys, tokens, user-provided secrets) if any eval touches them. Combined with sandbox="inherit", retained histories may contain environment-derived data. This is expected for an evaluation tool but represents a real privacy/storage risk that users must manage.
Assessment
This skill is internally consistent with its stated purpose, but take these precautions before running it: - Review SKILL.md and the bundled scripts (especially anything that writes files) so you understand what will be stored under eval-workspace/. - Don't run evaluations against skills or prompts that will surface sensitive credentials or personal data; persisted histories include tool calls and tool results and may capture secrets. - Because the workflow uses sandbox="inherit" and cleanup="keep", spawned subagents inherit the main agent environment and histories are retained — consider running in a disposable/test account or environment if you have any sensitive registrations or tokens available to your agent. - If you need to test locally, create a clean ~/.openclaw/openclaw.json or ensure skills.load.extraDirs points only to safe directories; the resolver reads that file to find skill paths. - The skill does not auto-install anything (fake-tool requires manual copy + gateway restart), and there are no remote downloads in SKILL.md — still, inspect any scripts before executing them in your environment. If you want to proceed safely: run evaluations on a non-production agent, delete eval-workspace/ after reviewing results, and avoid exposing real credentials during tests.
scripts/analyze_latency.py:219
Dynamic code execution detected.
scripts/analyze_model_compare.py:330
Dynamic code execution detected.
scripts/analyze_quality.py:210
Dynamic code execution detected.
scripts/analyze_triggers.py:243
Dynamic code execution detected.
scripts/build_evals_with_context.py:89
Dynamic code execution detected.
scripts/legacy/run_compare.py:91
Dynamic code execution detected.
scripts/legacy/run_diagnostics.py:605
Dynamic code execution detected.
scripts/legacy/run_latency_profile.py:495
Dynamic code execution detected.
Patterns worth reviewing
These patterns may indicate risky behavior. Check the VirusTotal and OpenClaw results above for context-aware analysis before installing.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.1.1
Download zip
benchmarkvk97cb5qxbzjsa062skafxk56r5836bw4evalvk97cb5qxbzjsa062skafxk56r5836bw4latestvk97cb5qxbzjsa062skafxk56r5836bw4testingvk97cb5qxbzjsa062skafxk56r5836bw4

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Runtime requirements

🔬 Clawdis

SKILL.md

openclaw-eval-skill

Evaluation framework for any OpenClaw skill. No claude CLI dependency — all agent execution runs through sessions_spawn + sessions_history.

Scope: Works with CLI tool skills, conversational skills, and API integration skills.


Runtime Actions Disclosure

This skill performs the following actions during evaluation:

ActionPurposeWhen
Read ~/.openclaw/openclaw.jsonFind skill directories (extraDirs)Path resolution
Write to eval-workspace/Store evaluation resultsEvery eval run
Call sessions_spawnRun test queries in isolated sessionsTrigger & quality tests
Call sessions_historyCollect conversation data for analysisAfter each spawn
Persist cleanup="keep" sessionsRequired for trigger detectionTrigger rate tests

NOT performed automatically: Gateway restart, config modification, skill installation. These require manual user action (see "Bundled Test Skill" section).


Quick Eval

Just say:

evaluate weather

The agent will:

  1. Run scripts/resolve_paths.py weather to find all paths
  2. Execute trigger rate + quality compare with detected evals
  3. Output results to eval-workspace/weather/iter-N/

Options:

  • evaluate weather trigger — trigger rate only
  • evaluate weather quality — quality compare only
  • evaluate github --mode all — explicit mode

What gets auto-detected:

  • Skill path: from OpenClaw built-in skills or registered extraDirs
  • Evals: from evals/{skill-name}/ or fallback to evals/example-*.json
  • Output: next available iter-N directory

First step for agent: Run the resolver to get paths:

python scripts/resolve_paths.py {skill-name} --mode {trigger|quality|all}

Use the JSON output to fill in paths for the workflows below.


Bundled Test Skill: fake-tool

A test skill (test-skills/fake-tool/) is included for validating trigger rate detection. It simulates a fictional "Zephyr API" that models cannot know from training.

Manual setup required: The agent will NOT automatically install fake-tool or restart your gateway. If you want to test with fake-tool:

  1. Copy fake-tool to your skills directory:

    cp -r test-skills/fake-tool ~/.openclaw/workspace/skills/
    
  2. Restart OpenClaw gateway (from terminal):

    openclaw gateway restart
    
  3. Verify registration:

    python scripts/resolve_paths.py fake-tool
    

If step 3 returns a valid path, fake-tool is ready. If "not found", check that your ~/.openclaw/openclaw.json includes the skills directory in skills.load.extraDirs.


Evaluation Scenarios

Tier 1: Core (Always Run)

ScenarioWhat It TestsOutput
Trigger RateDoes description trigger SKILL.md reads at the right times? Includes positive (should trigger) AND negative (should NOT trigger) cases.recall, specificity, precision, F1
Quality CompareDoes skill improve output vs no-skill baseline?quality_score, assertion pass rate
Description DiagnosisWhy did triggers fail? Analyzes both false negatives AND false positives.gap analysis, recommendations

Tier 2: Optional (Run When Needed)

ScenarioWhat It TestsWhen to Use
Model ComparisonQuality + speed across haiku/sonnet/opusBefore deployment: which model is enough?
Efficiency ProfileResponse time + retry patternsWhen skill feels slow: is agent walking wrong paths?

Tier 3: Future (Roadmap)

ScenarioWhat It TestsStatus
Cross-skill ConflictTwo skills with overlapping descriptionsPlanned
Error RecoveryDoes agent recover when CLI fails?Planned

How This Skill Works

Two-layer architecture:

Layer 1: Agent (main OpenClaw session) — YOU ARE HERE
  → Reads evals.json
  → Calls sessions_spawn to run subagents
  → Calls sessions_history to collect results
  → Writes raw data to workspace/

Layer 2: Python analysis scripts (run via exec)
  → Read the raw data from workspace/
  → Compute statistics
  → Generate reports

Python scripts (analyze_*.py) are data processors — they cannot call sessions_spawn. The agent drives the workflow.


Usage

Follow USAGE.md for all workflows.

Quick reference:

WorkflowWhat It TestsUSAGE.md Section
Trigger RateDoes description trigger SKILL.md reads at the right times?Workflow 1
Quality CompareDoes skill improve output vs no-skill baseline?Workflow 2
Model ComparisonQuality + Speed across haiku/sonnet/opusWorkflow 3
Latency ProfileResponse time p50/p90Workflow 4

Each workflow follows the same pattern:

  1. Agent spawns subagents using sessions_spawn
  2. Agent collects histories using sessions_history
  3. Agent writes raw data to workspace/{skill}/iter-{n}/raw/
  4. Agent runs analysis script via exec

Core Principles

  1. Never modify the evaluated skill — observe only, give recommendations
  2. Keep eval records in workspace — output goes to eval-workspace/<skill-name>/iteration-N/
  3. Keep full records — save full_history.json (including tool_use + tool_result)

agents/ Reference

FilePurposeWhen to Use
grader.mdCheck assertions, record behavior anomalies, give priority recommendationsRequired for every Quality Compare eval
comparator.mdBlind A/B comparison without assertionsWhen unbiased comparison is needed
analyzer.mdAnalyze cross-eval patterns after all evals completePost-analysis

Directory Structure

eval-workspace/<skill-name>/
├── evals.json                    ← Eval definition (shared across iterations)
└── iteration-1/
    ├── raw/
    │   ├── histories/            ← Trigger test session histories
    │   └── transcripts/          ← Quality compare transcripts
    ├── trigger_results.json      ← analyze_triggers output
    ├── quality_results.json      ← analyze_quality output
    └── diagnostics/
        └── RECOMMENDATIONS.md

evals.json Format

Quality Compare (prompt + assertions):

{
  "skill_name": "my-skill",
  "evals": [
    {
      "id": 1,
      "name": "onboarding-fresh",
      "prompt": "Check the weather in Tokyo",
      "context": "Clean machine, no prior setup. For grader only.",
      "expected_output": "Install → configure → verify profile",
      "assertions": [
        {
          "id": "a1-1",
          "description": "Install command executed",
          "type": "output_contains",
          "value": "pip install"
        },
        {
          "id": "a1-2",
          "description": "Profile verified after setup",
          "type": "output_contains",
          "value": "profile current",
          "priority": true
        }
      ]
    }
  ]
}

Trigger Rate (query + expected):

{
  "id": 1,
  "name": "direct-weather",
  "query": "What's the weather in Singapore?",
  "expected": true,
  "category": "positive"
}

Assertion Types

TypeDetection
output_containsValue appears in conversation or tool output
output_not_containsValue does not appear
output_count_maxOccurrences ≤ max
tool_calledSpecific tool called at least once
tool_not_calledSpecific tool not called
conversation_containsValue appears anywhere in conversation
conversation_contains_anyAt least one value appears

Priority assertions ("priority": true): any failure → overall=FAIL. Gap assertions ("note": "Best practice..."): failure = skill design gap.


Issue Priority (grader output)

🔴 P0 Critical  — Core functionality broken
🟠 P1 High      — Significantly impacts usability
🟡 P2 Medium    — Room for improvement
🟢 P3 Low       — Minor polish

Behavior Anomaly Tracking

Grader records these signals beyond assertions:

FieldTrigger
path_correctionsWrong path then self-corrected
retry_countSame command executed multiple times
missing_file_readsAttempted to read non-existent files
skipped_stepsSteps required by skill were not executed
hallucinationsFabricated non-existent commands/APIs

Key Constraints

  • sandbox="inherit" — subagents inherit skill registration environment
  • cleanup="keep" — history must be retained for trigger detection
  • Skill must be in a real directory under skills.load.extraDirs (symlinks rejected)

Files

40 total
Select a file
Select a file to preview.

Comments

Loading comments…