OpenClaw Smartness Eval

PassAudited by ClawScan on May 10, 2026.

Overview

This is a disclosed evaluator that runs local OpenClaw tests, reads bounded agent state, and only calls an external LLM judge when explicitly enabled.

Before installing, confirm you trust the local OpenClaw workspace scripts that the task suite will run. Keep `--llm-judge` off unless you are comfortable sending evaluation summaries to an external provider, and review generated reports before sharing because they may summarize local agent logs or reasoning history.

Findings (5)

Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.

What this means

Running an evaluation can execute local OpenClaw scripts in the workspace.

Why it was flagged

The skill intentionally executes local test commands. This is central to the evaluation purpose and the documentation describes validation and timeouts, but users should still understand that local workspace scripts will run.

Skill content
本技能通过 `subprocess` 运行 `task-suite.json` 中定义的测试命令
Recommendation

Review `config/task-suite.json` before running standard or deep evaluations, and run it only in a workspace whose scripts you trust.

What this means

Evaluation results and command behavior depend on unbundled local OpenClaw components.

Why it was flagged

The skill's tests depend on external OpenClaw core scripts that are not shipped in this package, so actual runtime behavior also depends on the local workspace's copy of those scripts.

Skill content
这些脚本属于 OpenClaw 核心,不会随技能一起分发。安装此技能的用户需要有完整的 OpenClaw V5 环境。
Recommendation

Use the skill with a trusted, up-to-date OpenClaw installation and inspect local core scripts if you are in a sensitive environment.

What this means

Reports may summarize metrics derived from prior interactions, logs, alerts, or reasoning-store contents.

Why it was flagged

The evaluator reads local runtime logs and the reasoning knowledge store. This is purpose-aligned for scoring intelligence and trends, but these sources may contain sensitive or interaction-derived context.

Skill content
`state/message-analyzer-log.json` (真实日志抽样) ... `.reasoning/reasoning-store.sqlite` (推理知识库)
Recommendation

Review generated reports before sharing them, and avoid running the skill on workspaces containing sensitive logs unless that data use is acceptable.

What this means

If enabled, the skill can use your DeepSeek or OpenAI API account and may incur provider-side logging or cost.

Why it was flagged

The optional LLM judge uses provider API credentials. This is expected for the feature and explicitly opt-in, though the registry metadata does not declare these optional environment variables.

Skill content
需设置 `DEEPSEEK_API_KEY` 或 `OPENAI_API_KEY` 环境变量。该功能会发起外部 API 请求,默认不开启,仅在显式传入 `--llm-judge` 时启用。
Recommendation

Only set the API key and pass `--llm-judge` if you are comfortable using that provider for evaluation.

What this means

When LLM judging is enabled, evaluation summaries are sent to an external model provider.

Why it was flagged

The skill documents an optional external provider data flow for LLM judging. It claims not to send raw logs, and it is disabled by default, but summaries and evidence still leave the local workspace when enabled.

Skill content
`--llm-judge` ... Sends dimension summary to LLM API (no raw logs or user data)
Recommendation

Keep `--llm-judge` disabled for fully local evaluation, or review what summaries are sent before enabling it.