Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

Openclaw Smartness Eval

OpenClaw 智能度综合评估技能。围绕理解、分析、思考、推理、自我迭代、对话沟通、响应时长以及扩展维度输出综合评分、证据、风险与趋势。

MIT-0 · Free to use, modify, and redistribute. No attribution required.
1 · 34 · 0 current installs · 0 all-time installs
by圆规@yh22e
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
medium confidence
Purpose & Capability
The name/description (agent smartness evaluation) aligns with the code and configs: the skill uses 28 tests, rubrics, and multiple local state/log files to compute scores. Requiring access to workspace state files (logs, latency metrics, benchmark history) and some core OpenClaw scripts is consistent with the stated purpose. One small mismatch: registry metadata showed no required env vars, yet SKILL.md documents optional DEEPSEEK_API_KEY/OPENAI_API_KEY for the LLM-judge feature.
!
Instruction Scope
Runtime instructions and scripts read many workspace files (state/*, .reasoning/reasoning-store.sqlite, logs) and run other Python scripts in workspace/scripts (e.g., message-analyzer-v5.py, security-config-audit.py). That is coherent for evaluation, but it means the skill will access potentially sensitive user data and will execute code outside the skill bundle. SKILL.md explicitly documents external API calls only when --llm-judge is passed, which is good, but the codebase uses workspace-level scripts and logs broadly.
Install Mechanism
No install spec or remote downloads; this is an instruction/code-only skill. Nothing is fetched from arbitrary URLs or written to system paths outside the workspace/state directories created at runtime.
!
Credentials
Registry metadata declares no required environment variables, but SKILL.md and scripts mention (optional) DEEPSEEK_API_KEY / OPENAI_API_KEY for the LLM Judge feature and note that external API requests will be made when --llm-judge is used. The skill will therefore access external network resources if explicitly enabled and can read sensitive local files (logs, reasoning DB) even without extra env vars. The lack of declared env requirements is an inconsistency the user should be aware of.
Persistence & Privilege
The skill does not set always:true and does not request permanent elevated platform privileges. It will create and write to state/smartness-eval/ (runs, reports, history) under the workspace — this is expected for a reporting tool. It does run other workspace scripts (via python3), which is necessary for its tests but increases its execution surface.
What to consider before installing
This skill is plausible for evaluating an agent, but it needs broad local access and (optionally) external API keys. Before running or installing: 1) Inspect the workspace scripts that the skill will invoke (workspace/scripts/*). The skill will execute Python scripts located there (validate_command allows 'scripts/' paths), so those scripts determine what actually runs. 2) Be aware it reads logs and the reasoning DB (.reasoning/reasoning-store.sqlite and many state/*.json files) — these may contain sensitive user messages or secrets; avoid running it against a workspace with private data unless you trust the environment. 3) Do not pass --llm-judge or set DEEPSEEK_API_KEY / OPENAI_API_KEY unless you want the skill to make external API calls; SKILL.md states the network calls are opt-in. 4) Run scripts/check.py first to confirm local files are present. 5) Note the small metadata inconsistencies (owner id/version) — confirm the skill source and trust the publisher before granting it access to your workspace. If you need lower risk, run it in an isolated/sandbox workspace with non-sensitive logs.

Like a lobster shell, security has layers — review code before you run it.

Current versionv0.2.1
Download zip
agent-evalvk97ersev5z9esbhz1e5zphqjwx8307nvbenchmarkvk97ersev5z9esbhz1e5zphqjwx8307nvevaluationvk97ersev5z9esbhz1e5zphqjwx8307nvlatestvk97154rk7kgz2t831tx47j1xcx8304fa

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

OpenClaw Smartness Eval

用于评估 OpenClaw 是否真的“更聪明”,而不是只看单次回答是否看起来不错。

适用场景

  • 版本升级后回归:确认能力是否真的提升
  • 每周 / 每月自评:输出结构化能力报告
  • 发现退化:查看哪个维度下降最快
  • 准备对外展示:生成统一口径的能力评估结果

命令

1) 标准评估

python3 skills/openclaw-smartness-eval/scripts/eval.py --mode standard

2) 快速评估

python3 skills/openclaw-smartness-eval/scripts/eval.py --mode quick

3) 深度评估

python3 skills/openclaw-smartness-eval/scripts/eval.py --mode deep --compare-last

4) 只输出 Markdown

python3 skills/openclaw-smartness-eval/scripts/eval.py --mode standard --format markdown

5) 健康检查

python3 skills/openclaw-smartness-eval/scripts/check.py

输出内容

评估结果将写入:

  • state/smartness-eval/runs/<timestamp>.json
  • state/smartness-eval/reports/<date>.md
  • state/smartness-eval/history.jsonl

输出结果包含:

  • overall_score
  • grade
  • dimension_scores
  • expanded_scores
  • evidence
  • risk_flags
  • upgrade_recommendations
  • trend_vs_last

6) LLM Judge 主观评分

python3 skills/openclaw-smartness-eval/scripts/eval.py --mode standard --llm-judge

需设置 DEEPSEEK_API_KEYOPENAI_API_KEY 环境变量。 该功能会发起外部 API 请求,默认不开启,仅在显式传入 --llm-judge 时启用。

输出新增字段 (v0.2)

  • dimension_spread — 维度间离散度
  • trend_vs_last.dimension_deltas — 各维度分数变化
  • trend_vs_last.degradation_alert — 退化超过 5 分的维度
  • pass_at_k — deep 模式下各测试的 pass@k 可靠性
  • llm_judge — LLM 裁判主观评分和评语

数据来源

  • state/response-latency-metrics.json
  • state/error-tracker.json (时间窗口过滤)
  • state/pattern-library.json
  • state/cron-governor-report.json
  • state/benchmark-results/history.jsonl
  • state/v5-orchestrator-log.json
  • state/v5-finalize-log.json
  • state/message-analyzer-log.json (真实日志抽样)
  • state/reflection-reports/ (反思报告)
  • state/alerts.jsonl (告警日志)
  • state/rule-candidates.json
  • .reasoning/reasoning-store.sqlite (推理知识库)
  • scripts/regression-metrics-report.py (回归指标)
  • 任务集中的 28 项规则测试命令
  • 随机探针测试 (反作弊)

模式说明

  • quick — 小样本 + 关键日志,~10 个测试
  • standard — 默认周度评估,~25 个测试 + 2 个随机探针
  • deep — 全部测试 x2 重复运行 + pass@k + 30天窗口 + 趋势对比

文件结构

openclaw-smartness-eval/
├── SKILL.md
├── _meta.json
├── config/
│   ├── config.json
│   ├── rubrics.json
│   └── task-suite.json
└── scripts/
    ├── eval.py
    └── check.py

Files

9 total
Select a file
Select a file to preview.

Comments

Loading comments…