Install
openclaw skills install openclaw-smartness-evalOpenClaw 智能度综合评伌技能。围绕 14 个维度(含规划能力、幻觉控制)输出综合评分、证据、风险与趋势。对齐 CLEAR/T-Eval/Anthropic 行业标准。
openclaw skills install openclaw-smartness-eval用于评估 OpenClaw 是否真的“更聪明”,而不是只看单次回答是否看起来不错。
python3 skills/openclaw-smartness-eval/scripts/eval.py --mode standard
python3 skills/openclaw-smartness-eval/scripts/eval.py --mode quick
python3 skills/openclaw-smartness-eval/scripts/eval.py --mode deep --compare-last
python3 skills/openclaw-smartness-eval/scripts/eval.py --mode standard --format markdown
python3 skills/openclaw-smartness-eval/scripts/check.py
评估结果将写入:
state/smartness-eval/runs/<timestamp>.jsonstate/smartness-eval/reports/<date>.mdstate/smartness-eval/history.jsonl输出结果包含:
overall_scoregradedimension_scoresexpanded_scoresevidencerisk_flagsupgrade_recommendationstrend_vs_lastpython3 skills/openclaw-smartness-eval/scripts/eval.py --mode standard --llm-judge
需设置 DEEPSEEK_API_KEY 或 OPENAI_API_KEY 环境变量。
该功能会发起外部 API 请求,默认不开启,仅在显式传入 --llm-judge 时启用。
dimension_spread — 维度间离散度trend_vs_last.dimension_deltas — 各维度分数变化trend_vs_last.degradation_alert — 退化超过 5 分的维度pass_at_k — deep 模式下各测试的 pass@k 可靠性llm_judge — LLM 裁判主观评分和评语state/response-latency-metrics.jsonstate/error-tracker.json (时间窗口过滤)state/pattern-library.jsonstate/cron-governor-report.jsonstate/benchmark-results/history.jsonlstate/v5-orchestrator-log.jsonstate/v5-finalize-log.jsonstate/message-analyzer-log.json (真实日志抽样)state/reflection-reports/ (反思报告)state/alerts.jsonl (告警日志)state/rule-candidates.json.reasoning/reasoning-store.sqlite (推理知识库)scripts/regression-metrics-report.py (回归指标)quick — 小样本 + 关键日志,~10 个测试standard — 默认周度评估,~25 个测试 + 2 个随机探针deep — 全部测试 x2 重复运行 + pass@k + 30天窗口 + 趋势对比本技能被设计为只读评估工具,以下是完整的行为声明:
本技能只读取以下工作区状态文件,不修改任何现有文件:
| 文件 | 用途 | 写入? |
|---|---|---|
state/response-latency-metrics.json | 延迟 P50/P95 计算 | ❌ 只读 |
state/error-tracker.json | 错误修复率统计 | ❌ 只读 |
state/pattern-library.json | 模式库健康度 | ❌ 只读 |
state/cron-governor-report.json | Cron 任务状态 | ❌ 只读 |
state/benchmark-results/history.jsonl | 基准测试通过率 | ❌ 只读 |
state/v5-orchestrator-log.json | 编排器使用量 | ❌ 只读 |
state/v5-finalize-log.json | Finalize 审批率 | ❌ 只读 |
state/message-analyzer-log.json | 真实交互采样 | ❌ 只读 |
state/reflection-reports/ | 自省报告数量 | ❌ 只读 |
state/alerts.jsonl | 告警频率统计 | ❌ 只读 |
.reasoning/reasoning-store.sqlite | 推理深度查询 | ❌ 只读 |
本技能仅写入 state/smartness-eval/ 目录下的评估结果:
state/smartness-eval/runs/<timestamp>.json — 完整评估 JSONstate/smartness-eval/reports/<date>.md — Markdown 报告state/smartness-eval/history.jsonl — 历史评分记录本技能通过 subprocess 运行 task-suite.json 中定义的测试命令:
validate_command() 函数)python3 scripts/、cat state/、sqlite3 .reasoning/ 等安全前缀开头的命令--llm-judge 参数时,会调用 DeepSeek/OpenAI API(需用户自行配置 API Key)openclaw-smartness-eval/
├── SKILL.md
├── _meta.json
├── config/
│ ├── config.json
│ ├── rubrics.json
│ └── task-suite.json
└── scripts/
├── eval.py
└── check.py