OpenClaw Skill Tester

v1.0.0

OpenClaw Skill Testing Framework - 自动化测试技能质量,验证触发时机、功能正确性、性能指标对比

0· 9·0 current·0 all-time
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description (skill testing framework) align with required binaries (python3, bash) and included scripts. Minor mismatch: SKILL.md metadata lists extra packages (pytest, requests) that the shipped scripts do not actually use — harmless but slightly inconsistent.
Instruction Scope
Runtime instructions are limited to running the included Python/bash scripts, reading SKILL.md files for target skills, and writing reports to /tmp. Important operational note: the runner will execute test code against other skills found in the workspace, which means running arbitrary code from those skills (expected for a tester, but a potential side-effect if those target skills are untrusted).
Install Mechanism
No install spec; this is instruction- and script-only. Nothing is downloaded from external URLs or written to system locations beyond report files in /tmp and relative workspace paths.
Credentials
The skill does not require credentials or privileged environment variables. SKILL.md shows example SKILL_TESTER_* env vars for configuration (optional), which are reasonable and not required.
Persistence & Privilege
always is false and the skill does not modify other skills' configuration. It writes reports under /tmp/skill-tests and uses relative workspace paths — typical and proportionate for a test runner.
Assessment
This skill appears to do what it says: run trigger, functionality, and comparison tests against local skills. Before running: (1) inspect any target skills you plan to test — the runner will execute test logic against those directories, so untrusted target skills could run arbitrary code; (2) run in an isolated environment (container or CI) if you want to avoid side effects; (3) if you rely on the metadata packages (pytest, requests), consider installing them or removing them from the SKILL.md metadata to avoid confusion; (4) review and, if desired, change the report directory (/tmp/skill-tests) in the runner. No credentials or network downloads are required by the included scripts themselves.

Like a lobster shell, security has layers — review code before you run it.

latestvk976bsh8nb8cvr7eeyjqag70bn84447x

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Runtime requirements

Binspython3, bash

SKILL.md

OpenClaw Skill Testing Framework

自动化测试框架,确保技能质量与性能。

作者: Cao Xiaosi (曹小四)
版本: 1.0.0
许可证: MIT

🎯 测试目标

1. 触发测试 (Trigger Testing)

验证技能在正确时机加载,测试应该触发和不应该触发的场景。

测试内容:

  • ✅ 验证触发词匹配
  • ✅ 测试边界条件
  • ✅ 验证不应触发的场景

测试方法:

python3 scripts/test_trigger.py --skill stock-watcher --input "监控股票价格" --expected true
python3 scripts/test_trigger.py --skill stock-watcher --input "随便聊聊" --expected false

2. 功能测试 (Functionality Testing)

验证技能输出是否正确,测试各种输入场景。

测试内容:

  • ✅ 正常输入场景
  • ✅ 边界值测试
  • ✅ 异常输入处理
  • ✅ 输出格式验证

测试方法:

python3 scripts/test_functionality.py --skill a-stock-monitor --test-case "market_sentiment"
python3 scripts/test_functionality.py --skill a-stock-monitor --test-case "stock_price_query"

3. 对比测试 (Comparison Testing)

证明技能比没有技能时更好,对比工具调用次数、token 消耗等。

测试内容:

  • ✅ 工具调用次数对比
  • ✅ Token 消耗对比
  • ✅ 响应时间对比
  • ✅ 准确率对比

测试方法:

python3 scripts/test_comparison.py --skill stock-watcher --baseline "no-skill" --metric "tool_calls"
python3 scripts/test_comparison.py --skill stock-watcher --baseline "no-skill" --metric "tokens"

📁 目录结构

skills/skill-tester/
├── SKILL.md                           # 本文件
├── scripts/
│   ├── test_trigger.py                # 触发测试脚本
│   ├── test_functionality.py          # 功能测试脚本
│   ├── test_comparison.py             # 对比测试脚本
│   ├── test_runner.sh                 # 测试执行器
│   ├── utils/
│   │   ├── trigger_validator.py       # 触发验证工具
│   │   ├── output_validator.py        # 输出验证工具
│   │   └── metrics_collector.py       # 性能指标收集器
│   └── fixtures/
│       └── sample_inputs.json         # 测试用例数据
└── references/
    ├── TEST_GUIDELINES.md             # 测试指南
    └── METRICS_DEFINITION.md          # 指标定义

🚀 使用方式

快速测试

# 测试单个技能
python3 scripts/test_runner.sh --skill stock-watcher

# 测试所有技能
python3 scripts/test_runner.sh --all

# 详细模式
python3 scripts/test_runner.sh --skill stock-watcher --verbose

触发测试

# 测试应该触发的场景
python3 scripts/test_trigger.py --skill "stock-watcher" \
  --inputs "监控股票价格" "查看A股" "实时行情" \
  --expected true

# 测试不应该触发的场景
python3 scripts/test_trigger.py --skill "stock-watcher" \
  --inputs "随便聊聊" "今天天气" "帮我写代码" \
  --expected false

功能测试

# 运行特定测试用例
python3 scripts/test_functionality.py --skill "a-stock-monitor" \
  --test-case "market_sentiment_calculation"

# 运行所有测试用例
python3 scripts/test_functionality.py --skill "a-stock-monitor" \
  --all-cases

对比测试

# 对比工具调用次数
python3 scripts/test_comparison.py --skill "stock-watcher" \
  --baseline "no-skill" \
  --metric "tool_calls" \
  --iterations 10

# 对比 Token 消耗
python3 scripts/test_comparison.py --skill "stock-watcher" \
  --baseline "no-skill" \
  --metric "tokens" \
  --iterations 10

📊 测试报告

输出格式

{
  "skill_name": "stock-watcher",
  "timestamp": "2026-04-03T12:00:00",
  "test_summary": {
    "total": 20,
    "passed": 18,
    "failed": 2,
    "skipped": 0
  },
  "trigger_tests": {
    "passed": 10,
    "failed": 0
  },
  "functionality_tests": {
    "passed": 8,
    "failed": 2
  },
  "comparison_metrics": {
    "tool_calls_reduction": "45%",
    "token_savings": "32%",
    "response_time_improvement": "28%"
  },
  "failed_tests": [
    {
      "test_name": "test_stock_price_accuracy",
      "error": "Expected price 10.50, got 10.45",
      "input": "600000"
    }
  ]
}

报告生成

# 生成 JSON 报告
python3 scripts/test_runner.sh --skill stock-watcher --report json

# 生成 HTML 报告
python3 scripts/test_runner.sh --skill stock-watcher --report html

# 生成 Markdown 报告
python3 scripts/test_runner.sh --skill stock-watcher --report md

🔧 配置说明

测试配置文件

创建 test_config.json:

{
  "skills_path": "/Users/mars/.openclaw/workspace/skills",
  "test_iterations": 10,
  "timeout_seconds": 30,
  "verbose": false,
  "save_reports": true,
  "report_format": ["json", "md"],
  "exclude_skills": ["skill-tester", "self-improving"]
}

环境变量

export SKILL_TESTER_VERBOSE=true
export SKILL_TESTER_TIMEOUT=60
export SKILL_TESTER_REPORT_DIR="/tmp/skill-tests"

🧪 测试用例示例

触发测试用例

{
  "skill": "stock-watcher",
  "trigger_tests": {
    "should_trigger": [
      "监控股票价格",
      "查看A股实时行情",
      "A股今日走势"
    ],
    "should_not_trigger": [
      "帮我写代码",
      "今天天气如何",
      "随便聊聊"
    ]
  }
}

功能测试用例

{
  "skill": "a-stock-monitor",
  "functionality_tests": [
    {
      "name": "market_sentiment",
      "input": {"action": "calculate_sentiment"},
      "expected_output": {
        "type": "json",
        "fields": ["score", "level", "stats"]
      }
    },
    {
      "name": "stock_price_query",
      "input": {"stock_code": "600000"},
      "expected_output": {
        "type": "json",
        "fields": ["price", "change_pct", "volume"]
      }
    }
  ]
}

📈 性能指标定义

1. 工具调用次数 (Tool Calls)

  • 定义: 技能执行所需的工具调用总数
  • 目标: 减少不必要的工具调用
  • 对比基准: 无技能时的手动操作

2. Token 消耗 (Token Usage)

  • 定义: LLM 调用消耗的 Token 总数
  • 目标: 降低 Token 消耗,节省成本
  • 对比基准: 无技能时的原始对话

3. 响应时间 (Response Time)

  • 定义: 从请求到响应的总耗时(秒)
  • 目标: 提升响应速度
  • 对比基准: 无技能时的响应时间

4. 准确率 (Accuracy)

  • 定义: 输出结果的正确率
  • 目标: 100% 准确
  • 评估方法: 人工验证 + 自动化检查

🛠️ 开发指南

添加新技能测试

  1. scripts/fixtures/sample_inputs.json 中添加测试用例
  2. 创建技能特定的测试脚本(可选)
  3. 运行测试验证

扩展测试类型

创建新的测试模块:

# scripts/test_custom.py
from utils.metrics_collector import MetricsCollector

def test_custom_metric(skill_name):
    collector = MetricsCollector()
    # 自定义测试逻辑
    results = collector.collect_custom_metrics(skill_name)
    return results

🔍 故障排查

问题1: 测试超时

原因: 技能执行时间过长
解决: 增加超时配置 --timeout 60

问题2: 报告生成失败

原因: 报告目录不存在
解决: 创建目录 mkdir -p /tmp/skill-tests

问题3: 测试用例缺失

原因: 未定义测试用例
解决: 在 fixtures/sample_inputs.json 中添加

📝 最佳实践

  1. 定期运行测试: 每次技能更新后运行完整测试
  2. 保持测试用例更新: 随技能功能同步更新测试用例
  3. 监控性能指标: 定期对比性能指标,优化技能
  4. 自动化集成: 将测试集成到 CI/CD 流程中
  5. 文档化测试结果: 保存测试报告,便于追溯

📄 许可证

MIT License

🤝 贡献

欢迎提交问题、建议和改进意见!

📞 联系作者

Files

5 total
Select a file
Select a file to preview.

Comments

Loading comments…