Agent Eval Suite

Provides benchmark testing, A/B testing, performance regression detection, and simulation environment testing for agent evaluation.

Audits

Pass

Install

openclaw skills install yuyonghao-agent-eval-suite

Agent Eval Suite

Agent 评估套件,提供基准测试、A/B测试、性能回归检测和模拟环境测试。

功能特性

1. 基准测试框架

  • 标准化测试用例
  • 多维度评估指标
  • 基准对比
  • 评分系统

2. A/B 测试

  • 对照组设计
  • 随机分组
  • 统计显著性检验
  • 结果分析

3. 性能回归检测

  • 历史性能对比
  • 回归告警
  • 性能趋势图
  • 根因分析

4. 模拟环境测试

  • 沙箱环境
  • 场景模拟
  • 边界条件测试
  • 故障注入

安装

npm install

使用方法

Benchmark

const { Benchmark } = require('./src');

const benchmark = new Benchmark({ iterations: 100 });

benchmark.addTest('task-completion', {
  execute: async () => await agent.completeTask(task)
});

const results = await benchmark.run();
console.log(results);

ABTester

const { ABTester } = require('./src');

const ab = new ABTester({ confidenceLevel: 0.95 });

ab.createExperiment('new-prompt', {
  control: async () => await oldPrompt(),
  treatment: async () => await newPrompt()
});

const result = await ab.run('new-prompt', { sampleSize: 200 });
console.log(result); // { winner: 'treatment', confidence: 0.97 }

RegressionDetector

const { RegressionDetector } = require('./src');

const detector = new RegressionDetector({ threshold: 0.1 });

detector.record('response-time', { version: 'v1.1.0', value: 1200 });

const regressions = detector.detect();
console.log(regressions);

测试

npm test

License

MIT