Install
openclaw skills install @gechengling/ai-agent-evaluatorAI-powered agent evaluation and benchmarking assistant — design evaluation suites, run structured assessments (task completion rate, latency, safety, reasoning accuracy), compare multi-agent frameworks (CrewAI, LangChain, AutoGen), generate benchmark reports, and guide developers in selecting the right evaluation methodology. Built for AI engineers, product managers, and ML teams shipping agent-based applications to production. Keywords: AI agent evaluation, agent benchmarking, LLM testing, CrewAI, AutoGen, LangChain, SWE-bench, AgentBench, AI quality assurance, agent reliability.
openclaw skills install @gechengling/ai-agent-evaluatorYour expert companion for evaluating, benchmarking, and improving AI agents.
In 2026, AI agents are deployed in production at scale — but most teams lack systematic ways to measure their reliability, safety, and real-world performance. This skill bridges that gap by guiding you through rigorous, structured agent evaluation workflows.
English:
Chinese / 中文:
Input: Agent description, task type, sample inputs/outputs Steps:
Input: Agent capabilities, deployment domain Steps:
Input: Agent goal, available test data, budget/time Steps:
Input: Agent logs, failed task transcripts Steps:
Input: Use case requirements (e.g., "code review pipeline with 3 agents") Steps:
User: "I built a customer support agent with GPT-4o. How do I know if it's ready for production?"
Agent Evaluator:
Great question. Let's run a production readiness assessment across 5 dimensions:
- Task Completion Rate — target >92% for Tier-1 support queries
- Hallucination Rate — should be <2% on factual product questions
- Escalation Accuracy — correctly identifies when to route to human
- Latency P95 — target <3s for chat use case
- Safety Compliance — passes 50-case adversarial test suite
To start, share 10 sample conversations (anonymized) and I'll generate your baseline scorecard.
User: "Compare CrewAI vs LangChain for a financial report analysis pipeline"
Agent Evaluator:
Here's a side-by-side for your use case... [generates structured comparison table + recommendation]
| Concept | Description |
|---|---|
| SWE-Bench | Software engineering task benchmark (GitHub issues) |
| AgentBench | Multi-domain agent task evaluation suite |
| BFCL | Berkeley Function Calling Leaderboard |
| WebArena | Browser automation + web task benchmark |
| Task Success Rate (TSR) | % of tasks completed correctly end-to-end |
| Step Success Rate (SSR) | % of individual reasoning steps correct |
| Hallucination Rate | Frequency of factually incorrect outputs |
| Grounding Accuracy | Correct attribution to source documents |
Built for AI teams who ship agents to production — not just demos. Author: @gechengling | version: "3.0.2"
| 失败类别 | 子类型 | 检测方法 | 修复方向 | 发生频率 |
|---|---|---|---|---|
| 工具调用失败 | API超时/限流 | 日志中API错误码统计 | 重试+退避策略 | 22% |
| 工具调用失败 | 参数格式错误 | 对比工具schema定义 | Schema修正+类型校验 | 15% |
| 工具调用失败 | 认证失效(401/403) | 检测401/403响应 | 自动刷新token | 8% |
| 幻觉输出 | 编造工具返回数据 | 对比原始工具输出 | 强制引用来源 | 18% |
| 幻觉输出 | 错误推理链条 | 检查推理步骤逻辑 | CoT+自校验 | 12% |
| 循环/死锁 | 无限重试循环 | 检测重复调用(>5次) | 最大重试次数上限 | 10% |
| 循环/死锁 | 相互调用死锁 | 检测环形调用图 | 超时+人工介入 | 3% |
| 上下文丢失 | 超Token限制截断 | 监控上下文长度 | 摘要压缩+外部存储 | 7% |
| 上下文丢失 | 关键事实遗忘 | 对比早期对话事实 | 显式记忆+检索 | 5% |
| 安全阻断 | 敏感词触发 | 检测安全过滤器日志 | Prompt调整+白名单 | 4% |
| 安全阻断 | 内容策略拒绝 | 检测拒绝响应模式 | 内容改写+分级策略 | 3% |
| 数据质量 | 检索结果不相关 | 评估RAG命中率 | 查询改写+多路检索 | 14% |
| 数据质量 | 数据过期/错误 | 对比数据源时间戳 | 数据新鲜度检查 | 6% |
失败根因分析(Top 3):
评估工具推荐(2026):