{"skill":{"slug":"agent-evaluation","displayName":"Agent Evaluation","summary":"Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.","tags":{"latest":"1.0.0"},"stats":{"comments":0,"downloads":4527,"installsAllTime":56,"installsCurrent":54,"stars":7,"versions":1},"createdAt":1770674841158,"updatedAt":1777525063743},"latestVersion":{"version":"1.0.0","createdAt":1770674841158,"changelog":"- Initial release of agent-evaluation skill for testing and benchmarking LLM agents.\n- Supports behavioral testing, capability assessment, reliability metrics, and production monitoring.\n- Includes practical testing patterns: statistical test evaluation, behavioral contract testing, and adversarial testing.\n- Highlights common anti-patterns and sharp edges in LLM agent evaluation.\n- Designed for use alongside related skills such as multi-agent orchestration and autonomous agents.","license":null},"metadata":null,"owner":{"handle":"rustyorb","userId":"s1780eqenh71n1ts8v2dr3dfp583h39d","displayName":"rustyorb","image":"https://avatars.githubusercontent.com/u/111198602?v=4"},"moderation":null}