Agent Evaluation

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

MIT-0 · Free to use, modify, and redistribute. No attribution required.
6 · 2.9k · 38 current installs · 39 all-time installs
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name and description match the SKILL.md content. The skill is instruction-only and requests no binaries, credentials, or config paths—appropriate for a testing/evaluation guidance document. Related-skill references (multi-agent orchestration, etc.) are plausible and consistent.
Instruction Scope
SKILL.md provides high-level evaluation patterns (statistical testing, adversarial testing, behavioral contracts) and anti-patterns. It does not instruct the agent to read system files, access credentials, or post data to external endpoints. However, the instructions are deliberately high-level and give the agent wide discretion about how to run tests (including adversarial tests), so operational bounds should be set by the user or platform to prevent unintended actions against external systems or private data.
Install Mechanism
No install spec and no code files — lowest-risk instruction-only skill. Nothing is written to disk by the skill itself.
Credentials
The skill declares no environment variables, credentials, or config paths. There is no disproportionate request for secrets or access.
Persistence & Privilege
always is false and the skill is user-invocable with normal autonomous invocation allowed. That is expected for a testing skill and does not request persistent or cross-skill configuration changes.
Assessment
This skill is a high-level guide for designing and running LLM-agent tests and appears internally consistent. Before enabling it in an agent, decide and enforce operational limits: which systems or services tests may target, whether adversarial tests are allowed against live/third-party systems, and how test data is stored to avoid leaking prompts into training data. Because the skill is only instructions, also review any actual test code or other skills you pair it with (for example multi-agent orchestration) — those components will determine real access and risk. Finally, enable logging and sandboxing for agent-executed tests to detect unexpected behavior.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0
Download zip
latestvk971gwmyjapsntsac2j2vbapx980tyae

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

Capabilities

  • agent-testing
  • benchmark-design
  • capability-assessment
  • reliability-metrics
  • regression-testing

Requirements

  • testing-fundamentals
  • llm-fundamentals

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

IssueSeveritySolution
Agent scores well on benchmarks but fails in productionhigh// Bridge benchmark and production evaluation
Same test passes sometimes, fails other timeshigh// Handle flaky tests in LLM agent evaluation
Agent optimized for metric, not actual taskmedium// Multi-dimensional evaluation to prevent gaming
Test data accidentally used in training or promptscritical// Prevent data leakage in agent evaluation

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

Files

1 total
Select a file
Select a file to preview.

Comments

Loading comments…