Agent Evaluation

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 6 · 2.9k · 38 current installs · 39 all-time installs

by@rustyorb

MIT-0

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

Name and description match the SKILL.md content. The skill is instruction-only and requests no binaries, credentials, or config paths—appropriate for a testing/evaluation guidance document. Related-skill references (multi-agent orchestration, etc.) are plausible and consistent.

ℹ

Instruction Scope

SKILL.md provides high-level evaluation patterns (statistical testing, adversarial testing, behavioral contracts) and anti-patterns. It does not instruct the agent to read system files, access credentials, or post data to external endpoints. However, the instructions are deliberately high-level and give the agent wide discretion about how to run tests (including adversarial tests), so operational bounds should be set by the user or platform to prevent unintended actions against external systems or private data.

✓

Install Mechanism

No install spec and no code files — lowest-risk instruction-only skill. Nothing is written to disk by the skill itself.

✓

Credentials

The skill declares no environment variables, credentials, or config paths. There is no disproportionate request for secrets or access.

✓

Persistence & Privilege

always is false and the skill is user-invocable with normal autonomous invocation allowed. That is expected for a testing skill and does not request persistent or cross-skill configuration changes.

Assessment

This skill is a high-level guide for designing and running LLM-agent tests and appears internally consistent. Before enabling it in an agent, decide and enforce operational limits: which systems or services tests may target, whether adversarial tests are allowed against live/third-party systems, and how test data is stored to avoid leaking prompts into training data. Because the skill is only instructions, also review any actual test code or other skills you pair it with (for example multi-agent orchestration) — those components will determine real access and risk. Finally, enable logging and sandboxing for agent-executed tests to detect unexpected behavior.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0

Download zip

latestvk971gwmyjapsntsac2j2vbapx980tyae

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

SKILL.md

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Requirements

testing-fundamentals
llm-fundamentals

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

Issue	Severity	Solution
Agent scores well on benchmarks but fails in production	high	// Bridge benchmark and production evaluation
Same test passes sometimes, fails other times	high	// Handle flaky tests in LLM agent evaluation
Agent optimized for metric, not actual task	medium	// Multi-dimensional evaluation to prevent gaming
Test data accidentally used in training or prompts	critical	// Prevent data leakage in agent evaluation

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

Files

1 total

Select a file

Select a file to preview.

Comments

Loading comments…