Agent Evaluation
v1.0.0Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
⭐ 6· 4k·53 current·55 all-time
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
OpenClaw
Benign
high confidencePurpose & Capability
Name and description match the SKILL.md content. The skill is instruction-only and requests no binaries, credentials, or config paths—appropriate for a testing/evaluation guidance document. Related-skill references (multi-agent orchestration, etc.) are plausible and consistent.
Instruction Scope
SKILL.md provides high-level evaluation patterns (statistical testing, adversarial testing, behavioral contracts) and anti-patterns. It does not instruct the agent to read system files, access credentials, or post data to external endpoints. However, the instructions are deliberately high-level and give the agent wide discretion about how to run tests (including adversarial tests), so operational bounds should be set by the user or platform to prevent unintended actions against external systems or private data.
Install Mechanism
No install spec and no code files — lowest-risk instruction-only skill. Nothing is written to disk by the skill itself.
Credentials
The skill declares no environment variables, credentials, or config paths. There is no disproportionate request for secrets or access.
Persistence & Privilege
always is false and the skill is user-invocable with normal autonomous invocation allowed. That is expected for a testing skill and does not request persistent or cross-skill configuration changes.
Assessment
This skill is a high-level guide for designing and running LLM-agent tests and appears internally consistent. Before enabling it in an agent, decide and enforce operational limits: which systems or services tests may target, whether adversarial tests are allowed against live/third-party systems, and how test data is stored to avoid leaking prompts into training data. Because the skill is only instructions, also review any actual test code or other skills you pair it with (for example multi-agent orchestration) — those components will determine real access and risk. Finally, enable logging and sandboxing for agent-executed tests to detect unexpected behavior.Like a lobster shell, security has layers — review code before you run it.
latestvk971gwmyjapsntsac2j2vbapx980tyae
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
