Reddi Agent Evaluation

reddi.tech fork of agent-evaluation. Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and produc...

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 27 · 0 current installs · 0 all-time installs
byNissan Dookeran@nissan
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description, SKILL.md content, and included test cases all describe agent evaluation and benchmarking. The declared required binary (python3) is surprising because the skill is instruction-only and ships no runnable code; it may be harmless (a generic dependency hint) but is disproportionate to the provided files.
Instruction Scope
The instructions are focused on designing and running evaluation tests, statistical approaches, and anti-patterns. They do not instruct the agent to read arbitrary files, exfiltrate data, or call unexpected external endpoints. The metadata allows outbound network calls for LLM API usage, which matches the skill's purpose of scoring agents.
Install Mechanism
There is no install spec and no code files to download or execute. This is the lowest-risk model for an OpenClaw skill.
Credentials
The skill declares no required environment variables, no primary credential, and no config paths. That aligns with an instruction-only evaluation guide.
Persistence & Privilege
The skill is not force-included (always: false) and uses normal autonomous invocation semantics. It does not request persistent system-wide changes or other skills' credentials.
Assessment
This skill appears coherent and low-risk: it only provides guidance and test cases for evaluating LLM agents and requires no secrets or installs. Before installing, consider: (1) confirm why python3 is declared — if you have no intent to run external Python scripts this requirement is unnecessary; (2) note that metadata permits outbound network calls (standard for calling an LLM API) — ensure your agent's configured LLM endpoints and keys are ones you trust; (3) because this is instruction-only, future versions could add code or env requirements — re-review on updates. If you plan to run any evaluation scripts referenced in your own workflows, run them in a controlled environment and audit any code they download.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.2
Download zip
latestvk97dqa3czh5xt5yab4zp4g71rx82yffp

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Runtime requirements

📋 Clawdis
Binspython3

SKILL.md

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

Capabilities

  • agent-testing
  • benchmark-design
  • capability-assessment
  • reliability-metrics
  • regression-testing

Requirements

  • testing-fundamentals
  • llm-fundamentals

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

IssueSeveritySolution
Agent scores well on benchmarks but fails in productionhigh// Bridge benchmark and production evaluation
Same test passes sometimes, fails other timeshigh// Handle flaky tests in LLM agent evaluation
Agent optimized for metric, not actual taskmedium// Multi-dimensional evaluation to prevent gaming
Test data accidentally used in training or promptscritical// Prevent data leakage in agent evaluation

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

Files

3 total
Select a file
Select a file to preview.

Comments

Loading comments…