Skylv Agent Evaluator
v1.0.2Evaluate AI agent behavior on accuracy, efficiency, clarity, safety, and helpfulness, providing scores, grades, and improvement suggestions.
Security Scan
OpenClaw
Suspicious
medium confidencePurpose & Capability
The declared purpose (evaluate agent behavior across five dimensions) aligns with the included code, which implements a scoring engine. However the SKILL.md/README claim different dimension names and weights (SKILL.md: Accuracy, Efficiency, Safety, Coherence, Adaptability; README: Accuracy 25% etc.) while the code defines accuracy, efficiency, clarity, safety, helpfulness with different weights. This mismatch between documentation and implementation is misleading.
Instruction Scope
SKILL.md states 'Analysis: Score each dimension using LLM-as-judge', but agent_evaluator.js performs local regex/heuristic scoring with no LLM calls or external network activity. The runtime instructions imply behavior (LLM judgement) that the code does not perform — a substantive divergence in scope.
Install Mechanism
No install spec or external downloads; the skill is instruction-only with a bundled JS file. No packages are fetched and nothing is written to disk aside from reading user-supplied files, so installation risk is low.
Credentials
The skill requests no environment variables, credentials, or special config paths. The code reads only a user-supplied file path and uses no secrets or external services.
Persistence & Privilege
always is false and the skill does not modify other skills or system settings. It does not persist credentials or enable itself automatically, so there are no elevated persistence privileges.
What to consider before installing
This package appears to be a local, heuristic-based evaluator (reads a file and applies regex rules). Before installing or using it, note that the SKILL.md claims 'LLM-as-judge' and a different set of evaluation dimensions/weights than the code actually implements — ask the author to explain which implementation is authoritative. If you plan to use it: (1) run it on non-sensitive sample logs in a sandbox to confirm behavior; (2) verify which criteria and weightings are used by inspecting the code (CRITERIA in agent_evaluator.js); (3) if you expect LLM-based scoring, do not trust the current code as-is — it makes no external calls; (4) consider forking or adjusting the script if you need LLM judgement or different metrics. The tool does not request secrets or network access, so the direct security risk is low, but the documentation/implementation mismatch could lead to mistaken trust in its results.Like a lobster shell, security has layers — review code before you run it.
latest
Agent Evaluator
Score any AI agent's behavior across 5 objective dimensions.
Scoring Dimensions
| Dimension | Weight | What it measures |
|---|---|---|
| Accuracy | 30% | Correctness of outputs and decisions |
| Efficiency | 20% | Resource usage, speed, token optimization |
| Safety | 20% | Harmlessness, no prompt injection, data privacy |
| Coherence | 15% | Logical consistency across turns |
| Adaptability | 15% | Learning from feedback, self-correction |
Evaluation Flow
- Input: Agent's recent conversation or output samples
- Analysis: Score each dimension using LLM-as-judge
- Report: Detailed breakdown + improvement suggestions
Quick Start
Evaluate the agent in my conversation history
Example Output
AGENT EVALUATION REPORT
========================
Accuracy: 8.5/10 ████████▓░
Efficiency: 7.0/10 ███████░░░
Safety: 9.2/10 █████████▒
Coherence: 8.0/10 ████████░░
Adaptability: 7.5/10 ███████▓░░
------------------------
OVERALL: 8.1/10
Top Issues:
- [HIGH] Efficiency: Consider using caching for repeated calls
- [MEDIUM] Adaptability: Add self-reflection step after each task
Recommendations:
1. Implement cost-guard for token tracking
2. Add error-recovery loop for failed API calls
Use Cases
- Before shipping: Validate agent quality before release
- Regression testing: Detect quality drops after updates
- A/B comparison: Compare two agents or prompts objectively
- User feedback loop: Convert user corrections into objective scores
MIT License © SKY-lv
Comments
Loading comments...
