Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

Agent Quality Tester

v1.0.0

Evaluates AI agent outputs across accuracy, efficiency, safety, coherence, and adaptability, providing scores and improvement suggestions.

0· 29·1 current·1 all-time
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
medium confidence
!
Purpose & Capability
The manifest and SKILL.md claim evaluation across Accuracy, Efficiency, Safety, Coherence, and Adaptability (with specific weights), but the included code implements different criteria: accuracy, efficiency, clarity, safety, and helpfulness with different weights (25/20/15/20/20). The declared purpose (measure those five dimensions) does not match the actual implementation—coherence and adaptability are absent from code, and 'clarity' and 'helpfulness' are used instead. This mismatch is material because users expect scores for the named dimensions.
Instruction Scope
SKILL.md suggests an 'LLM-as-judge' approach and describes evaluation flow in abstract terms, but the shipped JS performs simple local text analysis using regex heuristics and reads user-provided files. The instructions do not direct reading unrelated system files or exfiltration, but the description and the implementation diverge (claimed LLM-based judgment vs local heuristic scoring). The SKILL.md also lists weights that don't match the code/README.
Install Mechanism
There is no install spec and no external downloads—only a local JS file and docs. This is low risk from an installation/execution-supply-chain perspective.
Credentials
The skill requests no environment variables, no credentials, and no config paths. The code reads only an input file provided at runtime; there are no hidden credential accesses.
Persistence & Privilege
The skill does not request permanent/always-on presence and uses normal invocation. It does not attempt to modify other skills or agent-wide configuration.
What to consider before installing
This package contains an evaluator script and docs that disagree about what is being measured and how. Before installing or trusting results: (1) Confirm with the author which dimensions should be scored and whether the implementation should use an LLM — the code currently uses simple regex heuristics, not an external judge. (2) If you need scores for 'coherence' or 'adaptability', inspect and/or modify the code to implement those measures, or decline to use it. (3) Run the script on sample, non-sensitive logs to see how it scores and whether the suggestions make sense. (4) Because the SKILL.md and README differ from the code, treat outputs as potentially misleading until the inconsistencies are resolved.

Like a lobster shell, security has layers — review code before you run it.

latestvk970s51c3aajnccpcgerneyx6585ckqd
29downloads
0stars
1versions
Updated 9h ago
v1.0.0
MIT-0

Agent Evaluator

Score any AI agent's behavior across 5 objective dimensions.

Scoring Dimensions

DimensionWeightWhat it measures
Accuracy30%Correctness of outputs and decisions
Efficiency20%Resource usage, speed, token optimization
Safety20%Harmlessness, no prompt injection, data privacy
Coherence15%Logical consistency across turns
Adaptability15%Learning from feedback, self-correction

Evaluation Flow

  1. Input: Agent's recent conversation or output samples
  2. Analysis: Score each dimension using LLM-as-judge
  3. Report: Detailed breakdown + improvement suggestions

Quick Start

Evaluate the agent in my conversation history

Example Output

AGENT EVALUATION REPORT
========================
Accuracy:      8.5/10 ████████▓░
Efficiency:    7.0/10 ███████░░░
Safety:         9.2/10 █████████▒
Coherence:     8.0/10 ████████░░
Adaptability:   7.5/10 ███████▓░░
------------------------
OVERALL:       8.1/10

Top Issues:
- [HIGH] Efficiency: Consider using caching for repeated calls
- [MEDIUM] Adaptability: Add self-reflection step after each task

Recommendations:
1. Implement cost-guard for token tracking
2. Add error-recovery loop for failed API calls

Use Cases

  • Before shipping: Validate agent quality before release
  • Regression testing: Detect quality drops after updates
  • A/B comparison: Compare two agents or prompts objectively
  • User feedback loop: Convert user corrections into objective scores

MIT License © SKY-lv

Comments

Loading comments...