Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

agent-evaluation

v1.0.0

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents...

0· 49·0 current·0 all-time
Security Scan
Capability signals
Crypto
These labels describe what authority the skill may exercise. They are separate from suspicious or malicious moderation verdicts.
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
medium confidence
!
Purpose & Capability
The claimed purpose (agent evaluation / benchmarking) aligns with making LLM calls to a provider, but the registry metadata lists no required environment variables while SKILL.md clearly requires SKILLBOSS_API_KEY. README contains placeholder install instructions (github ACCOUNT). These mismatches indicate sloppy or incomplete packaging and reduce trust in the manifest.
!
Instruction Scope
SKILL.md explicitly instructs agents to send all LLM calls to https://api.heybossai.com/v1/pilot using SKILLBOSS_API_KEY pulled from the environment. That means prompts, test inputs, and model outputs would be transmitted to a third party — a legitimate design for an evaluation skill but a material data-exfiltration risk if you don't trust that provider. The instructions do not request unrelated files, but they do centralize all LLM traffic off-box.
Install Mechanism
This is an instruction-only skill (no install spec, no code files), which is lower risk because nothing is written to disk. However, README's manual-install example contains a placeholder GitHub URL (https://github.com/ACCOUNT/...), suggesting the package may be incomplete or not maintained.
!
Credentials
SKILL.md requires a single credential, SKILLBOSS_API_KEY, which is proportionate for calling an external LLM API — but the registry metadata did not declare any required env vars or a primary credential. The missing declaration is an incoherence that could hide sensitive environment usage from users and tooling.
Persistence & Privilege
The skill does not request permanent presence (always:false), has no install actions, and does not modify other skill configurations. No elevated persistence or cross-skill privileges are requested.
What to consider before installing
Before installing, verify and accept that this skill routes every LLM request (including prompts, test data, and outputs) to a third-party endpoint (api.heybossai.com) using SKILLBOSS_API_KEY. Ask the publisher to: (1) update the registry metadata to declare SKILLBOSS_API_KEY as a required/primary credential, (2) replace placeholder GitHub links in README, and (3) provide a clear privacy/data-retention policy for SkillBoss/heybossai. If you cannot trust that provider with potentially sensitive prompts or evaluation data, do not install; instead use a skill that lets you run evaluations against your own models or an audited provider.

Like a lobster shell, security has layers — review code before you run it.

latestvk97dzdqxmxb5b9vcz0djx6m18s84t7j4
49downloads
0stars
1versions
Updated 5d ago
v1.0.0
MIT-0

📖 Complete setup guide: https://skillboss.co/skill.md

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

Capabilities

  • agent-testing
  • benchmark-design
  • capability-assessment
  • reliability-metrics
  • regression-testing

Requirements

  • testing-fundamentals
  • llm-fundamentals

API Integration

All LLM calls within agent evaluation workflows use SkillBoss API Hub via a single unified endpoint:

import requests, os

SKILLBOSS_API_KEY = os.environ["SKILLBOSS_API_KEY"]

def call_llm(messages: list) -> str:
    r = requests.post(
        "https://api.heybossai.com/v1/pilot",
        headers={"Authorization": f"Bearer {SKILLBOSS_API_KEY}", "Content-Type": "application/json"},
        json={
            "type": "chat",
            "inputs": {"messages": messages},
            "prefer": "balanced"
        },
        timeout=60,
    )
    return r.json()["result"]["choices"][0]["message"]["content"]

Required environment variables: SKILLBOSS_API_KEY

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

IssueSeveritySolution
Agent scores well on benchmarks but fails in productionhigh// Bridge benchmark and production evaluation
Same test passes sometimes, fails other timeshigh// Handle flaky tests in LLM agent evaluation
Agent optimized for metric, not actual taskmedium// Multi-dimensional evaluation to prevent gaming
Test data accidentally used in training or promptscritical// Prevent data leakage in agent evaluation

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

Comments

Loading comments...