Llm Eval Router

ReviewAudited by ClawScan on May 10, 2026.

Overview

The skill is coherent, but it should be reviewed because it sends task prompts to cloud LLM providers and can automatically change production model routing.

Install only if you are comfortable sending evaluated task prompts to Anthropic/OpenAI and possibly Gemini or Langfuse if configured. Start in shadow-only mode, use restricted API keys, avoid sensitive datasets unless redacted, and require human approval before any model is promoted into production routing.

Findings (4)

Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.

What this means

Your Anthropic and OpenAI accounts may be used for evaluation calls, incurring cost and exposing prompts to those providers.

Why it was flagged

The skill requires user API keys for Anthropic and OpenAI. That is expected for cloud baseline and judge calls, but it gives the workflow billable access to those provider accounts.

Skill content
"requires": { "bins": ["ollama", "python3"], "env": ["ANTHROPIC_API_KEY", "OPENAI_API_KEY"] }
Recommendation

Use restricted or project-specific API keys where possible, monitor provider usage, and avoid running sensitive prompts unless provider sharing is acceptable.

What this means

A user may underestimate how often task prompts, which may contain private or business data, leave the local machine.

Why it was flagged

The artifacts clearly disclose cloud provider calls, but they conflict on prompt-sharing frequency: one statement suggests all provider prompt sharing is sampled at 15%, while another says Anthropic receives baseline prompts every accumulation cycle.

Skill content
"Task prompts are sent to Anthropic/OpenAI/Gemini ... all at 15% sampling" and "Anthropic API — to generate ground truth baseline responses (every accumulation cycle)"
Recommendation

Clarify exactly which prompts go to each provider and when, require explicit opt-in for sensitive datasets, and add redaction or allowlists for task types sent to cloud services.

What this means

If evaluation data, thresholds, or judge behavior are wrong, a weaker model could be routed into production and affect many future tasks.

Why it was flagged

The skill intends to automatically change production routing based on accumulated evaluation scores. The provided artifact excerpt does not show a required human approval step or containment mechanism before production impact.

Skill content
"Automatically promotes models when statistically proven equivalent" and "promote it to handle that task type in production. Demote it automatically if quality drops."
Recommendation

Default to shadow-only mode, require manual approval before promotion, use canary rollout limits, log all routing changes, and provide an easy rollback path.

What this means

Local score files may retain sensitive prompt/output information and, if modified or polluted, could skew future model promotion decisions.

Why it was flagged

The skill persists evaluation data locally, which is expected for promotion statistics, but that data may contain task information and can influence later routing decisions.

Skill content
"Scored run data is stored on disk in `data/scores/*.json`"
Recommendation

Store score files in a protected project directory, define retention and cleanup rules, and review the accumulated dataset before any production promotion.