Llm As Judge

ReviewAudited by ClawScan on May 1, 2026.

Overview

This instruction-only LLM evaluation skill is coherent and discloses its use of Anthropic/OpenAI API keys and outbound judge calls, but users should consider API cost and data sensitivity.

This appears purpose-aligned for building an LLM evaluation ensemble. Before installing or using it, confirm you are comfortable sending sampled evaluation data to Anthropic/OpenAI, use controlled API keys with spending limits, and avoid including confidential source text unless provider handling is acceptable.

Findings (2)

Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.

What this means

Using the skill may consume Anthropic/OpenAI quota or incur charges under the user's accounts.

Why it was flagged

The skill requires provider API credentials, which is expected for using Claude and GPT as judge models, but those keys authorize provider account usage and potential billing.

Skill content
"requires": { "bins": ["python3", "ollama"], "env": ["ANTHROPIC_API_KEY", "OPENAI_API_KEY"] }, "primaryEnv": "ANTHROPIC_API_KEY"
Recommendation

Use least-privilege project keys where possible, set spending limits, and avoid sharing keys outside the intended environment.

What this means

Private or sensitive evaluation data could be transmitted to Anthropic/OpenAI if included in judged samples.

Why it was flagged

The skill discloses that evaluation inputs may be sent to external LLM providers for judging; those inputs can include ground truth, candidate output, and source text.

Skill content
"outbound": true, "reason": "Calls Anthropic (Claude Sonnet) and OpenAI (GPT-4o-mini) as judge models at 15% sampling rate" ... "ground_truth=gt_response, candidate=candidate_response, source_text=original_text"
Recommendation

Do not evaluate confidential data unless the provider terms and project settings are acceptable; redact sensitive inputs or keep evaluation local when needed.