Llm As Judge

v1.0.1

Build a cost-efficient LLM evaluation ensemble with sampling, tiebreakers, and deterministic validators. Learned from 600+ production runs judging local Olla...

0· 384·1 current·1 all-time
byNissan Dookeran@nissan
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
The skill is an LLM evaluation ensemble that explicitly uses local models (requires ollama + python3) and sampled cloud judges (Anthropic and OpenAI API keys). Required binaries and declared env vars are consistent with that purpose; nothing requested appears unrelated to running a judge ensemble.
Instruction Scope
SKILL.md instructs the agent to run deterministic validators locally and to call external judge APIs for a 15% sample. The instructions do not appear to ask for unrelated system files or credentials, but they do result in sending sampled evaluation payloads to third‑party LLM providers (using the user's keys). Users should be aware that sampled inputs/outputs will be transmitted to those providers.
Install Mechanism
This is an instruction-only skill with no install spec and no code to write to disk, which is low-risk from an install standpoint.
Credentials
The skill requires ANTHROPIC_API_KEY and OPENAI_API_KEY, which matches the stated use of Anthropic and OpenAI judges. Minor inconsistency: the SKILL.md mentions an optional Gemini tiebreaker (Google) but does not declare any Google-related env var — users enabling that path would need additional credentials. Also note both cloud keys are declared required even though cloud calls are sampled at 15%.
Persistence & Privilege
always is false and the skill is user-invocable with normal autonomous invocation allowed. No elevated persistent privileges are requested in the manifest.
Assessment
This skill appears to do what it claims: run local validators and periodically call cloud LLMs as judges using your API keys. Before installing: (1) confirm you are comfortable that sampled evaluation data (inputs and model outputs) will be sent to Anthropic/OpenAI and may incur cost; remove or redact any sensitive PII from evaluated data; (2) if you plan to enable the optional Gemini tiebreaker, expect to supply Google credentials not listed in the manifest; (3) verify you have ollama installed and trust the local models you run on-device; and (4) review API usage and quotas to avoid unexpected billing from sampled judge calls.

Like a lobster shell, security has layers — review code before you run it.

latestvk97ajq52390hmd74n4sk2bkcsn83r706

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Runtime requirements

⚖️ Clawdis
Binspython3, ollama
EnvANTHROPIC_API_KEY, OPENAI_API_KEY
Primary envANTHROPIC_API_KEY

Comments