LLM as Judge

Cross-model verification for complex tasks. Spawn a judge subagent with a different model to review plans, code, architecture, or decisions before execution....

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 34 · 1 current installs · 1 all-time installs
byNeal Meyer@ngmeyer
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name and description match the content: the skill defines a prompt-and-workflow for spawning a judge subagent using a different model for review. It does not request binaries, credentials, or system access that would be out of scope for a cross-model review pattern. It does reference specific providers/models (Claude, Kimi, Grok, Gemini, Opus), which is an expectation about available model providers rather than a secret or extra entitlement.
Instruction Scope
SKILL.md and templates are scoped to reviewing plans, code, and high‑stakes systems and constrain judge output to APPROVE/REVISE/REJECT with scored feedback. There is no instruction to read unrelated files, access environment secrets, or call external endpoints. Note: in practice using third‑party judge models may involve sending potentially sensitive project data to another provider — the skill does not explicitly warn about avoiding secrets or PHI when sending content to an external judge model.
Install Mechanism
Instruction-only skill with no install spec and no code files. This is low risk and expected for a prompt/workflow template.
Credentials
The skill declares no required environment variables or credentials, which is coherent for a pattern. However it presumes the agent/platform can invoke alternative models/providers; in real use you may need provider credentials or API keys (not declared here). Consider whether your agent will route judge calls to third‑party providers and whether those providers will receive sensitive data.
Persistence & Privilege
always is false, no requested persistent presence, and the skill does not attempt to modify other skills or system settings. Autonomous invocation is allowed (platform default) but this is not combined with elevated privileges or secret access.
Assessment
This is a coherent, low-risk prompt/workflow template for cross-model review. Before installing or using it, confirm your agent/platform can actually call alternative models/providers as the skill expects; if that involves third‑party APIs, avoid sending sensitive secrets or personal data to judge models, and check provider data handling and costs. If you plan to run security‑critical or proprietary code through an external judge model, obtain explicit consent and consider local/private review alternatives.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.2.0
Download zip
latestvk9748d4znkwd9gxh6wytn607qh833se7

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

LLM-as-Judge

Core principle: Same model = same blind spots. Different model = fresh perspective. Cross-model review catches ~85% of issues vs ~60% for self-reflection.

Activation Criteria

Use this pattern when:

  • Architecture or system design decisions
  • Multi-file changes affecting >5 files or >500 LOC
  • Security-critical code (auth, payments, crypto/DeFi)
  • Financial/trading systems (market making, quant strategies)
  • Planning documents that will drive weeks of work
  • Stuck after 3+ failed attempts on same problem

Skip when:

  • Simple edits, config tweaks, bug fixes with obvious cause
  • Documentation updates
  • Single-file changes under 100 LOC
  • Tasks where self-review is sufficient

The Pattern

Executor (Model A) → Output → Judge (Model B) → Verdict → Action

Verdicts: APPROVE | REVISE (with specific feedback) | REJECT (restart)

Model Pairing

Use a different provider than the executor to avoid shared blind spots:

  • Executor: Claude → Judge: kimi or grok or gemini-pro
  • Executor: Kimi/Gemini → Judge: opus
  • Principle: Different provider, similar capability tier

Judge Prompt Templates

Plan/Architecture Review

See references/judge-prompts.md for full templates covering:

  • Plan completeness, feasibility, risk, testing strategy
  • Architecture review with scoring (0-10 per dimension)
  • Code review checklist (correctness, design, safety, maintainability)

Integration Points

  • With adversarial review: This IS the formalized version of "spawn a separate model to review"
  • With planning-protocol: Judge reviews the plan before the Execute phase
  • With coding workflows: Code → cross-model review → fix findings → test → build → push

Quick Decision

Simple task?           → Self-review
Complex / high stakes? → LLM-as-Judge
Stuck after retries?   → LLM-as-Judge (fresh perspective)
Financial/security?    → LLM-as-Judge (mandatory)

Gotchas

  • Same provider defeats the purpose — Claude Opus judging Claude Sonnet shares the same training distribution. Use a different provider (Grok judging Claude, Gemini judging GPT, etc.).
  • Vague judge output is useless — If the judge says "looks good" without specifics, the prompt is too weak. Always require the judge to produce scored dimensions + specific actionable items, even if approving.
  • Judge scope creep — Judges sometimes rewrite the entire plan instead of reviewing it. Constrain the verdict to APPROVE / REVISE / REJECT with specific feedback, not a replacement solution.
  • Approval rate drift — If the judge approves >80% of submissions, the model pairing is too similar or the prompts are too lenient. Target 60-70% approval rate.
  • Don't judge trivial tasks — A 50-line CSS fix doesn't need cross-model review. Use the activation criteria in this skill strictly.

Files

2 total
Select a file
Select a file to preview.

Comments

Loading comments…