Llm Judge

LLM-as-judge methodology for comparing code implementations across repositories. Scores implementations on functionality, security, test quality, overenginee...

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 19 · 1 current installs · 1 all-time installs
byKevin Anderson@anderskev
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description (LLM-as-judge across repos) matches the runtime instructions: spawning repo-gathering agents, collecting structured facts, running tests, and spawning judge agents to score using rubrics. No unrelated credentials, binaries, or install steps are requested.
Instruction Scope
Instructions legitimately require reading repository files, running git commands, and executing tests to collect facts. This is coherent with the stated purpose. However, running tests (pytest, npm test, go test, etc.) and executing shell commands inside each repo can execute arbitrary code from the target repository — a safety risk if the repo is untrusted. The skill does not specify sandboxing, network restrictions, or limits on what tests may do.
Install Mechanism
Instruction-only skill with no install spec and no external downloads. This minimizes supply-chain risk and is proportionate to the task.
Credentials
No environment variables, credentials, or config paths are requested. The skill references other internal skills (e.g., @beagle:llm-artifacts-detection) which is expected for modular analysis; nothing asks for unrelated secrets or external service keys.
Persistence & Privilege
always is false and the skill does not request elevated platform privileges. It will write its report to .beagle/llm-judge-report.json in the analyzed repo (expected behavior). Because the skill can be invoked autonomously by agents (platform default), combined with its behavior of running repo tests, the operational blast radius is higher if used on untrusted repos — consider restricting autonomy or using sandboxing.
Assessment
This skill appears to be what it claims: an LLM-based judge that inspects repositories, runs tests, gathers structured facts, and scores repos with rubrics. Before installing or using it, consider: 1) Running it only on repositories you trust or in an isolated/sandboxed environment (containers or VMs) because Step 1 executes tests and shell commands that can run arbitrary code. 2) Restrict network access for the test runs if you need to avoid exfiltration or external downloads. 3) Be aware it will write its report to .beagle/llm-judge-report.json in the repo. 4) Review any linked/loaded skills (e.g., @beagle:llm-artifacts-detection) you trust them as well. If you plan to analyze untrusted code, enforce strong sandboxing and resource limits or run the skill only on CI runners designed for untrusted workloads.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0
Download zip
latestvk979bjaf2fsfrqjq587z1drww5839b0w

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

LLM Judge Skill

Compare code implementations across 2+ repositories using structured evaluation.

Overview

This skill implements a two-phase LLM-as-judge evaluation:

  1. Phase 1: Fact Gathering - Parallel agents explore each repo and extract structured facts
  2. Phase 2: Judging - Parallel judges score each dimension using consistent rubrics

Reference Files

FilePurpose
references/fact-schema.mdJSON schema for Phase 1 facts
references/scoring-rubrics.mdDetailed rubrics for each dimension
references/repo-agent.mdInstructions for Phase 1 agents
references/judge-agents.mdInstructions for Phase 2 judges

Scoring Dimensions

DimensionDefault WeightEvaluates
Functionality30%Spec compliance, test pass rate
Security25%Vulnerabilities, security patterns
Test Quality20%Coverage, DRY, mock boundaries
Overengineering15%Unnecessary complexity
Dead Code10%Unused code, TODOs

Scoring Scale

ScoreMeaning
5Excellent - Exceeds expectations
4Good - Meets requirements, minor issues
3Average - Functional but notable gaps
2Below Average - Significant issues
1Poor - Fails basic requirements

Phase 1: Spawning Repo Agents

For each repository, spawn a Task agent with:

You are a Phase 1 Repo Agent for the LLM Judge evaluation.

**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT

**Instructions:** Read @beagle:llm-judge references/repo-agent.md

Gather facts and return a JSON object following the schema in references/fact-schema.md.

Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.

Return ONLY valid JSON, no markdown or explanations.

Phase 2: Spawning Judge Agents

After all Phase 1 agents complete, spawn 5 judge agents (one per dimension):

You are the $DIMENSION Judge for the LLM Judge evaluation.

**Spec Document:**
$SPEC_CONTENT

**Facts from all repos:**
$ALL_FACTS_JSON

**Instructions:** Read @beagle:llm-judge references/judge-agents.md

Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.

Return ONLY valid JSON following the judge output schema.

Aggregation

After Phase 2 completes:

  1. Collect scores from all 5 judges
  2. For each repo, compute weighted total:
    weighted_total = sum(score[dim] * weight[dim]) / 100
    
  3. Rank repos by weighted total (descending)
  4. Generate verdict explaining the ranking

Output

Write results to .beagle/llm-judge-report.json and display markdown summary.

Dependencies

  • @beagle:llm-artifacts-detection - Reused by repo agents for dead code/overengineering

Files

5 total
Select a file
Select a file to preview.

Comments

Loading comments…