Llm Judge
LLM-as-judge methodology for comparing code implementations across repositories. Scores implementations on functionality, security, test quality, overenginee...
MIT-0 · Free to use, modify, and redistribute. No attribution required.
⭐ 0 · 19 · 1 current installs · 1 all-time installs
byKevin Anderson@anderskev
MIT-0
Security Scan
OpenClaw
Benign
high confidencePurpose & Capability
Name/description (LLM-as-judge across repos) matches the runtime instructions: spawning repo-gathering agents, collecting structured facts, running tests, and spawning judge agents to score using rubrics. No unrelated credentials, binaries, or install steps are requested.
Instruction Scope
Instructions legitimately require reading repository files, running git commands, and executing tests to collect facts. This is coherent with the stated purpose. However, running tests (pytest, npm test, go test, etc.) and executing shell commands inside each repo can execute arbitrary code from the target repository — a safety risk if the repo is untrusted. The skill does not specify sandboxing, network restrictions, or limits on what tests may do.
Install Mechanism
Instruction-only skill with no install spec and no external downloads. This minimizes supply-chain risk and is proportionate to the task.
Credentials
No environment variables, credentials, or config paths are requested. The skill references other internal skills (e.g., @beagle:llm-artifacts-detection) which is expected for modular analysis; nothing asks for unrelated secrets or external service keys.
Persistence & Privilege
always is false and the skill does not request elevated platform privileges. It will write its report to .beagle/llm-judge-report.json in the analyzed repo (expected behavior). Because the skill can be invoked autonomously by agents (platform default), combined with its behavior of running repo tests, the operational blast radius is higher if used on untrusted repos — consider restricting autonomy or using sandboxing.
Assessment
This skill appears to be what it claims: an LLM-based judge that inspects repositories, runs tests, gathers structured facts, and scores repos with rubrics. Before installing or using it, consider: 1) Running it only on repositories you trust or in an isolated/sandboxed environment (containers or VMs) because Step 1 executes tests and shell commands that can run arbitrary code. 2) Restrict network access for the test runs if you need to avoid exfiltration or external downloads. 3) Be aware it will write its report to .beagle/llm-judge-report.json in the repo. 4) Review any linked/loaded skills (e.g., @beagle:llm-artifacts-detection) you trust them as well. If you plan to analyze untrusted code, enforce strong sandboxing and resource limits or run the skill only on CI runners designed for untrusted workloads.Like a lobster shell, security has layers — review code before you run it.
Current versionv1.0.0
Download ziplatest
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
SKILL.md
LLM Judge Skill
Compare code implementations across 2+ repositories using structured evaluation.
Overview
This skill implements a two-phase LLM-as-judge evaluation:
- Phase 1: Fact Gathering - Parallel agents explore each repo and extract structured facts
- Phase 2: Judging - Parallel judges score each dimension using consistent rubrics
Reference Files
| File | Purpose |
|---|---|
| references/fact-schema.md | JSON schema for Phase 1 facts |
| references/scoring-rubrics.md | Detailed rubrics for each dimension |
| references/repo-agent.md | Instructions for Phase 1 agents |
| references/judge-agents.md | Instructions for Phase 2 judges |
Scoring Dimensions
| Dimension | Default Weight | Evaluates |
|---|---|---|
| Functionality | 30% | Spec compliance, test pass rate |
| Security | 25% | Vulnerabilities, security patterns |
| Test Quality | 20% | Coverage, DRY, mock boundaries |
| Overengineering | 15% | Unnecessary complexity |
| Dead Code | 10% | Unused code, TODOs |
Scoring Scale
| Score | Meaning |
|---|---|
| 5 | Excellent - Exceeds expectations |
| 4 | Good - Meets requirements, minor issues |
| 3 | Average - Functional but notable gaps |
| 2 | Below Average - Significant issues |
| 1 | Poor - Fails basic requirements |
Phase 1: Spawning Repo Agents
For each repository, spawn a Task agent with:
You are a Phase 1 Repo Agent for the LLM Judge evaluation.
**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT
**Instructions:** Read @beagle:llm-judge references/repo-agent.md
Gather facts and return a JSON object following the schema in references/fact-schema.md.
Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.
Return ONLY valid JSON, no markdown or explanations.
Phase 2: Spawning Judge Agents
After all Phase 1 agents complete, spawn 5 judge agents (one per dimension):
You are the $DIMENSION Judge for the LLM Judge evaluation.
**Spec Document:**
$SPEC_CONTENT
**Facts from all repos:**
$ALL_FACTS_JSON
**Instructions:** Read @beagle:llm-judge references/judge-agents.md
Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.
Return ONLY valid JSON following the judge output schema.
Aggregation
After Phase 2 completes:
- Collect scores from all 5 judges
- For each repo, compute weighted total:
weighted_total = sum(score[dim] * weight[dim]) / 100 - Rank repos by weighted total (descending)
- Generate verdict explaining the ranking
Output
Write results to .beagle/llm-judge-report.json and display markdown summary.
Dependencies
@beagle:llm-artifacts-detection- Reused by repo agents for dead code/overengineering
Files
5 totalSelect a file
Select a file to preview.
Comments
Loading comments…
