Install
openclaw skills install llm-judgeUse when comparing two or more code implementations against a spec or requirements doc. Triggers on "which repo is better", "compare these implementations", "evaluate both solutions", "rank these codebases", or "judge which approach wins". Also covers choosing between competing PRs or vendor submissions solving the same problem. Does NOT review a single codebase for quality — use code review skills instead. Does NOT evaluate strategy docs — use strategy-review. Requires a spec file and 2+ repo paths.
openclaw skills install llm-judgeCompare code implementations across multiple repositories using structured evaluation.
/beagle-analysis:llm-judge <spec> <repo1> <repo2> [repo3...] [--labels=...] [--weights=...] [--branch=...]
| Argument | Required | Description |
|---|---|---|
spec | Yes | Path to spec/requirements document |
repos | Yes | 2+ paths to repositories to compare |
--labels | No | Comma-separated labels (default: directory names) |
--weights | No | Override weights, e.g. functionality:40,security:30 |
--branch | No | Branch to compare against main (default: main) |
$ARGUMENTS into spec_path, repo_paths, labels, weights, and branch.Sequenced workflow: do not start the next phase until the current gate passes. Each pass condition must be checkable (file on disk, non-empty content, or json.load succeeds)—not “I reviewed internally.”
| Gate | Pass condition | Unblocks |
|---|---|---|
| A — Inputs | spec_path is a readable file and non-empty; len(repo_paths) ≥ 2; each path contains .git. | Phase 1 repo agents |
| B — Phase 1 facts | For each repo agent output: stdin/stdout parses as JSON; required keys/shape match references/fact-schema.md. | Phase 2 judge agents |
| C — Phase 2 scores | Five judge outputs (one per dimension) each parse as JSON; each includes a score (and justification) for every repo label. | Aggregation |
| D — Report file | .beagle/llm-judge-report.json exists; python3 -c "import json; json.load(open('.beagle/llm-judge-report.json'))" exits 0. | Markdown summary to the user |
| E — Consistency | Summary table and verdict use the same labels, weights, and per-dimension scores as the JSON report. | Mark task complete |
Parallelism is allowed within a phase (all Phase 1 tasks together; all Phase 2 tasks together), but Phase 2 must not start until Gate B passes, and the user-visible summary must not precede Gate D.
Parse $ARGUMENTS to extract:
spec_path: first positional argumentrepo_paths: remaining positional arguments (must be 2+)labels: from --labels or derived from directory namesweights: from --weights or defaultsbranch: from --branch or mainDefault Weights:
{
"functionality": 30,
"security": 25,
"tests": 20,
"overengineering": 15,
"dead_code": 10
}
[ -f "$SPEC_PATH" ] || { echo "Error: Spec file not found: $SPEC_PATH"; exit 1; }
for repo in "${REPO_PATHS[@]}"; do
[ -d "$repo/.git" ] || { echo "Error: Not a git repository: $repo"; exit 1; }
done
[ ${#REPO_PATHS[@]} -ge 2 ] || { echo "Error: Need at least 2 repositories to compare"; exit 1; }
SPEC_CONTENT=$(cat "$SPEC_PATH") || { echo "Error: Failed to read spec file: $SPEC_PATH"; exit 1; }
[ -z "$SPEC_CONTENT" ] && { echo "Error: Spec file is empty: $SPEC_PATH"; exit 1; }
Load the llm-judge skill: Skill(skill: "beagle-analysis:llm-judge")
Spawn one Task per repo:
You are a Phase 1 Repo Agent for the LLM Judge evaluation.
**Your Repo:** $LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT
**Instructions:**
1. Load skill: Skill(skill: "beagle-analysis:llm-judge")
2. Read references/repo-agent.md for detailed instructions
3. Read references/fact-schema.md for the output format
4. Load Skill(skill: "beagle-core:llm-artifacts-detection") for analysis
Explore the repository and gather facts. Return ONLY valid JSON following the fact schema.
Do NOT score or judge. Only gather facts.
Collect all repo outputs into ALL_FACTS.
echo "$FACTS" | python3 -c "import json,sys; json.load(sys.stdin)" 2>/dev/null || { echo "Error: Invalid JSON from $LABEL"; exit 1; }
Spawn five judge agents, one per dimension:
You are the $DIMENSION Judge for the LLM Judge evaluation.
**Spec Document:**
$SPEC_CONTENT
**Facts from all repos:**
$ALL_FACTS_JSON
**Instructions:**
1. Load skill: Skill(skill: "beagle-analysis:llm-judge")
2. Read references/judge-agents.md for detailed instructions
3. Read references/scoring-rubrics.md for the $DIMENSION rubric
Score each repo on $DIMENSION. Return ONLY valid JSON with scores and justifications.
for repo_label in labels:
scores[repo_label] = {}
for dimension in dimensions:
scores[repo_label][dimension] = judge_outputs[dimension]['scores'][repo_label]
weighted_total = sum(
scores[repo_label][dim]['score'] * weights[dim] / 100
for dim in dimensions
)
scores[repo_label]['weighted_total'] = round(weighted_total, 2)
ranking = sorted(labels, key=lambda l: scores[l]['weighted_total'], reverse=True)
Name the winner, explain why they won, and note any close calls or trade-offs.
mkdir -p .beagle
Write .beagle/llm-judge-report.json with version, timestamp, repo metadata, weights, scores, ranking, and verdict.
Render a markdown summary with the scores table, ranking, verdict, and detailed justifications.
python3 -c "import json; json.load(open('.beagle/llm-judge-report.json'))" && echo "Valid report"
The generated report should include:
| File | Purpose |
|---|---|
| references/fact-schema.md | JSON schema for Phase 1 facts |
| references/scoring-rubrics.md | Detailed rubrics for each dimension |
| references/repo-agent.md | Instructions for Phase 1 agents |
| references/judge-agents.md | Instructions for Phase 2 judges |
| Dimension | Default Weight | Evaluates |
|---|---|---|
| Functionality | 30% | Spec compliance, test pass rate |
| Security | 25% | Vulnerabilities, security patterns |
| Test Quality | 20% | Coverage, DRY, mock boundaries |
| Overengineering | 15% | Unnecessary complexity |
| Dead Code | 10% | Unused code, TODOs |
| Score | Meaning |
|---|---|
| 5 | Excellent - Exceeds expectations |
| 4 | Good - Meets requirements, minor issues |
| 3 | Average - Functional but notable gaps |
| 2 | Below Average - Significant issues |
| 1 | Poor - Fails basic requirements |
For each repository, spawn a Task agent with:
You are a Phase 1 Repo Agent for the LLM Judge evaluation.
**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT
**Instructions:** Read @beagle:llm-judge references/repo-agent.md
Gather facts and return a JSON object following the schema in references/fact-schema.md.
Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.
Return ONLY valid JSON, no markdown or explanations.
Collect all repo-agent outputs into ALL_FACTS.
After all Phase 1 agents complete, spawn 5 judge agents, one per dimension:
You are the $DIMENSION Judge for the LLM Judge evaluation.
**Spec Document:**
$SPEC_CONTENT
**Facts from all repos:**
$ALL_FACTS_JSON
**Instructions:** Read @beagle:llm-judge references/judge-agents.md
Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.
Return ONLY valid JSON following the judge output schema.
.beagle/llm-judge-report.json.Display a markdown summary with scores, ranking, verdict, and detailed justifications.
Before completing (maps to Hard gates D and E):
.beagle/llm-judge-report.json exists and json.load succeeds.weighted_total equals the sum over dimensions of (score × weight / 100) using the configured weights; markdown summary matches the JSON report.