Install
openclaw skills install llm-judge-ensembleBuild a cost-efficient LLM evaluation ensemble with sampling, tiebreakers, and deterministic validators. Learned from 600+ production runs judging local Ollama models.
openclaw skills install llm-judge-ensembleLast used: 2026-03-24 Memory references: 1 Status: Active
Build a cost-efficient LLM evaluation ensemble for comparing and scoring generative AI outputs at scale.
Use LLM-as-Judge when:
Do NOT use LLM-as-Judge when:
These two evaluation approaches overlap but serve different purposes:
| LLM-as-Judge (this skill) | Oli QA Gate | |
|---|---|---|
| What it evaluates | Model output quality across many runs | A specific build, PR, or code change |
| Scale | 100–1000s of outputs, statistical view | Single task, single output |
| Rubric | Multi-dimension, configurable weights | Oli's built-in code/content quality checklist |
| Cost model | 15% sampling to control spend | One-shot review, no sampling |
| When to use | Model comparison, shadow testing, promotion gates | Post-Kit build review, content QA before publish |
| Output | Weighted numeric score + dimension breakdown | Pass/Fail with specific feedback items |
Rule: If you're evaluating a model's outputs at scale → LLM-as-Judge. If you're reviewing a specific piece of work → Oli.
Run on 100% of outputs. Zero cost. Catches obvious failures before burning judge tokens.
If Layer 1 fails, score is 0.0 — no need to invoke expensive judges.
Run on 100% of outputs that pass Layer 1. Minimal cost (local computation only).
Layer 2 produces heuristic scores (0.0–1.0) that contribute to the final weighted score.
Sampled at 15% of runs to control cost. Forced to 100% during promotion gates.
Two independent judges (e.g., Claude + GPT-4o) score the output. Each judge evaluates all 6 dimensions independently.
Tiebreaker pattern: When primary judges disagree by Δ ≥ 0.20 on any dimension, a third judge is invoked. The tiebreaker score replaces the outlier. This reduced score variance by 34% at only 8% additional cost.
| Dimension | Weight | What It Measures |
|---|---|---|
| Structural accuracy | 0.20 | Format compliance, schema adherence |
| Semantic similarity | 0.25 | Meaning preservation vs ground truth |
| Factual accuracy | 0.25 | Correctness of facts, numbers, entities |
| Task completion | 0.15 | Does it actually answer the question? |
| Tool use correctness | 0.05 | Valid tool calls (when applicable) |
| Latency | 0.10 | Response time within acceptable bounds |
Weights are configurable per task type. Tool use weight is redistributed when not applicable.
If you're building a new judge, start here. A rubric needs at minimum: criterion name, description, scoring anchor points, and weight.
SUMMARISATION_RUBRIC = [
{
"criterion": "factual_accuracy",
"description": "All facts, numbers, and named entities in the summary are present and correct relative to the source document.",
"weight": 0.40,
"anchors": {
1: "Multiple factual errors or hallucinated entities",
5: "Mostly accurate with minor omissions",
10: "Fully accurate — every fact traceable to the source"
}
},
{
"criterion": "completeness",
"description": "The summary covers the key points of the source without omitting critical information.",
"weight": 0.35,
"anchors": {
1: "Covers less than half the key points",
5: "Covers main points, misses some secondary ones",
10: "Comprehensive — all key points captured"
}
},
{
"criterion": "conciseness",
"description": "The summary avoids redundancy and filler. Information density is high.",
"weight": 0.25,
"anchors": {
1: "Bloated — repetition or filler makes up >30% of the text",
5: "Acceptable density with some redundancy",
10: "Every sentence carries unique information"
}
}
]
Anchor points are mandatory. Without concrete score anchors, judges will default to the centre of the scale (all 5s or all 7s). See Score Calibration below.
LLM judges are prone to centrality bias (clustering around the middle) and leniency bias (inflating scores). Counter these with:
1. Forced anchor examples in the rubric prompt: Include 1–2 concrete output examples at score 2 and score 9 for each criterion. Judges calibrate against examples, not abstract descriptions.
2. Require justification before score: Prompt: "First write a 1-sentence critique, then give your score." Chain-of-thought before the number reduces leniency bias significantly.
3. Watch for score compression on small datasets: If all outputs score 7.2–7.8 on a 10-point scale, your rubric criteria may be too vague or your ground truth is low-quality. Tighten anchor definitions or add adversarial examples.
4. Calibration run before production: Before trusting a new rubric, manually score 10 outputs, then compare to judge scores. If the correlation is below 0.7, rewrite the weakest criterion's anchor points.
5. Normalise by judge: Different judges have different baseline biases. Track each judge's mean score across a calibration set and apply a per-judge offset if needed.
# Example: detect and log judge bias
calibration_scores = {
"claude": [7.2, 7.4, 7.1, 7.5, 7.3], # compressed range → leniency bias
"gpt4o": [5.1, 8.3, 6.7, 9.1, 4.2], # wider spread → better discrimination
}
for judge, scores in calibration_scores.items():
spread = max(scores) - min(scores)
print(f"{judge}: spread={spread:.1f}, mean={sum(scores)/len(scores):.1f}")
# If spread < 3.0 on a 10-point scale, that judge has calibration problems
Avoid these patterns — they produce noisy, unreliable scores:
| Anti-pattern | Example | Why it fails |
|---|---|---|
| Compound criterion | "Accurate AND well-written AND concise" | Judge can't assign one score to three things |
| Vague superlatives | "Is the response excellent?" | No anchor for what "excellent" means |
| Overlapping criteria | Separate "clarity" and "readability" | Judges score the same thing twice, inflating that dimension |
| No anchor points | "Score 1–10 for quality" | Produces centrality bias — everyone gets 5–7 |
| Unmeasurable criterion | "Does the response feel right?" | Subjective, inconsistent across judge models |
| Too many criteria | 12+ dimensions | Attention degrades, later criteria scored worse than early ones |
Good rubric shape: 3–6 criteria, each independently measurable, each with explicit anchor examples at the low, mid, and high end of the scale.
When a dimension is not sampled (LLM judge not invoked on this run), record the score as null, not 0.0. Unsampled dimensions must be excluded from the weighted average, not treated as failures.
Early bug: recording unsampled dimensions as 0.0 created a systematic 0.03–0.08 downward bias across all models. The fix: null means "not measured", which is fundamentally different from "scored zero".
# WRONG — penalises unsampled dimensions
weighted = sum(s * w for s, w in zip(scores, weights)) / sum(weights)
# RIGHT — exclude null dimensions
pairs = [(s, w) for s, w in zip(scores, weights) if s is not None]
weighted = sum(s * w for s, w in pairs) / sum(w for _, w in pairs)
With 15% LLM sampling, average cost per evaluated run: ~$0.003
At 200 runs for promotion: total judge cost ≈ $0.60 per model per task type.
from evaluation import JudgeEnsemble, DeterministicValidator, HeuristicScorer
# Layer 1: must be valid text, 50-500 chars
validator = DeterministicValidator(
min_length=50,
max_length=500,
required_format="text",
)
# Layer 2: check entity overlap with source
heuristic = HeuristicScorer(
check_entity_overlap=True,
check_novel_facts=True,
check_numerical_consistency=True,
)
# Layer 3: LLM judges (sampled)
ensemble = JudgeEnsemble(
judges=["claude-sonnet-4-20250514", "gpt-4o"],
tiebreaker="claude-sonnet-4-20250514",
sample_rate=0.15,
tiebreaker_threshold=0.20,
dimensions=["structural", "semantic", "factual", "completion", "latency"],
)
# Evaluate
result = ensemble.evaluate(
task_type="summarize",
ground_truth=gt_response,
candidate=candidate_response,
source_text=original_text,
validator=validator,
heuristic=heuristic,
)
print(f"Weighted score: {result.weighted_score:.3f}")
print(f"Dimensions: {result.scores}") # {semantic: 0.95, factual: 0.88, ...}
# None values for unsampled dimensions
Recording unsampled dimensions as 0.0 instead of null
No anchor points in rubric → all scores cluster at 7
Compound criteria ("accurate AND clear AND concise")
Running 15% sampling at promotion gates
Not pinning judge model versions
claude-sonnet-4 scoring patterns can shift between API versions. Pin the model string and update consciously.Rubric criteria that overlap
Skipping Layer 1 to "save time"