Install
openclaw skills install reddi-llm-judgeBuild a cost-efficient LLM evaluation ensemble with sampling, tiebreakers, and deterministic validators. Learned from 600+ production runs judging local Ollama models.
openclaw skills install reddi-llm-judgeBuild a cost-efficient LLM evaluation ensemble for comparing and scoring generative AI outputs at scale.
Run on 100% of outputs. Zero cost. Catches obvious failures before burning judge tokens.
If Layer 1 fails, score is 0.0 — no need to invoke expensive judges.
Run on 100% of outputs that pass Layer 1. Minimal cost (local computation only).
Layer 2 produces heuristic scores (0.0–1.0) that contribute to the final weighted score.
Sampled at 15% of runs to control cost. Forced to 100% during promotion gates.
Two independent judges (e.g., Claude + GPT-4o) score the output. Each judge evaluates all 6 dimensions independently.
Tiebreaker pattern: When primary judges disagree by Δ ≥ 0.20 on any dimension, a third judge is invoked. The tiebreaker score replaces the outlier. This reduced score variance by 34% at only 8% additional cost.
| Dimension | Weight | What It Measures |
|---|---|---|
| Structural accuracy | 0.20 | Format compliance, schema adherence |
| Semantic similarity | 0.25 | Meaning preservation vs ground truth |
| Factual accuracy | 0.25 | Correctness of facts, numbers, entities |
| Task completion | 0.15 | Does it actually answer the question? |
| Tool use correctness | 0.05 | Valid tool calls (when applicable) |
| Latency | 0.10 | Response time within acceptable bounds |
Weights are configurable per task type. Tool use weight is redistributed when not applicable.
When a dimension is not sampled (LLM judge not invoked on this run), record the score as null, not 0.0. Unsampled dimensions must be excluded from the weighted average, not treated as failures.
Early bug: recording unsampled dimensions as 0.0 created a systematic 0.03–0.08 downward bias across all models. The fix: null means "not measured", which is fundamentally different from "scored zero".
# WRONG — penalises unsampled dimensions
weighted = sum(s * w for s, w in zip(scores, weights)) / sum(weights)
# RIGHT — exclude null dimensions
pairs = [(s, w) for s, w in zip(scores, weights) if s is not None]
weighted = sum(s * w for s, w in pairs) / sum(w for _, w in pairs)
With 15% LLM sampling, average cost per evaluated run: ~$0.003
At 200 runs for promotion: total judge cost ≈ $0.60 per model per task type.
from evaluation import JudgeEnsemble, DeterministicValidator, HeuristicScorer
# Layer 1: must be valid text, 50-500 chars
validator = DeterministicValidator(
min_length=50,
max_length=500,
required_format="text",
)
# Layer 2: check entity overlap with source
heuristic = HeuristicScorer(
check_entity_overlap=True,
check_novel_facts=True,
check_numerical_consistency=True,
)
# Layer 3: LLM judges (sampled)
ensemble = JudgeEnsemble(
judges=["claude-sonnet-4-20250514", "gpt-4o"],
tiebreaker="claude-sonnet-4-20250514",
sample_rate=0.15,
tiebreaker_threshold=0.20,
dimensions=["structural", "semantic", "factual", "completion", "latency"],
)
# Evaluate
result = ensemble.evaluate(
task_type="summarize",
ground_truth=gt_response,
candidate=candidate_response,
source_text=original_text,
validator=validator,
heuristic=heuristic,
)
print(f"Weighted score: {result.weighted_score:.3f}")
print(f"Dimensions: {result.scores}") # {semantic: 0.95, factual: 0.88, ...}
# None values for unsampled dimensions