Install
openclaw skills install llm-eval-routerShadow-test local Ollama models against a cloud baseline with a multi-judge ensemble. Automatically promotes models when statistically proven equivalent — reducing API costs with evidence, not hope.
openclaw skills install llm-eval-routerLast used: 2026-03-24 Memory references: 7 Status: Active
Set up a production-quality shadow evaluation pipeline that automatically promotes local Ollama models when they statistically prove they match cloud model quality — reducing inference costs with evidence, not hope.
Run every task through your best local model (shadow) in parallel with your cloud baseline (ground truth). A lightweight judge ensemble scores the local output. After 200+ runs, if the local model hits 0.95 mean score, promote it to handle that task type in production. Demote it automatically if quality drops.
ollama pull qwen2.5 or ollama pull phi4This skill makes outbound API calls to:
What stays local:
data/scores/*.jsonLangfuse (optional) can be self-hosted or cloud. If self-hosted, all observability data stays on your network.
Every response is scored on:
| Dimension | Default weight | Analyze weight | What it measures |
|---|---|---|---|
| Structural | 25% | 10% | Format compliance, required keys present |
| Semantic | 25% | 40% | Meaning equivalence to ground truth |
| Factual | 20% | 25% | No hallucinated facts/numbers/entities |
| Completion | 15% | 18% | Task fully addressed |
| Tool use | 10% | 4% | Correct tool/format selection |
| Latency | 5% | 3% | Within acceptable bounds |
Important: Use per-task weight overrides. The default 25/25 split treats structural
accuracy equally with semantic similarity — which works for extract/classify/format tasks
(where exact format matters) but is wrong for open-ended analysis. difflib.SequenceMatcher
on two prose analyses of the same question scores ~0.29 even when they're semantically
identical. With structural weight at 25%, this alone caps analyze scores at ~0.59.
# src/evaluator.py — per-task weight profiles
TASK_WEIGHT_OVERRIDES = {
"analyze": {
"structural_accuracy": 0.10, # difflib is NOT meaningful for prose
"semantic_similarity": 0.40, # cosine over embeddings captures meaning
"factual_drift": 0.25,
"task_completion": 0.18,
"tool_correctness": 0.04,
"latency_score": 0.03,
},
"code_transform": {
"structural_accuracy": 0.15,
"semantic_similarity": 0.35,
"factual_drift": 0.20,
"task_completion": 0.20,
"tool_correctness": 0.07,
"latency_score": 0.03,
},
}
Also: For analyze tasks, constrain output structure via system_prompt so GT and candidates produce comparably-formatted responses (Finding/Recommendation/Confidence/Reasoning). This reduces Layer 2 drift and improves difflib scores even at reduced weight.
These run on every response at zero cost. Judges only run when L1+L2 pass and the sampling rate triggers.
Create config/task_types.yaml:
tasks:
- id: summarize
description: "Summarize a document in N sentences"
require_json: false
judge_dimensions: [semantic, factual, completion]
- id: classify
description: "Classify text into one of N categories"
require_json: true # response must be valid JSON
judge_dimensions: [structural, semantic, completion]
- id: extract
description: "Extract structured data from unstructured text"
require_json: true
judge_dimensions: [structural, factual, completion]
- id: format
description: "Reformat content to match a template"
require_json: false
judge_dimensions: [structural, semantic, completion]
The router assigns each task to a model using a round-robin strategy during burn-in (building n), then switches to confidence-weighted routing after promotion.
# src/router.py — simplified version
class Router:
def __init__(self, candidates: list[str], control_floor: str):
self.candidates = candidates
self.control_floor = control_floor
self._rr_counters = defaultdict(int)
def route(self, task_type: str, confidence_tracker: ConfidenceTracker) -> str:
"""Return the best model for this task type."""
promoted = confidence_tracker.get_promoted(task_type)
if promoted:
return promoted # use promoted model directly
# Round-robin during burn-in for fair exposure
idx = self._rr_counters[task_type] % len(self.candidates)
self._rr_counters[task_type] += 1
return self.candidates[idx]
For each task, run it through BOTH the local model (candidate) and the cloud baseline (ground truth). Never use the ground truth response in production — it's only for evaluation.
async def evaluate_pair(prompt: str, local_response: str, gt_response: str,
task_type: str) -> float:
# Layer 1: deterministic
l1_score = validators.layer1(local_response, task_type)
if l1_score == 0.0:
return 0.0 # hard fail — safety or format violation
# Layer 2: heuristic drift
l2_score = validators.layer2(local_response, gt_response)
# Sample judges (15%)
if random.random() < JUDGE_SAMPLE_RATE:
sonnet_score = await judge_sonnet(prompt, local_response, gt_response)
mini_score = await judge_gpt4o_mini(prompt, local_response, gt_response)
if abs(sonnet_score - mini_score) >= 0.20:
gemini_score = await judge_gemini(prompt, local_response, gt_response)
final = median([sonnet_score, mini_score, gemini_score])
else:
final = (sonnet_score + mini_score) / 2
return weighted_score(l1_score, l2_score, final)
else:
return weighted_score(l1_score, l2_score, judge_score=None)
Track scores per model/task pair on disk (so restarts don't lose data):
# src/scoring/confidence.py — simplified
@dataclass
class ModelStats:
model_id: str
task_type: str
scores: list[float] # all scores (None excluded)
promoted: bool = False
demoted: bool = False
@property
def mean(self) -> float:
return sum(self.scores) / len(self.scores) if self.scores else 0.0
@property
def n(self) -> int:
return len(self.scores)
def should_promote(self) -> bool:
return self.n >= 200 and self.mean >= 0.95 and not self.promoted
def should_demote(self) -> bool:
recent = self.scores[-50:] # last 50
pass_rate = sum(1 for s in recent if s >= 0.85) / len(recent)
return pass_rate < 0.92 and not self.demoted
Run this on a cron (every 10-20 minutes via launchd/systemd):
# run_accumulate.py
async def accumulate():
task_type = pick_next_task() # round-robin across task types
prompt, gt_response = generate_task(task_type) # call cloud baseline
for candidate in router.get_candidates(task_type):
local_response = await ollama_client.complete(candidate, prompt)
score = await evaluate_pair(prompt, local_response, gt_response, task_type)
confidence_tracker.record(candidate, task_type, score)
if confidence_tracker.should_promote(candidate, task_type):
router.promote(candidate, task_type)
langfuse.log_promotion(candidate, task_type, confidence_tracker.stats(candidate, task_type))
# config/routing_policy.yaml
control_floor_model: phi4:latest # never promote below this model's score
task_policies:
policy_check_high_risk:
never_local: true # these tasks always use cloud model
summarize:
min_score_for_routing: 0.85
fallback_chain: [qwen2.5, llama3.1, phi4]
classify:
min_score_for_routing: 0.90 # higher bar for classification
fallback_chain: [qwen2.5, granite4, llama3.1]
Expose a simple HTTP API (FastAPI):
POST /run — route a task through the best available model
GET /health — service status + promoted models + ollama connectivity
GET /status — full scoreboard (model × task × mean × n)
GET /report — cost heatmap + efficiency analysis
What worked:
<think>...</think> blocks stripped before evaluation. Otherwise Layer 2
drift detection flags the reasoning chain as hallucinated content.None ≠ 0.0 for unsampled runs: a run where no judge scored is not a failing run.
Store None, exclude from mean. Mixing None with 0.0 poisons the mean.require_json: False for plain-text tasks: classify and extract tasks that return
formatted text (not JSON objects) will fail Layer 1 if you require JSON. Separate
the "is the format correct" check from "is it valid JSON."system_prompt that specifies
an exact output format (Finding/Recommendation/Confidence/Reasoning). Both GT and
candidates follow the same template, improving structural alignment and reducing drift
penalty. Without this, Layer 2 drift fires on differently-phrased but correct analyses.run_task, get_status,
get_champions, get_promotion_timeline, get_cost_heatmap). Lets an LLM agent
query evaluation state without bespoke integration work.What didn't work:
With a 20-minute accumulator cadence and 9 candidates × 7 task types:
Per accumulation cycle (one task, one model):
At 6 runs/hour × 24 hours: ~$0.70/day during burn-in. After first promotions: drops to ~$0.10/day (90%+ of task volume local).