Install
openclaw skills install llm-evaluationDeep LLM evaluation workflow—quality dimensions, golden sets, human vs automatic metrics, regression suites, offline/online signals, and safe rollout gates for model or prompt changes. Use when shipping prompt updates, swapping models, or building eval harnesses for agents and RAG.
openclaw skills install llm-evaluationEvaluation turns “it feels better” into reproducible evidence. Design around failure modes your product cares about—not only aggregate scores.
Trigger conditions:
Initial offer:
Use six stages: (1) define quality & constraints, (2) build datasets & rubrics, (3) automatic metrics, (4) human evaluation, (5) regression & gates, (6) online validation & iteration. Confirm latency/cost budgets and risk (PII, safety).
Goal: Name dimensions that map to user harm if they fail.
Exit condition: Weighted priority of dimensions; non-goals stated.
Goal: Fixed eval sets + clear scoring rules.
Exit condition: Golden set size justified; inter-rater plan if human scoring.
Goal: Fast signals—know limitations.
Exit condition: Each auto metric has known blind spots documented.
Goal: Authoritative judgment where automatic metrics lie.
Exit condition: Human scores correlate enough with auto for ongoing monitoring—or you rely on human for release.
Goal: Block bad deploys in CI or release pipeline.
Exit condition: Rollback criteria defined before rollout.
Goal: Production truth—shadow, A/B, or gradual ramp.