Install
openclaw skills install refinement-loopDesign and run iterative generate→critique→revise loops optimized for Claude Opus 4.8, with thinking-as-critic, cost controls, and model routing.
openclaw skills install refinement-loopA refinement loop improves an output it can't get right in one shot: it generates a candidate, evaluates it against explicit criteria, revises based on that evaluation, and repeats until the output is good enough.
Before building a loop: ask whether a single Opus call with extended thinking already produces acceptable output. Opus 4.8's thinking blocks perform internal multi-step self-critique before committing to a response. One well-crafted prompt + thinking=on often outperforms a sloppy three-pass loop. Refine only when quality genuinely benefits from critique the generator couldn't apply to itself in one go.
Use a refinement loop when:
Don't use one when:
Opus 4.8 with thinking enabled performs internal deliberation before responding. That thinking block IS a critique pass. Structure your prompt so the thinking does the evaluative work:
System: You are an expert [domain] writer and critic.
Think through this step by step:
1. Draft a response to the requirements below.
2. Critique your draft against this rubric: [rubric].
3. Identify the top 2–3 specific failures.
4. Revise your draft to fix them.
5. Output only the final revised version.
Requirements: [requirements]
This collapses the generator and critic into one Opus call. Use this as pass 1. Only escalate to a multi-call loop if the output still falls short.
When you need a true multi-call loop, don't run Opus on every step. Route by role:
| Role | Model | Rationale |
|---|---|---|
| Generator (pass 1) | Opus 4.8 + thinking | Best first draft |
| Critic (all passes) | Claude Sonnet | Fast, cheap, accurate at rubric evaluation |
| Reviser (passes 2+) | Sonnet or Opus | Sonnet if rubric-mechanical; Opus if creative/complex |
| Final pass | Opus 4.8 + thinking | Polish and coherence check |
This cuts loop cost by 60–80% vs. running Opus on every step.
When using Opus as critic, instruct it to surface the critique in the thinking and output only a structured critique object. The thinking block will be far more honest and thorough than the visible response (Opus tends to soften visible criticism):
System: You are a strict critic. Do not produce the revised artifact.
Output only a JSON critique object:
{
"score": <0-10>,
"failures": ["specific failure 1", "specific failure 2", ...],
"converged": <true if no meaningful improvements remain>
}
Rubric: [rubric]
Artifact: [artifact]
Specify all five explicitly; a vague version of any one breaks the loop.
budget_tokens = 0
MAX_TOKENS = 50_000 # set before you start; abort if exceeded
MAX_ITERS = 4 # rarely need more; Opus is strong
best = opus_generate(requirements, thinking=True) # Pattern 1 first
score, critique = sonnet_evaluate(best, rubric) # cheap critic
budget_tokens += estimate_tokens(best, critique)
i = 0
while score < BAR and i < MAX_ITERS:
if budget_tokens > MAX_TOKENS:
break # cost abort — return best seen so far
candidate = sonnet_revise(best, critique, requirements)
new_score, new_critique = sonnet_evaluate(candidate, rubric)
budget_tokens += estimate_tokens(candidate, new_critique)
if new_critique.get("converged"):
break # model says no meaningful improvements remain
if semantic_similarity(candidate, best) > 0.97:
break # text stopped changing — convergence
if new_score > score:
best, score, critique = candidate, new_score, new_critique
i += 1
# Optional: final Opus polish pass if budget allows
if budget_tokens + OPUS_POLISH_COST < MAX_TOKENS:
best = opus_polish(best, requirements, thinking=True)
return best
"converged": true when the rubric has no remaining actionable failures.new_score - score < 0.5 (on a 10-point scale) for two consecutive passes, stop.Combine: stop when any one triggers.
The critique must be specific and actionable, not a grade.
Pass the full rubric to the critic every round. Where the artifact allows objective checks (code passes tests, JSON validates, under word limit), use those — far stronger than prose judgments.
Run generation and evaluation as separate roles — different prompts, different instructions, ideally different models (see Pattern 2). A critic operating in the same breath that just produced the text tends to rubber-stamp it.
With Opus's thinking-as-critic pattern (Pattern 1), the thinking block provides enough adversarial distance. When visible-output critique is too soft, switch to Pattern 3.
Carry three things between passes: original requirements, current best candidate, latest critique. Re-supply the original requirements every round.
With Opus 4.8's 200k context, pass the full history of all passes. This helps the model see the trajectory and avoid re-introducing earlier mistakes.
MAX_TOKENS budget before the loop begins. Abort and return best if exceeded.| Failure | Mitigation |
|---|---|
| Sycophantic critic | Separate critic role, concrete rubric, Pattern 3, objective checks |
| Drift from original goal | Re-supply requirements every pass |
| Over-correction | Critique against full rubric every round; keep-best |
| Mode collapse / blandness | Cap iterations; Opus final polish pass |
| Final ≠ best | Track and return highest-scoring, never blindly the last |
| Infinite churn | MAX_ITERS + convergence detection (all three signals) |
| Cost blowout | MAX_TOKENS budget cap + model routing |
| Looping when one prompt would do | Run Opus+thinking single call first |
| Vague convergence | Use all three convergence signals |
Goal: tight 150-word product blurb. Rubric: under 150 words, leads with benefit, one concrete proof point, active voice, ends on CTA.
Goal: prompt that extracts {name, date, total} as JSON from invoices.