Install
openclaw skills install double-agentThis skill should be used when designing, implementing, or improving any AI system that requires quality assurance through separation of generation and evaluation roles. It implements the Generator-Evaluator dual-agent architecture (inspired by Anthropic's engineering blog and GAN design), where a Generator produces outputs and a dedicated Evaluator agent independently validates them through real interaction (e.g., Playwright browser operations), eliminating AI self-evaluation bias. Use when: building AI-generated UIs/code that needs quality checks, designing multi-agent pipelines with feedback loops, or upgrading existing coder+tester workflows to real interaction-based evaluation.
openclaw skills install double-agentThe DoubleAgent pattern solves a fundamental problem in AI-generated software: AI self-evaluation bias.
When a single AI agent both generates and evaluates its own output, it systematically overestimates quality — the same cognitive conflict that occurs when a student grades their own exam. The solution is to forcibly separate the two cognitive roles into independent agents with different prompts, goals, and evaluation criteria.
This skill provides:
User Goal / Spec
↓
┌─────────────┐
│ Generator │ ← Produces output (code, UI, content, data)
└──────┬──────┘
│ output artifact
↓
┌────────────────────────────────────┐
│ Evaluator │
│ • Reads spec (NOT generator output)│
│ • Operates artifact via Playwright │
│ (click, fill form, navigate) │
│ • Scores on rubric (0-100) │
│ • Writes structured feedback │
└────────────────┬───────────────────┘
│ score + feedback
↓
┌────────────────┐
│ Score ≥ target? │
│ YES → Done │
│ NO → Loop │
└────────┬────────┘
│
└──→ Generator (next iteration)
Key principle: The Evaluator reads the original spec, not the Generator's output. It evaluates independently, as if it were a real user encountering the product for the first time.
| Scenario | Apply DoubleAgent? |
|---|---|
| AI-generated frontend UI with interactions | ✅ Yes |
| Multi-step workflow code (forms, flows) | ✅ Yes |
| API endpoint implementation + validation | ✅ Yes |
| Content generation (reports, copy, docs) | ✅ Yes (text-based evaluator) |
| Single-function refactoring | ⚠️ Optional |
| Simple config changes | ❌ Not needed |
Write a clear spec that both agents will reference independently. The spec must be:
See references/architecture.md for spec template.
Assign the Generator a single role: produce output that satisfies the spec.
Assign the Evaluator a single role: independently verify the spec is satisfied.
references/evaluator-prompts.md for calibrated prompt templatesMAX_ROUNDS = 15
PASS_THRESHOLD = 80 # out of 100
for round in range(MAX_ROUNDS):
output = generator.run(spec, history)
evaluation = evaluator.run(spec, output) # Playwright-based
history.append({"round": round, "score": evaluation.score, "feedback": evaluation.feedback})
if evaluation.score >= PASS_THRESHOLD:
break
if evaluation.score_trend == "plateauing":
generator.switch_approach() # Complete strategy reset
See scripts/iteration_loop.py for a complete implementation template.
To prevent score drift, run the Evaluator on 3-5 known examples FIRST:
If scores deviate >15 points from expected, adjust the Evaluator's prompt or rubric weights before the real run.
Effective rubrics for software systems:
| Dimension | Weight | What to Measure |
|---|---|---|
| Functional completeness | 30% | Does each spec requirement work end-to-end? |
| Interaction quality | 25% | Click/form/navigation behavior as a real user |
| Edge case handling | 20% | Error states, empty data, boundary inputs |
| Code/design quality | 15% | Consistency, readability, no obvious anti-patterns |
| Originality / craft | 10% | Avoids generic/template outputs when spec requires uniqueness |
Adjust weights based on the domain. For content systems, increase "originality". For data pipelines, increase "edge case handling".
When evaluating web/H5/mini-program outputs, the Evaluator should:
Playwright MCP tool calls to use:
playwright_navigate → open URLplaywright_click → interact with elementsplaywright_fill → fill form inputsplaywright_screenshot → capture evidenceplaywright_get_visible_text → verify contentreferences/architecture.md — Detailed architecture patterns, spec templates, and design rationalereferences/evaluator-prompts.md — Ready-to-use Evaluator prompt templates for different artifact typesscripts/iteration_loop.py — Complete iteration loop implementation templatescripts/calibrate_evaluator.py — Evaluator calibration utility