DoubleAgent — Generator-Evaluator Dual Agent Pattern

v1.0.0

This skill should be used when designing, implementing, or improving any AI system that requires quality assurance through separation of generation and evalu...

0· 98·0 current·0 all-time
bymingyuan@zmy1006-sudo

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for zmy1006-sudo/double-agent.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "DoubleAgent — Generator-Evaluator Dual Agent Pattern" (zmy1006-sudo/double-agent) from ClawHub.
Skill page: https://clawhub.ai/zmy1006-sudo/double-agent
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install double-agent

ClawHub CLI

Package manager switcher

npx clawhub@latest install double-agent
Security Scan
VirusTotalVirusTotal
Pending
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
The name/description (Generator–Evaluator dual-agent QA) matches the included SKILL.md, architecture references, prompt templates, and iteration/calibration scripts. One minor gap: the SKILL.md and templates assume Playwright-based browser evaluation and agent-subagent invocation (examples reference sessions_spawn / WorkBuddy), but the skill does not declare any runtime dependencies, binaries, or environment variables (e.g., Playwright, browsers, or subagent endpoint credentials). An integrator will need to supply those separately.
Instruction Scope
Runtime instructions focus on defining specs, running a generator, running an independent evaluator that performs real interactions (Playwright/http), scoring, and looping. There are no instructions that read unrelated system files, exfiltrate secrets, or contact unexpected external endpoints. The included templates require implementers to replace placeholders with their actual agent invocation code.
Install Mechanism
This is an instruction-only skill with template scripts; there is no install spec and no downloads or external installers. The code files are templates and raise NotImplementedError where integrators must implement their agent calls — nothing will be executed automatically by the skill as provided.
Credentials
The skill declares no required environment variables or credentials, which is proportionate to an instructional/template skill. However, some evaluator templates (API evaluation) mention passing auth headers/tokens and the Playwright flow will typically require network access and potentially credentials for protected test environments. Those are not requested by the skill and must be supplied by the user when integrating; make sure any tokens needed for target artifacts are scoped and managed appropriately.
Persistence & Privilege
always:false and default autonomous invocation are set (normal). The skill does not request persistent presence or modify other skills. Because agent invocation hooks (run_generator/run_evaluator) are left for integrators to implement, the skill itself does not gain elevated privileges by itself.
Assessment
This skill is a coherent pattern/template for running a Generator→Evaluator loop and appears benign, but it's a framework rather than a turnkey integration. Before installing/using it: 1) Be prepared to provide and secure any Playwright/browser binaries and test environment access (the skill assumes browser automation but doesn't install it). 2) Wire run_generator() and run_evaluator() yourself — those functions currently raise NotImplementedError and contain commented examples referencing a WorkBuddy API; audit and restrict any subagent/session APIs you call. 3) If you evaluate protected services/APIs, supply scoped test credentials (not broad production keys) and rotate them afterward. 4) Be mindful of privacy: evaluator screenshots, logs, or artifact URLs may contain sensitive data—store them securely or redact before uploading. 5) If you allow autonomous invocation, review and control what target URLs the Evaluator will access to avoid accidental scanning of internal systems. If you want extra assurance, ask the developer for a concrete integration example (how it will invoke your agents and where screenshots/logs are stored) and a manifest of required local tooling (Playwright, browsers) before running in production.

Like a lobster shell, security has layers — review code before you run it.

latestvk97ev70jtnfkjsa08jan3tsc7h83x86e
98downloads
0stars
1versions
Updated 4w ago
v1.0.0
MIT-0

DoubleAgent Skill

Purpose

The DoubleAgent pattern solves a fundamental problem in AI-generated software: AI self-evaluation bias.

When a single AI agent both generates and evaluates its own output, it systematically overestimates quality — the same cognitive conflict that occurs when a student grades their own exam. The solution is to forcibly separate the two cognitive roles into independent agents with different prompts, goals, and evaluation criteria.

This skill provides:

  1. Architecture templates for Generator-Evaluator agent pairs
  2. Evaluator prompt templates calibrated with few-shot scoring examples
  3. Iteration loop design for 5-15 round refinement cycles
  4. Playwright integration patterns for real browser-based evaluation
  5. Scoring rubric design to prevent score drift and grade inflation

Core Architecture

User Goal / Spec
      ↓
 ┌─────────────┐
 │  Generator  │ ← Produces output (code, UI, content, data)
 └──────┬──────┘
        │ output artifact
        ↓
 ┌────────────────────────────────────┐
 │           Evaluator                │
 │  • Reads spec (NOT generator output)│
 │  • Operates artifact via Playwright │
 │    (click, fill form, navigate)     │
 │  • Scores on rubric (0-100)         │
 │  • Writes structured feedback       │
 └────────────────┬───────────────────┘
                  │ score + feedback
                  ↓
         ┌────────────────┐
         │ Score ≥ target? │
         │   YES → Done    │
         │   NO → Loop     │
         └────────┬────────┘
                  │
                  └──→ Generator (next iteration)

Key principle: The Evaluator reads the original spec, not the Generator's output. It evaluates independently, as if it were a real user encountering the product for the first time.


When to Apply

ScenarioApply DoubleAgent?
AI-generated frontend UI with interactions✅ Yes
Multi-step workflow code (forms, flows)✅ Yes
API endpoint implementation + validation✅ Yes
Content generation (reports, copy, docs)✅ Yes (text-based evaluator)
Single-function refactoring⚠️ Optional
Simple config changes❌ Not needed

Implementation Steps

Step 1: Define the Spec Contract

Write a clear spec that both agents will reference independently. The spec must be:

  • Concrete (measurable outcomes, not vague goals)
  • Observable (evaluable through interaction or inspection)
  • Versioned (so both agents work from the same contract)

See references/architecture.md for spec template.

Step 2: Configure the Generator Agent

Assign the Generator a single role: produce output that satisfies the spec.

  • Do NOT ask the Generator to self-evaluate
  • Do NOT include evaluation criteria in the Generator's prompt
  • Provide: spec + iteration history + previous evaluator feedback

Step 3: Configure the Evaluator Agent

Assign the Evaluator a single role: independently verify the spec is satisfied.

  • Load references/evaluator-prompts.md for calibrated prompt templates
  • Use Playwright MCP for UI/web artifacts (real browser interaction)
  • Use structured JSON output for scores to enable automated loop control
  • Calibrate with few-shot examples BEFORE running (prevents grade inflation)

Step 4: Design the Iteration Loop

MAX_ROUNDS = 15
PASS_THRESHOLD = 80  # out of 100

for round in range(MAX_ROUNDS):
    output = generator.run(spec, history)
    evaluation = evaluator.run(spec, output)  # Playwright-based
    
    history.append({"round": round, "score": evaluation.score, "feedback": evaluation.feedback})
    
    if evaluation.score >= PASS_THRESHOLD:
        break
    
    if evaluation.score_trend == "plateauing":
        generator.switch_approach()  # Complete strategy reset

See scripts/iteration_loop.py for a complete implementation template.

Step 5: Calibrate the Evaluator

To prevent score drift, run the Evaluator on 3-5 known examples FIRST:

  • 1 example at ~30/100 (clearly bad)
  • 1 example at ~60/100 (mediocre)
  • 1 example at ~85/100 (good)
  • 1 example at ~95/100 (excellent)

If scores deviate >15 points from expected, adjust the Evaluator's prompt or rubric weights before the real run.


Scoring Rubric Design

Effective rubrics for software systems:

DimensionWeightWhat to Measure
Functional completeness30%Does each spec requirement work end-to-end?
Interaction quality25%Click/form/navigation behavior as a real user
Edge case handling20%Error states, empty data, boundary inputs
Code/design quality15%Consistency, readability, no obvious anti-patterns
Originality / craft10%Avoids generic/template outputs when spec requires uniqueness

Adjust weights based on the domain. For content systems, increase "originality". For data pipelines, increase "edge case handling".


Playwright Integration (for UI artifacts)

When evaluating web/H5/mini-program outputs, the Evaluator should:

  1. Navigate to the deployed artifact URL
  2. Execute each spec requirement as a user action sequence
  3. Observe actual behavior (DOM state, network requests, visual output)
  4. Record pass/fail per requirement with screenshots
  5. Report structured JSON with score breakdown

Playwright MCP tool calls to use:

  • playwright_navigate → open URL
  • playwright_click → interact with elements
  • playwright_fill → fill form inputs
  • playwright_screenshot → capture evidence
  • playwright_get_visible_text → verify content

Reference Files

  • references/architecture.md — Detailed architecture patterns, spec templates, and design rationale
  • references/evaluator-prompts.md — Ready-to-use Evaluator prompt templates for different artifact types

Scripts

  • scripts/iteration_loop.py — Complete iteration loop implementation template
  • scripts/calibrate_evaluator.py — Evaluator calibration utility

Comments

Loading comments...