DoubleAgent — Generator-Evaluator Dual Agent Pattern

v1.0.0

This skill should be used when designing, implementing, or improving any AI system that requires quality assurance through separation of generation and evalu...

⭐ 0· 98·0 current·0 all-time

bymingyuan@zmy1006-sudo

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for zmy1006-sudo/double-agent.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "DoubleAgent — Generator-Evaluator Dual Agent Pattern" (zmy1006-sudo/double-agent) from ClawHub.
Skill page: https://clawhub.ai/zmy1006-sudo/double-agent
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install double-agent

ClawHub CLI

Package manager switcher

npx clawhub@latest install double-agent

Security Scan

VirusTotal

Pending

View report →

OpenClaw

Benign

high confidence

ℹ

Purpose & Capability

The name/description (Generator–Evaluator dual-agent QA) matches the included SKILL.md, architecture references, prompt templates, and iteration/calibration scripts. One minor gap: the SKILL.md and templates assume Playwright-based browser evaluation and agent-subagent invocation (examples reference sessions_spawn / WorkBuddy), but the skill does not declare any runtime dependencies, binaries, or environment variables (e.g., Playwright, browsers, or subagent endpoint credentials). An integrator will need to supply those separately.

✓

Instruction Scope

Runtime instructions focus on defining specs, running a generator, running an independent evaluator that performs real interactions (Playwright/http), scoring, and looping. There are no instructions that read unrelated system files, exfiltrate secrets, or contact unexpected external endpoints. The included templates require implementers to replace placeholders with their actual agent invocation code.

✓

Install Mechanism

This is an instruction-only skill with template scripts; there is no install spec and no downloads or external installers. The code files are templates and raise NotImplementedError where integrators must implement their agent calls — nothing will be executed automatically by the skill as provided.

ℹ

Credentials

The skill declares no required environment variables or credentials, which is proportionate to an instructional/template skill. However, some evaluator templates (API evaluation) mention passing auth headers/tokens and the Playwright flow will typically require network access and potentially credentials for protected test environments. Those are not requested by the skill and must be supplied by the user when integrating; make sure any tokens needed for target artifacts are scoped and managed appropriately.

✓

Persistence & Privilege

always:false and default autonomous invocation are set (normal). The skill does not request persistent presence or modify other skills. Because agent invocation hooks (run_generator/run_evaluator) are left for integrators to implement, the skill itself does not gain elevated privileges by itself.

Assessment

This skill is a coherent pattern/template for running a Generator→Evaluator loop and appears benign, but it's a framework rather than a turnkey integration. Before installing/using it: 1) Be prepared to provide and secure any Playwright/browser binaries and test environment access (the skill assumes browser automation but doesn't install it). 2) Wire run_generator() and run_evaluator() yourself — those functions currently raise NotImplementedError and contain commented examples referencing a WorkBuddy API; audit and restrict any subagent/session APIs you call. 3) If you evaluate protected services/APIs, supply scoped test credentials (not broad production keys) and rotate them afterward. 4) Be mindful of privacy: evaluator screenshots, logs, or artifact URLs may contain sensitive data—store them securely or redact before uploading. 5) If you allow autonomous invocation, review and control what target URLs the Evaluator will access to avoid accidental scanning of internal systems. If you want extra assurance, ask the developer for a concrete integration example (how it will invoke your agents and where screenshots/logs are stored) and a manifest of required local tooling (Playwright, browsers) before running in production.

Like a lobster shell, security has layers — review code before you run it.

latestvk97ev70jtnfkjsa08jan3tsc7h83x86e

98downloads

0stars

1versions

Updated 4w ago

v1.0.0

MIT-0

DoubleAgent Skill

Purpose

The DoubleAgent pattern solves a fundamental problem in AI-generated software: AI self-evaluation bias.

When a single AI agent both generates and evaluates its own output, it systematically overestimates quality — the same cognitive conflict that occurs when a student grades their own exam. The solution is to forcibly separate the two cognitive roles into independent agents with different prompts, goals, and evaluation criteria.

This skill provides:

Architecture templates for Generator-Evaluator agent pairs
Evaluator prompt templates calibrated with few-shot scoring examples
Iteration loop design for 5-15 round refinement cycles
Playwright integration patterns for real browser-based evaluation
Scoring rubric design to prevent score drift and grade inflation

Core Architecture

User Goal / Spec
      ↓
 ┌─────────────┐
 │  Generator  │ ← Produces output (code, UI, content, data)
 └──────┬──────┘
        │ output artifact
        ↓
 ┌────────────────────────────────────┐
 │           Evaluator                │
 │  • Reads spec (NOT generator output)│
 │  • Operates artifact via Playwright │
 │    (click, fill form, navigate)     │
 │  • Scores on rubric (0-100)         │
 │  • Writes structured feedback       │
 └────────────────┬───────────────────┘
                  │ score + feedback
                  ↓
         ┌────────────────┐
         │ Score ≥ target? │
         │   YES → Done    │
         │   NO → Loop     │
         └────────┬────────┘
                  │
                  └──→ Generator (next iteration)

Key principle: The Evaluator reads the original spec, not the Generator's output. It evaluates independently, as if it were a real user encountering the product for the first time.

When to Apply

Scenario	Apply DoubleAgent?
AI-generated frontend UI with interactions	✅ Yes
Multi-step workflow code (forms, flows)	✅ Yes
API endpoint implementation + validation	✅ Yes
Content generation (reports, copy, docs)	✅ Yes (text-based evaluator)
Single-function refactoring	⚠️ Optional
Simple config changes	❌ Not needed

Implementation Steps

Step 1: Define the Spec Contract

Write a clear spec that both agents will reference independently. The spec must be:

Concrete (measurable outcomes, not vague goals)
Observable (evaluable through interaction or inspection)
Versioned (so both agents work from the same contract)

See references/architecture.md for spec template.

Step 2: Configure the Generator Agent

Assign the Generator a single role: produce output that satisfies the spec.

Do NOT ask the Generator to self-evaluate
Do NOT include evaluation criteria in the Generator's prompt
Provide: spec + iteration history + previous evaluator feedback

Step 3: Configure the Evaluator Agent

Assign the Evaluator a single role: independently verify the spec is satisfied.

Load references/evaluator-prompts.md for calibrated prompt templates
Use Playwright MCP for UI/web artifacts (real browser interaction)
Use structured JSON output for scores to enable automated loop control
Calibrate with few-shot examples BEFORE running (prevents grade inflation)

Step 4: Design the Iteration Loop

MAX_ROUNDS = 15
PASS_THRESHOLD = 80  # out of 100

for round in range(MAX_ROUNDS):
    output = generator.run(spec, history)
    evaluation = evaluator.run(spec, output)  # Playwright-based
    
    history.append({"round": round, "score": evaluation.score, "feedback": evaluation.feedback})
    
    if evaluation.score >= PASS_THRESHOLD:
        break
    
    if evaluation.score_trend == "plateauing":
        generator.switch_approach()  # Complete strategy reset

See scripts/iteration_loop.py for a complete implementation template.

Step 5: Calibrate the Evaluator

To prevent score drift, run the Evaluator on 3-5 known examples FIRST:

1 example at ~30/100 (clearly bad)
1 example at ~60/100 (mediocre)
1 example at ~85/100 (good)
1 example at ~95/100 (excellent)

If scores deviate >15 points from expected, adjust the Evaluator's prompt or rubric weights before the real run.

Scoring Rubric Design

Effective rubrics for software systems:

Dimension	Weight	What to Measure
Functional completeness	30%	Does each spec requirement work end-to-end?
Interaction quality	25%	Click/form/navigation behavior as a real user
Edge case handling	20%	Error states, empty data, boundary inputs
Code/design quality	15%	Consistency, readability, no obvious anti-patterns
Originality / craft	10%	Avoids generic/template outputs when spec requires uniqueness

Adjust weights based on the domain. For content systems, increase "originality". For data pipelines, increase "edge case handling".

Playwright Integration (for UI artifacts)

When evaluating web/H5/mini-program outputs, the Evaluator should:

Navigate to the deployed artifact URL
Execute each spec requirement as a user action sequence
Observe actual behavior (DOM state, network requests, visual output)
Record pass/fail per requirement with screenshots
Report structured JSON with score breakdown

Playwright MCP tool calls to use:

playwright_navigate → open URL
playwright_click → interact with elements
playwright_fill → fill form inputs
playwright_screenshot → capture evidence
playwright_get_visible_text → verify content

Reference Files

references/architecture.md — Detailed architecture patterns, spec templates, and design rationale
references/evaluator-prompts.md — Ready-to-use Evaluator prompt templates for different artifact types

Scripts

scripts/iteration_loop.py — Complete iteration loop implementation template
scripts/calibrate_evaluator.py — Evaluator calibration utility

Comments

Loading comments...