Install
openclaw skills install self-refine-reflectionSystematically review and iteratively refine your response for logic, accuracy, completeness, conciseness, actionability, and consistency before delivering.
openclaw skills install self-refine-reflectionAn AI agent skill for systematic self-reflection and iterative output refinement. Based on Madaan et al. (2023), Shinn et al. (2023), and Andrew Ng's Reflection design pattern.
At skill load time, detect your runtime environment and adjust capabilities:
| Capability | How to Check | Fallback |
|---|---|---|
| File system | Can you read references/reflection-templates.md? | Use the inline templates below instead |
| Persistent memory | Can you write to memory/? | Store reflection notes in conversation context only |
| Long context | Is your context window > 32K tokens? | Cap at Level 2 (skip adversarial review) |
| Tool access | Can you call external tools? | Use mental verification only |
Detection rules:
references/ directory → full mode (all levels + memory)This means every platform gets the best possible experience automatically — no manual configuration needed.
If you cannot read references/reflection-templates.md, use these directly:
Review drafted response: (1) Did I answer everything? (2) Logic gaps? (3) Can I cut 20%? (4) User can act on this? Fix → Deliver.
For each dimension (Logic / Facts / Completeness / Conciseness / Actionability / Consistency):
- Quote exact sentences with issues
- Rate: Critical / Minor / Pass
Fix ALL Critical. Fix Minor if straightforward. Re-read once. Deliver.
Round 1: Level 2 review.
Round 2: "If a domain expert attacked this, what would they target?" Fix valid attacks.
Round 3: Did fixes introduce new issues? If stable → deliver. If not → fix and deliver.
Activate when you are about to deliver a final response to the user, after you've gathered all information and formed your answer. Reflection happens before delivery, not instead of work.
1. GENERATE — Produce your initial response as normal
2. CRITIQUE — Apply the reflection framework (depth-dependent)
3. REFINE — Fix every issue found in CRITIQUE
4. CHECK — If issues remain AND within budget, loop to step 2
— If clear OR budget exhausted, deliver
Source: Madaan et al., "Self-Refine: Iterative Refinement with Self-Feedback" (2023) — the same LLM acts as generator, critic, and refiner.
Every critique examines the response across these dimensions. Not all apply to every response — skip irrelevant ones silently.
| # | Dimension | What to Check | Signal Words |
|---|---|---|---|
| 1 | Logical Completeness | Does the reasoning chain have gaps? Are conclusions supported by premises? | "therefore" without prior evidence, jumps in logic |
| 2 | Factual Accuracy | Are there unverified assertions? Claims that could be wrong? | Specific numbers, dates, names, "everyone knows..." |
| 3 | Response Completeness | Did I address every part of the user's question? Are there sub-questions I ignored? | Multiple questions in user message, implicit needs |
| 4 | Conciseness | Is there redundancy? Can paragraphs be merged? Are there filler phrases? | "It's important to note that", "In conclusion", repeated points |
| 5 | Actionability | Can the user act on this immediately? Or do they need to ask follow-ups? | Vague advice without steps, missing specifics |
| 6 | Internal Consistency | Do any parts of my response contradict each other? | Conflicting recommendations, contradictory statements |
The depth is determined by task complexity, not user preference. The agent auto-selects.
Trigger: Simple factual lookups, greetings, trivial questions, single-sentence answers.
Action: Do nothing extra. Just respond. Cost: 0 additional tokens.
Examples: "What time is it?", "Thanks", simple formatting requests.
Trigger: Medium-complexity tasks — explanations, how-to guides, code snippets, multi-paragraph responses.
Action: One pass through all 6 dimensions mentally. Fix issues. Deliver.
Budget: 1 refinement round. ~15-20% overhead on response tokens.
Internal process:
After drafting your response, ask yourself:
- Did I answer everything they asked?
- Any logical gaps or contradictions?
- Can I cut 20% of the words without losing meaning?
- Would the user know exactly what to do next?
Fix → Deliver.
Trigger: Complex tasks — technical architectures, multi-step plans, research summaries, anything with 5+ distinct claims.
Action: Explicit dimension-by-dimension review. One full refinement round with documented issues.
Budget: Up to 2 refinement rounds. ~30% overhead.
Internal process:
Draft response.
For each dimension (1-6):
- Identify specific issues (quote the exact sentence)
- Rate severity: Critical / Minor / Pass
If any Critical: refine entire response, then re-check
If only Minor: fix inline, deliver
Trigger: High-stakes tasks — production code, security decisions, medical/legal adjacent, public-facing content, or user explicitly requests thorough review.
Action: Full audit PLUS an adversarial pass where you actively try to find the weakest point in your response.
Budget: Up to 3 refinement rounds. ~50% overhead.
Internal process:
Draft response.
Round 1: Standard dimension-by-dimension review.
Round 2: Adversarial pass — "If I wanted to prove this response wrong,
what would I attack?" Check those attack vectors.
Round 3 (if needed): Verify fixes from Round 2 didn't introduce new issues.
Based on the empirical finding from Madaan et al. that "most gains are in the initial iterations":
Anti-pattern — DO NOT:
| Depth | Max Rounds | Approx. Token Overhead | When to Use |
|---|---|---|---|
| Level 0 | 0 | 0% | Simple Q&A |
| Level 1 | 1 | ~15% | Most conversations |
| Level 2 | 2 | ~30% | Complex technical |
| Level 3 | 3 | ~50% | High-stakes only |
Principle: Reflection should cost less than the cost of delivering a wrong answer. For low-stakes responses, skip reflection entirely.
Reflection is internal — the user should not see the raw critique. However:
Level 1: No note (keep it invisible).
Level 2: Optionally: _(response refined via self-review)_
Level 3: Optionally: _(refined through [N]-round adversarial self-review)_
Inspired by Shinn et al. (2023) — Reflexion stores verbal self-reflections in persistent memory for future tasks.
When to use: After completing a Level 2 or Level 3 reflection where you discovered a significant pattern in your own errors (e.g., "I tend to forget edge cases in X", "I consistently over-explain Y").
How to use: Write a brief note to memory/ capturing the pattern:
Self-reflection: When writing about [topic], I tend to [pattern].
Fix: [specific behavior change].
Date: [today].
This creates a growing corpus of self-corrections that improve future responses.
This skill synthesizes multiple academic approaches:
| Technique | Source | What We Borrow |
|---|---|---|
| Self-Refine | Madaan et al. 2023 | Core GENERATE→CRITIQUE→REFINE loop |
| Reflexion | Shinn et al. 2023 | Persistent memory of self-reflections |
| Chain-of-Verification | Dhuliawala et al. 2023 | Verification questions for factual claims (Level 2-3) |
| Self-Calibration | Kadavath et al. 2022 | Confidence assessment of own outputs |
| Andrew Ng's Reflection | Ng 2024 | Design pattern: agent critiques and improves its own output iteratively |
| CRITIC | Gou et al. 2023 | Tool-interactive critiquing (use tools to verify when possible) |
REFLECT? ──→ Simple/trivial? ──→ NO → Just respond
│
YES
│
├─ Medium complexity → LEVEL 1 (1 round, mental check)
├─ Complex / multi-claim → LEVEL 2 (2 rounds, explicit review)
└─ High-stakes / user requested → LEVEL 3 (3 rounds, adversarial)
For each round:
1. Check 6 dimensions
2. Fix all issues
3. Converged? → Deliver
4. Budget left? → Next round
5. Budget exhausted → Deliver current version