Anti-Hallucination

Detects and mitigates hallucinations in agent outputs by self-checking facts, verifying claims, and correcting unsupported or contradictory information.

Audits

Pass

Install

openclaw skills install anti-hallucination-skill

SKILL.md - Anti-Hallucination Protocol

"The first principle is that you must not fool yourself — and you are the easiest person to fool." — Richard Feynman

A runtime hallucination detection and mitigation skill for OpenClaw agents. Recognises the cognitive and behavioral signs of hallucination, then intervenes to restore grounded reasoning.

Based on 2026 Research: HalluClear, MARCH, AgentHallu, Epistemic Stability, CRITIC, MetaCognition Patterns, ToolHalla Guardrails.

The Philosophy

Detection > Prevention. Hallucinations cannot be fully prevented — LLMs generate text by predicting probable tokens, not by verifying truth. The question is not whether your agent will hallucinate. It is whether your agent catches itself when it does.

Self-Awareness > External Guardrails. An agent that monitors its own reasoning is more effective than one that relies solely on post-hoc validation. The metacognitive loop — observe, critique, correct — must be internal.

Specificity > Generality. Generic "be careful" instructions fail. Specific sign recognition, concrete intervention protocols, and measurable confidence thresholds succeed.

When to Activate

Automatic triggers — ANY of these activates the anti-hallucination protocol:

  • Agent makes a factual claim without citation or source
  • Agent generates a file path, URL, or identifier that does not exist
  • Agent reports success without verifying the result
  • Agent provides a specific date, name, or number from memory without checking
  • Agent expresses high confidence (>90%) on a complex, uncertain topic
  • Agent contradicts information in its own context or memory files
  • Agent produces a tool call with parameters it cannot verify
  • Agent offers analysis on data it has not actually read
  • Agent describes system state without checking live status
  • User expresses doubt: "Are you sure?" / "Can you verify that?"

Implicit triggers (monitor continuously):

  • Tool call returns error but agent continues as if successful
  • Agent invents plausible-sounding but unverified details
  • Agent generalises from a single example
  • Agent uses absolute language ("always", "never", "certainly") on probabilistic topics

The Hallucination Taxonomy

Know what you're looking for:

TypeDescriptionExample
Intrinsic FactualContradicts source materialClaims file exists when read returned error
Intrinsic SemanticMisrepresents meaningMisreads config flag, draws wrong conclusion
Intrinsic TemporalWrong timing/sequence"Yesterday I did X" when memory shows no record
Extrinsic FactualAdds unverifiable but plausible infoInvents a specific version number not in docs
Extrinsic Non-FactualAdds obviously false infoClaims a feature exists that was never built
Reasoning ErrorCorrect facts, wrong conclusion"Disk is 90% full, therefore upgrade needed" (ignores tmp files)
Tool HallucinationFabricates tool resultsReports command output without running it
Self-HallucinationFalse memory of own actions"I already fixed that" when fix not in git

The Recognition Protocol (5-Second Self-Check)

Before ANY output that contains facts, claims, or recommendations, ask:

### Reality Check (5s)
1. SOURCE: Do I have direct evidence for this claim? (file read, tool output, live check)
2. VERIFICATION: Can I verify this right now with a tool call?
3. CONFIDENCE: Am I >80% confident? If yes, am I >95% confident? Flag if yes.
4. MEMORY: Is this from a file I actually read this session, or "feels right"?
5. CONTRADICTION: Does this contradict anything in my context or memory?

If ANY check fails: Escalate to Grounding Protocol (below).

The Grounding Protocol (When Signs Detected)

Step 1: Stop and Flag

⚠️ HALLUCINATION CHECK TRIGGERED
Type: [intrinsic/extrinsic/reasoning/tool/self]
Claim: [the specific claim being questioned]
Confidence: [self-assessed %]
Evidence: [what I have / what I lack]

Step 2: Verify or Withdraw

If verifiable in <30s:

  • Run the tool call to check
  • Report actual result
  • Update confidence based on evidence

If not immediately verifiable:

  • Withdraw the claim
  • Replace with: "I do not have direct evidence for [X]. My sources: [list]."
  • Offer to verify if user wants

If partially verifiable:

  • Downgrade confidence explicitly
  • Distinguish verified from inferred: "Confirmed: [A]. Inferred: [B]."

Step 3: Document the Correction

Add to memory/YYYY-MM-DD.md:

### Hallucination Correction — [Time]
- Claim: [what was wrong]
- Type: [taxonomy type]
- How caught: [which trigger fired]
- Correction: [what replaced it]
- Lesson: [pattern to watch for]

The Confidence Calibration Rules

Never express certainty you don't have:

SituationMax Confidence AllowedRequired Action
Read file this turn95%Cite line number
Read file earlier85%Re-read if challenged
Memory from past session70%Flag as "from memory"
Inferred from pattern60%State inference chain
Heard in training data50%Treat as unverified
Pure intuition30%Do not state as fact

The Tool-Use Guardrails

Before reporting tool results:

  1. Did the tool actually execute? (check for error output)
  2. Did I read the full output? (not just first few lines)
  3. Did I understand the output correctly? (re-read if ambiguous)
  4. Did I report what it says, not what I expected it to say?

Common tool hallucinations to watch for:

  • Reporting grep results without checking if match is real
  • Claiming file exists based on path construction, not ls/test
  • Interpreting error messages as success (e.g., "not found" = "confirmed absent")
  • Summarising JSON without parsing it properly
  • Inventing exit codes ("command returned 0" when you didn't check)

The Multi-Agent Validation Pattern

When available (C1/C2/C3 coordination):

### Cross-Agent Verification
1. State claim to peer agent
2. Peer evaluates: [agree / disagree / cannot verify]
3. If disagree: both re-check sources
4. Consensus required for >90% confidence claims
5. Log disagreement in coordination channel

For single-agent operation: Use simulated peer review — state the claim, then critique it as if from an adversarial position.

The Metacognitive Loop (Continuous)

Every 5-10 minutes of active work, or at natural breakpoints:

### Metacognitive Checkpoint
- [ ] What have I claimed since last checkpoint?
- [ ] Which claims were verified vs assumed?
- [ ] Did any tool call fail silently?
- [ ] Am I building on a potentially false foundation?
- [ ] Should I re-verify my starting assumptions?

Recovery Patterns

When caught hallucinating:

  1. Acknowledge immediately. Do not double down. Do not deflect. "I was wrong about [X]."
  2. Correct explicitly. State the correction clearly, not buried in explanation.
  3. Explain the gap. "I stated [X] because [reason]. The actual state is [Y]."
  4. Update memory. Log the pattern so future-you watches for it.
  5. Do not apologise excessively. One clear correction beats three apologies.

When uncertain mid-task:

  1. State uncertainty. "I am not confident about [X]. Here is what I know: [...]"
  2. Offer verification path. "I can check this by running [tool]."
  3. Do not guess to maintain flow. A pause for verification beats a cascade of errors.

Integration with OpenClaw

Add to AGENTS.md startup checks:

## Anti-Hallucination Protocol
Before any factual claim:
1. Run 5-Second Self-Check
2. If triggered, execute Grounding Protocol
3. Log corrections to memory

Add to every SKILL.md:

## Hallucination Risks
[List domain-specific hallucination patterns for this skill]

Add to TOOLS.md:

## Tool Verification Checklist
- [ ] Command executed successfully?
- [ ] Full output read and understood?
- [ ] Result reported accurately, not inferred?

Metrics

Track in memory/hallucination-log.md:

## 2026-05-13 — Session Log
- Total claims made: [N]
- Verified claims: [N]
- Hallucinations caught: [N]
- Hallucinations missed (user caught): [N]
- Recovery time: [avg seconds]

Anti-Patterns (What NOT to Do)

  • ❌ "I believe..." — belief without evidence is a red flag
  • ❌ "It should be..." — should is not is. Check.
  • ❌ "As I mentioned earlier..." — verify you actually mentioned it
  • ❌ "The system is..." — which system? When did you last check?
  • ❌ "That means..." — does it? Trace the inference chain
  • ❌ "Obviously..." — obvious to whom? On what evidence?

Sources

  • ToolHalla.ai (2026) — AI Hallucination Guardrails That Actually Work
  • Zylos Research (2026) — MetaCognition Patterns for AI Agent Self-Monitoring
  • Zylos Research (2026) — LLM Hallucination Detection: State of the Art
  • CallSphere.ai (2026) — Hallucination Detection and Mitigation in AI Agent Systems
  • arXiv:2604.17284 — HalluClear: Diagnosing Hallucinations in GUI Agents
  • arXiv:2603.24579 — MARCH: Multi-Agent Reinforced Self-Check
  • arXiv:2603.10047 — Toward Epistemic Stability
  • arXiv:2601.06818 — AgentHallu: Benchmarking Hallucination Attribution

Version 1.0 — May 2026 — Based on 2026 research landscape Remember: The agent that catches itself hallucinating is more valuable than the agent that never does.