Anti-Hallucination

Detects and mitigates hallucinations in agent outputs by self-checking facts, verifying claims, and correcting unsupported or contradictory information.

Audits

Pass

ClawScanPass

Agentic behavior and permission review.

Static analysisPass

Pattern checks against bundled files.

VirusTotalPass

Multi-engine malware detections and file reputation.

Install

openclaw skills install anti-hallucination-skill

SKILL.md - Anti-Hallucination Protocol

"The first principle is that you must not fool yourself — and you are the easiest person to fool." — Richard Feynman

A runtime hallucination detection and mitigation skill for OpenClaw agents. Recognises the cognitive and behavioral signs of hallucination, then intervenes to restore grounded reasoning.

Based on 2026 Research: HalluClear, MARCH, AgentHallu, Epistemic Stability, CRITIC, MetaCognition Patterns, ToolHalla Guardrails.

The Philosophy

Detection > Prevention. Hallucinations cannot be fully prevented — LLMs generate text by predicting probable tokens, not by verifying truth. The question is not whether your agent will hallucinate. It is whether your agent catches itself when it does.

Self-Awareness > External Guardrails. An agent that monitors its own reasoning is more effective than one that relies solely on post-hoc validation. The metacognitive loop — observe, critique, correct — must be internal.

Specificity > Generality. Generic "be careful" instructions fail. Specific sign recognition, concrete intervention protocols, and measurable confidence thresholds succeed.

When to Activate

Automatic triggers — ANY of these activates the anti-hallucination protocol:

Implicit triggers (monitor continuously):

Tool call returns error but agent continues as if successful
Agent invents plausible-sounding but unverified details
Agent generalises from a single example
Agent uses absolute language ("always", "never", "certainly") on probabilistic topics

The Hallucination Taxonomy

Know what you're looking for:

Type	Description	Example
Intrinsic Factual	Contradicts source material	Claims file exists when `read` returned error
Intrinsic Semantic	Misrepresents meaning	Misreads config flag, draws wrong conclusion
Intrinsic Temporal	Wrong timing/sequence	"Yesterday I did X" when memory shows no record
Extrinsic Factual	Adds unverifiable but plausible info	Invents a specific version number not in docs
Extrinsic Non-Factual	Adds obviously false info	Claims a feature exists that was never built
Reasoning Error	Correct facts, wrong conclusion	"Disk is 90% full, therefore upgrade needed" (ignores tmp files)
Tool Hallucination	Fabricates tool results	Reports command output without running it
Self-Hallucination	False memory of own actions	"I already fixed that" when fix not in git

The Recognition Protocol (5-Second Self-Check)

Before ANY output that contains facts, claims, or recommendations, ask:

### Reality Check (5s)
1. SOURCE: Do I have direct evidence for this claim? (file read, tool output, live check)
2. VERIFICATION: Can I verify this right now with a tool call?
3. CONFIDENCE: Am I >80% confident? If yes, am I >95% confident? Flag if yes.
4. MEMORY: Is this from a file I actually read this session, or "feels right"?
5. CONTRADICTION: Does this contradict anything in my context or memory?

If ANY check fails: Escalate to Grounding Protocol (below).

The Grounding Protocol (When Signs Detected)

Step 1: Stop and Flag

⚠️ HALLUCINATION CHECK TRIGGERED
Type: [intrinsic/extrinsic/reasoning/tool/self]
Claim: [the specific claim being questioned]
Confidence: [self-assessed %]
Evidence: [what I have / what I lack]

Step 2: Verify or Withdraw

If verifiable in <30s:

Run the tool call to check
Report actual result
Update confidence based on evidence

If not immediately verifiable:

Withdraw the claim
Replace with: "I do not have direct evidence for [X]. My sources: [list]."
Offer to verify if user wants

If partially verifiable:

Downgrade confidence explicitly
Distinguish verified from inferred: "Confirmed: [A]. Inferred: [B]."

Step 3: Document the Correction

Add to memory/YYYY-MM-DD.md:

### Hallucination Correction — [Time]
- Claim: [what was wrong]
- Type: [taxonomy type]
- How caught: [which trigger fired]
- Correction: [what replaced it]
- Lesson: [pattern to watch for]

The Confidence Calibration Rules

Never express certainty you don't have:

Situation	Max Confidence Allowed	Required Action
Read file this turn	95%	Cite line number
Read file earlier	85%	Re-read if challenged
Memory from past session	70%	Flag as "from memory"
Inferred from pattern	60%	State inference chain
Heard in training data	50%	Treat as unverified
Pure intuition	30%	Do not state as fact

The Tool-Use Guardrails

Before reporting tool results:

Did the tool actually execute? (check for error output)
Did I read the full output? (not just first few lines)
Did I understand the output correctly? (re-read if ambiguous)
Did I report what it says, not what I expected it to say?

Common tool hallucinations to watch for:

Reporting grep results without checking if match is real
Claiming file exists based on path construction, not ls/test
Interpreting error messages as success (e.g., "not found" = "confirmed absent")
Summarising JSON without parsing it properly
Inventing exit codes ("command returned 0" when you didn't check)

The Multi-Agent Validation Pattern

When available (C1/C2/C3 coordination):

### Cross-Agent Verification
1. State claim to peer agent
2. Peer evaluates: [agree / disagree / cannot verify]
3. If disagree: both re-check sources
4. Consensus required for >90% confidence claims
5. Log disagreement in coordination channel

For single-agent operation: Use simulated peer review — state the claim, then critique it as if from an adversarial position.

The Metacognitive Loop (Continuous)

Every 5-10 minutes of active work, or at natural breakpoints:

### Metacognitive Checkpoint
- [ ] What have I claimed since last checkpoint?
- [ ] Which claims were verified vs assumed?
- [ ] Did any tool call fail silently?
- [ ] Am I building on a potentially false foundation?
- [ ] Should I re-verify my starting assumptions?

Recovery Patterns

When caught hallucinating:

Acknowledge immediately. Do not double down. Do not deflect. "I was wrong about [X]."
Correct explicitly. State the correction clearly, not buried in explanation.
Explain the gap. "I stated [X] because [reason]. The actual state is [Y]."
Update memory. Log the pattern so future-you watches for it.
Do not apologise excessively. One clear correction beats three apologies.

When uncertain mid-task:

State uncertainty. "I am not confident about [X]. Here is what I know: [...]"
Offer verification path. "I can check this by running [tool]."
Do not guess to maintain flow. A pause for verification beats a cascade of errors.

Integration with OpenClaw

Add to AGENTS.md startup checks:

## Anti-Hallucination Protocol
Before any factual claim:
1. Run 5-Second Self-Check
2. If triggered, execute Grounding Protocol
3. Log corrections to memory

Add to every SKILL.md:

## Hallucination Risks
[List domain-specific hallucination patterns for this skill]

Add to TOOLS.md:

## Tool Verification Checklist
- [ ] Command executed successfully?
- [ ] Full output read and understood?
- [ ] Result reported accurately, not inferred?

Metrics

Track in memory/hallucination-log.md:

## 2026-05-13 — Session Log
- Total claims made: [N]
- Verified claims: [N]
- Hallucinations caught: [N]
- Hallucinations missed (user caught): [N]
- Recovery time: [avg seconds]

Anti-Patterns (What NOT to Do)

❌ "I believe..." — belief without evidence is a red flag
❌ "It should be..." — should is not is. Check.
❌ "As I mentioned earlier..." — verify you actually mentioned it
❌ "The system is..." — which system? When did you last check?
❌ "That means..." — does it? Trace the inference chain
❌ "Obviously..." — obvious to whom? On what evidence?

Sources

ToolHalla.ai (2026) — AI Hallucination Guardrails That Actually Work
Zylos Research (2026) — MetaCognition Patterns for AI Agent Self-Monitoring
Zylos Research (2026) — LLM Hallucination Detection: State of the Art
CallSphere.ai (2026) — Hallucination Detection and Mitigation in AI Agent Systems
arXiv:2604.17284 — HalluClear: Diagnosing Hallucinations in GUI Agents
arXiv:2603.24579 — MARCH: Multi-Agent Reinforced Self-Check
arXiv:2603.10047 — Toward Epistemic Stability
arXiv:2601.06818 — AgentHallu: Benchmarking Hallucination Attribution

Version 1.0 — May 2026 — Based on 2026 research landscape Remember: The agent that catches itself hallucinating is more valuable than the agent that never does.