{"skill":{"slug":"hallucination-guard","displayName":"Hallucination Guard — 4-Layer AI Fabrication Defense","summary":"Detect and prevent AI agent hallucinations during task execution. Use when: (1) an agent claims to have created files, commits, or artifacts — verify them, (...","description":"---\nname: hallucination-guard\ndescription: \"Detect and prevent AI agent hallucinations during task execution. Use when: (1) an agent claims to have created files, commits, or artifacts — verify them, (2) an agent produces data reports or numbers — audit against source, (3) running long multi-step tasks where fabrication risk is high, (4) you need cross-model verification of critical outputs. Provides 4-layer defense: L0 context hygiene, L1 claim-evidence protocol, L2 cross-model audit, L3 drift detection. NOT for: simple Q&A, opinion-based tasks, or conversations where factual accuracy is not critical.\"\n---\n\n# Hallucination Guard\n\n4-layer defense against agent fabrication. Each layer is independent — use one or combine.\n\n## When Hallucinations Happen\n\nHighest risk conditions (apply more layers when these are present):\n- Extended sessions (>50 turns or >30min continuous work)\n- Tasks involving file creation, code, git, or data analysis\n- Agent reporting quantitative results (numbers, metrics, PnL)\n- Multiple sequential \"successes\" with no errors or retries\n\n## Layer 0: Context Hygiene (Prevention)\n\nReduce hallucination probability before it starts.\n\n**For long tasks (>10 steps):**\n1. Break into segments of ≤8 steps each\n2. Between segments: flush working state to a file, reload from file (not from in-context memory)\n3. Each segment starts with `read` of the state file — never trust carried-over context for facts\n\n**For data-intensive tasks:**\n- Load source data from files at point of use, not from earlier context\n- If a number was mentioned 20+ turns ago, re-read the source before citing it\n\n**Cost: Zero.** This is a workflow discipline, not an API call.\n\n## Layer 1: Claim-Evidence Protocol (Detection)\n\nEvery agent claim of physical action must include tool-verified evidence.\n\n### The Rule\n\n```\nCLAIM:    \"I created/modified/committed X\"\nEVIDENCE: Tool output proving X exists and matches the claim\nSTATUS:   VERIFIED (evidence confirms) or UNVERIFIED (no evidence yet)\n```\n\n### Verification Commands by Claim Type\n\n| Claim | Verify With |\n|-------|-------------|\n| Created file | `ls -la {path} && head -20 {path}` |\n| Modified file | `grep -n '{expected_content}' {path}` |\n| Git commit | `git log --oneline -3` |\n| Git push | `git log --oneline origin/{branch} -3` |\n| Ran tests | Show actual test output (pass AND fail counts) |\n| API response | Show raw response body |\n| Data analysis | Show `wc -l` of source + sample rows |\n\n### Red Flags (claim likely fabricated)\n\n- Claim references a file but no `read`/`exec` tool was called\n- Exact round numbers in data (187 trades, +$126.50) without source\n- \"All tests passed\" with no test output shown\n- Multiple consecutive successes with zero errors\n\n**Cost: ~50 tokens per claim.** One `exec` call per physical claim.\n\n## Layer 2: Cross-Model Audit (Verification)\n\nSpawn a second agent (different model) to independently verify claims.\n\n### When to Use\n\n- Critical outputs: financial reports, deployment decisions, data analysis\n- When L1 evidence exists but numbers need independent validation\n- After any task where the agent reported unusually perfect results\n\n### How to Run\n\nSee [references/audit-prompt.md](references/audit-prompt.md) for the spawn template.\n\nKey principles:\n1. Auditor receives ONLY the evidence (files, outputs) — not the original agent's conclusions\n2. Auditor independently extracts facts from evidence and compares to claims\n3. Auditor uses the cheapest model that can do the verification (flash for file checks, sonnet for logic)\n\n**Cost: 1 subagent spawn.** Use flash/gemini for simple checks (~$0.001). Reserve sonnet/opus for complex logic verification.\n\n## Layer 3: Drift Detection (Monitoring)\n\nMonitor long-running agent tasks for hallucination patterns.\n\n### When to Use\n\n- Tasks expected to take >15 minutes\n- Agent is working autonomously (coding agent, research agent)\n- High-stakes tasks where undetected fabrication causes real damage\n\n### Setup\n\nSee [references/drift-monitor.md](references/drift-monitor.md) for implementation.\n\nCore signals:\n- **Claim/Tool Ratio**: If claims > 3× tool calls → alert\n- **Zero-Error Streak**: 8+ consecutive \"successes\" with 0 errors → suspicious\n- **Phantom References**: Agent references files/branches never created → critical alert\n\n**Cost: Periodic check via `sessions_history`.** No extra model calls unless alert triggers.\n\n## Choosing Layers\n\n| Scenario | Recommended |\n|----------|-------------|\n| Quick file creation | L1 only |\n| Data report from CSV | L0 + L1 |\n| Multi-step coding task | L0 + L1 + L2 |\n| Autonomous long-running agent | All four layers |\n| Routine conversation | None needed |\n\n## Integration with Other Skills\n\n- **War Room**: Add L1 verification to each agent's output (verify cited data)\n- **Coding agents**: Wrap with L3 drift monitor for long sessions\n- **Any task with `sessions_spawn`**: Add L2 audit as a final verification step\n\n## References\n\n- [references/audit-prompt.md](references/audit-prompt.md) — Cross-model audit spawn template\n- [references/drift-monitor.md](references/drift-monitor.md) — Drift detection implementation\n- [references/taxonomy.md](references/taxonomy.md) — Hallucination types with real-world examples\n","tags":{"latest":"1.0.0"},"stats":{"comments":0,"downloads":210,"installsAllTime":8,"installsCurrent":3,"stars":0,"versions":1},"createdAt":1772559131513,"updatedAt":1778491707669},"latestVersion":{"version":"1.0.0","createdAt":1772559131513,"changelog":"Initial release: 4-layer defense against AI agent hallucinations. L0: Context hygiene (prevention, zero cost). L1: Claim-evidence protocol (50 tokens per claim). L2: Cross-model audit with spawn templates. L3: Drift detection for long-running agents. Includes hallucination taxonomy with 4 real-world fabrication patterns. Cost-conscious: defaults to cheapest model, layers are independent.","license":null},"metadata":null,"owner":{"handle":"scytheshan-pixel","userId":"s172xdy3mpvawwwc1w7n539e998843cm","displayName":"scytheshan-pixel","image":"https://avatars.githubusercontent.com/u/248577078?v=4"},"moderation":null}