{"skill":{"slug":"incident-replay","displayName":"Incident Replay","summary":"Post-mortem analysis for AI agent failures. Capture state, reconstruct timelines, identify root causes. When your agent breaks, know what happened, why, and...","description":"---\r\nname: \"Incident Replay Agent Failure Forensics\"\r\ndescription: \"Post-mortem analysis for AI agent failures. Capture state, reconstruct timelines, identify root causes. When your agent breaks, know what happened, why, and how to prevent it.\"\r\nauthor: \"@TheShadowRose\"\r\nversion: \"1.0.5\"\r\ntags: [\"forensics\", \"debugging\", \"post-mortem\", \"failure-analysis\", \"incident\", \"recovery\"]\r\nlicense: \"MIT\"\r\n---\r\n\r\n# Incident Replay Agent Failure Forensics\r\n\r\nPost-mortem analysis for AI agent failures. Capture state, reconstruct timelines, identify root causes. When your agent breaks, know what happened, why, and how to prevent it.\r\n\r\n---\r\n\r\n**Post-mortem analysis for AI agent failures. Capture state, reconstruct timelines, identify root causes.**\r\n\r\nWhen your agent breaks, you need to know what happened, why, and how to prevent it next time. Incident Replay captures workspace state at points in time, detects when things go wrong, reconstructs the sequence of events, and classifies root causes with actionable remediation steps.\r\n\r\n---\r\n\r\n## The Problem\r\n\r\nYour agent crashed overnight. Files are missing. The config looks wrong. The logs are a wall of text. What happened? When? Why?\r\n\r\nWithout forensics tooling, post-mortem analysis is manual detective work: diffing files by hand, grepping logs, guessing at causation. Incident Replay automates the mechanics so you can focus on understanding.\r\n\r\n## What It Does\r\n\r\n### 1. **Capture** (`incident_capture.py`)\r\n- Take point-in-time snapshots of your workspace (files, sizes, hashes, content)\r\n- Configurable include/exclude patterns (track what matters, ignore noise)\r\n- Automatic snapshot pruning (keep last N)\r\n- Compare any two snapshots to see exactly what changed\r\n- Trigger detection — automatically flag incidents based on:\r\n  - Log patterns (tracebacks, errors, fatal messages)\r\n  - File changes (unexpected deletions, config modifications)\r\n  - Content patterns (secrets in output, constraint violations)\r\n  - Empty output files\r\n\r\n### 2. **Replay** (`incident_replay.py`)\r\n- Build chronological timelines from snapshots, file changes, and triggers\r\n- Extract decision chains from agent logs and memory files\r\n- Heuristic root cause classification:\r\n  - **Config error** — misconfiguration caused the failure\r\n  - **Data corruption** — input data was malformed or missing\r\n  - **Drift** — gradual workspace state degradation\r\n  - **External failure** — API/network/filesystem dependency failed\r\n  - **Logic error** — bug in agent logic or prompt\r\n  - **Resource exhaustion** — ran out of memory, disk, tokens, or time\r\n- Remediation suggestions tailored to each root cause category\r\n- Incident database with persistent storage and pattern tracking\r\n\r\n### 3. **Report** (`incident_report.py`)\r\n- Full incident reports with timeline, changes, triggers, and remediation\r\n- Summary reports across all incidents with severity and root cause breakdowns\r\n- Decision chain visualisation (what the agent decided and why)\r\n- Export markdown or JSON\r\n\r\n---\r\n\r\n## Quick Start\r\n\r\n```bash\r\n# 1. Configure\r\ncp config_example.json incident_config.json\r\n# Edit workspace root, triggers, log patterns\r\n\r\n# 2. Take a baseline snapshot\r\npython3 incident_capture.py --config incident_config.json --snapshot --label baseline\r\n\r\n# 3. ... agent does work, something breaks ...\r\n\r\n# 4. Take a post-incident snapshot\r\npython3 incident_capture.py --config incident_config.json --snapshot --label post-incident\r\n\r\n# 5. See what changed\r\npython3 incident_capture.py --config incident_config.json \\\r\n  --diff incident_data/snapshots/SNAP1.json incident_data/snapshots/SNAP2.json\r\n\r\n# 6. Check triggers\r\npython3 incident_capture.py --config incident_config.json \\\r\n  --triggers incident_data/snapshots/SNAP1.json incident_data/snapshots/SNAP2.json\r\n\r\n# 7. Full analysis — creates an incident with timeline, root cause, remediation\r\npython3 incident_replay.py --config incident_config.json \\\r\n  --analyze incident_data/snapshots/SNAP1.json incident_data/snapshots/SNAP2.json \\\r\n  --title \"Agent crashed during deployment\"\r\n\r\n# 8. Generate incident report\r\npython3 incident_report.py --config incident_config.json --incident INC-0001\r\n\r\n# 9. View all incidents and patterns\r\npython3 incident_replay.py --config incident_config.json --incidents\r\npython3 incident_replay.py --config incident_config.json --patterns\r\npython3 incident_report.py --config incident_config.json --summary\r\n```\r\n\r\n## Programmatic Usage\r\n\r\n```python\r\nfrom incident_capture import Capturer, Snapshot, _load_config\r\nfrom incident_replay import Analyzer\r\n\r\ncfg = _load_config(\"incident_config.json\")\r\ncap = Capturer(cfg)\r\nanalyzer = Analyzer(cfg)\r\n\r\n# Take snapshots\r\nbefore = cap.take_snapshot(label=\"before\")\r\n# ... agent runs ...\r\nafter = cap.take_snapshot(label=\"after\")\r\n\r\n# Analyse\r\nchanges = cap.diff_snapshots(before, after)\r\ntriggers = cap.check_triggers(before, after)\r\ndecisions = analyzer.extract_decisions(after)\r\ntimeline = analyzer.build_timeline(\r\n    [before, after],\r\n    triggers=[t.to_dict() for t in triggers],\r\n    changes=changes,\r\n)\r\n\r\n# Create incident\r\nincident = analyzer.create_incident(\r\n    title=\"Agent failed during task X\",\r\n    timeline=timeline,\r\n    triggers=[t.to_dict() for t in triggers],\r\n    file_changes=changes,\r\n    decisions=decisions,\r\n)\r\nprint(f\"Created {incident.id}: {incident.root_cause}\")\r\n```\r\n\r\n---\r\n\r\n## Use Cases\r\n\r\n- **Overnight failure analysis:** Agent ran unattended and broke — what happened?\r\n- **Config change impact:** Track exactly what changed after a config update\r\n- **Drift detection:** Compare weekly snapshots to catch gradual degradation\r\n- **Secret leak detection:** Catch credentials or sensitive data in agent outputs\r\n- **Regression forensics:** Agent used to work, now it doesn't — find the divergence point\r\n- **Team incident management:** Track incidents over time, find recurring patterns\r\n\r\n## What's Included\r\n\r\n| File | Purpose |\r\n|------|---------|\r\n| `incident_capture.py` | State snapshot and change detection |\r\n| `incident_replay.py` | Timeline reconstruction, analysis, incident management |\r\n| `incident_report.py` | Report generation (markdown, JSON) |\r\n| `config_example.json` | Full configuration template |\r\n| `LIMITATIONS.md` | What this tool doesn't do |\r\n| `LICENSE` | MIT License |\r\n\r\n## Requirements\r\n\r\n- Python 3.8+\r\n- No external dependencies (stdlib only)\r\n- Works on any OS\r\n- Platform-agnostic (works with any file-based AI agent workspace)\r\n\r\n## Configuration\r\n\r\nSee `config_example.json` for the complete reference. Key areas:\r\n\r\n- **`WORKSPACE_ROOT`** — Directory to monitor\r\n- **`INCLUDE/EXCLUDE_PATTERNS`** — What files to capture\r\n- **`TRIGGERS`** — Conditions that flag incidents (log patterns, file changes, content scans)\r\n- **`ROOT_CAUSE_CATEGORIES`** — Classification categories with descriptions and remediation\r\n- **`DECISION_MARKERS`** — Regex patterns to extract agent decisions from logs\r\n- **`LOG_FILES`** — Which files to scan for decision chains\r\n\r\n---\r\n\r\n## quality-verified\r\n\r\n\r\n## License\r\n\r\nMIT — See `LICENSE` file.\r\n\r\n\r\n---\r\n\r\n\r\n## ⚠️ Security Note — Config File\r\n\r\nConfiguration is loaded from a JSON file. This is safe to share — no code execution.\r\n\r\n- Config path is validated for existence and size (1MB cap) before loading\r\n- Must be a `.json` file — raises `ValueError` if given a non-JSON path\r\n- Keep your config under version control; it defines what triggers are watched and what's protected\r\n\r\n## ⚠️ Disclaimer\r\n\r\nThis software is provided \"AS IS\", without warranty of any kind, express or implied.\r\n\r\n**USE AT YOUR OWN RISK.**\r\n\r\n- The author(s) are NOT liable for any damages, losses, or consequences arising from \r\n  the use or misuse of this software — including but not limited to financial loss, \r\n  data loss, security breaches, business interruption, or any indirect/consequential damages.\r\n- This software does NOT constitute financial, legal, trading, or professional advice.\r\n- Users are solely responsible for evaluating whether this software is suitable for \r\n  their use case, environment, and risk tolerance.\r\n- No guarantee is made regarding accuracy, reliability, completeness, or fitness \r\n  for any particular purpose.\r\n- The author(s) are not responsible for how third parties use, modify, or distribute \r\n  this software after purchase.\r\n\r\nBy downloading, installing, or using this software, you acknowledge that you have read \r\nthis disclaimer and agree to use the software entirely at your own risk.\r\n\r\n\r\n**DATA DISCLAIMER:** This software processes and stores data locally on your system. \r\nThe author(s) are not responsible for data loss, corruption, or unauthorized access \r\nresulting from software bugs, system failures, or user error. Always maintain \r\nindependent backups of important data. This software does not transmit data externally \r\nunless explicitly configured by the user.\r\n\r\n------\r\n\r\n## Support & Links\r\n\r\n| | |\r\n|---|---|\r\n| 🐛 **Bug Reports** | TheShadowyRose@proton.me |\r\n| ☕ **Ko-fi** | [ko-fi.com/theshadowrose](https://ko-fi.com/theshadowrose) |\r\n| 🛒 **Gumroad** | [shadowyrose.gumroad.com](https://shadowyrose.gumroad.com) |\r\n| 🐦 **Twitter** | [@TheShadowyRose](https://twitter.com/TheShadowyRose) |\r\n| 🐙 **GitHub** | [github.com/TheShadowRose](https://github.com/TheShadowRose) |\r\n| 🧠 **PromptBase** | [promptbase.com/profile/shadowrose](https://promptbase.com/profile/shadowrose) |\r\n\r\n*Built with [OpenClaw](https://github.com/openclaw/openclaw) — thank you for making this possible.*\r\n\r\n---\r\n\r\n🛠️ **Need something custom?** Custom OpenClaw agents & skills starting at $500. If you can describe it, I can build it. → [Hire me on Fiverr](https://www.fiverr.com/s/jjmlZ0v)\r\n","tags":{"latest":"1.0.6","debugging":"1.0.3","forensics":"1.0.3","incidents":"1.0.3","logs":"1.0.3","replay":"1.0.3"},"stats":{"comments":0,"downloads":667,"installsAllTime":25,"installsCurrent":1,"stars":0,"versions":4},"createdAt":1773077505596,"updatedAt":1778491792475},"latestVersion":{"version":"1.0.6","createdAt":1773146921791,"changelog":"- No file or documentation changes in this version.\n- Version number remains at 1.0.5 in the documentation, with no updates for 1.0.6 reflected.\n- No new features, fixes, or updates introduced.","license":"MIT-0"},"metadata":{"setup":[],"os":null,"systems":null},"owner":{"handle":"theshadowrose","userId":"s1736mx5m1zt9qzh6fvzvffnhh83hgf8","displayName":"Shadow Rose","image":"https://avatars.githubusercontent.com/u/262919821?v=4"},"moderation":null}