Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

S³ Incident Runbook Templates

v1.0.0

Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to...

0· 193·0 current·0 all-time
bySolomon Neas@solomonneas

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for solomonneas/s3-incident-runbooks.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "S³ Incident Runbook Templates" (solomonneas/s3-incident-runbooks) from ClawHub.
Skill page: https://clawhub.ai/solomonneas/s3-incident-runbooks
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install s3-incident-runbooks

ClawHub CLI

Package manager switcher

npx clawhub@latest install s3-incident-runbooks
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
medium confidence
Purpose & Capability
Name and description match the contents: templates and step-by-step operational runbooks for incident response. The commands and sections (kubectl, psql, curl, rollout undo, scaling, network policy) are plausible and expected in an on-call/infrastructure runbook. However, some items (internal endpoints, Sentry/Grafana links, feature-flag APIs) are placeholder/organization-specific and the SKILL.md assumes access to internal infrastructure without declaring those needs.
!
Instruction Scope
The instructions include concrete, executable commands (kubectl, psql, curl, kubectl apply, pg_terminate_backend, rollout undo, scaling) that would perform destructive or high-privilege actions if executed. They reference environment variables ($DB_HOST, $DB_USER) and internal endpoints (api.company.com, prometheus, grafana, sentry) that are not declared in requires.env. The skill also suggests applying network policies and terminating DB backends — actions beyond mere read-only diagnostics. Because SKILL.md could be used to drive an agent to run these commands, the lack of explicit guardrails (explicitly requiring human confirmation or declaring that commands are examples only) is a concern.
Install Mechanism
Instruction-only skill (no install spec, no code files). This minimizes disk/installation risk because nothing is written or downloaded by the skill itself.
!
Credentials
The skill declares no required environment variables or credentials but the runbooks reference sensitive values and services (DB_HOST, DB_USER, internal API endpoints, PagerDuty/Slack/Pager integrations, psql auth). That mismatch means the instructions assume access to secrets and internal systems without declaring or justifying them. Users should not supply full DB or cloud credentials to this skill without strict controls.
Persistence & Privilege
The skill is not always-enabled and doesn't request persistent privileges or modify other skills. However, it instructs high-privilege operational steps; combined with the platform default that the agent can invoke the skill autonomously, this increases the blast radius if the agent is permitted to execute commands. There are no special install-time persistence concerns.
Scan Findings in Context
[no_scan_findings] expected: The regex-based scanner found nothing — expected because this is an instruction-only SKILL.md with no code files. Absence of findings does not mean the instructions are safe; the file clearly references sensitive commands and environment variables.
What to consider before installing
This skill is coherent for building incident runbooks, but treat it as a recipe for human operators rather than something to run automatically. Before installing or letting an agent execute these instructions: 1) Do not provide DB or cloud credentials to the skill; supply examples or redact secrets. 2) Require explicit human confirmation for any destructive command (rollbacks, pg_terminate_backend, kubectl apply/scale). 3) Limit the agent's execution environment and Kubernetes/DB permissions (use least privilege, test in staging). 4) Verify and replace placeholder internal endpoints (api.company.com, prometheus, grafana, sentry) with your real URLs or remove them. 5) Have on-call/infrastructure owners review and approve the runbook steps and any referenced scripts (resources/implementation-playbook.md appears referenced but not included). If you need the agent to perform actions, consider adding stricter guards (declared required env vars, explicit confirmation prompts, and scoped short-lived credentials).

Like a lobster shell, security has layers — review code before you run it.

incident-responsevk97dz0qynhnkd3ybvnr5j41rv5836e6vlatestvk97dz0qynhnkd3ybvnr5j41rv5836e6vplaybooksvk97dz0qynhnkd3ybvnr5j41rv5836e6vrunbooksvk97dz0qynhnkd3ybvnr5j41rv5836e6vsocvk97dz0qynhnkd3ybvnr5j41rv5836e6v
193downloads
0stars
1versions
Updated 15h ago
v1.0.0
MIT-0

Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

Do not use this skill when

  • The task is unrelated to incident runbook templates
  • You need a different domain or tool outside this scope

Instructions

  • Clarify goals, constraints, and required inputs.
  • Apply relevant best practices and validate outcomes.
  • Provide actionable steps and verification.
  • If detailed examples are required, open resources/implementation-playbook.md.

Use this skill when

  • Creating incident response procedures
  • Building service-specific runbooks
  • Establishing escalation paths
  • Documenting recovery procedures
  • Responding to active incidents
  • Onboarding on-call engineers

Core Concepts

1. Incident Severity Levels

SeverityImpactResponse TimeExample
SEV1Complete outage, data loss15 minProduction down
SEV2Major degradation30 minCritical feature broken
SEV3Minor impact2 hoursNon-critical bug
SEV4Minimal impactNext business dayCosmetic issue

2. Runbook Structure

1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix

Runbook Templates

Template 1: Service Outage Runbook

# [Service Name] Outage Runbook

## Overview
**Service**: Payment Processing Service
**Owner**: Platform Team
**Slack**: #payments-incidents
**PagerDuty**: payments-oncall

## Impact Assessment
- [ ] Which customers are affected?
- [ ] What percentage of traffic is impacted?
- [ ] Are there financial implications?
- [ ] What's the blast radius?

## Detection
### Alerts
- `payment_error_rate > 5%` (PagerDuty)
- `payment_latency_p99 > 2s` (Slack)
- `payment_success_rate < 95%` (PagerDuty)

### Dashboards
- [Payment Service Dashboard](https://grafana/d/payments)
- [Error Tracking](https://sentry.io/payments)
- [Dependency Status](https://status.stripe.com)

## Initial Triage (First 5 Minutes)

### 1. Assess Scope
```bash
# Check service health
kubectl get pods -n payments -l app=payment-service

# Check recent deployments
kubectl rollout history deployment/payment-service -n payments

# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"

2. Quick Health Checks

  • Can you reach the service? curl -I https://api.company.com/payments/health
  • Database connectivity? Check connection pool metrics
  • External dependencies? Check Stripe, bank API status
  • Recent changes? Check deploy history

3. Initial Classification

SymptomLikely CauseGo To Section
All requests failingService downSection 4.1
High latencyDatabase/dependencySection 4.2
Partial failuresCode bugSection 4.3
Spike in errorsTraffic surgeSection 4.4

Mitigation Procedures

4.1 Service Completely Down

# Step 1: Check pod status
kubectl get pods -n payments

# Step 2: If pods are crash-looping, check logs
kubectl logs -n payments -l app=payment-service --tail=100

# Step 3: Check recent deployments
kubectl rollout history deployment/payment-service -n payments

# Step 4: ROLLBACK if recent deploy is suspect
kubectl rollout undo deployment/payment-service -n payments

# Step 5: Scale up if resource constrained
kubectl scale deployment/payment-service -n payments --replicas=10

# Step 6: Verify recovery
kubectl rollout status deployment/payment-service -n payments

4.2 High Latency

# Step 1: Check database connections
kubectl exec -n payments deploy/payment-service -- \
  curl localhost:8080/metrics | grep db_pool

# Step 2: Check slow queries (if DB issue)
psql -h $DB_HOST -U $DB_USER -c "
  SELECT pid, now() - query_start AS duration, query
  FROM pg_stat_activity
  WHERE state = 'active' AND duration > interval '5 seconds'
  ORDER BY duration DESC;"

# Step 3: Kill long-running queries if needed
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"

# Step 4: Check external dependency latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health

# Step 5: Enable circuit breaker if dependency is slow
kubectl set env deployment/payment-service \
  STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments

4.3 Partial Failures (Specific Errors)

# Step 1: Identify error pattern
kubectl logs -n payments -l app=payment-service --tail=500 | \
  grep -i error | sort | uniq -c | sort -rn | head -20

# Step 2: Check error tracking
# Go to Sentry: https://sentry.io/payments

# Step 3: If specific endpoint, enable feature flag to disable
curl -X POST https://api.company.com/internal/feature-flags \
  -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'

# Step 4: If data issue, check recent data changes
psql -h $DB_HOST -c "
  SELECT * FROM audit_log
  WHERE table_name = 'payment_methods'
  AND created_at > now() - interval '1 hour';"

4.4 Traffic Surge

# Step 1: Check current request rate
kubectl top pods -n payments

# Step 2: Scale horizontally
kubectl scale deployment/payment-service -n payments --replicas=20

# Step 3: Enable rate limiting
kubectl set env deployment/payment-service \
  RATE_LIMIT_ENABLED=true \
  RATE_LIMIT_RPS=1000 -n payments

# Step 4: If attack, block suspicious IPs
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: block-suspicious
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: payment-service
  ingress:
  - from:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 192.168.1.0/24  # Suspicious range
EOF

Verification Steps

# Verify service is healthy
curl -s https://api.company.com/payments/health | jq

# Verify error rate is back to normal
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'

# Verify latency is acceptable
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq

# Smoke test critical flows
./scripts/smoke-test-payments.sh

Rollback Procedures

# Rollback Kubernetes deployment
kubectl rollout undo deployment/payment-service -n payments

# Rollback database migration (if applicable)
./scripts/db-rollback.sh $MIGRATION_VERSION

# Rollback feature flag
curl -X POST https://api.company.com/internal/feature-flags \
  -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'

Escalation Matrix

ConditionEscalate ToContact
> 15 min unresolved SEV1Engineering Manager@manager (Slack)
Data breach suspectedSecurity Team#security-incidents
Financial impact > $10kFinance + Legal@finance-oncall
Customer communication neededSupport Lead@support-lead

Communication Templates

Initial Notification (Internal)

🚨 INCIDENT: Payment Service Degradation

Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]

Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards

Updates in #payments-incidents

Status Update

📊 UPDATE: Payment Service Incident

Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes

Actions Taken:
- Rolled back deployment v2.3.4 → v2.3.3
- Scaled service from 5 → 10 replicas

Next Steps:
- Continuing to monitor
- Root cause analysis in progress

ETA to Resolution: ~15 minutes

Resolution Notification

✅ RESOLVED: Payment Service Incident

Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4

Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully

Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress

### Template 2: Database Incident Runbook

```markdown
# Database Incident Runbook

## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
| Kill query | `SELECT pg_terminate_backend(pid);` |
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |

## Connection Pool Exhaustion
```sql
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;

-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';

Replication Lag

-- Check lag on replica
SELECT
  CASE
    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
  END AS lag_seconds;

-- If lag > 60s, consider:
-- 1. Check network between primary/replica
-- 2. Check replica disk I/O
-- 3. Consider failover if unrecoverable

Disk Space Critical

# Check disk usage
df -h /var/lib/postgresql/data

# Find large tables
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"

# VACUUM to reclaim space
psql -c "VACUUM FULL large_table;"

# If emergency, delete old data or expand disk

## Best Practices

### Do's
- **Keep runbooks updated** - Review after every incident
- **Test runbooks regularly** - Game days, chaos engineering
- **Include rollback steps** - Always have an escape hatch
- **Document assumptions** - What must be true for steps to work
- **Link to dashboards** - Quick access during stress

### Don'ts
- **Don't assume knowledge** - Write for 3 AM brain
- **Don't skip verification** - Confirm each step worked
- **Don't forget communication** - Keep stakeholders informed
- **Don't work alone** - Escalate early
- **Don't skip postmortems** - Learn from every incident

## Resources

- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
- [PagerDuty Incident Response](https://response.pagerduty.com/)
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)

Comments

Loading comments...