Install
openclaw skills install multi-site-health-monitorMonitor dozens of websites with configurable health checks, auto-restart alerts, and intelligent alert routing. Use when the user needs uptime tracking, perf...
openclaw skills install multi-site-health-monitorThe Multi-Site Health Monitor skill automates continuous monitoring of 10-100+ websites with configurable health checks, intelligent alert routing, and automatic incident escalation. This production-grade monitoring solution integrates with Slack, PagerDuty, Datadog, Google Sheets, and WordPress to provide real-time visibility into your digital infrastructure.
Try these example prompts immediately:
Monitor these sites every 5 minutes and alert Slack if any fail:
- https://api.example.com/health
- https://app.example.com/status
- https://cdn.example.com/ping
- https://wordpress.example.com/wp-json/health
- https://db.example.com/check
Alert rules:
- Critical (page down): Slack #incidents + PagerDuty
- Warning (slow >3s): Slack #alerts
- Info (cert expires <30d): Google Sheets log
Monitor https://payment-service.example.com/health every 2 minutes.
If it fails 3 times in a row:
1. POST to https://restart-api.example.com/restart-payment-service
2. Alert PagerDuty with incident "Payment Service Down"
3. Notify Slack #critical-incidents
4. Log to Google Sheets with timestamp, error details, restart status
Response timeout: 10 seconds
Expected response: HTTP 200 with {"status":"healthy"}
Monitor these WordPress sites for health + security:
- https://site1.example.com/wp-json/wp/v2/health-check
- https://site2.example.com/wp-json/wp/v2/health-check
- https://site3.example.com/wp-json/wp/v2/health-check
Check for:
- Core updates available (warning if >1 week old)
- Plugin vulnerabilities (critical if any)
- Database connectivity (critical if down)
- SSL certificate expiry (warning if <30 days)
Alert destinations:
- Critical: PagerDuty + Slack #wordpress-critical
- Warning: Slack #wordpress-alerts
- Info: Google Sheets #monitoring-log
Monitor https://api.example.com/metrics every 10 minutes.
Alert if:
- Response time > 2000ms (warning) or > 5000ms (critical)
- Error rate > 1% (warning) or > 5% (critical)
- CPU usage > 70% (warning) or > 90% (critical)
- Memory usage > 80% (warning) or > 95% (critical)
Send metrics to Datadog with tags: env:prod, service:api, team:backend
Monitor endpoints via:
Example: Monitor API health with custom authentication
Endpoint: https://api.example.com/health
Method: POST
Headers:
Authorization: Bearer YOUR_API_KEY
User-Agent: MultiSiteMonitor/1.0.0
Expected Status: 200
Expected Body: {"status":"healthy","version":"2.1.0"}
Timeout: 10 seconds
# Slack notifications
export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
export SLACK_CHANNEL="#incidents" # or #alerts, #monitoring, etc.
# PagerDuty incident creation
export PAGERDUTY_API_KEY="YOUR_PAGERDUTY_API_KEY"
export PAGERDUTY_SERVICE_ID="YOUR_SERVICE_ID"
# Datadog metrics ingestion
export DATADOG_API_KEY="YOUR_DATADOG_API_KEY"
export DATADOG_APP_KEY="YOUR_DATADOG_APP_KEY"
export DATADOG_SITE="datadoghq.com" # or datadoghq.eu
# Google Sheets logging
export GOOGLE_SHEETS_ID="YOUR_SPREADSHEET_ID"
export GOOGLE_SERVICE_ACCOUNT_JSON="/path/to/service-account.json"
# AWS auto-restart (optional)
export AWS_ACCESS_KEY_ID="YOUR_AWS_KEY"
export AWS_SECRET_ACCESS_KEY="YOUR_AWS_SECRET"
export AWS_REGION="us-east-1"
# SSH for remote service restart (optional)
export SSH_PRIVATE_KEY="/path/to/private/key"
export SSH_USER="deploy"
# monitors.yaml
monitors:
- name: "Production API"
url: "https://api.example.com/health"
interval: 300 # seconds
timeout: 10
method: "GET"
expected_status: 200
expected_body_contains: "healthy"
alert_rules:
critical:
- slack_channel: "#critical-incidents"
- pagerduty_severity: "critical"
warning:
- slack_channel: "#alerts"
auto_restart:
enabled: true
command: "systemctl restart api-service"
max_retries: 3
retry_delay: 60
- name: "WordPress Site"
url: "https://wordpress.example.com/wp-json/wp/v2/health-check"
interval: 600
timeout: 15
method: "GET"
headers:
Authorization: "Bearer YOUR_WP_TOKEN"
checks:
- type: "wordpress_core_updates"
alert_if: "available"
severity: "warning"
- type: "plugin_vulnerabilities"
alert_if: "found"
severity: "critical"
- type: "ssl_certificate"
expires_in_days: 30
severity: "warning"
alert_rules:
critical:
- pagerduty_severity: "critical"
- slack_channel: "#wordpress-critical"
warning:
- slack_channel: "#wordpress-alerts"
- google_sheets: true
- name: "Database Health"
url: "https://db-monitor.example.com/health"
interval: 120
timeout: 20
method: "POST"
headers:
Content-Type: "application/json"
Authorization: "Bearer YOUR_DB_TOKEN"
body: '{"check":"full"}'
expected_status: 200
performance_thresholds:
response_time_ms: 2000
error_rate_percent: 1.0
alert_rules:
critical:
- pagerduty_severity: "critical"
- datadog_metric: "db.health.critical"
warning:
- datadog_metric: "db.health.warning"
- slack_channel: "#database-alerts"
monitors.yaml.env file with all required API keysmulti-site-health-monitor --validate to verify all URLs respond🚨 CRITICAL: Production API Down
Service: Production API (api.example.com/health)
Status: HTTP 500 Internal Server Error
Response Time: 12.3s (timeout threshold: 10s)
Last Healthy: 2024-01-15 14:32:15 UTC
Incident Duration: 5 minutes 23 seconds
Alert Count: 3 consecutive failures
Auto-Restart Status: ✅ Triggered (attempt 1/3)
PagerDuty Incident: INC-12345 (assigned to @oncall-backend)
Next Check: 2024-01-15 14:42:15 UTC
Escalation: Will escalate to #management if unresolved in 25 minutes
Timestamp | Service | Status | Response Time | Error | Action Taken
2024-01-15 14:37 | Production API | FAIL | 12300ms | HTTP 500 | Auto-restart triggered
2024-01-15 14:32 | Production API | FAIL | 10100ms | Timeout | Alert sent
2024-01-15 14:27 | Production API | OK | 245ms | - | -
2024-01-15 14:22 | WordPress Site | WARN | 3200ms | Slow | Alert sent
2024-01-15 14:17 | Database Health | OK | 145ms | - | -
Incident: INC-12345
Service: Production API
Severity: Critical
Status: Triggered
Title: Production API Down - HTTP 500 (5min+ duration)
Description:
Endpoint: https://api.example.com/health
Status: HTTP 500 Internal Server Error
Response Time: 12.3s
Consecutive Failures: 3
Last Healthy: 2024-01-15 14:32:15 UTC
Auto-Restart: Triggered (attempt 1/3)
Assigned To: @oncall-backend
Created: 2024-01-15 14:37:00 UTC
Escalation Policy: Backend On-Call → Manager → VP Engineering
multi_site_monitor.health_check.response_time:245ms (tags: service:api, env:prod)
multi_site_monitor.health_check.status:200 (tags: service:api, env:prod)
multi_site_monitor.health_check.availability:99.87 (tags: service:api, env:prod)
multi_site_monitor.auto_restart.attempts:1 (tags: service:api, env:prod)
/wp-json/custom/health returning comprehensive data