Install
openclaw skills install datadog-dashboard-builderDesign Datadog dashboards and monitors — recommend metrics, widget layouts, alerting thresholds, and SLO definitions. Analyze existing dashboards for blind spots and noise. Use when asked to build monitoring, set up alerts, or improve observability in Datadog.
openclaw skills install datadog-dashboard-builderDesign production-grade Datadog dashboards, monitors, and SLOs from scratch or audit existing ones. Recommends the right metrics, widget types, alert thresholds, and layout patterns so on-call engineers can diagnose incidents in under 60 seconds.
Use when: "build a Datadog dashboard", "set up monitoring for this service", "review our alerting", "we need SLOs", "our dashboard is too noisy", "what should we monitor", or when designing observability for a new service.
Before building anything, classify the service:
Service Name: ____
Type: [API | Worker | Queue Consumer | Batch Job | Frontend | Database | Cache]
Traffic Pattern: [Steady | Diurnal | Spiky | Event-Driven | Cron-Based]
Criticality: [Tier 1 (revenue) | Tier 2 (core feature) | Tier 3 (internal) | Tier 4 (best-effort)]
Dependencies: [list upstream and downstream services]
Current Pain Points: [what incidents happened, what was hard to debug]
Every service dashboard starts with the four golden signals (Google SRE book):
| Signal | What to Measure | Datadog Metric Pattern |
|---|---|---|
| Latency | Request duration (p50, p95, p99) | trace.{service}.request.duration or {service}.request.latency |
| Traffic | Requests per second | trace.{service}.request.hits or {service}.request.count |
| Errors | Error rate as percentage | trace.{service}.request.errors / trace.{service}.request.hits * 100 |
| Saturation | Resource utilization (CPU, memory, connections, queue depth) | system.cpu.user, system.mem.used, {service}.pool.active |
For each service type, add specific metrics:
API Services:
- Endpoint-level latency breakdown (which endpoint is slow?)
- HTTP status code distribution (2xx, 4xx, 5xx)
- Request payload size (are large payloads causing timeouts?)
- Rate limiting triggers
- Authentication failures
Queue/Worker: Queue depth, processing rate, consumer lag, dead letter queue size, job duration by type, retry count.
Database: Query duration by operation, connection pool utilization, lock wait time, replication lag, cache hit ratio, slow query count.
Frontend/SPA: Core Web Vitals (LCP, FID, CLS) via RUM, JS error rate by page, client-side API latency, page load time, session crash rate.
Follow this proven layout pattern (top to bottom):
Row 1: Health Overview — 4x Query Value widgets (SLO burndown, Error Rate %, p99 Latency, RPS)
Row 2: Request Flow — Request Rate timeseries (stacked by endpoint) + Error Rate timeseries
Row 3: Latency — p50/p95/p99 overlay + Latency heatmap or top-list by endpoint
Row 4: Infrastructure — CPU %, Memory %, Disk I/O, Network (4 widgets)
Row 5: Dependencies — Downstream latency + Downstream error rate (DB, cache, APIs)
Row 6: Changes — Event overlay: deploys, config changes, incidents
Query Value Widgets (Row 1):
{
"type": "query_value",
"requests": [{
"q": "sum:trace.express.request.errors{service:my-api}.as_count() / sum:trace.express.request.hits{service:my-api}.as_count() * 100",
"aggregator": "avg"
}],
"precision": 2,
"custom_unit": "%",
"conditional_formats": [
{"comparator": "<", "value": 1, "palette": "white_on_green"},
{"comparator": ">=", "value": 1, "palette": "white_on_yellow"},
{"comparator": ">=", "value": 5, "palette": "white_on_red"}
]
}
Timeseries Widgets:
avg aggregation for latency, sum for countsweek_before() function to overlay last week for trend comparisonHeatmaps:
Top Lists:
Error Rate Monitor:
name: "[{service}] Error rate above {threshold}%"
type: metric alert
query: |
sum(last_5m):
sum:trace.{service}.request.errors{env:production}.as_count() /
sum:trace.{service}.request.hits{env:production}.as_count() * 100
> {threshold}
thresholds:
critical: 5 # Page the on-call
warning: 2 # Slack notification
recovery: 1 # Auto-resolve
evaluation_delay: 60 # Wait for late-arriving data
require_full_window: false
notify_no_data: true
no_data_timeframe: 10
renotify_interval: 30
escalation_message: "Error rate still elevated after 30 minutes"
tags:
- "service:{service}"
- "team:{team}"
- "tier:1"
Additional monitors to create (follow same pattern as error rate above):
avg(last_5m):trace.{service}.request.duration.by.service.99p{env:production} > 2000 — critical at 2s, warning at 1savg(last_10m):avg:system.cpu.user{service:{service}} by {host} > 80 — critical at 90%, warning at 80%anomalies() function with agile algorithm, sensitivity 3, weekly seasonality for traffic volume| Service Tier | Error Rate Critical | Latency p99 Critical | CPU Critical |
|---|---|---|---|
| Tier 1 (revenue) | 1% | 500ms | 80% |
| Tier 2 (core) | 5% | 2s | 85% |
| Tier 3 (internal) | 10% | 5s | 90% |
| Tier 4 (best-effort) | No page | No page | 95% |
Create metric-based SLOs with numerator (successful requests excluding 5xx) divided by denominator (all requests). Set 30-day rolling window.
Recommended SLO Targets by Tier:
| Tier | Availability SLO | Latency SLO (p99 < target) | Error Budget (30 days) |
|---|---|---|---|
| Tier 1 | 99.95% | 99.9% under 500ms | 21.6 min downtime |
| Tier 2 | 99.9% | 99.5% under 2s | 43.2 min downtime |
| Tier 3 | 99.5% | 99% under 5s | 3.6 hr downtime |
| Tier 4 | 99% | N/A | 7.2 hr downtime |
When reviewing an existing dashboard, check for:
avg for latency instead of percentiles (hides tail)$env, $service, $host dropdowns# Dashboard Design: {Service Name}
## Service Profile
- **Type:** {API/Worker/etc.}
- **Tier:** {1-4}
- **Dependencies:** {list}
## Dashboard Structure
{Layout description with widget specifications}
## Monitors
{List of monitors with thresholds and notification routing}
## SLOs
{SLO definitions with targets and error budgets}
## Audit Findings (if reviewing existing)
- {Finding 1: problem and recommendation}
- {Finding 2: problem and recommendation}
## Implementation Steps
1. {Step-by-step instructions to create in Datadog UI or via API/Terraform}
env and service so one dashboard works across environmentsweek_before() overlays to catch gradual degradation that doesn't trigger alerts