Install
openclaw skills install datadog-dashboard-builderDesign Datadog dashboards and monitors — recommend metrics, widget layouts, alerting thresholds, and SLO definitions. Analyze existing dashboards for blind s...
openclaw skills install datadog-dashboard-builderDesign production-grade Datadog dashboards, monitors, and SLOs from scratch or audit existing ones. Recommends the right metrics, widget types, alert thresholds, and layout patterns so on-call engineers can diagnose incidents in under 60 seconds.
Use when: "build a Datadog dashboard", "set up monitoring for this service", "review our alerting", "we need SLOs", "our dashboard is too noisy", "what should we monitor", or when designing observability for a new service.
Before building anything, classify the service:
Service Name: ____
Type: [API | Worker | Queue Consumer | Batch Job | Frontend | Database | Cache]
Traffic Pattern: [Steady | Diurnal | Spiky | Event-Driven | Cron-Based]
Criticality: [Tier 1 (revenue) | Tier 2 (core feature) | Tier 3 (internal) | Tier 4 (best-effort)]
Dependencies: [list upstream and downstream services]
Current Pain Points: [what incidents happened, what was hard to debug]
Every service dashboard starts with the four golden signals (Google SRE book):
| Signal | What to Measure | Datadog Metric Pattern |
|---|---|---|
| Latency | Request duration (p50, p95, p99) | trace.{service}.request.duration or {service}.request.latency |
| Traffic | Requests per second | trace.{service}.request.hits or {service}.request.count |
| Errors | Error rate as percentage | trace.{service}.request.errors / trace.{service}.request.hits * 100 |
| Saturation | Resource utilization (CPU, memory, connections, queue depth) | system.cpu.user, system.mem.used, {service}.pool.active |
For each service type, add specific metrics:
API Services:
- Endpoint-level latency breakdown (which endpoint is slow?)
- HTTP status code distribution (2xx, 4xx, 5xx)
- Request payload size (are large payloads causing timeouts?)
- Rate limiting triggers
- Authentication failures
Queue/Worker: Queue depth, processing rate, consumer lag, dead letter queue size, job duration by type, retry count.
Database: Query duration by operation, connection pool utilization, lock wait time, replication lag, cache hit ratio, slow query count.
Frontend/SPA: Core Web Vitals (LCP, FID, CLS) via RUM, JS error rate by page, client-side API latency, page load time, session crash rate.
Follow this proven layout pattern (top to bottom):
Row 1: Health Overview — 4x Query Value widgets (SLO burndown, Error Rate %, p99 Latency, RPS)
Row 2: Request Flow — Request Rate timeseries (stacked by endpoint) + Error Rate timeseries
Row 3: Latency — p50/p95/p99 overlay + Latency heatmap or top-list by endpoint
Row 4: Infrastructure — CPU %, Memory %, Disk I/O, Network (4 widgets)
Row 5: Dependencies — Downstream latency + Downstream error rate (DB, cache, APIs)
Row 6: Changes — Event overlay: deploys, config changes, incidents
Query Value Widgets (Row 1):
{
"type": "query_value",
"requests": [{
"q": "sum:trace.express.request.errors{service:my-api}.as_count() / sum:trace.express.request.hits{service:my-api}.as_count() * 100",
"aggregator": "avg"
}],
"precision": 2,
"custom_unit": "%",
"conditional_formats": [
{"comparator": "<", "value": 1, "palette": "white_on_green"},
{"comparator": ">=", "value": 1, "palette": "white_on_yellow"},
{"comparator": ">=", "value": 5, "palette": "white_on_red"}
]
}
Timeseries Widgets:
avg aggregation for latency, sum for countsweek_before() function to overlay last week for trend comparisonHeatmaps:
Top Lists:
Error Rate Monitor:
name: "[{service}] Error rate above {threshold}%"
type: metric alert
query: |
sum(last_5m):
sum:trace.{service}.request.errors{env:production}.as_count() /
sum:trace.{service}.request.hits{env:production}.as_count() * 100
> {threshold}
thresholds:
critical: 5 # Page the on-call
warning: 2 # Slack notification
recovery: 1 # Auto-resolve
evaluation_delay: 60 # Wait for late-arriving data
require_full_window: false
notify_no_data: true
no_data_timeframe: 10
renotify_interval: 30
escalation_message: "Error rate still elevated after 30 minutes"
tags:
- "service:{service}"
- "team:{team}"
- "tier:1"
Additional monitors to create (follow same pattern as error rate above):
avg(last_5m):trace.{service}.request.duration.by.service.99p{env:production} > 2000 — critical at 2s, warning at 1savg(last_10m):avg:system.cpu.user{service:{service}} by {host} > 80 — critical at 90%, warning at 80%anomalies() function with agile algorithm, sensitivity 3, weekly seasonality for traffic volume| Service Tier | Error Rate Critical | Latency p99 Critical | CPU Critical |
|---|---|---|---|
| Tier 1 (revenue) | 1% | 500ms | 80% |
| Tier 2 (core) | 5% | 2s | 85% |
| Tier 3 (internal) | 10% | 5s | 90% |
| Tier 4 (best-effort) | No page | No page | 95% |
Create metric-based SLOs with numerator (successful requests excluding 5xx) divided by denominator (all requests). Set 30-day rolling window.
Recommended SLO Targets by Tier:
| Tier | Availability SLO | Latency SLO (p99 < target) | Error Budget (30 days) |
|---|---|---|---|
| Tier 1 | 99.95% | 99.9% under 500ms | 21.6 min downtime |
| Tier 2 | 99.9% | 99.5% under 2s | 43.2 min downtime |
| Tier 3 | 99.5% | 99% under 5s | 3.6 hr downtime |
| Tier 4 | 99% | N/A | 7.2 hr downtime |
When reviewing an existing dashboard, check for:
avg for latency instead of percentiles (hides tail)$env, $service, $host dropdowns# Dashboard Design: {Service Name}
## Service Profile
- **Type:** {API/Worker/etc.}
- **Tier:** {1-4}
- **Dependencies:** {list}
## Dashboard Structure
{Layout description with widget specifications}
## Monitors
{List of monitors with thresholds and notification routing}
## SLOs
{SLO definitions with targets and error budgets}
## Audit Findings (if reviewing existing)
- {Finding 1: problem and recommendation}
- {Finding 2: problem and recommendation}
## Implementation Steps
1. {Step-by-step instructions to create in Datadog UI or via API/Terraform}
env and service so one dashboard works across environmentsweek_before() overlays to catch gradual degradation that doesn't trigger alerts