Install
openclaw skills install prom-queryPrometheus Metrics Query & Alert Interpreter — query metrics, interpret timeseries, triage alerts
openclaw skills install prom-queryYou have access to a Prometheus-compatible metrics server. Use this skill to query metrics, check alerts, inspect targets, and explore available metrics. You can query Prometheus, Thanos, Mimir, and VictoriaMetrics — they all share the same HTTP API.
| Command | Purpose | Example |
|---|---|---|
query <promql> | Instant query (current value) | prom-query query 'up' |
range <promql> [--start=] [--end=] [--step=] | Range query (timeseries over time) | prom-query range 'rate(http_requests_total[5m])' --start=-1h --step=1m |
alerts [--state=firing|pending|inactive] | List active alerts | prom-query alerts --state=firing |
targets [--state=active|dropped|any] | Scrape target health | prom-query targets |
explore [pattern] | Search available metrics by name pattern | prom-query explore 'http_request' |
rules [--type=alert|record] | Alerting & recording rules | prom-query rules --type=alert |
When the user asks a question about their system, translate it to PromQL using these patterns:
# "What's the error rate for the API?"
rate(http_requests_total{code=~"5.."}[5m]) / rate(http_requests_total[5m])
# "Error rate for the payments service"
rate(http_requests_total{service="payments", code=~"5.."}[5m])
# "4xx and 5xx errors per second"
sum(rate(http_requests_total{code=~"[45].."}[5m])) by (code)
# "P99 latency"
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# "P50 latency by service"
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# "Average request duration"
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
# "CPU usage per instance"
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# "CPU usage per pod (Kubernetes)"
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace)
# "Which pods use the most CPU?"
topk(10, sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace))
# "Memory usage percentage per instance"
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# "Memory usage per pod (Kubernetes)"
sum(container_memory_working_set_bytes{container!=""}) by (pod, namespace)
# "Pods using more than 1GB RAM"
sum(container_memory_working_set_bytes{container!=""}) by (pod, namespace) > 1e9
# "Disk usage percentage"
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
# "Disk will be full in 4 hours?" (linear prediction)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4*3600) < 0
# "Network traffic in/out per interface"
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
# "How many pods are not ready?"
sum(kube_pod_status_ready{condition="false"}) by (namespace)
# "Pods in CrashLoopBackOff"
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}
# "Deployment replica mismatch"
kube_deployment_spec_replicas != kube_deployment_status_available_replicas
# "Node conditions"
kube_node_status_condition{condition="Ready", status="true"} == 0
# "Show me everything about <service>"
# First, explore what metrics exist:
prom-query explore '<service_name>'
# "Is everything up?"
prom-query query 'up'
# "What changed in the last hour?"
# Use range query with the relevant metric and look for step changes:
prom-query range '<metric>' --start=-1h --step=1m
# Rate of any counter:
rate(<counter_metric>[5m])
# Sum across labels:
sum(<metric>) by (<label>)
# Top N:
topk(10, <metric>)
When you get range query results, look for:
Range query results include automatic summaries for each series:
min / max / avg: Statistical summary of all valuesfirst / last: Start and end values (shows trend direction)pointCount: Number of data pointsdownsampled: Whether the step was automatically increased to limit data volumeThe script automatically downsamples range queries that would return more than 500 data points. When downsampled: true, tell the user the step was adjusted and offer to zoom into a narrower time window for full resolution.
When helping with an incident or investigating a problem:
prom-query alerts --state=firing — see what's actually firingprom-query targets — are any scrape targets down?When presenting alerts to the user:
activeAt)value, explain what it means in contextWhen running in a Discord channel:
Show Last 1h TrendList Firing AlertsExplore Related Metricsexplore command uses regex pattern matching (case-insensitive).-1h, -30m, -2d), epoch timestamps, or ISO8601 dates.If a query fails:
explore to find the right metric name.topk().Powered by Anvil AI 📊