Prometheus
Prometheus monitoring — scrape configuration, service discovery, recording rules, alert rules, and production deployment for infrastructure and application metrics.
MIT-0 · Free to use, modify, and redistribute. No attribution required.
⭐ 0 · 703 · 0 current installs · 0 all-time installs
by@wpank
MIT-0
Security Scan
OpenClaw
Benign
high confidencePurpose & Capability
The name/description match the provided content: Prometheus config, scrape/service-discovery examples, recording and alerting rule templates, and deployment guidance. The skill requests no binaries, env vars, or credentials that would be unrelated to Prometheus setup.
Instruction Scope
SKILL.md and templates reference system paths (/etc/prometheus, /var/run/secrets, ca files, bearer_token_file) and suggest using tools like promtool and curl to validate/reload configs. These references are expected for a Prometheus production setup but mean the operator (or agent acting on behalf of the operator) will need filesystem and network access to the Prometheus host and Kubernetes service account files to perform actions — review before granting such access.
Install Mechanism
No install spec or downloaded code; the skill is instruction-only and includes config templates. No external packages or arbitrary URLs are fetched by the skill itself.
Credentials
No environment variables or credentials are required by the skill metadata. Template examples mention tokens, certs, and remote_write endpoints only as optional configuration items — appropriate for this purpose.
Persistence & Privilege
always is false and the skill does not request persistent presence or modify other skills. Autonomous invocation (model-invocation enabled) is platform-default and not by itself a concern here.
Assessment
This skill is a coherent bundle of Prometheus configs and examples, not executable code. Before using it: (1) review and customize thresholds, job names, and target addresses before deploying; (2) be cautious when copying files into /etc or using bearer_token_file or CA/key files — those are sensitive and require proper permissions and RBAC; (3) do not enable remote_write to external endpoints you don't trust (it can send metrics off-cluster); (4) validate configs with promtool and reload Prometheus via its API or Helm as appropriate; (5) if you allow an agent to act autonomously with this skill, restrict its access to only the hosts/configs you intend it to modify.Like a lobster shell, security has layers — review code before you run it.
Current versionv1.0.0
Download ziplatest
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
SKILL.md
Prometheus
Production Prometheus setup covering scrape configuration, service discovery, recording rules, alert rules, and operational best practices for infrastructure and application monitoring.
When to Use
| Scenario | Example |
|---|---|
| Set up metrics collection | New service needs Prometheus scraping |
| Configure service discovery | K8s pods, file-based, or static targets |
| Create recording rules | Pre-compute expensive PromQL queries |
| Design alert rules | SLO-based alerts for availability and latency |
| Production deployment | HA setup with retention and storage planning |
| Troubleshoot scraping | Targets down, metrics missing, relabeling issues |
Architecture
Applications ──(/metrics)──→ Prometheus Server ──→ AlertManager → Slack/PD
↑ │
client libraries ├──→ Grafana (dashboards)
(prom client) └──→ Thanos/Cortex (long-term storage)
Installation
Kubernetes (Helm)
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageVolumeSize=50Gi
Core Configuration
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: production
region: us-west-2
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
# Self-monitoring
- job_name: prometheus
static_configs:
- targets: ["localhost:9090"]
# Node exporters
- job_name: node-exporter
static_configs:
- targets: ["node1:9100", "node2:9100", "node3:9100"]
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: "([^:]+)(:[0-9]+)?"
replacement: "${1}"
# Application metrics (TLS)
- job_name: my-app
scheme: https
metrics_path: /metrics
tls_config:
ca_file: /etc/prometheus/ca.crt
static_configs:
- targets: ["app1:9090", "app2:9090"]
Service Discovery
Kubernetes Pods (Annotation-Based)
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels:
[__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels:
[__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels:
[__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
Pod annotations to enable scraping:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
File-Based Discovery
scrape_configs:
- job_name: file-sd
file_sd_configs:
- files: ["/etc/prometheus/targets/*.json"]
refresh_interval: 5m
targets/production.json:
[{
"targets": ["app1:9090", "app2:9090"],
"labels": { "env": "production", "service": "api" }
}]
Discovery Method Comparison
| Method | Best For | Dynamic |
|---|---|---|
static_configs | Fixed infrastructure, dev | No |
file_sd_configs | CM-managed inventories | Yes (file watch) |
kubernetes_sd_configs | K8s workloads | Yes (API watch) |
consul_sd_configs | Consul service mesh | Yes (Consul watch) |
ec2_sd_configs | AWS EC2 instances | Yes (API poll) |
Recording Rules
Pre-compute expensive queries for dashboard and alert performance:
# /etc/prometheus/rules/recording_rules.yml
groups:
- name: api_metrics
interval: 15s
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_errors:rate5m
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
- record: job:http_error_rate:ratio
expr: job:http_errors:rate5m / job:http_requests:rate5m
- record: job:http_duration:p95
expr: >
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- name: resource_metrics
interval: 30s
rules:
- record: instance:node_cpu:utilization
expr: >
100 - (avg by (instance)
(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: instance:node_memory:utilization
expr: >
100 - ((node_memory_MemAvailable_bytes
/ node_memory_MemTotal_bytes) * 100)
- record: instance:node_disk:utilization
expr: >
100 - ((node_filesystem_avail_bytes
/ node_filesystem_size_bytes) * 100)
Naming Convention
level:metric_name:operations
| Part | Example | Meaning |
|---|---|---|
| level | job:, instance: | Aggregation level |
| metric_name | http_requests | Base metric |
| operations | :rate5m, :ratio | Applied functions |
Alert Rules
# /etc/prometheus/rules/alert_rules.yml
groups:
- name: availability
rules:
- alert: ServiceDown
expr: up{job="my-app"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} is down"
description: "{{ $labels.job }} down for >1 minute"
- alert: HighErrorRate
expr: job:http_error_rate:ratio > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Error rate {{ $value | humanizePercentage }} for {{ $labels.job }}"
- alert: HighP95Latency
expr: job:http_duration:p95 > 1
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency {{ $value }}s for {{ $labels.job }}"
- name: resources
rules:
- alert: HighCPU
expr: instance:node_cpu:utilization > 80
for: 5m
labels: { severity: warning }
annotations:
summary: "CPU {{ $value }}% on {{ $labels.instance }}"
- alert: HighMemory
expr: instance:node_memory:utilization > 85
for: 5m
labels: { severity: warning }
annotations:
summary: "Memory {{ $value }}% on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: instance:node_disk:utilization > 90
for: 5m
labels: { severity: critical }
annotations:
summary: "Disk {{ $value }}% on {{ $labels.instance }}"
Alert Severity Guide
| Severity | Threshold | Response |
|---|---|---|
critical | Service down, data loss risk | Page on-call immediately |
warning | Degraded, approaching limit | Investigate within hours |
info | Notable but not urgent | Review in next business day |
Validation
# Validate config syntax
promtool check config prometheus.yml
# Validate rule files
promtool check rules /etc/prometheus/rules/*.yml
# Test a query
promtool query instant http://localhost:9090 'up'
# Reload config without restart
curl -X POST http://localhost:9090/-/reload
Best Practices
| Practice | Detail |
|---|---|
Naming: prefix_name_unit | Snake_case, _total for counters, _seconds/_bytes for units |
| Scrape intervals 15–60s | Shorter wastes resources and storage |
| Recording rules for dashboards | Pre-compute anything queried repeatedly |
| Monitor Prometheus itself | prometheus_tsdb_*, scrape_duration_seconds |
| HA deployment | 2+ instances scraping same targets |
| Retention planning | Match --storage.tsdb.retention.time to disk capacity |
| Federation for scale | Global Prometheus aggregates from regional instances |
| Long-term storage | Thanos or Cortex for >30d retention |
Troubleshooting Quick Reference
| Problem | Diagnosis | Fix |
|---|---|---|
Target shows DOWN | Check /targets page for error | Fix firewall, verify endpoint, check TLS |
| Metrics missing | Query up{job="x"} | Verify scrape config, check /metrics endpoint |
| High cardinality | prometheus_tsdb_head_series growing | Drop high-cardinality labels with metric_relabel_configs |
| Storage filling up | Check prometheus_tsdb_storage_* | Reduce retention, add disk, enable compaction |
| Slow queries | Check prometheus_engine_query_duration_seconds | Add recording rules, reduce range, limit series |
| Config not applied | Check prometheus_config_last_reload_successful | Fix syntax, POST /-/reload |
NEVER Do
| Anti-Pattern | Why | Do Instead |
|---|---|---|
| Scrape interval < 5s | Overwhelms targets and storage | Use 15–60s intervals |
| High-cardinality labels (user ID, request ID) | Explodes TSDB series count | Use logs for high-cardinality data |
Alert without for duration | Fires on transient spikes | Always set for: 1m minimum |
| Skip recording rules | Dashboards compute expensive queries every load | Pre-compute with recording rules |
| Store secrets in prometheus.yml | Config often in Git | Use file-based secrets or env substitution |
Ignore up metric | Miss targets silently going down | Alert on up == 0 for all jobs |
| Single Prometheus instance in prod | Single point of failure | Run 2+ replicas with shared targets |
| Unbounded retention | Disk fills, Prometheus crashes | Set explicit --storage.tsdb.retention.time |
Templates
| Template | Description |
|---|---|
| templates/prometheus.yml | Full config with static, file-based, and K8s discovery |
| templates/alert-rules.yml | 25+ alert rules by category |
| templates/recording-rules.yml | Pre-computed metrics for HTTP, latency, resources, SLOs |
Files
5 totalSelect a file
Select a file to preview.
Comments
Loading comments…
