Prometheus

Prometheus monitoring — scrape configuration, service discovery, recording rules, alert rules, and production deployment for infrastructure and application metrics.

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 0 · 703 · 0 current installs · 0 all-time installs

by@wpank

MIT-0

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

The name/description match the provided content: Prometheus config, scrape/service-discovery examples, recording and alerting rule templates, and deployment guidance. The skill requests no binaries, env vars, or credentials that would be unrelated to Prometheus setup.

ℹ

Instruction Scope

SKILL.md and templates reference system paths (/etc/prometheus, /var/run/secrets, ca files, bearer_token_file) and suggest using tools like promtool and curl to validate/reload configs. These references are expected for a Prometheus production setup but mean the operator (or agent acting on behalf of the operator) will need filesystem and network access to the Prometheus host and Kubernetes service account files to perform actions — review before granting such access.

✓

Install Mechanism

No install spec or downloaded code; the skill is instruction-only and includes config templates. No external packages or arbitrary URLs are fetched by the skill itself.

✓

Credentials

No environment variables or credentials are required by the skill metadata. Template examples mention tokens, certs, and remote_write endpoints only as optional configuration items — appropriate for this purpose.

✓

Persistence & Privilege

always is false and the skill does not request persistent presence or modify other skills. Autonomous invocation (model-invocation enabled) is platform-default and not by itself a concern here.

Assessment

This skill is a coherent bundle of Prometheus configs and examples, not executable code. Before using it: (1) review and customize thresholds, job names, and target addresses before deploying; (2) be cautious when copying files into /etc or using bearer_token_file or CA/key files — those are sensitive and require proper permissions and RBAC; (3) do not enable remote_write to external endpoints you don't trust (it can send metrics off-cluster); (4) validate configs with promtool and reload Prometheus via its API or Helm as appropriate; (5) if you allow an agent to act autonomously with this skill, restrict its access to only the hosts/configs you intend it to modify.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0

Download zip

latestvk977ms7r51f8g4ach3p1nbntws80wrxa

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

SKILL.md

Prometheus

Production Prometheus setup covering scrape configuration, service discovery, recording rules, alert rules, and operational best practices for infrastructure and application monitoring.

When to Use

Scenario	Example
Set up metrics collection	New service needs Prometheus scraping
Configure service discovery	K8s pods, file-based, or static targets
Create recording rules	Pre-compute expensive PromQL queries
Design alert rules	SLO-based alerts for availability and latency
Production deployment	HA setup with retention and storage planning
Troubleshoot scraping	Targets down, metrics missing, relabeling issues

Architecture

Applications ──(/metrics)──→ Prometheus Server ──→ AlertManager → Slack/PD
      ↑                           │
  client libraries          ├──→ Grafana (dashboards)
  (prom client)             └──→ Thanos/Cortex (long-term storage)

Installation

Kubernetes (Helm)

helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageVolumeSize=50Gi

Core Configuration

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: production
    region: us-west-2

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Self-monitoring
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]

  # Node exporters
  - job_name: node-exporter
    static_configs:
      - targets: ["node1:9100", "node2:9100", "node3:9100"]
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: "([^:]+)(:[0-9]+)?"
        replacement: "${1}"

  # Application metrics (TLS)
  - job_name: my-app
    scheme: https
    metrics_path: /metrics
    tls_config:
      ca_file: /etc/prometheus/ca.crt
    static_configs:
      - targets: ["app1:9090", "app2:9090"]

Service Discovery

Kubernetes Pods (Annotation-Based)

scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels:
          [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels:
          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

Pod annotations to enable scraping:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
    prometheus.io/path: "/metrics"

File-Based Discovery

scrape_configs:
  - job_name: file-sd
    file_sd_configs:
      - files: ["/etc/prometheus/targets/*.json"]
        refresh_interval: 5m

targets/production.json:

[{
  "targets": ["app1:9090", "app2:9090"],
  "labels": { "env": "production", "service": "api" }
}]

Discovery Method Comparison

Method	Best For	Dynamic
`static_configs`	Fixed infrastructure, dev	No
`file_sd_configs`	CM-managed inventories	Yes (file watch)
`kubernetes_sd_configs`	K8s workloads	Yes (API watch)
`consul_sd_configs`	Consul service mesh	Yes (Consul watch)
`ec2_sd_configs`	AWS EC2 instances	Yes (API poll)

Recording Rules

Pre-compute expensive queries for dashboard and alert performance:

# /etc/prometheus/rules/recording_rules.yml
groups:
  - name: api_metrics
    interval: 15s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

      - record: job:http_error_rate:ratio
        expr: job:http_errors:rate5m / job:http_requests:rate5m

      - record: job:http_duration:p95
        expr: >
          histogram_quantile(0.95,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

  - name: resource_metrics
    interval: 30s
    rules:
      - record: instance:node_cpu:utilization
        expr: >
          100 - (avg by (instance)
            (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      - record: instance:node_memory:utilization
        expr: >
          100 - ((node_memory_MemAvailable_bytes
            / node_memory_MemTotal_bytes) * 100)

      - record: instance:node_disk:utilization
        expr: >
          100 - ((node_filesystem_avail_bytes
            / node_filesystem_size_bytes) * 100)

Naming Convention

level:metric_name:operations

Part	Example	Meaning
level	`job:`, `instance:`	Aggregation level
metric_name	`http_requests`	Base metric
operations	`:rate5m`, `:ratio`	Applied functions

Alert Rules

# /etc/prometheus/rules/alert_rules.yml
groups:
  - name: availability
    rules:
      - alert: ServiceDown
        expr: up{job="my-app"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} is down"
          description: "{{ $labels.job }} down for >1 minute"

      - alert: HighErrorRate
        expr: job:http_error_rate:ratio > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Error rate {{ $value | humanizePercentage }} for {{ $labels.job }}"

      - alert: HighP95Latency
        expr: job:http_duration:p95 > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency {{ $value }}s for {{ $labels.job }}"

  - name: resources
    rules:
      - alert: HighCPU
        expr: instance:node_cpu:utilization > 80
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "CPU {{ $value }}% on {{ $labels.instance }}"

      - alert: HighMemory
        expr: instance:node_memory:utilization > 85
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Memory {{ $value }}% on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: instance:node_disk:utilization > 90
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Disk {{ $value }}% on {{ $labels.instance }}"

Alert Severity Guide

Severity	Threshold	Response
`critical`	Service down, data loss risk	Page on-call immediately
`warning`	Degraded, approaching limit	Investigate within hours
`info`	Notable but not urgent	Review in next business day

Validation

# Validate config syntax
promtool check config prometheus.yml

# Validate rule files
promtool check rules /etc/prometheus/rules/*.yml

# Test a query
promtool query instant http://localhost:9090 'up'

# Reload config without restart
curl -X POST http://localhost:9090/-/reload

Best Practices

Practice	Detail
Naming: `prefix_name_unit`	Snake_case, `_total` for counters, `_seconds`/`_bytes` for units
Scrape intervals 15–60s	Shorter wastes resources and storage
Recording rules for dashboards	Pre-compute anything queried repeatedly
Monitor Prometheus itself	`prometheus_tsdb_*`, `scrape_duration_seconds`
HA deployment	2+ instances scraping same targets
Retention planning	Match `--storage.tsdb.retention.time` to disk capacity
Federation for scale	Global Prometheus aggregates from regional instances
Long-term storage	Thanos or Cortex for >30d retention

Troubleshooting Quick Reference

Problem	Diagnosis	Fix
Target shows `DOWN`	Check `/targets` page for error	Fix firewall, verify endpoint, check TLS
Metrics missing	Query `up{job="x"}`	Verify scrape config, check `/metrics` endpoint
High cardinality	`prometheus_tsdb_head_series` growing	Drop high-cardinality labels with `metric_relabel_configs`
Storage filling up	Check `prometheus_tsdb_storage_*`	Reduce retention, add disk, enable compaction
Slow queries	Check `prometheus_engine_query_duration_seconds`	Add recording rules, reduce range, limit series
Config not applied	Check `prometheus_config_last_reload_successful`	Fix syntax, POST `/-/reload`

NEVER Do

Anti-Pattern	Why	Do Instead
Scrape interval < 5s	Overwhelms targets and storage	Use 15–60s intervals
High-cardinality labels (user ID, request ID)	Explodes TSDB series count	Use logs for high-cardinality data
Alert without `for` duration	Fires on transient spikes	Always set `for: 1m` minimum
Skip recording rules	Dashboards compute expensive queries every load	Pre-compute with recording rules
Store secrets in prometheus.yml	Config often in Git	Use file-based secrets or env substitution
Ignore `up` metric	Miss targets silently going down	Alert on `up == 0` for all jobs
Single Prometheus instance in prod	Single point of failure	Run 2+ replicas with shared targets
Unbounded retention	Disk fills, Prometheus crashes	Set explicit `--storage.tsdb.retention.time`

Templates

Template	Description
templates/prometheus.yml	Full config with static, file-based, and K8s discovery
templates/alert-rules.yml	25+ alert rules by category
templates/recording-rules.yml	Pre-computed metrics for HTTP, latency, resources, SLOs

Files

5 total

Select a file

Select a file to preview.

Comments

Loading comments…