Prometheus

Prometheus monitoring — scrape configuration, service discovery, recording rules, alert rules, and production deployment for infrastructure and application metrics.

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 703 · 0 current installs · 0 all-time installs
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
The name/description match the provided content: Prometheus config, scrape/service-discovery examples, recording and alerting rule templates, and deployment guidance. The skill requests no binaries, env vars, or credentials that would be unrelated to Prometheus setup.
Instruction Scope
SKILL.md and templates reference system paths (/etc/prometheus, /var/run/secrets, ca files, bearer_token_file) and suggest using tools like promtool and curl to validate/reload configs. These references are expected for a Prometheus production setup but mean the operator (or agent acting on behalf of the operator) will need filesystem and network access to the Prometheus host and Kubernetes service account files to perform actions — review before granting such access.
Install Mechanism
No install spec or downloaded code; the skill is instruction-only and includes config templates. No external packages or arbitrary URLs are fetched by the skill itself.
Credentials
No environment variables or credentials are required by the skill metadata. Template examples mention tokens, certs, and remote_write endpoints only as optional configuration items — appropriate for this purpose.
Persistence & Privilege
always is false and the skill does not request persistent presence or modify other skills. Autonomous invocation (model-invocation enabled) is platform-default and not by itself a concern here.
Assessment
This skill is a coherent bundle of Prometheus configs and examples, not executable code. Before using it: (1) review and customize thresholds, job names, and target addresses before deploying; (2) be cautious when copying files into /etc or using bearer_token_file or CA/key files — those are sensitive and require proper permissions and RBAC; (3) do not enable remote_write to external endpoints you don't trust (it can send metrics off-cluster); (4) validate configs with promtool and reload Prometheus via its API or Helm as appropriate; (5) if you allow an agent to act autonomously with this skill, restrict its access to only the hosts/configs you intend it to modify.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0
Download zip
latestvk977ms7r51f8g4ach3p1nbntws80wrxa

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

Prometheus

Production Prometheus setup covering scrape configuration, service discovery, recording rules, alert rules, and operational best practices for infrastructure and application monitoring.

When to Use

ScenarioExample
Set up metrics collectionNew service needs Prometheus scraping
Configure service discoveryK8s pods, file-based, or static targets
Create recording rulesPre-compute expensive PromQL queries
Design alert rulesSLO-based alerts for availability and latency
Production deploymentHA setup with retention and storage planning
Troubleshoot scrapingTargets down, metrics missing, relabeling issues

Architecture

Applications ──(/metrics)──→ Prometheus Server ──→ AlertManager → Slack/PD
      ↑                           │
  client libraries          ├──→ Grafana (dashboards)
  (prom client)             └──→ Thanos/Cortex (long-term storage)

Installation

Kubernetes (Helm)

helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageVolumeSize=50Gi

Core Configuration

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: production
    region: us-west-2

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Self-monitoring
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]

  # Node exporters
  - job_name: node-exporter
    static_configs:
      - targets: ["node1:9100", "node2:9100", "node3:9100"]
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: "([^:]+)(:[0-9]+)?"
        replacement: "${1}"

  # Application metrics (TLS)
  - job_name: my-app
    scheme: https
    metrics_path: /metrics
    tls_config:
      ca_file: /etc/prometheus/ca.crt
    static_configs:
      - targets: ["app1:9090", "app2:9090"]

Service Discovery

Kubernetes Pods (Annotation-Based)

scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels:
          [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels:
          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

Pod annotations to enable scraping:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
    prometheus.io/path: "/metrics"

File-Based Discovery

scrape_configs:
  - job_name: file-sd
    file_sd_configs:
      - files: ["/etc/prometheus/targets/*.json"]
        refresh_interval: 5m

targets/production.json:

[{
  "targets": ["app1:9090", "app2:9090"],
  "labels": { "env": "production", "service": "api" }
}]

Discovery Method Comparison

MethodBest ForDynamic
static_configsFixed infrastructure, devNo
file_sd_configsCM-managed inventoriesYes (file watch)
kubernetes_sd_configsK8s workloadsYes (API watch)
consul_sd_configsConsul service meshYes (Consul watch)
ec2_sd_configsAWS EC2 instancesYes (API poll)

Recording Rules

Pre-compute expensive queries for dashboard and alert performance:

# /etc/prometheus/rules/recording_rules.yml
groups:
  - name: api_metrics
    interval: 15s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

      - record: job:http_error_rate:ratio
        expr: job:http_errors:rate5m / job:http_requests:rate5m

      - record: job:http_duration:p95
        expr: >
          histogram_quantile(0.95,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

  - name: resource_metrics
    interval: 30s
    rules:
      - record: instance:node_cpu:utilization
        expr: >
          100 - (avg by (instance)
            (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      - record: instance:node_memory:utilization
        expr: >
          100 - ((node_memory_MemAvailable_bytes
            / node_memory_MemTotal_bytes) * 100)

      - record: instance:node_disk:utilization
        expr: >
          100 - ((node_filesystem_avail_bytes
            / node_filesystem_size_bytes) * 100)

Naming Convention

level:metric_name:operations
PartExampleMeaning
leveljob:, instance:Aggregation level
metric_namehttp_requestsBase metric
operations:rate5m, :ratioApplied functions

Alert Rules

# /etc/prometheus/rules/alert_rules.yml
groups:
  - name: availability
    rules:
      - alert: ServiceDown
        expr: up{job="my-app"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} is down"
          description: "{{ $labels.job }} down for >1 minute"

      - alert: HighErrorRate
        expr: job:http_error_rate:ratio > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Error rate {{ $value | humanizePercentage }} for {{ $labels.job }}"

      - alert: HighP95Latency
        expr: job:http_duration:p95 > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency {{ $value }}s for {{ $labels.job }}"

  - name: resources
    rules:
      - alert: HighCPU
        expr: instance:node_cpu:utilization > 80
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "CPU {{ $value }}% on {{ $labels.instance }}"

      - alert: HighMemory
        expr: instance:node_memory:utilization > 85
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Memory {{ $value }}% on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: instance:node_disk:utilization > 90
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Disk {{ $value }}% on {{ $labels.instance }}"

Alert Severity Guide

SeverityThresholdResponse
criticalService down, data loss riskPage on-call immediately
warningDegraded, approaching limitInvestigate within hours
infoNotable but not urgentReview in next business day

Validation

# Validate config syntax
promtool check config prometheus.yml

# Validate rule files
promtool check rules /etc/prometheus/rules/*.yml

# Test a query
promtool query instant http://localhost:9090 'up'

# Reload config without restart
curl -X POST http://localhost:9090/-/reload

Best Practices

PracticeDetail
Naming: prefix_name_unitSnake_case, _total for counters, _seconds/_bytes for units
Scrape intervals 15–60sShorter wastes resources and storage
Recording rules for dashboardsPre-compute anything queried repeatedly
Monitor Prometheus itselfprometheus_tsdb_*, scrape_duration_seconds
HA deployment2+ instances scraping same targets
Retention planningMatch --storage.tsdb.retention.time to disk capacity
Federation for scaleGlobal Prometheus aggregates from regional instances
Long-term storageThanos or Cortex for >30d retention

Troubleshooting Quick Reference

ProblemDiagnosisFix
Target shows DOWNCheck /targets page for errorFix firewall, verify endpoint, check TLS
Metrics missingQuery up{job="x"}Verify scrape config, check /metrics endpoint
High cardinalityprometheus_tsdb_head_series growingDrop high-cardinality labels with metric_relabel_configs
Storage filling upCheck prometheus_tsdb_storage_*Reduce retention, add disk, enable compaction
Slow queriesCheck prometheus_engine_query_duration_secondsAdd recording rules, reduce range, limit series
Config not appliedCheck prometheus_config_last_reload_successfulFix syntax, POST /-/reload

NEVER Do

Anti-PatternWhyDo Instead
Scrape interval < 5sOverwhelms targets and storageUse 15–60s intervals
High-cardinality labels (user ID, request ID)Explodes TSDB series countUse logs for high-cardinality data
Alert without for durationFires on transient spikesAlways set for: 1m minimum
Skip recording rulesDashboards compute expensive queries every loadPre-compute with recording rules
Store secrets in prometheus.ymlConfig often in GitUse file-based secrets or env substitution
Ignore up metricMiss targets silently going downAlert on up == 0 for all jobs
Single Prometheus instance in prodSingle point of failureRun 2+ replicas with shared targets
Unbounded retentionDisk fills, Prometheus crashesSet explicit --storage.tsdb.retention.time

Templates

TemplateDescription
templates/prometheus.ymlFull config with static, file-based, and K8s discovery
templates/alert-rules.yml25+ alert rules by category
templates/recording-rules.ymlPre-computed metrics for HTTP, latency, resources, SLOs

Files

5 total
Select a file
Select a file to preview.

Comments

Loading comments…