{"skill":{"slug":"prometheus-devops","displayName":"Prometheus","summary":"Prometheus monitoring — scrape configuration, service discovery, recording rules, alert rules, and production deployment for infrastructure and application metrics.","description":"---\nname: prometheus\nmodel: fast\nversion: 1.0.0\ndescription: >\n  Prometheus monitoring — scrape configuration, service discovery, recording\n  rules, alert rules, and production deployment for infrastructure and\n  application metrics.\ncategory: devops\ntags: [prometheus, monitoring, metrics, alerting, observability]\nauthor: skills-factory\n---\n\n# Prometheus\n\nProduction Prometheus setup covering scrape configuration, service discovery,\nrecording rules, alert rules, and operational best practices for infrastructure\nand application monitoring.\n\n## When to Use\n\n| Scenario | Example |\n|----------|---------|\n| Set up metrics collection | New service needs Prometheus scraping |\n| Configure service discovery | K8s pods, file-based, or static targets |\n| Create recording rules | Pre-compute expensive PromQL queries |\n| Design alert rules | SLO-based alerts for availability and latency |\n| Production deployment | HA setup with retention and storage planning |\n| Troubleshoot scraping | Targets down, metrics missing, relabeling issues |\n\n## Architecture\n\n```\nApplications ──(/metrics)──→ Prometheus Server ──→ AlertManager → Slack/PD\n      ↑                           │\n  client libraries          ├──→ Grafana (dashboards)\n  (prom client)             └──→ Thanos/Cortex (long-term storage)\n```\n\n## Installation\n\n### Kubernetes (Helm)\n\n```bash\nhelm repo add prometheus-community \\\n  https://prometheus-community.github.io/helm-charts\nhelm install prometheus prometheus-community/kube-prometheus-stack \\\n  --namespace monitoring --create-namespace \\\n  --set prometheus.prometheusSpec.retention=30d \\\n  --set prometheus.prometheusSpec.storageVolumeSize=50Gi\n```\n\n## Core Configuration\n\n### prometheus.yml\n\n```yaml\nglobal:\n  scrape_interval: 15s\n  evaluation_interval: 15s\n  external_labels:\n    cluster: production\n    region: us-west-2\n\nalerting:\n  alertmanagers:\n    - static_configs:\n        - targets: [\"alertmanager:9093\"]\n\nrule_files:\n  - /etc/prometheus/rules/*.yml\n\nscrape_configs:\n  # Self-monitoring\n  - job_name: prometheus\n    static_configs:\n      - targets: [\"localhost:9090\"]\n\n  # Node exporters\n  - job_name: node-exporter\n    static_configs:\n      - targets: [\"node1:9100\", \"node2:9100\", \"node3:9100\"]\n    relabel_configs:\n      - source_labels: [__address__]\n        target_label: instance\n        regex: \"([^:]+)(:[0-9]+)?\"\n        replacement: \"${1}\"\n\n  # Application metrics (TLS)\n  - job_name: my-app\n    scheme: https\n    metrics_path: /metrics\n    tls_config:\n      ca_file: /etc/prometheus/ca.crt\n    static_configs:\n      - targets: [\"app1:9090\", \"app2:9090\"]\n```\n\n## Service Discovery\n\n### Kubernetes Pods (Annotation-Based)\n\n```yaml\nscrape_configs:\n  - job_name: kubernetes-pods\n    kubernetes_sd_configs:\n      - role: pod\n    relabel_configs:\n      - source_labels:\n          [__meta_kubernetes_pod_annotation_prometheus_io_scrape]\n        action: keep\n        regex: true\n      - source_labels:\n          [__meta_kubernetes_pod_annotation_prometheus_io_path]\n        action: replace\n        target_label: __metrics_path__\n        regex: (.+)\n      - source_labels:\n          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]\n        action: replace\n        regex: ([^:]+)(?::\\d+)?;(\\d+)\n        replacement: $1:$2\n        target_label: __address__\n      - source_labels: [__meta_kubernetes_namespace]\n        target_label: namespace\n      - source_labels: [__meta_kubernetes_pod_name]\n        target_label: pod\n```\n\n**Pod annotations to enable scraping:**\n\n```yaml\nmetadata:\n  annotations:\n    prometheus.io/scrape: \"true\"\n    prometheus.io/port: \"9090\"\n    prometheus.io/path: \"/metrics\"\n```\n\n### File-Based Discovery\n\n```yaml\nscrape_configs:\n  - job_name: file-sd\n    file_sd_configs:\n      - files: [\"/etc/prometheus/targets/*.json\"]\n        refresh_interval: 5m\n```\n\n**targets/production.json:**\n\n```json\n[{\n  \"targets\": [\"app1:9090\", \"app2:9090\"],\n  \"labels\": { \"env\": \"production\", \"service\": \"api\" }\n}]\n```\n\n### Discovery Method Comparison\n\n| Method | Best For | Dynamic |\n|--------|----------|---------|\n| `static_configs` | Fixed infrastructure, dev | No |\n| `file_sd_configs` | CM-managed inventories | Yes (file watch) |\n| `kubernetes_sd_configs` | K8s workloads | Yes (API watch) |\n| `consul_sd_configs` | Consul service mesh | Yes (Consul watch) |\n| `ec2_sd_configs` | AWS EC2 instances | Yes (API poll) |\n\n## Recording Rules\n\nPre-compute expensive queries for dashboard and alert performance:\n\n```yaml\n# /etc/prometheus/rules/recording_rules.yml\ngroups:\n  - name: api_metrics\n    interval: 15s\n    rules:\n      - record: job:http_requests:rate5m\n        expr: sum by (job) (rate(http_requests_total[5m]))\n\n      - record: job:http_errors:rate5m\n        expr: sum by (job) (rate(http_requests_total{status=~\"5..\"}[5m]))\n\n      - record: job:http_error_rate:ratio\n        expr: job:http_errors:rate5m / job:http_requests:rate5m\n\n      - record: job:http_duration:p95\n        expr: >\n          histogram_quantile(0.95,\n            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))\n          )\n\n  - name: resource_metrics\n    interval: 30s\n    rules:\n      - record: instance:node_cpu:utilization\n        expr: >\n          100 - (avg by (instance)\n            (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)\n\n      - record: instance:node_memory:utilization\n        expr: >\n          100 - ((node_memory_MemAvailable_bytes\n            / node_memory_MemTotal_bytes) * 100)\n\n      - record: instance:node_disk:utilization\n        expr: >\n          100 - ((node_filesystem_avail_bytes\n            / node_filesystem_size_bytes) * 100)\n```\n\n### Naming Convention\n\n```\nlevel:metric_name:operations\n```\n\n| Part | Example | Meaning |\n|------|---------|---------|\n| level | `job:`, `instance:` | Aggregation level |\n| metric_name | `http_requests` | Base metric |\n| operations | `:rate5m`, `:ratio` | Applied functions |\n\n## Alert Rules\n\n```yaml\n# /etc/prometheus/rules/alert_rules.yml\ngroups:\n  - name: availability\n    rules:\n      - alert: ServiceDown\n        expr: up{job=\"my-app\"} == 0\n        for: 1m\n        labels:\n          severity: critical\n        annotations:\n          summary: \"{{ $labels.instance }} is down\"\n          description: \"{{ $labels.job }} down for >1 minute\"\n\n      - alert: HighErrorRate\n        expr: job:http_error_rate:ratio > 0.05\n        for: 5m\n        labels:\n          severity: warning\n        annotations:\n          summary: \"Error rate {{ $value | humanizePercentage }} for {{ $labels.job }}\"\n\n      - alert: HighP95Latency\n        expr: job:http_duration:p95 > 1\n        for: 5m\n        labels:\n          severity: warning\n        annotations:\n          summary: \"P95 latency {{ $value }}s for {{ $labels.job }}\"\n\n  - name: resources\n    rules:\n      - alert: HighCPU\n        expr: instance:node_cpu:utilization > 80\n        for: 5m\n        labels: { severity: warning }\n        annotations:\n          summary: \"CPU {{ $value }}% on {{ $labels.instance }}\"\n\n      - alert: HighMemory\n        expr: instance:node_memory:utilization > 85\n        for: 5m\n        labels: { severity: warning }\n        annotations:\n          summary: \"Memory {{ $value }}% on {{ $labels.instance }}\"\n\n      - alert: DiskSpaceLow\n        expr: instance:node_disk:utilization > 90\n        for: 5m\n        labels: { severity: critical }\n        annotations:\n          summary: \"Disk {{ $value }}% on {{ $labels.instance }}\"\n```\n\n### Alert Severity Guide\n\n| Severity | Threshold | Response |\n|----------|-----------|----------|\n| `critical` | Service down, data loss risk | Page on-call immediately |\n| `warning` | Degraded, approaching limit | Investigate within hours |\n| `info` | Notable but not urgent | Review in next business day |\n\n## Validation\n\n```bash\n# Validate config syntax\npromtool check config prometheus.yml\n\n# Validate rule files\npromtool check rules /etc/prometheus/rules/*.yml\n\n# Test a query\npromtool query instant http://localhost:9090 'up'\n\n# Reload config without restart\ncurl -X POST http://localhost:9090/-/reload\n```\n\n## Best Practices\n\n| Practice | Detail |\n|----------|--------|\n| Naming: `prefix_name_unit` | Snake_case, `_total` for counters, `_seconds`/`_bytes` for units |\n| Scrape intervals 15–60s | Shorter wastes resources and storage |\n| Recording rules for dashboards | Pre-compute anything queried repeatedly |\n| Monitor Prometheus itself | `prometheus_tsdb_*`, `scrape_duration_seconds` |\n| HA deployment | 2+ instances scraping same targets |\n| Retention planning | Match `--storage.tsdb.retention.time` to disk capacity |\n| Federation for scale | Global Prometheus aggregates from regional instances |\n| Long-term storage | Thanos or Cortex for >30d retention |\n\n## Troubleshooting Quick Reference\n\n| Problem | Diagnosis | Fix |\n|---------|-----------|-----|\n| Target shows `DOWN` | Check `/targets` page for error | Fix firewall, verify endpoint, check TLS |\n| Metrics missing | Query `up{job=\"x\"}` | Verify scrape config, check `/metrics` endpoint |\n| High cardinality | `prometheus_tsdb_head_series` growing | Drop high-cardinality labels with `metric_relabel_configs` |\n| Storage filling up | Check `prometheus_tsdb_storage_*` | Reduce retention, add disk, enable compaction |\n| Slow queries | Check `prometheus_engine_query_duration_seconds` | Add recording rules, reduce range, limit series |\n| Config not applied | Check `prometheus_config_last_reload_successful` | Fix syntax, POST `/-/reload` |\n\n## NEVER Do\n\n| Anti-Pattern | Why | Do Instead |\n|-------------|-----|------------|\n| Scrape interval < 5s | Overwhelms targets and storage | Use 15–60s intervals |\n| High-cardinality labels (user ID, request ID) | Explodes TSDB series count | Use logs for high-cardinality data |\n| Alert without `for` duration | Fires on transient spikes | Always set `for: 1m` minimum |\n| Skip recording rules | Dashboards compute expensive queries every load | Pre-compute with recording rules |\n| Store secrets in prometheus.yml | Config often in Git | Use file-based secrets or env substitution |\n| Ignore `up` metric | Miss targets silently going down | Alert on `up == 0` for all jobs |\n| Single Prometheus instance in prod | Single point of failure | Run 2+ replicas with shared targets |\n| Unbounded retention | Disk fills, Prometheus crashes | Set explicit `--storage.tsdb.retention.time` |\n\n## Templates\n\n| Template | Description |\n|----------|-------------|\n| [templates/prometheus.yml](templates/prometheus.yml) | Full config with static, file-based, and K8s discovery |\n| [templates/alert-rules.yml](templates/alert-rules.yml) | 25+ alert rules by category |\n| [templates/recording-rules.yml](templates/recording-rules.yml) | Pre-computed metrics for HTTP, latency, resources, SLOs |\n","tags":{"latest":"1.0.0"},"stats":{"comments":0,"downloads":1191,"installsAllTime":0,"installsCurrent":0,"stars":1,"versions":1},"createdAt":1770727486856,"updatedAt":1778486478500},"latestVersion":{"version":"1.0.0","createdAt":1770727486856,"changelog":"Initial release — overhaul from CLI/query utility to full production Prometheus configuration templates.\n\n- Replaces previous CLI/query scripts with ready-to-use configuration YAML for Prometheus.\n- Adds detailed templates for `prometheus.yml`, `alert-rules.yml`, and `recording-rules.yml` covering scraping, service discovery, rule definitions, and alerting best practices.\n- Includes best-practice documentation for installation (with Helm), architecture, configuration, and validation.\n- Targets DevOps, SREs, and engineers deploying Prometheus in production environments.\n- Removes all JavaScript files and CLI instructions in favor of declarative YAML configs and operational playbook.","license":null},"metadata":null,"owner":{"handle":"wpank","userId":"s17bjjbnm7xjckd29e1h2641ks8849md","displayName":"wpank","image":"https://avatars.githubusercontent.com/u/9498646?v=4"},"moderation":null}