Logging Observability
Structured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing.
Like a lobster shell, security has layers — review code before you run it.
License
SKILL.md
Logging & Observability
Patterns for building observable systems across the three pillars: logs, metrics, and traces.
Three Pillars
| Pillar | Purpose | Question It Answers | Example |
|---|---|---|---|
| Logs | What happened | Why did this request fail? | {"level":"error","msg":"payment declined","user_id":"u_82"} |
| Metrics | How much / how fast | Is latency increasing? | http_request_duration_seconds{route="/api/orders"} 0.342 |
| Traces | Request flow | Where is the bottleneck? | Span: api-gateway → auth → order-service → db |
Each pillar is strongest when correlated. Embed trace_id in every log line to jump from a log entry to the full distributed trace.
Structured Logging
Always emit logs as structured JSON — never free-text strings.
Required Fields
| Field | Purpose | Required |
|---|---|---|
timestamp | ISO-8601 with milliseconds | Yes |
level | Severity (DEBUG … FATAL) | Yes |
service | Originating service name | Yes |
message | Human-readable description | Yes |
trace_id | Distributed trace correlation | Yes |
span_id | Current span within trace | Yes |
correlation_id | Business-level correlation (order ID) | When applicable |
error | Structured error object | On errors |
context | Request-specific metadata | Recommended |
Context Enrichment
Attach context at the middleware level so downstream logs inherit automatically:
app.use((req, res, next) => {
const ctx = {
trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
request_id: crypto.randomUUID(),
user_id: req.user?.id,
method: req.method,
path: req.path,
};
asyncLocalStorage.run(ctx, () => next());
});
Library Recommendations
| Library | Language | Strengths | Perf |
|---|---|---|---|
| Pino | Node.js | Fastest Node logger, low overhead | Excellent |
| structlog | Python | Composable processors, context binding | Good |
| zerolog | Go | Zero-allocation JSON logging | Excellent |
| zap | Go | High performance, typed fields | Excellent |
| tracing | Rust | Spans + events, async-aware | Excellent |
Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.
Log Levels
| Level | When to Use | Example |
|---|---|---|
| FATAL | App cannot continue, process will exit | Database connection pool exhausted |
| ERROR | Operation failed, needs attention | Payment charge failed: CARD_DECLINED |
| WARN | Unexpected but recoverable | Retry 2/3 for upstream timeout |
| INFO | Normal business events | Order ORD-1234 placed successfully |
| DEBUG | Developer troubleshooting | Cache miss for key user:82:preferences |
| TRACE | Very fine-grained (rarely in prod) | Entering validateAddress with payload |
Rules: Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.
Distributed Tracing
OpenTelemetry Setup
Always prefer OpenTelemetry over vendor-specific SDKs:
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
serviceName: 'order-service',
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Span Creation
const tracer = trace.getTracer('order-service');
async function processOrder(order: Order) {
return tracer.startActiveSpan('processOrder', async (span) => {
try {
span.setAttribute('order.id', order.id);
span.setAttribute('order.total_cents', order.totalCents);
await validateInventory(order);
await chargePayment(order);
span.setStatus({ code: SpanStatusCode.OK });
} catch (err) {
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
span.recordException(err);
throw err;
} finally {
span.end();
}
});
}
Context Propagation
- Use W3C Trace Context (
traceparentheader) — default in OTel - Propagate across HTTP, gRPC, and message queues
- For async workers: serialise
traceparentinto the job payload
Trace Sampling
| Strategy | Use When |
|---|---|
| Always On | Low-traffic services, debugging |
| Probabilistic (N%) | General production use |
| Rate-limited (N/sec) | High-throughput services |
| Tail-based | When you need all error traces |
Always sample 100% of error traces regardless of strategy.
Metrics Collection
RED Method (Request-Driven)
Monitor these three for every service endpoint:
| Metric | What It Measures | Prometheus Example |
|---|---|---|
| Rate | Requests/sec | rate(http_requests_total[5m]) |
| Errors | Failed request ratio | rate(http_requests_total{status=~"5.."}[5m]) |
| Duration | Response time | histogram_quantile(0.99, http_request_duration_seconds) |
USE Method (Resource-Driven)
For infrastructure components (CPU, memory, disk, network):
| Metric | What It Measures | Example |
|---|---|---|
| Utilization | % resource busy | CPU usage at 78% |
| Saturation | Work queued/waiting | 12 requests queued in thread pool |
| Errors | Error events on resource | 3 disk I/O errors in last minute |
Monitoring Stack
| Tool | Category | Best For |
|---|---|---|
| Prometheus | Metrics | Pull-based metrics, alerting rules |
| Grafana | Visualisation | Dashboards for metrics, logs, traces |
| Jaeger | Tracing | Distributed trace visualisation |
| Loki | Logs | Log aggregation (pairs with Grafana) |
| OpenTelemetry | Collection | Vendor-neutral telemetry collection |
Recommendation: Start with OTel Collector → Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.
Alert Design
Severity Levels
| Severity | Response Time | Example |
|---|---|---|
| P1 | Immediate | Service fully down, data loss |
| P2 | < 30 min | Error rate > 5%, latency p99 > 5s |
| P3 | Business hours | Disk > 80%, cert expiring in 7 days |
| P4 | Best effort | Non-critical deprecation warning |
Alert Fatigue Prevention
- Alert on symptoms, not causes — "error rate > 5%" not "pod restarted"
- Multi-window, multi-burn-rate — catch both sudden spikes and slow burns
- Require runbook links — every alert must link to diagnosis and remediation
- Review monthly — delete or tune alerts that never fire or always fire
- Group related alerts — use inhibition rules to suppress child alerts
- Set appropriate thresholds — if alert fires daily and is ignored, raise threshold or delete
Dashboard Patterns
Overview Dashboard ("War Room")
- Total requests/sec across all services
- Global error rate (%) with trendline
- p50 / p95 / p99 latency
- Active alerts count by severity
- Deployment markers overlaid on graphs
Service Dashboard (Per-Service)
- RED metrics for each endpoint
- Dependency health (upstream/downstream success rates)
- Resource utilisation (CPU, memory, connections)
- Top errors table with count and last seen
Observability Checklist
Every service must have:
- Structured JSON logging with consistent schema
- Correlation / trace IDs propagated on all requests
- RED metrics exposed for every external endpoint
- Health check endpoints (
/healthzand/readyz) - Distributed tracing with OpenTelemetry
- Dashboards for RED metrics and resource utilisation
- Alerts for error rate, latency, and saturation with runbook links
- Log level configurable at runtime without redeployment
- PII scrubbing verified and tested
- Retention policies defined for logs, metrics, and traces
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Logging PII | Privacy/compliance violation | Mask or exclude PII; use token references |
| Excessive logging | Storage costs balloon, signal drowns | Log business events, not data flow |
| Unstructured logs | Cannot query or alert on fields | Use structured JSON with consistent schema |
| String interpolation | Breaks structured fields, injection risk | Pass fields as metadata, not in message |
| Missing correlation IDs | Cannot trace across services | Generate and propagate trace_id everywhere |
| Alert storms | On-call fatigue, real issues buried | Use grouping, inhibition, deduplication |
| Metrics with high cardinality | Prometheus OOM, dashboard timeouts | Never use user ID or request ID as label |
NEVER Do
- NEVER log passwords, tokens, API keys, or secrets — even at DEBUG level
- NEVER use console.log / print in production — use a structured logger
- NEVER use user IDs, emails, or request IDs as metric labels — cardinality will explode
- NEVER create alerts without a runbook link — unactionable alerts erode trust
- NEVER rely on logs alone — you need metrics and traces for full observability
- NEVER log request/response bodies by default — opt-in only, with PII redaction
- NEVER ignore log volume — set budgets and alert when a service exceeds daily quota
- NEVER skip context propagation in async flows — broken traces are worse than no traces
Files
2 totalComments
Loading comments…
