{"skill":{"slug":"log-pii-redactor","displayName":"Log Pii Redactor","summary":"Detect and redact personally identifiable information (PII) in application logs to comply with GDPR, CCPA, HIPAA, and PCI DSS. Knows the realistic 2026 PII s...","description":"---\nname: log-pii-redactor\ndescription: Detect and redact personally identifiable information (PII) in application logs to comply with GDPR, CCPA, HIPAA, and PCI DSS. Knows the realistic 2026 PII surface — emails, phone numbers, SSNs, credit cards, IPv4/IPv6, JWT tokens, API keys, cloud secret patterns, addresses, names leaked via headers and stack traces. Picks the right strategy per field (irreversible mask vs deterministic tokenize vs salted hash vs drop) and ships a regex pack, a pre-prod scanner, and integration recipes for Fluent Bit, Logstash, Vector, and the OpenTelemetry Collector. Maps every redaction to the relevant compliance clause (GDPR Art 5/32, HIPAA Safe Harbor §164.514(b)(2), PCI DSS 3.4/3.5). Use when asked to scrub logs, build a redaction pipeline, audit a log stream for PII, design a tokenization scheme, prep for a SOC 2 or HIPAA audit, or stop sensitive data flowing into Datadog/Splunk/ELK/S3.\nmetadata:\n  tags: [\"pii\", \"logging\", \"redaction\", \"gdpr\", \"ccpa\", \"hipaa\", \"pci-dss\", \"compliance\", \"observability\", \"data-privacy\"]\n---\n\n# Log PII Redactor\n\nFind the PII bleeding into your logs, decide what to do with each kind, and put a deterministic redaction step in front of every sink that stores or indexes log data. The goal is not \"no PII anywhere\"; it is *no unredacted PII reaching a destination that retains it*. That distinction governs every design choice below.\n\n## Usage\n\n**Basic invocation:**\n> Audit my JSON app logs for PII\n> Build a Fluent Bit redaction filter\n> Should I mask, tokenize, or hash user IDs?\n> Write a pre-prod scanner that fails CI if PII is found\n> Map our redaction rules to HIPAA Safe Harbor\n\n**With context:**\n> Rails app, JSON logs to Datadog, EU users, GDPR scope\n> Fintech, PCI DSS Level 1, card data leaks suspected in payment service logs\n> Healthcare SaaS, HIPAA, log pipeline is Vector → S3 → Athena\n> Microservices, OTel Collector in front of Loki, names leaking via X-Forwarded-For and OAuth profile headers\n\nThe skill returns a regex pack, a per-field strategy table, integration config for the user's pipeline, a scanner script, and a compliance mapping table.\n\n## The Three Real Questions\n\nMost PII redaction projects fail because they conflate three different problems:\n\n1. **Detection** — what counts as PII *in your data*? (Emails are easy; \"names in stack traces\" is not.)\n2. **Strategy** — for each field, what action preserves the log's debugging value while removing the privacy harm?\n3. **Placement** — at which point in the pipeline do you redact? (Source, agent, collector, or sink — only one of these is correct, and it depends on threat model.)\n\nSolve them in that order. Skipping detection produces strategies for problems you don't have. Skipping strategy produces config that breaks debugging. Skipping placement produces redacted Splunk and unredacted S3 cold storage with five-year retention — the worst outcome.\n\n## Step 1: Detection — The PII Surface\n\nPII is broader than the obvious fields. The realistic surface in a 2026 web app:\n\n### Direct identifiers (always PII)\n\n| Type | Pattern shape | Notes |\n|---|---|---|\n| Email | RFC 5322-ish | The most common leak; appears in user objects, audit logs, error messages, OAuth callbacks |\n| Phone | E.164 + national formats | `+12025551234`, `(202) 555-1234`, `+44 20 7946 0958` |\n| SSN (US) | `\\d{3}-\\d{2}-\\d{4}` | Plus the unhyphenated variant; never log raw |\n| National IDs | Country-specific | UK NINO, Canadian SIN, German Steuer-ID, Indian Aadhaar — each has its own format |\n| Credit card | 13–19 digits, Luhn-valid | PCI DSS scope; redaction is mandatory, not optional |\n| IBAN | 2 letters + 2 digits + up to 30 alphanumeric | EU bank accounts |\n| Passport / Driver's license | Country-specific | Often appears in KYC flows |\n| Date of birth | Many formats | PII alone in HIPAA, quasi-identifier in GDPR |\n\n### Network identifiers (PII under GDPR)\n\n| Type | Pattern | GDPR status |\n|---|---|---|\n| IPv4 | `\\b(?:\\d{1,3}\\.){3}\\d{1,3}\\b` | PII — Recital 30, confirmed *Breyer v. Germany* |\n| IPv6 | RFC 4291 | PII; same rationale |\n| MAC address | `([0-9A-Fa-f]{2}[:-]){5}[0-9A-Fa-f]{2}` | PII; device-level identifier |\n| User-Agent | Free text | Quasi-identifier; combined with IP, fingerprintable |\n| Session/cookie IDs | Opaque tokens | PII when stable across requests |\n\n### Secrets (not PII per se, but leak-equivalents)\n\n| Type | Pattern hint |\n|---|---|\n| JWT | `eyJ[A-Za-z0-9_-]+\\.[A-Za-z0-9_-]+\\.[A-Za-z0-9_-]+` |\n| AWS access key | `AKIA[0-9A-Z]{16}` |\n| AWS secret | 40-char base64 in `aws_secret_access_key` context |\n| GitHub PAT | `ghp_[0-9A-Za-z]{36}` |\n| Slack token | `xox[baprs]-[0-9A-Za-z-]+` |\n| Stripe key | `sk_live_[0-9A-Za-z]{24,}` |\n| Generic bearer | `Bearer [A-Za-z0-9._-]+` in Authorization header |\n| Private key | `-----BEGIN (RSA |EC |OPENSSH |)PRIVATE KEY-----` |\n\n### Indirect identifiers (the hard ones)\n\nThese are why naive regex packs fail:\n\n- **Names** in `X-Forwarded-For`, `User-Agent` extensions, OAuth profile dumps, CRM webhook bodies, exception messages (`User Petro Pankov not found`)\n- **Addresses** in delivery webhooks, geocoder responses, error contexts\n- **Free-text fields** that customers stuff with PII (support ticket bodies, search queries)\n- **URLs** with PII in query strings (`?email=alice@example.com&token=...`)\n- **Stack traces** that include serialized object dumps with user data\n- **GraphQL/SQL parameters** logged on slow-query traces\n\nIndirect identifiers can rarely be regex-matched cleanly. Strategy: redact at the *structured field* level by name (key allowlist/denylist), not by content scanning.\n\n## Step 2: Strategy — Mask, Tokenize, Hash, or Drop\n\nEvery PII field gets one of four treatments. The choice depends on whether you need to *correlate* logs after redaction.\n\n### Mask (irreversible, character-level)\n\n- `alice@example.com` → `a****@e******.com` or `***REDACTED***`\n- Use when: humans read the log, no need to correlate across entries\n- Pros: simple; safe\n- Cons: cannot pivot (\"show all errors for this user\")\n\n### Tokenize (deterministic, reversible with vault)\n\n- `alice@example.com` → `tok_a8f3c2...` (token in log; mapping in a separate vault)\n- Use when: you need to correlate across logs but never reverse without authorization\n- Pros: full debugging capability via vault lookup\n- Cons: requires vault infrastructure (usually a small Postgres + KMS-encrypted lookup service)\n\n### Hash (deterministic, irreversible, salted)\n\n- `alice@example.com` → `HMAC-SHA256(salt, value)` truncated to 16 chars\n- Use when: correlation needed, reversal forbidden (HIPAA Safe Harbor compliant)\n- Pros: no vault; deterministic across services if salt is shared\n- Cons: rainbow-table attack on small spaces (e.g., phone numbers) — rotate salt quarterly; truncate output to discourage offline attack\n- Critical: salt must be in KMS, never in the redaction config file\n\n### Drop (the field never enters the log)\n\n- The field is removed entirely or replaced with a fixed sentinel (`\"<dropped>\"`)\n- Use when: the field has no debugging value (raw card numbers, passwords, private keys)\n- Always-drop list (no exceptions): passwords, raw card PANs, CVVs, private keys, full session cookies, OAuth refresh tokens\n\n### Decision matrix\n\n| Field | Default strategy | Why |\n|---|---|---|\n| Password | Drop | Zero debug value; PCI/SOX/PII in one |\n| Credit card PAN | Drop or mask last 4 (`****-****-****-1234`) | PCI DSS 3.4 |\n| CVV | Drop | PCI DSS 3.2 — must never be stored |\n| Email | Hash (HMAC) | Correlation valuable; reversal not needed |\n| Phone | Hash | Same |\n| SSN/National ID | Drop | No debug value justifies retention |\n| User ID (internal) | Pass through | Already a pseudonym if generated server-side |\n| IP address | Truncate (`/24` v4, `/64` v6) or hash | GDPR-acceptable; preserves geo signal |\n| JWT | Drop body, keep header for debugging | Body has user claims |\n| API key | Drop | No debug case justifies retention |\n| Names in free text | Tokenize via NER pre-pass | Or drop the whole field if low-value |\n| URL query params | Allowlist params; drop unknowns | `?token=` always drops |\n| User-Agent | Pass through | Quasi-identifier; usually acceptable |\n| Stack trace | Scan + scrub | Apply field-level redaction inside |\n\n## Step 3: The Regex Pack\n\nA practical pack for log-stream filters. Keep these in one config and reuse across all your tools (Vector, Fluent Bit, OTel all accept these).\n\n```yaml\n# pii-patterns.yaml — share across all log agents\npatterns:\n  email: '\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b'\n  phone_e164: '\\+[1-9]\\d{1,14}\\b'\n  phone_us: '\\b\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}\\b'\n  ssn_us: '\\b\\d{3}-\\d{2}-\\d{4}\\b'\n  ssn_us_unhyphenated: '\\b(?!000|666|9\\d{2})\\d{9}\\b'  # context-checked\n  credit_card: '\\b(?:\\d[ -]*?){13,19}\\b'  # validate Luhn after match\n  ipv4: '\\b(?:\\d{1,3}\\.){3}\\d{1,3}\\b'\n  ipv6: '\\b(?:[0-9a-fA-F]{1,4}:){2,7}[0-9a-fA-F]{1,4}\\b'\n  jwt: '\\beyJ[A-Za-z0-9_-]+\\.[A-Za-z0-9_-]+\\.[A-Za-z0-9_-]+\\b'\n  aws_access_key: '\\bAKIA[0-9A-Z]{16}\\b'\n  github_pat: '\\bghp_[0-9A-Za-z]{36}\\b'\n  stripe_live_key: '\\bsk_live_[0-9A-Za-z]{24,}\\b'\n  slack_token: '\\bxox[baprs]-[0-9A-Za-z-]{10,}\\b'\n  bearer_token: '(?i)\\bBearer\\s+[A-Za-z0-9._-]+'\n  private_key_block: '-----BEGIN [A-Z ]*PRIVATE KEY-----[\\s\\S]+?-----END [A-Z ]*PRIVATE KEY-----'\n  iban: '\\b[A-Z]{2}\\d{2}[A-Z0-9]{1,30}\\b'\n```\n\nCritical caveats:\n\n- **Credit card regex must be Luhn-validated post-match** or you'll redact every order ID\n- **SSN unhyphenated requires context** (preceding `ssn`/`social`) or it false-positives on every 9-digit number\n- **IPv4 regex matches version strings** (`192.168.0.1` and `4.5.6.7` and `5.0.0.1`); allowlist private ranges if you want\n- **Email regex over-matches** on things like `path/to/file@domain` — acceptable cost\n\n## Step 4: The Pre-Production Scanner\n\nRun this in CI against a sample of staging logs *before* production rollout. It's the cheapest way to catch what your redaction rules miss.\n\n```python\n#!/usr/bin/env python3\n# pii_scan.py — fail CI if PII patterns appear in a log file\nimport re, sys, json, hashlib, yaml\nfrom pathlib import Path\n\nPATTERNS = yaml.safe_load(Path(\"pii-patterns.yaml\").read_text())[\"patterns\"]\nCOMPILED = {name: re.compile(p) for name, p in PATTERNS.items()}\n\ndef luhn_ok(num):\n    digits = [int(c) for c in num if c.isdigit()]\n    if not 13 <= len(digits) <= 19:\n        return False\n    checksum = 0\n    for i, d in enumerate(reversed(digits)):\n        if i % 2 == 1:\n            d *= 2\n            if d > 9:\n                d -= 9\n        checksum += d\n    return checksum % 10 == 0\n\ndef scan_line(line, lineno):\n    findings = []\n    for name, rx in COMPILED.items():\n        for m in rx.finditer(line):\n            val = m.group(0)\n            if name == \"credit_card\" and not luhn_ok(val):\n                continue\n            # Truncate finding for the report; never log full match\n            sample = hashlib.sha256(val.encode()).hexdigest()[:8]\n            findings.append((lineno, name, sample))\n    return findings\n\ndef main(path, threshold=0):\n    findings = []\n    with open(path) as f:\n        for i, line in enumerate(f, 1):\n            findings.extend(scan_line(line, i))\n    by_type = {}\n    for _, name, _ in findings:\n        by_type[name] = by_type.get(name, 0) + 1\n    print(json.dumps({\"total\": len(findings), \"by_type\": by_type}, indent=2))\n    sys.exit(1 if len(findings) > threshold else 0)\n\nif __name__ == \"__main__\":\n    main(sys.argv[1], int(sys.argv[2]) if len(sys.argv) > 2 else 0)\n```\n\nRun as a CI gate against any sample of pre-prod logs. Hashed samples in the report let engineers triage without re-leaking.\n\n## Step 5: Pipeline Integration Recipes\n\n### Fluent Bit\n\n```ini\n[FILTER]\n    Name         modify\n    Match        app.*\n    Remove       password\n    Remove       authorization\n    Remove       cvv\n\n[FILTER]\n    Name         lua\n    Match        app.*\n    script       /fluent-bit/scripts/redact.lua\n    call         redact\n\n# redact.lua — apply regex pack to every string value\nfunction redact(tag, ts, record)\n    for k, v in pairs(record) do\n        if type(v) == \"string\" then\n            v = string.gsub(v, \"[%w._%%+-]+@[%w.-]+%.%a%a+\", \"<EMAIL>\")\n            v = string.gsub(v, \"%+%d[%d ]+\", \"<PHONE>\")\n            v = string.gsub(v, \"Bearer [%w._-]+\", \"Bearer <REDACTED>\")\n            record[k] = v\n        end\n    end\n    return 1, ts, record\nend\n```\n\n### Vector (the cleanest option for new pipelines)\n\n```toml\n[transforms.redact_pii]\ntype = \"remap\"\ninputs = [\"app_logs\"]\nsource = '''\n. = redact(., redactor: \"full\", filters: [\"pattern\", \"us_social_security_number\"])\n.email = if exists(.email) { hmac(.email, key: get_env_var!(\"PII_HMAC_KEY\"), algorithm: \"SHA-256\") } else { null }\n.phone = if exists(.phone) { hmac(.phone, key: get_env_var!(\"PII_HMAC_KEY\"), algorithm: \"SHA-256\") } else { null }\ndel(.password)\ndel(.cvv)\ndel(.authorization)\n.client_ip = if exists(.client_ip) { ip_subnet!(.client_ip, \"/24\") } else { null }\n'''\n```\n\nVector's built-in `redact` function already covers many patterns; the example above adds field-level HMAC for correlation.\n\n### Logstash\n\n```ruby\nfilter {\n  mutate {\n    remove_field => [ \"password\", \"cvv\", \"[headers][authorization]\" ]\n  }\n  ruby {\n    code => '\n      h = event.to_hash\n      h.each do |k, v|\n        next unless v.is_a?(String)\n        v = v.gsub(/[\\w.+-]+@[\\w.-]+\\.\\w+/, \"<EMAIL>\")\n        v = v.gsub(/Bearer\\s+[\\w.-]+/, \"Bearer <REDACTED>\")\n        v = v.gsub(/eyJ[\\w-]+\\.[\\w-]+\\.[\\w-]+/, \"<JWT>\")\n        event.set(k, v)\n      end\n    '\n  }\n}\n```\n\n### OpenTelemetry Collector\n\n```yaml\nprocessors:\n  redaction:\n    allow_all_keys: true\n    blocked_values:\n      - '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}'\n      - 'eyJ[A-Za-z0-9_-]+\\.[A-Za-z0-9_-]+\\.[A-Za-z0-9_-]+'\n      - '\\b(?:\\d[ -]*?){13,19}\\b'\n    summary: debug\n\n  attributes:\n    actions:\n      - key: http.request.header.authorization\n        action: delete\n      - key: http.request.body.password\n        action: delete\n      - key: enduser.id\n        action: hash\n```\n\nThe OTel `redaction` processor handles regex-style scrubbing; `attributes` handles field-level deletion and hashing.\n\n## Step 6: Placement Decision\n\n| Where you redact | Pros | Cons | Use when |\n|---|---|---|---|\n| In application code | Most precise | Every team must implement; drift over time | Highly regulated (HIPAA), small surface |\n| Sidecar agent (Fluent Bit on pod) | Centralized config; near source | Pod resource cost | Kubernetes; multi-language services |\n| Collector (Vector / OTel) | Single chokepoint; easy to audit | All raw PII still on the wire to collector | Most teams; default choice |\n| Sink-side (Datadog redaction rules) | Easy; vendor-managed | Raw PII already left your network; data leaves your control | Never as the only layer |\n\n**Default architecture:** redact in the collector (Vector or OTel) on the same VPC as the apps; treat sink-side rules as a defense-in-depth backup, never the primary control.\n\n## Compliance Mapping\n\n| Regulation | Clause | What it requires | How redaction satisfies |\n|---|---|---|---|\n| GDPR | Art 5(1)(c) data minimization | Only personal data necessary may be processed | Drop unused PII fields; hash IDs |\n| GDPR | Art 5(1)(f) integrity & confidentiality | Personal data secured against unauthorized access | Mask/hash before logs reach indexed sinks |\n| GDPR | Art 32 security of processing | Pseudonymization listed as exemplar safeguard | HMAC-with-KMS-salt = pseudonymization |\n| CCPA | §1798.140(o) | \"Personal information\" includes IP, device IDs | Truncate or hash IPs |\n| HIPAA | §164.514(b)(2) Safe Harbor | 18 identifiers must be removed | Drop names, SSNs, MRNs, full DOB, full ZIP, dates more granular than year, IPs, biometrics |\n| HIPAA | §164.312(a)(1) access control | Logs containing PHI must enforce same controls as the source | If logs aren't redacted, log store inherits PHI scope; redaction shrinks scope |\n| PCI DSS | 3.4 | PAN must be unreadable wherever stored | Mask to last-4 or drop |\n| PCI DSS | 3.5 | Cryptographic key management | HMAC salt in KMS, rotated |\n| PCI DSS | 10.5 | Audit trails secured | Redaction must not impair the audit trail's integrity (keep transaction IDs) |\n| SOC 2 | CC6.7 | Restrict transmission of PII | Redact before egress to third-party SaaS observability |\n\n## Common Pitfalls\n\n- **Redacting after sink ingestion** — the data is already on someone else's disk, possibly in cold-tier backup. Redact *before* the sink.\n- **Hashing without a salt** — rainbow-table attack on small spaces (phones, ZIPs) reverses your hashes in minutes.\n- **Salt in the config file** — anyone with config access can reverse the hash. Salt belongs in KMS / Vault.\n- **Forgetting backups** — log archives in S3/Glacier are long-retention. Redaction must run *before* archive write.\n- **Redacting the request ID** — kills your ability to correlate. Request IDs should be server-generated UUIDs and pass through.\n- **Trusting the application to never log PII** — it will. The redaction layer is your second line; treat the application's hygiene as best-effort, not a control.\n- **Skipping URLs and stack traces** — the highest-leak surfaces. Always scan query strings and exception messages.\n- **Indexing before redacting** — Elasticsearch keeps tokenized PII even after the source doc is overwritten. Redact upstream of the indexer.\n\n## Output Format\n\nThe skill returns:\n\n1. **PII surface report** — every field/pattern that needs redaction in the user's data\n2. **Strategy table** — per-field decision (mask/tokenize/hash/drop) with rationale\n3. **Regex pack** — yaml file ready for Vector, Fluent Bit, OTel, Logstash\n4. **Pipeline config** — full integration recipe for the user's specific stack\n5. **Pre-prod scanner** — Python script wired into CI to fail builds on PII leaks\n6. **Compliance mapping** — which clause each redaction satisfies\n7. **Placement diagram** — where in the architecture redaction runs\n8. **Rollout plan** — staging validation, sampled prod canary, full enable, audit log retention\n","topics":["Gdpr","Pipeline"],"tags":{"latest":"1.0.0"},"stats":{"comments":0,"downloads":365,"installsAllTime":13,"installsCurrent":0,"stars":0,"versions":1},"createdAt":1777851342819,"updatedAt":1778492842520},"latestVersion":{"version":"1.0.0","createdAt":1777851342819,"changelog":"Initial release of log-pii-redactor — a toolkit for PII detection and redaction in logs, covering modern compliance needs.\n\n- Detects a wide range of PII and secrets in logs, including emails, phones, SSNs, credit cards, JWTs, API keys, headers, stack traces, and more.\n- Provides tailored redaction strategies: masking, deterministic tokenization, salted hashing, and field dropping, mapped to compliance standards like GDPR, HIPAA, PCI DSS.\n- Includes regex packs, per-field strategy tables, integration recipes for Fluent Bit, Logstash, Vector, and OpenTelemetry Collector, plus pre-production scanners.\n- Guides users on detection, strategy selection, and placement within the log pipeline to prevent unredacted PII retention.\n- Supplies compliance mapping for audit and documentation needs.","license":"MIT-0"},"metadata":null,"owner":{"handle":"charlie-morrison","userId":"s17cttbdxry5kkyafjw983mq8s83p4y3","displayName":"charlie-morrison","image":"https://avatars.githubusercontent.com/u/271589886?v=4"},"moderation":null}