Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

data-anonymizer

v1.0.0

Anonymize sensitive data in databases, files, and APIs for testing and compliance. Detect PII (names, emails, SSNs, addresses, phone numbers), apply anonymiz...

0· 30· 1 versions· 0 current· 0 all-time· Updated 10h ago· MIT-0

Data Anonymizer

Anonymize production data for safe use in testing, development, and analytics. Detect PII automatically, apply appropriate anonymization strategies (masking, hashing, synthetic replacement, generalization), and generate realistic fake data that preserves data relationships and statistical properties.

Use when: "anonymize data", "mask PII", "create test data from production", "GDPR compliance", "data masking", "remove personal data", "sanitize database", "fake data generation", or when preparing production data for non-production use.

Commands

1. detect — Find PII in Data Sources

Step 1: Scan for PII Patterns

# Scan files for common PII patterns
rg -n "(\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b)" --type-not binary 2>/dev/null | head -20
echo "--- Emails found above ---"

rg -n "\\b\\d{3}[-.]?\\d{2}[-.]?\\d{4}\\b" --type-not binary 2>/dev/null | head -20
echo "--- SSN-like patterns above ---"

rg -n "\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b" --type-not binary 2>/dev/null | head -20
echo "--- Phone numbers above ---"

rg -n "\\b\\d{4}[- ]?\\d{4}[- ]?\\d{4}[- ]?\\d{4}\\b" --type-not binary 2>/dev/null | head -20
echo "--- Credit card-like patterns above ---"

Step 2: Scan Database Schema

# Find columns likely containing PII (by name pattern)
python3 -c "
pii_column_patterns = [
    'email', 'phone', 'address', 'street', 'city', 'zip', 'postal',
    'ssn', 'social_security', 'tax_id', 'national_id',
    'first_name', 'last_name', 'full_name', 'name',
    'birth', 'dob', 'date_of_birth', 'age',
    'credit_card', 'card_number', 'cvv', 'expiry',
    'ip_address', 'ip', 'user_agent',
    'password', 'secret', 'token', 'api_key',
    'latitude', 'longitude', 'lat', 'lng', 'geo',
    'photo', 'avatar', 'image_url',
    'salary', 'income', 'bank_account', 'iban', 'routing',
]

# Parse schema from SQL dump or migration files
import sys
for pattern in pii_column_patterns:
    print(f'  - {pattern}*')
print('\\nUse these patterns to grep your database schema:')
print('rg -i \"(\" + \"|\".join(pii_column_patterns[:5]) + \")\" migrations/ schema.sql')
"

Step 3: Classify Sensitivity

LevelData TypesStrategy
CriticalSSN, credit card, passwords, API keysDelete or hash (irreversible)
HighEmail, phone, full name, addressSynthetic replacement
MediumDate of birth, IP address, locationGeneralization (year only, /24 subnet)
LowAge range, city, job titleKeep or slight perturbation

2. anonymize — Apply Anonymization

Strategy 1: Synthetic Replacement (recommended for test data)

# Generate realistic fake data preserving format and relationships
import hashlib

def anonymize_email(email):
    """Consistent fake email — same input always produces same output"""
    h = hashlib.sha256(email.encode()).hexdigest()[:8]
    domain = email.split('@')[1] if '@' in email else 'example.com'
    return f"user_{h}@test-{domain}"

def anonymize_name(name):
    """Replace with consistent fake name"""
    from faker import Faker
    fake = Faker()
    fake.seed_instance(hash(name) % (2**32))
    return fake.name()

def anonymize_phone(phone):
    """Keep format, replace digits"""
    import re
    h = hashlib.sha256(phone.encode()).hexdigest()
    digits = [c for c in h if c.isdigit()]
    result = ''
    d = 0
    for c in phone:
        if c.isdigit():
            result += digits[d % len(digits)]
            d += 1
        else:
            result += c
    return result

def anonymize_address(address):
    """Replace with fake address in same region"""
    from faker import Faker
    fake = Faker()
    fake.seed_instance(hash(address) % (2**32))
    return fake.address()

Strategy 2: Masking (quick, for logs/exports)

def mask_email(email):
    parts = email.split('@')
    return f"{parts[0][:2]}***@{parts[1]}" if '@' in email else '***'

def mask_phone(phone):
    return phone[:3] + '***' + phone[-2:]

def mask_ssn(ssn):
    return '***-**-' + ssn[-4:]

def mask_card(card):
    return '****-****-****-' + card[-4:]

Strategy 3: SQL-Level Anonymization

-- PostgreSQL anonymization script
UPDATE users SET
    email = 'user_' || md5(email) || '@example.com',
    first_name = 'User',
    last_name = 'Test_' || substring(md5(last_name) from 1 for 6),
    phone = '+1' || lpad(abs(hashtext(phone))::text, 10, '0'),
    address_line1 = floor(random() * 9999)::text || ' Test Street',
    city = 'Testville',
    zip_code = lpad(abs(hashtext(zip_code))::text, 5, '0'),
    date_of_birth = date_of_birth - (random() * 365)::int * interval '1 day',
    ssn = NULL
WHERE true;

-- Verify no real data remains
SELECT email FROM users WHERE email NOT LIKE '%@example.com' LIMIT 5;

3. verify — Validate Anonymization

After anonymization, verify:

  • No real email addresses remain (check against known patterns)
  • No real phone numbers (validate format but not real numbers)
  • Statistical properties preserved (age distribution, geographic spread)
  • Referential integrity maintained (FK relationships intact)
  • Uniqueness constraints respected (no duplicate generated values)

4. report — Generate Compliance Report

# Data Anonymization Report

## Scope
- Database: production_backup_20260429
- Tables processed: 15
- Records processed: 2.3M

## PII Found and Anonymized
| Column | Table | Records | Strategy | Verified |
|--------|-------|---------|----------|----------|
| email | users | 150,000 | Synthetic | ✅ |
| phone | users | 148,322 | Synthetic | ✅ |
| ssn | employees | 1,200 | Deleted | ✅ |
| address | orders | 890,000 | Synthetic | ✅ |
| ip_address | logs | 5.2M | Generalized (/24) | ✅ |

## Verification
- ✅ No real emails in anonymized data
- ✅ Foreign key integrity preserved
- ✅ Unique constraints satisfied
- ✅ Statistical distributions preserved (±5%)

Version tags

latestvk974e5qqgnhhrrxebtye7qsyp985sjcy