multi-source-data-cleaner-pro

Data & APIs

EN: Production-grade data cleaning across heterogeneous sources (CSV/Excel/JSON/Parquet/SQL dumps/log files). Profiles schemas, detects encoding/delimiter, normalizes types, handles missing values, deduplicates fuzzy records, reconciles schema across sources, and outputs a clean unified dataset plus a full data-quality report. Use when user provides one or more dirty datasets and asks "清洗数据 / 合并数据 / 去重 / 缺失值处理 / data cleaning / dedup / schema reconcile". 中文:跨异构来源(CSV/Excel/JSON/Parquet/SQL 导出/日志文件)的工业级数据清洗。剖析 schema、自动识别编码与分隔符、归一化类型、处理缺失值、模糊去重、跨源字段对齐,输出统一的干净数据集与完整数据质量报告。当用户提供脏数据并要求"清洗/合并/去重/缺失值处理"时触发。

Install

openclaw skills install multi-source-data-cleaner-pro

Multi-Source Data Cleaner · 多源数据清洗

Drop a folder of CSVs, Excels, and JSONs from 5 different teams; get back a single clean table, a deduplication report, and a data-quality scorecard. No manual schema mapping required.

把 5 个部门各种格式的 CSV/Excel/JSON 一起扔进来,自动给你一张干净统一表、去重报告、数据质量评分。无需手工配字段映射。


🎯 When to Use · 何时使用

Trigger keywords (中文): 清洗数据、数据清洗、合并数据、去重、缺失值、字段对齐、schema 合并、数据质量、数据预处理、ETL

Trigger keywords (EN): clean data, data cleaning, deduplicate, missing values, schema reconcile, ETL, data quality, profile dataset

Supported sources:

格式 / Format说明
CSV / TSVAuto-detect encoding (UTF-8/GBK/BIG5), delimiter, quote char, header row
Excel (.xlsx/.xls/.xlsm)Multi-sheet, merged cells, formula values
JSON / JSONL / NDJSONNested structures auto-flattened
Parquet / FeatherNative columnar reading
SQL dumps (.sql)MySQL / PostgreSQL INSERT extraction
Log filesPattern-detected structured lines

Do NOT use when:

  • Input is unstructured free text (use NLP extraction skills first)
  • Input is binary/proprietary format with no parser (Adobe Indesign, custom CAD, etc.)
  • User wants real-time streaming cleaning (this is batch-oriented)

📋 Cleaning Pipeline · 清洗流程

Step 1: Source profiling · 源剖析

python3 scripts/profile.py --input <file-or-dir> --out profile.json

For each source produces:

  • File format, encoding, line endings
  • Schema (columns, inferred types, null rates, cardinality)
  • Sample rows
  • Quality flags: encoding mismatches, type inconsistencies, suspicious patterns

Step 2: Type inference & normalization · 类型推断与归一

scripts/normalize_types.py standardizes:

  • Numbers: thousands separators, scientific notation, currency symbols → numeric
  • Dates: 50+ formats (2024-03-15, 2024/3/15, 15 Mar 2024, 民国113年3月15日, Excel serial) → ISO 8601
  • Booleans: Y/N/是/否/0/1/true/false/T/F/✓/✗ → boolean
  • Phone numbers: normalize to E.164
  • Chinese names: full-width / half-width normalization
  • IDs: zero-padding, prefix detection

Step 3: Missing value handling · 缺失值处理

Per-column strategy (configurable in templates/missing_strategy.json):

  • drop_row — drop rows where this column is null
  • mean|median|mode — statistical imputation (with imputation flag column)
  • constant:<value> — fill with literal
  • forward_fill — for time-series
  • interpolate — linear/spline for numeric series
  • keep_null — preserve as null (default for unknown)

Critical rule: every imputed value gets a sidecar <col>_imputed boolean column so downstream analysis can distinguish original vs. imputed data.

Step 4: Schema reconciliation · Schema 合并

scripts/reconcile_schema.py aligns columns across sources using:

  • Exact name match
  • Fuzzy match (Levenshtein + Chinese pinyin)
  • Type compatibility check
  • User-supplied mapping override (--mapping mapping.yaml)

Outputs a crosswalk.json documenting every column mapping for audit.

Step 5: Fuzzy deduplication · 模糊去重

scripts/dedup.py uses configurable blocking + record linkage:

  • Blocking keys to narrow candidates (e.g. first 3 chars of name + phone last 4)
  • Similarity scoring: Jaro-Winkler for names, token-set for addresses, exact for IDs
  • Threshold-based merge with conflict resolution rules (newest wins / longest non-null / authoritative source priority)

Reports merge groups for human review before commit.

Step 6: PII handling · 隐私字段处理

Per CLEANER_PII_POLICY:

  • keep — leave as-is (use only with explicit user authorization)
  • mask — partial mask (王*三, 138****5678, 4400****1234)
  • drop — remove column entirely

Auto-detection of common PII: 姓名、身份证号、手机号、邮箱、地址、银行卡号、IP、车牌号。

Step 7: Data quality report · 数据质量报告

python3 scripts/quality_report.py --input cleaned.parquet --out dq_report.md

Six dimensions (per DAMA-DMBOK):

  • Completeness (完整性)
  • Accuracy (准确性, sample validation)
  • Consistency (一致性, cross-column rules)
  • Timeliness (时效性)
  • Uniqueness (唯一性, dedup outcome)
  • Validity (有效性, regex/range checks)

Each scored 0-100 with drill-down detail.


📤 Output Format · 输出格式

output/
├── cleaned.parquet              # main clean dataset (or .csv if requested)
├── crosswalk.json               # source → target schema mapping
├── dedup_groups.json            # merged record groups for review
├── dq_report.md                 # human-readable data quality report
├── dq_report.json               # machine-readable DQ metrics
├── audit/
│   ├── per_source_profile.json
│   ├── imputation_log.csv
│   └── pii_actions.log
└── provenance.csv               # row-level lineage: which source each row came from

⚠️ Safety & Compliance · 安全合规

  1. No silent data loss — every drop/merge/impute action logged in audit/.
  2. Imputation flags mandatory — imputed values marked so they cannot masquerade as originals.
  3. PII default mask — unless user explicitly authorizes keep, PII is masked.
  4. Reversibility — original sources never modified; cleaning is non-destructive.
  5. Dedup human-in-the-loop — fuzzy merges above threshold but below 0.95 confidence flagged for review, not auto-committed.
  6. No external network calls — all processing local; no data leaves the workspace.

不静默丢数据,所有删除/合并/填充均记录到 audit/;填充值带标志列防止假冒原值;隐私字段默认脱敏;原始文件不修改;模糊去重低置信度合并强制人工复核;不向外部上传任何数据。


🚀 Usage Examples · 使用示例

Example 1: Clean a single messy CSV

python3 scripts/run_pipeline.py \
  --input sales_q1.csv \
  --output-dir ./cleaned_q1/ \
  --pii-policy mask

Example 2: Merge 3 source CSVs into unified customer table

python3 scripts/run_pipeline.py \
  --input ./customer_sources/ \
  --output-dir ./unified_customers/ \
  --dedup-keys name,phone \
  --priority-source crm_export.csv

Example 3: Schema reconcile with manual mapping override

python3 scripts/run_pipeline.py \
  --input ./multi_team_data/ \
  --mapping mapping.yaml \
  --output-dir ./unified/

mapping.yaml:

target_schema:
  customer_id: { aliases: [客户ID, cust_id, ClientID, 编号] }
  phone:        { aliases: [手机, 联系电话, Mobile, tel] }
  signup_date:  { aliases: [注册日期, 开户日期, CreatedAt], type: date }

Example 4: Quality scan only (read-only audit)

python3 scripts/profile.py --input ./suspicious_dataset/ --out dq_audit.md --read-only

🧪 Testing · 测试

cd tests && python3 -m pytest -v

Fixtures include:

  • Encoding test set (UTF-8 BOM, GBK, BIG5, Latin1)
  • 12 date format variants
  • Schema-drift simulation across 5 source files
  • Synthetic dedup dataset (10k records with controlled duplication)
  • PII regression suite

📚 References · 参考资料

  • DAMA-DMBOK Data Quality dimensions
  • Fellegi-Sunter probabilistic record linkage
  • Jaro-Winkler distance for fuzzy match
  • pandas, pyarrow, recordlinkage library docs

🏷️ Tags · 标签

data ETL data-cleaning dedup schema-reconcile data-quality 数据清洗 多源整合 去重 数据质量