Install
openclaw skills install multi-source-data-cleaner-proEN: Production-grade data cleaning across heterogeneous sources (CSV/Excel/JSON/Parquet/SQL dumps/log files). Profiles schemas, detects encoding/delimiter, normalizes types, handles missing values, deduplicates fuzzy records, reconciles schema across sources, and outputs a clean unified dataset plus a full data-quality report. Use when user provides one or more dirty datasets and asks "清洗数据 / 合并数据 / 去重 / 缺失值处理 / data cleaning / dedup / schema reconcile". 中文:跨异构来源(CSV/Excel/JSON/Parquet/SQL 导出/日志文件)的工业级数据清洗。剖析 schema、自动识别编码与分隔符、归一化类型、处理缺失值、模糊去重、跨源字段对齐,输出统一的干净数据集与完整数据质量报告。当用户提供脏数据并要求"清洗/合并/去重/缺失值处理"时触发。
openclaw skills install multi-source-data-cleaner-proDrop a folder of CSVs, Excels, and JSONs from 5 different teams; get back a single clean table, a deduplication report, and a data-quality scorecard. No manual schema mapping required.
把 5 个部门各种格式的 CSV/Excel/JSON 一起扔进来,自动给你一张干净统一表、去重报告、数据质量评分。无需手工配字段映射。
Trigger keywords (中文): 清洗数据、数据清洗、合并数据、去重、缺失值、字段对齐、schema 合并、数据质量、数据预处理、ETL
Trigger keywords (EN): clean data, data cleaning, deduplicate, missing values, schema reconcile, ETL, data quality, profile dataset
Supported sources:
| 格式 / Format | 说明 |
|---|---|
| CSV / TSV | Auto-detect encoding (UTF-8/GBK/BIG5), delimiter, quote char, header row |
| Excel (.xlsx/.xls/.xlsm) | Multi-sheet, merged cells, formula values |
| JSON / JSONL / NDJSON | Nested structures auto-flattened |
| Parquet / Feather | Native columnar reading |
| SQL dumps (.sql) | MySQL / PostgreSQL INSERT extraction |
| Log files | Pattern-detected structured lines |
Do NOT use when:
python3 scripts/profile.py --input <file-or-dir> --out profile.json
For each source produces:
scripts/normalize_types.py standardizes:
2024-03-15, 2024/3/15, 15 Mar 2024, 民国113年3月15日, Excel serial) → ISO 8601Y/N/是/否/0/1/true/false/T/F/✓/✗ → booleanPer-column strategy (configurable in templates/missing_strategy.json):
drop_row — drop rows where this column is nullmean|median|mode — statistical imputation (with imputation flag column)constant:<value> — fill with literalforward_fill — for time-seriesinterpolate — linear/spline for numeric serieskeep_null — preserve as null (default for unknown)Critical rule: every imputed value gets a sidecar <col>_imputed boolean column so downstream analysis can distinguish original vs. imputed data.
scripts/reconcile_schema.py aligns columns across sources using:
--mapping mapping.yaml)Outputs a crosswalk.json documenting every column mapping for audit.
scripts/dedup.py uses configurable blocking + record linkage:
Reports merge groups for human review before commit.
Per CLEANER_PII_POLICY:
keep — leave as-is (use only with explicit user authorization)mask — partial mask (王*三, 138****5678, 4400****1234)drop — remove column entirelyAuto-detection of common PII: 姓名、身份证号、手机号、邮箱、地址、银行卡号、IP、车牌号。
python3 scripts/quality_report.py --input cleaned.parquet --out dq_report.md
Six dimensions (per DAMA-DMBOK):
Each scored 0-100 with drill-down detail.
output/
├── cleaned.parquet # main clean dataset (or .csv if requested)
├── crosswalk.json # source → target schema mapping
├── dedup_groups.json # merged record groups for review
├── dq_report.md # human-readable data quality report
├── dq_report.json # machine-readable DQ metrics
├── audit/
│ ├── per_source_profile.json
│ ├── imputation_log.csv
│ └── pii_actions.log
└── provenance.csv # row-level lineage: which source each row came from
audit/.keep, PII is masked.不静默丢数据,所有删除/合并/填充均记录到 audit/;填充值带标志列防止假冒原值;隐私字段默认脱敏;原始文件不修改;模糊去重低置信度合并强制人工复核;不向外部上传任何数据。
python3 scripts/run_pipeline.py \
--input sales_q1.csv \
--output-dir ./cleaned_q1/ \
--pii-policy mask
python3 scripts/run_pipeline.py \
--input ./customer_sources/ \
--output-dir ./unified_customers/ \
--dedup-keys name,phone \
--priority-source crm_export.csv
python3 scripts/run_pipeline.py \
--input ./multi_team_data/ \
--mapping mapping.yaml \
--output-dir ./unified/
mapping.yaml:
target_schema:
customer_id: { aliases: [客户ID, cust_id, ClientID, 编号] }
phone: { aliases: [手机, 联系电话, Mobile, tel] }
signup_date: { aliases: [注册日期, 开户日期, CreatedAt], type: date }
python3 scripts/profile.py --input ./suspicious_dataset/ --out dq_audit.md --read-only
cd tests && python3 -m pytest -v
Fixtures include:
pandas, pyarrow, recordlinkage library docsdata ETL data-cleaning dedup schema-reconcile data-quality 数据清洗 多源整合 去重 数据质量