Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

多源数据清洗器

v1.0.0

支持多格式数据解析与智能字段识别,自动去重补全格式统一,多源数据关联合并并生成飞书原生清洗报告。

0· 17·0 current·0 all-time
byYK-Global@qiji0802
Security Scan
Capability signals
Requires sensitive credentials
These labels describe what authority the skill may exercise. They are separate from suspicious or malicious moderation verdicts.
VirusTotalVirusTotal
Pending
View report →
OpenClawOpenClaw
Suspicious
medium confidence
Purpose & Capability
The code and SKILL.md implement the advertised features (parsing, field identification, cleaning, fuzzy dedup/join, Feishu export and report generation). Declared dependencies (pandas, fuzzywuzzy, etc.) match the implementation. However the registry metadata claims no required env vars while SKILL.md lists DATA_CLEANER_API_KEY, DATA_CLEANER_TIER and an optional state file — this mismatch is inconsistent and could mislead users about required credentials.
Instruction Scope
SKILL.md instructs running scripts/main.py and provides API-style run_clean_pipeline/run_merge_pipeline calls. Instructions and code reference writing monthly usage state to /tmp/data_cleaner_state.json and optional Feishu Bitable / Feishu cloud document creation. AI-powered features (field identification, classification) are optional but will call external AI providers (MiniMax/DeepSeek) when an API key is supplied. There are no instructions to read unrelated system files, but external network interactions (AI providers, Feishu) are implied and not fully documented in registry metadata.
Install Mechanism
No registry install spec is provided (the skill is listed as instruction-only), but the bundle contains Python scripts and lists pip dependencies with an install command in SKILL.md/README. This is not dangerous per se, but the absence of an explicit install step in the registry metadata vs the included code is an inconsistency: user must manually install dependencies and ensure the code runs in a Python environment.
Credentials
The SKILL.md reasonably requests an AI API key for AI features (DATA_CLEANER_API_KEY), a tier setting, and an optional state file path; these are proportionate to the advertised functionality. That said, Feishu export requires folder tokens/open_id but those credentials are only exposed as function parameters in the docs (feishu_folder_token / feishu_open_id) and not declared as required env vars in the registry metadata — another metadata/instruction mismatch. The skill does not request unrelated system credentials in the provided files.
Persistence & Privilege
The skill does not request permanent platform-wide privileges (always:false). It persists usage counts to /tmp/data_cleaner_state.json, which is a limited, local persistence. It does not appear to modify other skills or system settings.
What to consider before installing
This skill implements a local Python data-cleaning tool with optional AI and Feishu integrations. Before installing: 1) Note metadata inconsistencies — the registry claims no env vars while SKILL.md requires an AI API key (DATA_CLEANER_API_KEY) for AI features and the code can write a state file to /tmp; 2) If you enable AI features, your data (potentially sensitive PII) will be sent to third-party AI providers (MiniMax/DeepSeek) — only enable with safe, non-sensitive datasets and with a trusted API key; 3) Feishu export will need folder tokens/open_id (not listed as registry-required env vars) — inspect scripts/output.py and reporter.py to see what endpoints are used and how auth is handled; 4) The package has no automated install spec in the registry; you must install Python deps manually (pip). Recommended actions: review output.py/reporter.py for network calls, run the code in an isolated environment (air‑gapped or controlled VM) with non-sensitive test data, and only provide API keys/tokens if you trust the author. If you need higher assurance, ask the author for an explicit install manifest and confirmation of what external endpoints are contacted and what data is transmitted.

Like a lobster shell, security has layers — review code before you run it.

chinavk972cdjaxrez6yfq6f6bt2pqvh8545wjcsvvk972cdjaxrez6yfq6f6bt2pqvh8545wjdata-cleaningvk972cdjaxrez6yfq6f6bt2pqvh8545wjexcelvk972cdjaxrez6yfq6f6bt2pqvh8545wjfeishuvk972cdjaxrez6yfq6f6bt2pqvh8545wjjsonvk972cdjaxrez6yfq6f6bt2pqvh8545wjlatestvk972cdjaxrez6yfq6f6bt2pqvh8545wj
17downloads
0stars
1versions
Updated 6h ago
v1.0.0
MIT-0

SKILL.md — 多源数据清洗器

Metadata

字段
namemulti-source-data-cleaner
label多源数据清洗器
version1.0.0
languagePython
runtimesubprocess (scripts/main.py)
trigger_words数据清洗、数据去重、表格整理、数据合并、格式统一、CRM数据整理、Excel清洗

Description

上传乱数据,输出干净数据。支持多格式解析、智能字段识别、AI去重/补全/格式化、多源关联合并,飞书原生输出(多维表格+云文档质量报告)。

适用场景: 电商订单整理、CRM客户数据清洗、银行流水整理、花名册整理、多系统数据合并。


Capabilities

F1 · 多格式识别与解析

  • Excel (.xlsx / .xls)
  • CSV / TSV
  • JSON(半结构化)
  • 剪贴板粘贴文本

F2 · 智能字段识别

  • AI 自动识别:姓名、手机号、邮箱、地址、金额、日期、SKU、订单号、身份证、性别等
  • 支持用户自定义字段映射覆盖

F3 · 数据清洗

  • 去重:精确去重 + 智能模糊去重(FuzzyWuzzy,阈值 88%)
  • 补全:均值/众数/语义推断/留空
  • 格式统一
    • 手机号 → 1xx-xxxx-xxxx
    • 日期 → YYYY-MM-DD
    • 金额 → 两位小数
    • 地址 → 省市区街道标准化

F4 · 数据分类/打标签(专业版)

  • 内置 8 种业务规则(高价值客户、沉睡用户、VIP客户、企业客户等)
  • 支持自定义 JSON 规则
  • AI 自动打标签(需要专业版 + AI API Key)

F5 · 多源关联合并(专业版)

  • 按手机号/姓名/订单号等跨文件关联
  • Fuzzy Join(模糊匹配阈值可调)
  • 支持 2+ 文件迭代合并

F6 · 飞书原生输出

  • 导出干净 Excel / CSV
  • 飞书多维表格(标准版/专业版):直接写入 Bitable
  • 数据质量报告 → 飞书云文档:重复率/缺失率/清洗前后对比

Pricing / Tier Features

功能免费版基础版标准版专业版
月额度50条500条3000条不限
数据源数13不限不限
最大列数1050200不限
多格式解析
基础去重
智能模糊去重
格式统一
智能补全
多源合并
AI分类/打标签
数据质量报告
飞书多维表格

权限隔离实现: scripts/tier_limits.py — 所有操作入口均调用 check_tier() / check_feature() 验证。


Invocation

Agent 直接调用

from main import run_clean_pipeline, run_merge_pipeline

# 基本清洗
result = run_clean_pipeline(
    sources=["订单数据.xlsx"],
    texts=None,
    output_format="xlsx",
    output_path="/tmp/cleaned.xlsx",
    dedup_strategy="auto",
    fill_strategy="auto",
    classify=True,
    ai_model="deepseek",
    generate_report=True,
)

# 多源合并
merge_result = run_merge_pipeline(
    sources=["客户表.xlsx", "订单表.csv"],
    on=["手机号"],
    fuzzy_on=["姓名"],
    fuzzy_threshold=85,
    output_format="xlsx",
)

CLI 调用

# 清洗本地文件
python scripts/main.py clean -i data.xlsx -o cleaned.xlsx -f xlsx

# 粘贴文本数据
python scripts/main.py clean -t "姓名,电话
张三,13800138000
李四,13900139000" -o cleaned.csv -f csv

# 多源合并
python scripts/main.py merge --sources data1.csv data2.csv --on 手机号 -o merged.xlsx

# 生成质量报告
python scripts/main.py clean -i cleaned.xlsx --report-title "清洗报告" -o report.md

Function Reference

run_clean_pipeline()

参数:

参数类型默认值说明
sourcesList[str]None文件路径列表
textsList[str]None粘贴文本列表
tierstrNonefree/basic/std/pro
output_formatstr"xlsx"xlsx 或 csv
output_pathstrNone输出路径(自动生成临时文件)
custom_field_mappingDict[str,str]None{列名: 类型} 覆盖
dedup_strategystr"auto"exact / fuzzy / auto
fill_strategystr"auto"auto / mean / mode / leave_blank
classifyboolFalse是否启用 AI 分类
ai_modelstrNoneminimax / deepseek
generate_reportboolTrue是否生成质量报告
bitable_outputboolFalse输出到飞书多维表格
feishu_folder_tokenstrNone飞书文件夹 token
report_titlestr"数据质量报告"报告文档标题

返回: Dictfile_path, cleaned_rows, clean_report, usage, report_md, bitable, doc 等。


Environment Variables

变量必填说明
DATA_CLEANER_API_KEYAI 功能需填写MiniMax 或 DeepSeek API Key
DATA_CLEANER_TIER推荐填写订阅版本(free/basic/std/pro),默认 free
DATA_CLEANER_STATE_FILE可选月度用量记录文件路径

Dependencies

pandas>=1.5
openpyxl>=3.0
xlrd>=2.0
fuzzywuzzy>=0.18
python-Levenshtein>=0.12

安装:pip install pandas openpyxl xlrd fuzzywuzzy python-Levenshtein


Error Handling

异常说明用户提示
TierLimitExceeded超出月度额度或数据源数限制提示升级版本
FeatureNotAvailable当前版本不支持该功能提示解锁方式
MergeError合并失败(键不匹配等)提示检查关联键
ExportError导出失败(APIKey等)提示配置方式

Notes

  • 所有 DataFrame 操作使用 dtype=str + keep_default_na=False,避免意外类型转换
  • 日期解析支持:YYYY-MM-DDYYYY/MM/DDYYYY年MM月DD日YYYYMMDD、Unix时间戳
  • 手机号格式统一:自动识别 11 位中国手机号并格式化为 1xx-xxxx-xxxx
  • 模糊去重阈值默认 88%(FuzzyWuzzy ratio),可在 run_merge_pipeline 中通过 fuzzy_threshold 参数调整
  • 月度用量在 /tmp/data_cleaner_state.json 中持久化,重启后保留
  • 飞书 Bitable 输出每批最多 500 条,超出自动分批写入

Skill Author

技能开发者 · YK Global

Comments

Loading comments...