data-synthesis
v1.0.0从 CSV 语料切块后,用同一套 LLM 接口依次生成问题与答案,输出 JSONL 训练数据。 适用于文档/表格语料合成 QA、微调数据准备;支持 OpenAI 兼容网关与内网 Qwen 等服务。
⭐ 0· 15·0 current·0 all-time
bychichisyun@erxiong0
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
OpenClaw
Benign
high confidencePurpose & Capability
Name/description (synthesize QA from CSV for training) matches the included scripts and SKILL.md. The scripts only read the input CSV, chunk text, call an LLM endpoint when enabled, and write JSONL output — all consistent with the stated function.
Instruction Scope
SKILL.md and scripts limit actions to: validate/preview CSV, chunk text, call LLM (when DATA_SYNTHESIS_USE_API=1), and write output JSONL. There are no instructions to read unrelated files, access external endpoints beyond the configured OPENAI_BASE_URL, or exfiltrate other system data.
Install Mechanism
Instruction-only skill with bundled Python scripts; no install spec, no downloads, and only Python standard library usage (urllib, csv, json). No suspicious install sources or archive extraction.
Credentials
Optional environment variables (DATA_SYNTHESIS_USE_API, OPENAI_API_KEY, OPENAI_BASE_URL, DATA_SYNTHESIS_MODEL, etc.) are appropriate for contacting an LLM gateway. The registry metadata lists no required env vars, which is consistent because API use is opt-in. No unrelated credentials are requested.
Persistence & Privilege
Skill does not request always:true, does not modify other skills or system configuration, and does not persist credentials. Agent autonomy is default and not combined with other concerning privileges.
Assessment
This skill appears to do exactly what it says: chunk CSV text and produce JSONL Q/A pairs. Notes before you run it: (1) By default it runs in dry-run mode; it only calls external LLMs if you set DATA_SYNTHESIS_USE_API=1. (2) If you enable API mode and set OPENAI_API_KEY (or point OPENAI_BASE_URL to a gateway), the script will POST the text chunks and questions to that endpoint — that may reveal sensitive data from your CSV and incur costs. Audit the CSV for PII before sending, prefer an internal/enterprise gateway if available, and test with dry-run and small inputs first. (3) The output 'source_fields' includes other non-empty columns from each row — remove or redact sensitive columns beforehand. (4) Use a scoped API key and monitor usage. Overall the skill is internally coherent and there are no hidden endpoints or obfuscated behaviors in the provided code.Like a lobster shell, security has layers — review code before you run it.
latestvk97eqtfjzfc65tt717s1snne0x84rmyc
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
