Corpus Builder
语料库构建工具,支持智能分块、AI 标注、向量化存储。可选 LLM 标注(需 DashScope API)或规则降级。
MIT-0 · Free to use, modify, and redistribute. No attribution required.
⭐ 0 · 12 · 0 current installs · 0 all-time installs
MIT-0
Security Scan
OpenClaw
Benign
medium confidencePurpose & Capability
Name/description (corpus building, chunking, AI annotation, embeddings) align with the shipped code: chunker, annotator, embedder, store and CLI script are present. The only credential mentioned (DASHSCOPE_API_KEY) is appropriate for the optional LLM annotation mode.
Instruction Scope
SKILL.md instructs running the included scripts and passing the optional DASHSCOPE_API_KEY via env var; code shown reads the key only from the environment. The docs/examples reference a path under ~/.openclaw/workspace/skills for where to run the project, but the code does not appear to read global OpenClaw config files (the README/CHANGELOG note that reading ~/.openclaw was removed). No instructions ask the agent to read unrelated system files or other credentials.
Install Mechanism
This is instruction-only (no platform install spec). The package contains Python code and a requirements.txt; installation is via pip as documented. No external binary downloads or URL-based extract/install steps are present in the provided metadata.
Credentials
The skill declares only one optional env var (DASHSCOPE_API_KEY) for LLM mode, which is proportionate. Minor inconsistency: requirements.txt lists 'openai' (and pysqlite3-binary) while pyproject.toml's dependencies do not include openai; this is a packaging/documentation mismatch to verify. No other secret env vars are requested.
Persistence & Privilege
Skill is not marked always:true and does not request persistent system-wide privileges. It writes checkpoints/embeddings into local directories under the skill (config-controlled) which is expected behavior for a corpus builder.
Scan Findings in Context
[PRE_SCAN_NONE] expected: Static pre-scan reported no injection signals. The repository contains many source files (annotator, chunker, embedder, store) and unit tests; absence of regex findings does not imply safety, but no automatic red flags were found.
Assessment
What to check before installing:
- The only sensitive input is an optional DASHSCOPE_API_KEY environment variable used for LLM annotation. If you don't set it, the skill will run in rule-based (offline) mode.
- Inspect the annotator's HTTP/OpenAI fallback (_call_llm_http) before use to confirm requests go only to the expected DashScope endpoint; the visible code uses the OpenAI-compatible client with base_url pointing to coding.dashscope.aliyuncs.com which matches the README, but the fallback implementation was truncated in the provided view — verify it doesn't send data elsewhere.
- Packaging mismatch: requirements.txt includes 'openai' and pysqlite3-binary while pyproject.toml does not list 'openai' as a dependency. If you install via pip -r requirements.txt you will pull the OpenAI SDK (needed for LLM mode); if you install with pyproject tools you may not. Use the install method you trust and review dependencies.
- The skill stores checkpoints, embeddings, and the Chroma DB under local directories (configurable). Ensure you point persist_directory/checkpoint_dir to storage you control and do not include sensitive texts you don't want stored.
- Prefer rule-based mode (no DASHSCOPE_API_KEY) if you want fully offline operation; test with a small dataset first and run the included unit tests (pytest) to validate behavior in your environment.
- Avoid putting API keys into files checked into git or shared shells; prefer per-session environment variables or a secrets manager.
Overall: the code and docs largely match the stated purpose; the mismatches are documentation/packaging points to confirm rather than indicators of malicious behavior.Like a lobster shell, security has layers — review code before you run it.
Current versionv1.1.2
Download ziplatest
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
SKILL.md
Corpus Builder - 语料库构建工具
轻量级语料库构建工具,针对中文小说优化,支持场景智能分块、10 维度 AI 标注、ChromaDB 向量存储。
标注模式:
- LLM 模式(推荐):使用 DashScope API 进行智能标注(需
DASHSCOPE_API_KEY) - 规则模式(降级):无 API 时使用规则引擎自动标注(完全离线)
🔐 安全说明
本技能承诺:
- ✅ API Key 仅通过环境变量
DASHSCOPE_API_KEY传递 - ❌ 不读取
~/.openclaw/目录或任何全局配置文件 - ❌ 不存储 API Key 到 skill 目录或本地文件
- ❌ 不使用 subprocess 调用外部 CLI 工具
- ❌ 不访问 其他 provider 的凭证
环境配置
LLM 模式(需要 API Key)
设置环境变量(唯一支持的方式):
# 临时设置(当前终端有效)
export DASHSCOPE_API_KEY="sk-xxx"
# 永久设置(添加到 ~/.bashrc)
echo 'export DASHSCOPE_API_KEY="sk-xxx"' >> ~/.bashrc
source ~/.bashrc
⚠️ 注意: 不要将 API Key 提交到 Git 或分享给他人。
规则模式(完全离线)
无需 API Key,自动使用规则引擎进行标注:
- 不设置
DASHSCOPE_API_KEY环境变量 - 技能自动降级到规则标注模式
- 质量较低但完全离线运行
可选:SQLite3 兼容性
如果运行时报错 sqlite3 version < 3.35.0:
# 安装 pysqlite3-binary(仅旧系统需要)
pip3 install pysqlite3-binary --user
现代系统(Ubuntu 20.04+, macOS 12+, Python 3.10+)通常不需要。
快速开始
构建语料库
cd ~/.openclaw/workspace/skills/corpus-builder
# 1. 批量处理小说文本
python3 scripts/build_corpus.py \
--source ~/workspace/novels/reference \
--name 玄幻打斗 \
--genre 玄幻 \
--max-chunk-size 2000
# 2. 查看统计信息
python3 scripts/build_corpus.py \
--stats \
--collection 玄幻打斗
# 3. 导出标注数据
python3 scripts/build_corpus.py \
--export json \
--collection 玄幻打斗 \
--output results.json
💡 需要检索语料? 请使用
corpus-search技能。
标注数据示例
{
"scene_type": "打斗",
"emotion": "紧张",
"quality_score": 8,
"original_text": "...",
"source_file": "没钱修什么仙.txt"
}
依赖安装
cd ~/.openclaw/workspace/skills/corpus-builder
pip3 install -r requirements.txt --user
必需依赖
| 包 | 用途 |
|---|---|
| chromadb | 向量数据库 |
| sentence-transformers | 嵌入模型 |
| pyyaml | YAML 处理 |
| rich | CLI 美化 |
| psutil | 内存监控 |
内存优化
- 监控阈值: 2.5GB
- 自动释放: 浏览器/模型缓存
- 批量策略: AI 标注 5/批,向量化 32/批
- 增量处理: 断点续传,避免重复
配置文件
编辑 configs/default_config.yml:
chunking:
max_chunk_size: 2000
min_chunk_size: 100
overlap: 200
processing:
batch_size: 5
embedding_batch_size: 32
max_workers: 3
models:
embedding: "BAAI/bge-small-zh-v1.5"
annotation: "dashscope-coding/qwen3.5-plus"
storage:
persist_directory: "./corpus/chroma"
checkpoint_dir: "./corpus/cache"
故障排除
内存过高
# 降低内存限制
python3 scripts/build_corpus.py \
--source ./novels \
--name test \
--memory-limit 1500 \
--batch-size 3
LLM 调用失败
使用规则降级方案,标注结果仍可生成,只是质量得分较低。
ChromaDB 错误
删除向量库重新构建:
rm -rf corpus/chroma/{collection_name}
python3 scripts/build_corpus.py --source ./novels --name test
相关脚本
| 脚本 | 用途 |
|---|---|
scripts/build_corpus.py | 主程序(语料库构建) |
许可证
MIT License
Created for OpenClaw 🦞
Version: 1.0.0
Last Updated: 2026-03-28
Files
22 totalSelect a file
Select a file to preview.
Comments
Loading comments…
