Corpus Builder

语料库构建工具,支持智能分块、AI 标注、向量化存储。可选 LLM 标注(需 DashScope API)或规则降级。

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 21 · 0 current installs · 0 all-time installs
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
Name/description (corpus building, chunking, AI annotation, embeddings) align with the shipped code: chunker, annotator, embedder, store and CLI script are present. The only credential mentioned (DASHSCOPE_API_KEY) is appropriate for the optional LLM annotation mode.
Instruction Scope
SKILL.md instructs running the included scripts and passing the optional DASHSCOPE_API_KEY via env var; code shown reads the key only from the environment. The docs/examples reference a path under ~/.openclaw/workspace/skills for where to run the project, but the code does not appear to read global OpenClaw config files (the README/CHANGELOG note that reading ~/.openclaw was removed). No instructions ask the agent to read unrelated system files or other credentials.
Install Mechanism
This is instruction-only (no platform install spec). The package contains Python code and a requirements.txt; installation is via pip as documented. No external binary downloads or URL-based extract/install steps are present in the provided metadata.
Credentials
The skill declares only one optional env var (DASHSCOPE_API_KEY) for LLM mode, which is proportionate. Minor inconsistency: requirements.txt lists 'openai' (and pysqlite3-binary) while pyproject.toml's dependencies do not include openai; this is a packaging/documentation mismatch to verify. No other secret env vars are requested.
Persistence & Privilege
Skill is not marked always:true and does not request persistent system-wide privileges. It writes checkpoints/embeddings into local directories under the skill (config-controlled) which is expected behavior for a corpus builder.
Scan Findings in Context
[PRE_SCAN_NONE] expected: Static pre-scan reported no injection signals. The repository contains many source files (annotator, chunker, embedder, store) and unit tests; absence of regex findings does not imply safety, but no automatic red flags were found.
Assessment
What to check before installing: - The only sensitive input is an optional DASHSCOPE_API_KEY environment variable used for LLM annotation. If you don't set it, the skill will run in rule-based (offline) mode. - Inspect the annotator's HTTP/OpenAI fallback (_call_llm_http) before use to confirm requests go only to the expected DashScope endpoint; the visible code uses the OpenAI-compatible client with base_url pointing to coding.dashscope.aliyuncs.com which matches the README, but the fallback implementation was truncated in the provided view — verify it doesn't send data elsewhere. - Packaging mismatch: requirements.txt includes 'openai' and pysqlite3-binary while pyproject.toml does not list 'openai' as a dependency. If you install via pip -r requirements.txt you will pull the OpenAI SDK (needed for LLM mode); if you install with pyproject tools you may not. Use the install method you trust and review dependencies. - The skill stores checkpoints, embeddings, and the Chroma DB under local directories (configurable). Ensure you point persist_directory/checkpoint_dir to storage you control and do not include sensitive texts you don't want stored. - Prefer rule-based mode (no DASHSCOPE_API_KEY) if you want fully offline operation; test with a small dataset first and run the included unit tests (pytest) to validate behavior in your environment. - Avoid putting API keys into files checked into git or shared shells; prefer per-session environment variables or a secrets manager. Overall: the code and docs largely match the stated purpose; the mismatches are documentation/packaging points to confirm rather than indicators of malicious behavior.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.1.2
Download zip
latestvk97fe1eq5r4y3xebdw8v6nxrxn840mqw

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

Corpus Builder - 语料库构建工具

轻量级语料库构建工具,针对中文小说优化,支持场景智能分块、10 维度 AI 标注、ChromaDB 向量存储。

标注模式

  • LLM 模式(推荐):使用 DashScope API 进行智能标注(需 DASHSCOPE_API_KEY
  • 规则模式(降级):无 API 时使用规则引擎自动标注(完全离线)

🔐 安全说明

本技能承诺

  • ✅ API Key 通过环境变量 DASHSCOPE_API_KEY 传递
  • 不读取 ~/.openclaw/ 目录或任何全局配置文件
  • 不存储 API Key 到 skill 目录或本地文件
  • 不使用 subprocess 调用外部 CLI 工具
  • 不访问 其他 provider 的凭证

环境配置

LLM 模式(需要 API Key)

设置环境变量(唯一支持的方式):

# 临时设置(当前终端有效)
export DASHSCOPE_API_KEY="sk-xxx"

# 永久设置(添加到 ~/.bashrc)
echo 'export DASHSCOPE_API_KEY="sk-xxx"' >> ~/.bashrc
source ~/.bashrc

⚠️ 注意: 不要将 API Key 提交到 Git 或分享给他人。

规则模式(完全离线)

无需 API Key,自动使用规则引擎进行标注:

  • 不设置 DASHSCOPE_API_KEY 环境变量
  • 技能自动降级到规则标注模式
  • 质量较低但完全离线运行

可选:SQLite3 兼容性

如果运行时报错 sqlite3 version < 3.35.0

# 安装 pysqlite3-binary(仅旧系统需要)
pip3 install pysqlite3-binary --user

现代系统(Ubuntu 20.04+, macOS 12+, Python 3.10+)通常不需要。

快速开始

构建语料库

cd ~/.openclaw/workspace/skills/corpus-builder

# 1. 批量处理小说文本
python3 scripts/build_corpus.py \
    --source ~/workspace/novels/reference \
    --name 玄幻打斗 \
    --genre 玄幻 \
    --max-chunk-size 2000

# 2. 查看统计信息
python3 scripts/build_corpus.py \
    --stats \
    --collection 玄幻打斗

# 3. 导出标注数据
python3 scripts/build_corpus.py \
    --export json \
    --collection 玄幻打斗 \
    --output results.json

💡 需要检索语料? 请使用 corpus-search 技能。

标注数据示例

{
    "scene_type": "打斗",
    "emotion": "紧张",
    "quality_score": 8,
    "original_text": "...",
    "source_file": "没钱修什么仙.txt"
}

依赖安装

cd ~/.openclaw/workspace/skills/corpus-builder
pip3 install -r requirements.txt --user

必需依赖

用途
chromadb向量数据库
sentence-transformers嵌入模型
pyyamlYAML 处理
richCLI 美化
psutil内存监控

内存优化

  • 监控阈值: 2.5GB
  • 自动释放: 浏览器/模型缓存
  • 批量策略: AI 标注 5/批,向量化 32/批
  • 增量处理: 断点续传,避免重复

配置文件

编辑 configs/default_config.yml:

chunking:
  max_chunk_size: 2000
  min_chunk_size: 100
  overlap: 200
processing:
  batch_size: 5
  embedding_batch_size: 32
  max_workers: 3
models:
  embedding: "BAAI/bge-small-zh-v1.5"
  annotation: "dashscope-coding/qwen3.5-plus"
storage:
  persist_directory: "./corpus/chroma"
  checkpoint_dir: "./corpus/cache"

故障排除

内存过高

# 降低内存限制
python3 scripts/build_corpus.py \
    --source ./novels \
    --name test \
    --memory-limit 1500 \
    --batch-size 3

LLM 调用失败

使用规则降级方案,标注结果仍可生成,只是质量得分较低。

ChromaDB 错误

删除向量库重新构建:

rm -rf corpus/chroma/{collection_name}
python3 scripts/build_corpus.py --source ./novels --name test

相关脚本

脚本用途
scripts/build_corpus.py主程序(语料库构建)

许可证

MIT License

Created for OpenClaw 🦞
Version: 1.0.0
Last Updated: 2026-03-28

Files

22 total
Select a file
Select a file to preview.

Comments

Loading comments…