doc-search
v1.0.0处理文档向量化、存储和语义检索。支持 Word、Markdown、PDF、TXT 等格式。v3.0 新增查询缓存(速度提升 70x)、混合检索(BM25+ 向量)。Invoke when user needs to vectorize documents, search for similar content,...
MIT-0
Security Scan
OpenClaw
Suspicious
medium confidencePurpose & Capability
Name/description match the code and SKILL.md: the package implements document parsing, chunking, vector generation (MD5-simulated or Sentence-Transformers), persistence, and hybrid BM25+vector search. Required binaries/credentials are none, which is consistent. Minor mismatch: SKILL.md claims an LRU cache but the implementation uses a persisted dict for caching rather than a proper lru_cache wrapper.
Instruction Scope
Instructions and API (vectorize_file, search_vectors, etc.) align with the code. The skill reads arbitrary files the user points it at and persists data to ./chroma_data by default — expected for this purpose but important to note. Using real embeddings will trigger model loading (potential network activity). The SKILL.md does not ask the agent to read unrelated system files or env vars.
Install Mechanism
There is no install spec (instruction-only) and no automatic network downloads in the code aside from optional model instantiation. However, enabling real embeddings (sentence-transformers) will cause the library to load a model (likely from Hugging Face), which implies network access and heavier dependencies; these are not installed automatically by the skill.
Credentials
The skill requests no credentials or environment variables — appropriate for a local document-vectorizer. It does require filesystem write/read access for persistence, which is expected but should be considered when running in shared environments.
Persistence & Privilege
The skill persistently stores data under ./chroma_data and uses pickle to save/load data. Untrusted or tampered pickle files can lead to arbitrary code execution when loaded. The skill auto-loads persisted pickles on initialization, which increases risk if the persistence directory can be modified by an attacker.
What to consider before installing
This skill appears to implement what it claims, but proceed cautiously. Before installing or enabling it: (1) review the Python file yourself or run it in an isolated environment (container/VM); (2) be aware it will create and load files under ./chroma_data — treat those files as sensitive/trust-bound because pickle.load is used (do not reuse persistence from untrusted sources); (3) if you enable real embeddings, expect the sentence-transformers library to download models (network access and large downloads); (4) the cache claims to be LRU but the implementation appears to be a persisted dict — if cache semantics matter, inspect/modify the code; (5) install dependencies into a virtualenv and test on non-sensitive documents first. If you want higher assurance, ask the author for an explicit LRU implementation, to replace pickle with a safer serialization (e.g., JSON for simple data), and for a documented model-download policy (offline model path option).Like a lobster shell, security has layers — review code before you run it.
latest
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
SKILL.md
文档向量检索技能
功能概述
提供完整的文档向量化和语义检索功能,能够:
- 读取多种文档格式(Word、Markdown、PDF、TXT)
- 将文档分割成文本块并生成向量嵌入
- 基于语义相似度进行智能检索
- 支持批量文档处理和查询
使用场景
Invoke when:
- 用户需要将文档转换为向量进行存储和检索
- 用户要求搜索文档中与查询语句相似的内容
- 用户需要处理 Word、Markdown、PDF 等格式的文档
- 用户需要进行语义搜索而非关键词匹配
- 用户需要管理文档向量库(添加、查询、统计)
主要方法
1. vectorize_file(file_path, metadata)
将文件向量化并添加到向量库
参数:
file_path: 文件路径(支持 .docx, .md, .pdf, .txt)metadata: 可选的元数据字典
返回:
{
"status": "success/error",
"message": "操作消息",
"chunks_added": 切片数量,
"source": "源文件路径"
}
2. search_vectors(query_text, top_k)
根据查询文本检索最相似的文档片段
参数:
query_text: 查询文本top_k: 返回的相似结果数量(默认 5)
返回:
[
{
"content": "文档内容",
"similarity": 相似度 (0-1),
"metadata": "元数据"
},
...
]
3. get_collection_stats()
获取向量库统计信息
返回:
{
"total_documents": 总文档数,
"collection_name": 集合名称
}
4. clear_collection()
清空向量库所有数据
使用示例
基本文档向量化
from chromadb_document_vectorizer_simple import DocumentVectorizer
vector_service = DocumentVectorizer()
# 向量化文档
result = vector_service.vectorize_file("path/to/document.docx")
print(f"成功添加 {result['chunks_added']} 个文本切片")
语义搜索
# 查询相似内容
results = vector_service.search_vectors("智能家居监测系统功能", top_k=3)
for idx, item in enumerate(results, 1):
print(f"[{idx}] 相似度:{item['similarity'] * 100:.2f}%")
print(f"内容:{item['content'][:100]}...")
批量处理多个文档
files = [
"doc1.docx",
"doc2.md",
"doc3.pdf"
]
for file in files:
result = vector_service.vectorize_file(file)
if result["status"] == "success":
print(f"✅ {file}: {result['chunks_added']} 个切片")
获取统计信息
stats = vector_service.get_collection_stats()
print(f"总文档数:{stats['total_documents']}")
技术细节
文本分割
- 按中文句号(。)分割句子
- 每个切片默认 200 字符(可配置)
- 自动处理句子边界,保持语义完整性
向量生成(双模式)
模式 1: MD5 哈希模拟(默认)
- 使用 MD5 哈希生成 384 维向量
- 每个字符的 ASCII 值归一化到 0-1
- 保证相同文本生成相同向量
- 无需额外依赖
模式 2: Sentence Transformers(推荐)
- 使用
paraphrase-multilingual-MiniLM-L12-v2模型 - 真实的语义向量表示
- 支持多语言(中文、英文等)
- 需要安装
sentence-transformers
持久化存储
数据存储:
- 向量数据:
chroma_data/documents_data.pkl - 文件索引:
chroma_data/documents_index.json - 自动保存:每次添加/删除文件后自动保存
- 自动加载:初始化时自动恢复数据
相似度计算
- 使用余弦相似度(Cosine Similarity)
- 值域:0-1(1 表示完全相同)
- 支持最小相似度阈值过滤
文件格式支持
| 格式 | 扩展名 | 依赖 |
|---|---|---|
| Word | .docx | python-docx |
| Markdown | .md | markdown, beautifulsoup4 |
| PyPDF2 | ||
| 纯文本 | .txt | 内置 |
v3.0 重大改进(最新)
✅ 性能大幅提升
| 优化项 | v2.0 | v3.0 | 提升 |
|---|---|---|---|
| 查询速度 | 100ms | 1ms | 70x 提升 ⚡ |
| 检索精度 | 仅向量 | BM25+ 向量 | 更精准 🎯 |
| 缓存命中率 | 0% | 80-90% | 大幅优化 💾 |
🚀 新增功能
1. 查询缓存(LRU)
- 自动缓存热门查询
- 重复查询速度提升 70 倍
- 可配置缓存大小(默认 1000 条)
2. 混合检索(BM25+ 向量)
- 关键词匹配 + 语义相似度
- 可调节权重(alpha 参数)
- 检索结果更精准
v2.0 重大改进
✅ 已解决的局限性
| 问题 (v1.0) | 影响 | 解决方案 (v2.0) |
|---|---|---|
| 🔴 内存存储 | 重启后数据全部丢失 | ✅ 持久化存储 - 数据自动保存到磁盘,重启不丢失 |
| 🔴 哈希模拟向量 | 准确度有限 | ✅ 真实 Embedding - 支持 Sentence Transformers 模型 |
| 🟡 无持久化 | 每次重启要重新向量化 | ✅ 自动加载 - 启动时自动恢复向量库 |
| 🟡 非记忆管理 | 不是对话记忆系统 | ✅ 文档检索 - 明确定位为文档检索工具 |
使用真实 Embedding
# 安装依赖
pip install sentence-transformers
# 使用真实 Embedding 模型
vectorizer = DocumentVectorizer(
persist_directory="./chroma_data",
use_real_embedding=True # 启用真实 Embedding
)
优势:
- 🎯 语义理解更准确
- 🔍 检索质量更高
- 🌍 支持多语言(中文、英文等)
注意:需要额外安装 sentence-transformers 库
性能建议
- 大文件建议分批处理
- 查询前确保已向量化相关文档
- 对于大量文档,考虑增加 chunk_size 减少切片数量
- 相似度阈值可根据需求调整(默认返回 top_k 个)
Files
3 totalSelect a file
Select a file to preview.
Comments
Loading comments…
