Skill flagged — suspicious patterns detected
ClawHub Security flagged this skill as suspicious. Review the scan results before using.
原子化RAG知识库构建器
v1.0.0原子化RAG知识库构建器 - 让AI真正学会一本书,而非只是看过。理工农医特化,方法论提炼,全网最好的开源专属知识库建立技能。
⭐ 0· 48·0 current·0 all-time
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
OpenClaw
Suspicious
medium confidencePurpose & Capability
Name, README, SKILL.md and the Python code implement a PDF→atomic-RAG pipeline (OCR, semantic chunking, domain processors, embeddings, storing to Chroma/Milvus). The requested libraries and processors are coherent with the stated purpose. However package.json contains a GitHub token-like string in the repository URL and requirements include several heavy components (Milvus, Pinecone, Chroma, OCR/system deps) that are plausible but not declared in the skill metadata/manifest.
Instruction Scope
SKILL.md demonstrates running builder.process_pdf() and storing to vector DBs, but the instructions do not mention required API keys (e.g., OpenAI for embeddings), vector DB credentials, or required system binaries (tesseract, poppler). The runtime code will read local PDF files (expected) and call external services (embedding provider, vector DBs) — those network operations are not documented in SKILL.md or the skill manifest. The omission grants the skill broad implicit network access and unspecified credential use.
Install Mechanism
There is no formal install spec in the registry (instruction-only style), but a requirements.txt and package.json are included, indicating Python dependencies. That itself is fine, but package.json contains an apparent GitHub personal access token embedded in the repository URL (exposes a secret-like string). Also some Python packages (pdf2image, pytesseract) require system-level binaries (poppler, tesseract) which are not declared as required binaries.
Credentials
The skill metadata lists no required environment variables or primary credential, yet the code uses LangChain's OpenAIEmbeddings (which requires an embedding provider API key at runtime, commonly OPENAI_API_KEY) and supports vector stores (Chroma, Milvus, Pinecone) that typically need credentials or endpoints. Additionally, the package.json contains a token-like string that is unrelated to the declared environment requirements. The manifest under-declares sensitive external credentials and system dependencies.
Persistence & Privilege
The skill is not marked always:true, is user-invocable, and does not modify other skills or system-wide agent settings. It writes data to user-specified outputs (JSON, vector DB) only when explicitly invoked. No elevated persistence or cross-skill modification behavior was detected.
Scan Findings in Context
[hardcoded_token_in_package_json_repository_url] unexpected: package.json repository.url contains a string starting with 'ghp_' (looks like a GitHub personal access token). This is not required by a knowledge-base builder and may be an accidental secret leak; it is not expected for the skill's purpose.
What to consider before installing
This skill largely implements a reasonable PDF→RAG pipeline, but several red flags mean you should be careful before installing or running it:
- Check package.json: it contains an apparent GitHub token embedded in the repository URL. Treat that as a leaked secret — do not reuse it, and avoid trusting it. If you plan to use this code, remove/rotate the token and confirm it is not active.
- Expect to provide external credentials at runtime: OpenAI (or another embedding provider) API key, and any vector DB credentials (Chroma/Milvus/Pinecone). The skill does not declare these, but the code will call embedding/vector services.
- System binaries are required but not declared: OCR (pytesseract) usually needs Tesseract installed on the host; pdf2image and pdf processing may need poppler. Install these in a controlled environment before running.
- Run in an isolated environment first (sandbox/VM/container) and inspect network activity: the code will perform network calls to embedding providers and possibly vector DBs. Monitor outbound connections and avoid processing sensitive documents until you confirm where data is sent.
- Review and test with non-sensitive PDFs: verify which external endpoints are contacted and what metadata/contents are transmitted (embeddings are generated by sending text to an embedding API).
- If you will use this for medical content, be aware this code extracts diagnostic/treatment steps — ensure compliance with applicable regulations and have domain experts validate outputs.
Given the evidence (missing declared env vars, system deps, and an embedded token), treat this skill as suspicious until you fix/confirm the issues above. If you want, I can point to the exact lines/files where environment-dependent calls are made and suggest specific mitigations (e.g., declare required env vars, remove tokens, add installation notes for system binaries).Like a lobster shell, security has layers — review code before you run it.
latestvk97d7bz3vg2kwy2tsca35naye584q3nr
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
Runtime requirements
📚 Clawdis
