Skillv1.0.0

ClawScan security

原子化RAG知识库构建器 · ClawHub's context-aware review of the artifact, metadata, and declared behavior.

Scanner verdict

SuspiciousApr 12, 2026, 12:28 PM

Verdict: suspicious
Confidence: medium
Model: gpt-5-mini
Summary: The package mostly matches its stated purpose (building a RAG knowledge base), but the skill omits and contradicts several important runtime requirements (API keys, system binaries) and embeds what looks like a GitHub token in package.json — these inconsistencies warrant caution.
Guidance: This skill largely implements a reasonable PDF→RAG pipeline, but several red flags mean you should be careful before installing or running it: - Check package.json: it contains an apparent GitHub token embedded in the repository URL. Treat that as a leaked secret — do not reuse it, and avoid trusting it. If you plan to use this code, remove/rotate the token and confirm it is not active. - Expect to provide external credentials at runtime: OpenAI (or another embedding provider) API key, and any vector DB credentials (Chroma/Milvus/Pinecone). The skill does not declare these, but the code will call embedding/vector services. - System binaries are required but not declared: OCR (pytesseract) usually needs Tesseract installed on the host; pdf2image and pdf processing may need poppler. Install these in a controlled environment before running. - Run in an isolated environment first (sandbox/VM/container) and inspect network activity: the code will perform network calls to embedding providers and possibly vector DBs. Monitor outbound connections and avoid processing sensitive documents until you confirm where data is sent. - Review and test with non-sensitive PDFs: verify which external endpoints are contacted and what metadata/contents are transmitted (embeddings are generated by sending text to an embedding API). - If you will use this for medical content, be aware this code extracts diagnostic/treatment steps — ensure compliance with applicable regulations and have domain experts validate outputs. Given the evidence (missing declared env vars, system deps, and an embedded token), treat this skill as suspicious until you fix/confirm the issues above. If you want, I can point to the exact lines/files where environment-dependent calls are made and suggest specific mitigations (e.g., declare required env vars, remove tokens, add installation notes for system binaries).
Findings: [hardcoded_token_in_package_json_repository_url] unexpected: package.json repository.url contains a string starting with 'ghp_' (looks like a GitHub personal access token). This is not required by a knowledge-base builder and may be an accidental secret leak; it is not expected for the skill's purpose.

Review Dimensions

Purpose & Capability: noteName, README, SKILL.md and the Python code implement a PDF→atomic-RAG pipeline (OCR, semantic chunking, domain processors, embeddings, storing to Chroma/Milvus). The requested libraries and processors are coherent with the stated purpose. However package.json contains a GitHub token-like string in the repository URL and requirements include several heavy components (Milvus, Pinecone, Chroma, OCR/system deps) that are plausible but not declared in the skill metadata/manifest.
Instruction Scope: concernSKILL.md demonstrates running builder.process_pdf() and storing to vector DBs, but the instructions do not mention required API keys (e.g., OpenAI for embeddings), vector DB credentials, or required system binaries (tesseract, poppler). The runtime code will read local PDF files (expected) and call external services (embedding provider, vector DBs) — those network operations are not documented in SKILL.md or the skill manifest. The omission grants the skill broad implicit network access and unspecified credential use.
Install Mechanism: noteThere is no formal install spec in the registry (instruction-only style), but a requirements.txt and package.json are included, indicating Python dependencies. That itself is fine, but package.json contains an apparent GitHub personal access token embedded in the repository URL (exposes a secret-like string). Also some Python packages (pdf2image, pytesseract) require system-level binaries (poppler, tesseract) which are not declared as required binaries.
Credentials: concernThe skill metadata lists no required environment variables or primary credential, yet the code uses LangChain's OpenAIEmbeddings (which requires an embedding provider API key at runtime, commonly OPENAI_API_KEY) and supports vector stores (Chroma, Milvus, Pinecone) that typically need credentials or endpoints. Additionally, the package.json contains a token-like string that is unrelated to the declared environment requirements. The manifest under-declares sensitive external credentials and system dependencies.
Persistence & Privilege: okThe skill is not marked always:true, is user-invocable, and does not modify other skills or system-wide agent settings. It writes data to user-specified outputs (JSON, vector DB) only when explicitly invoked. No elevated persistence or cross-skill modification behavior was detected.