原子化RAG知识库构建器

Security checks across malware telemetry and agentic risk

Overview

The skill mostly does what it claims, but it includes an exposed GitHub-token-like repository URL and under-discloses external embedding and retained document storage risks.

Review before installing. Ask the publisher to remove and rotate the exposed GitHub-token-like credential, republish with a clean and consistent repository URL, and document that PDF text may be sent to OpenAI for embeddings and stored in vector databases or JSON files. Avoid using sensitive, regulated, proprietary, or medical documents unless you control the embedding backend, storage access, retention, and deletion process.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Supply ChainUnpinned Dependencies, External Script Fetching, Obfuscated Code
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Privilege EscalationExcessive Permissions, Sudo/Root Execution, Credential Access

Findings (21)

Context-Inappropriate Capability

High

Confidence: 99% confidence
Finding: The repository URL contains what appears to be a live GitHub personal access token embedded directly in package metadata. Publishing credentials in a package file can expose repository access to anyone who reads or mirrors the package, enabling unauthorized cloning, modification, or broader account/repository compromise depending on the token's scope. The skill's purpose is knowledge-base building, so embedding a GitHub token is unrelated functionality and increases suspicion rather than reducing it.

Missing User Warnings

Medium

Confidence: 87% confidence
Finding: The document instructs users to process PDFs and store extracted content in a vector database, but it gives no warning about sensitive data, consent, retention, or access control. In a knowledge-base builder, users may ingest proprietary, personal, educational, or regulated documents, so omission of privacy guidance can lead to unintended disclosure through embeddings, retrieval, or downstream sharing.

Missing User Warnings

High

Confidence: 94% confidence
Finding: The skill markets medical-domain capabilities such as diagnosis logic extraction and treatment-plan structuring without a disclaimer that outputs are informational only and not a substitute for licensed clinical judgment. This increases the risk that users over-trust generated medical guidance, potentially causing unsafe self-diagnosis, inappropriate treatment decisions, or misuse in clinical contexts.

Missing User Warnings

Medium

Confidence: 88% confidence
Finding: The skill encourages ingesting PDFs and building a persistent knowledge base/vector store but provides no warning that source documents may contain sensitive, proprietary, or regulated data. In practice, users may upload internal manuals, medical material, research papers, or personally identifiable information, causing unintended long-term retention, broader retrieval exposure, or downstream leakage through RAG responses.

Missing User Warnings

Medium

Confidence: 93% confidence
Finding: The usage example normalizes storing extracted PDF content into a vector database without disclosing retention, access, or privacy consequences. This is dangerous because examples are often copied directly, so users may persist confidential educational, enterprise, research, or medical content into a retrievable index without safeguards, increasing the risk of unauthorized disclosure or model-assisted data exfiltration.

Missing User Warnings

Medium

Confidence: 92% confidence
Finding: The code sends full atom content to OpenAIEmbeddings for remote embedding generation, which can disclose PDF-derived text to an external service without any explicit consent gate, warning, or data-classification check. In a knowledge-base builder, users may process copyrighted, sensitive, proprietary, medical, or research material, so silent exfiltration to a third party creates a real privacy and compliance risk.

Missing User Warnings

Medium

Confidence: 89% confidence
Finding: The storage path persists complete text chunks and metadata into a vector database, but the code does not warn the caller that source content is being durably written. This is risky because the skill is specifically designed to ingest entire books or domain documents, making accidental long-term storage of sensitive or licensed material more likely.

Unpinned Dependencies

Low

Category: Supply Chain
Content: python>=3.8 # PDF处理 pdfplumber>=0.10.0 PyMuPDF>=1.23.0 pdf2image>=1.16.0
Confidence: 81% confidence
Finding: pdfplumber>=0.10.0

Unpinned Dependencies

Low

Category: Supply Chain
Content: # PDF处理 pdfplumber>=0.10.0 PyMuPDF>=1.23.0 pdf2image>=1.16.0 # OCR
Confidence: 93% confidence
Finding: PyMuPDF>=1.23.0

Unpinned Dependencies

Low

Category: Supply Chain
Content: # PDF处理 pdfplumber>=0.10.0 PyMuPDF>=1.23.0 pdf2image>=1.16.0 # OCR pytesseract>=0.3.10
Confidence: 78% confidence
Finding: pdf2image>=1.16.0

Unpinned Dependencies

Low

Category: Supply Chain
Content: # OCR pytesseract>=0.3.10 Pillow>=10.0.0 # NLP和文本处理 spacy>=3.7.0
Confidence: 92% confidence
Finding: Pillow>=10.0.0

Unpinned Dependencies

Low

Category: Supply Chain
Content: pinecone-client>=2.2.0 # 嵌入模型 langchain>=0.1.0 langchain-openai>=0.0.5 sentence-transformers>=2.2.0
Confidence: 94% confidence
Finding: langchain>=0.1.0

Unpinned Dependencies

Low

Category: Supply Chain
Content: # 嵌入模型 langchain>=0.1.0 langchain-openai>=0.0.5 sentence-transformers>=2.2.0 # 知识图谱
Confidence: 88% confidence
Finding: langchain-openai>=0.0.5

Unpinned Dependencies

Low

Category: Supply Chain
Content: networkx>=3.1.0 # 数据科学 numpy>=1.24.0 pandas>=2.0.0 scikit-learn>=1.3.0
Confidence: 85% confidence
Finding: numpy>=1.24.0

Unpinned Dependencies

Low

Category: Supply Chain
Content: # 数据科学 numpy>=1.24.0 pandas>=2.0.0 scikit-learn>=1.3.0 # 其他 python-dotenv>=1.0.0
Confidence: 84% confidence
Finding: scikit-learn>=1.3.0

Unpinned Dependencies

Low

Category: Supply Chain
Content: # 其他 python-dotenv>=1.0.0 tqdm>=4.65.0
Confidence: 82% confidence
Finding: tqdm>=4.65.0

Known Vulnerable Dependency: PyMuPDF — 1 advisory(ies): CVE-2026-3029 (PyMuPDF has a path traversal in _main_.py)

Low

Category: Supply Chain
Confidence: 87% confidence
Finding: PyMuPDF

Known Vulnerable Dependency: Pillow — 10 advisory(ies): CVE-2016-2533 (Pillow buffer overflow in ImagingPcdDecode); CVE-2023-50447 (Arbitrary Code Execution in Pillow); CVE-2021-27922 (Pillow Uncontrolled Resource Consumption) +7 more

Critical

Category: Supply Chain
Confidence: 97% confidence
Finding: Pillow

Known Vulnerable Dependency: langchain — 10 advisory(ies): CVE-2023-36258 (langchain arbitrary code execution vulnerability); CVE-2026-45134 (LangSmith SDK: Public prompt pull deserializes untrusted manifests without trust); CVE-2024-2965 (Denial of service in langchain-community) +7 more

Critical

Category: Supply Chain
Confidence: 97% confidence
Finding: langchain

Known Vulnerable Dependency: langchain-openai — 2 advisory(ies): CVE-2026-41488 (langchain-openai: Image token counting SSRF protection can be bypassed via DNS r); CVE-2026-41488 (LangChain is a framework for building agents and LLM-powered applications. Prior)

Medium

Category: Supply Chain
Confidence: 91% confidence
Finding: langchain-openai

Known Vulnerable Dependency: scikit-learn — 6 advisory(ies): CVE-2020-13092 (scikit-learn Deserialization of Untrusted Data); CVE-2024-5206 (scikit-learn sensitive data leakage vulnerability); CVE-2020-28975 (scikit-learn Denial of Service) +3 more

Critical

Category: Supply Chain
Confidence: 86% confidence
Finding: scikit-learn

VirusTotal

67/67 vendors flagged this skill as clean.

View on VirusTotal