原子化RAG知识库构建器

Security checks across malware telemetry and agentic risk

Overview

The skill mostly does what it claims, but it includes an exposed GitHub-token-like repository URL and under-discloses external embedding and retained document storage risks.

Review before installing. Ask the publisher to remove and rotate the exposed GitHub-token-like credential, republish with a clean and consistent repository URL, and document that PDF text may be sent to OpenAI for embeddings and stored in vector databases or JSON files. Avoid using sensitive, regulated, proprietary, or medical documents unless you control the embedding backend, storage access, retention, and deletion process.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
  • Supply ChainUnpinned Dependencies, External Script Fetching, Obfuscated Code
  • Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
  • Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
  • Privilege EscalationExcessive Permissions, Sudo/Root Execution, Credential Access
Findings (21)

Context-Inappropriate Capability

High
Confidence
99% confidence
Finding
The repository URL contains what appears to be a live GitHub personal access token embedded directly in package metadata. Publishing credentials in a package file can expose repository access to anyone who reads or mirrors the package, enabling unauthorized cloning, modification, or broader account/repository compromise depending on the token's scope. The skill's purpose is knowledge-base building, so embedding a GitHub token is unrelated functionality and increases suspicion rather than reducing it.

Missing User Warnings

Medium
Confidence
87% confidence
Finding
The document instructs users to process PDFs and store extracted content in a vector database, but it gives no warning about sensitive data, consent, retention, or access control. In a knowledge-base builder, users may ingest proprietary, personal, educational, or regulated documents, so omission of privacy guidance can lead to unintended disclosure through embeddings, retrieval, or downstream sharing.

Missing User Warnings

High
Confidence
94% confidence
Finding
The skill markets medical-domain capabilities such as diagnosis logic extraction and treatment-plan structuring without a disclaimer that outputs are informational only and not a substitute for licensed clinical judgment. This increases the risk that users over-trust generated medical guidance, potentially causing unsafe self-diagnosis, inappropriate treatment decisions, or misuse in clinical contexts.

Missing User Warnings

Medium
Confidence
88% confidence
Finding
The skill encourages ingesting PDFs and building a persistent knowledge base/vector store but provides no warning that source documents may contain sensitive, proprietary, or regulated data. In practice, users may upload internal manuals, medical material, research papers, or personally identifiable information, causing unintended long-term retention, broader retrieval exposure, or downstream leakage through RAG responses.

Missing User Warnings

Medium
Confidence
93% confidence
Finding
The usage example normalizes storing extracted PDF content into a vector database without disclosing retention, access, or privacy consequences. This is dangerous because examples are often copied directly, so users may persist confidential educational, enterprise, research, or medical content into a retrievable index without safeguards, increasing the risk of unauthorized disclosure or model-assisted data exfiltration.

Missing User Warnings

Medium
Confidence
92% confidence
Finding
The code sends full atom content to OpenAIEmbeddings for remote embedding generation, which can disclose PDF-derived text to an external service without any explicit consent gate, warning, or data-classification check. In a knowledge-base builder, users may process copyrighted, sensitive, proprietary, medical, or research material, so silent exfiltration to a third party creates a real privacy and compliance risk.

Missing User Warnings

Medium
Confidence
89% confidence
Finding
The storage path persists complete text chunks and metadata into a vector database, but the code does not warn the caller that source content is being durably written. This is risky because the skill is specifically designed to ingest entire books or domain documents, making accidental long-term storage of sensitive or licensed material more likely.

Unpinned Dependencies

Low
Category
Supply Chain
Content
python>=3.8

# PDF处理
pdfplumber>=0.10.0
PyMuPDF>=1.23.0
pdf2image>=1.16.0
Confidence
81% confidence
Finding
pdfplumber>=0.10.0

Unpinned Dependencies

Low
Category
Supply Chain
Content
# PDF处理
pdfplumber>=0.10.0
PyMuPDF>=1.23.0
pdf2image>=1.16.0

# OCR
Confidence
93% confidence
Finding
PyMuPDF>=1.23.0

Unpinned Dependencies

Low
Category
Supply Chain
Content
# PDF处理
pdfplumber>=0.10.0
PyMuPDF>=1.23.0
pdf2image>=1.16.0

# OCR
pytesseract>=0.3.10
Confidence
78% confidence
Finding
pdf2image>=1.16.0

Unpinned Dependencies

Low
Category
Supply Chain
Content
# OCR
pytesseract>=0.3.10
Pillow>=10.0.0

# NLP和文本处理
spacy>=3.7.0
Confidence
92% confidence
Finding
Pillow>=10.0.0

Unpinned Dependencies

Low
Category
Supply Chain
Content
pinecone-client>=2.2.0

# 嵌入模型
langchain>=0.1.0
langchain-openai>=0.0.5
sentence-transformers>=2.2.0
Confidence
94% confidence
Finding
langchain>=0.1.0

Unpinned Dependencies

Low
Category
Supply Chain
Content
# 嵌入模型
langchain>=0.1.0
langchain-openai>=0.0.5
sentence-transformers>=2.2.0

# 知识图谱
Confidence
88% confidence
Finding
langchain-openai>=0.0.5

Unpinned Dependencies

Low
Category
Supply Chain
Content
networkx>=3.1.0

# 数据科学
numpy>=1.24.0
pandas>=2.0.0
scikit-learn>=1.3.0
Confidence
85% confidence
Finding
numpy>=1.24.0

Unpinned Dependencies

Low
Category
Supply Chain
Content
# 数据科学
numpy>=1.24.0
pandas>=2.0.0
scikit-learn>=1.3.0

# 其他
python-dotenv>=1.0.0
Confidence
84% confidence
Finding
scikit-learn>=1.3.0

Unpinned Dependencies

Low
Category
Supply Chain
Content
# 其他
python-dotenv>=1.0.0
tqdm>=4.65.0
Confidence
82% confidence
Finding
tqdm>=4.65.0

Known Vulnerable Dependency: PyMuPDF — 1 advisory(ies): CVE-2026-3029 (PyMuPDF has a path traversal in _main_.py)

Low
Category
Supply Chain
Confidence
87% confidence
Finding
PyMuPDF

Known Vulnerable Dependency: Pillow — 10 advisory(ies): CVE-2016-2533 (Pillow buffer overflow in ImagingPcdDecode); CVE-2023-50447 (Arbitrary Code Execution in Pillow); CVE-2021-27922 (Pillow Uncontrolled Resource Consumption) +7 more

Critical
Category
Supply Chain
Confidence
97% confidence
Finding
Pillow

Known Vulnerable Dependency: langchain — 10 advisory(ies): CVE-2023-36258 (langchain arbitrary code execution vulnerability); CVE-2026-45134 (LangSmith SDK: Public prompt pull deserializes untrusted manifests without trust); CVE-2024-2965 (Denial of service in langchain-community) +7 more

Critical
Category
Supply Chain
Confidence
97% confidence
Finding
langchain

Known Vulnerable Dependency: langchain-openai — 2 advisory(ies): CVE-2026-41488 (langchain-openai: Image token counting SSRF protection can be bypassed via DNS r); CVE-2026-41488 (LangChain is a framework for building agents and LLM-powered applications. Prior)

Medium
Category
Supply Chain
Confidence
91% confidence
Finding
langchain-openai

Known Vulnerable Dependency: scikit-learn — 6 advisory(ies): CVE-2020-13092 (scikit-learn Deserialization of Untrusted Data); CVE-2024-5206 (scikit-learn sensitive data leakage vulnerability); CVE-2020-28975 (scikit-learn Denial of Service) +3 more

Critical
Category
Supply Chain
Confidence
86% confidence
Finding
scikit-learn

VirusTotal

67/67 vendors flagged this skill as clean.

View on VirusTotal