Sci Data Extractor

Security checks across malware telemetry and agentic risk

Overview

This skill does what it claims: it extracts data from user-selected PDFs using local PDF parsing plus configured AI/OCR services, with some privacy and install-hygiene cautions.

Install only if you are comfortable sending extracted PDF content to your configured LLM provider and, when using --ocr mathpix, uploading the PDF to Mathpix. Use a virtual environment, prefer EXTRACTOR_API_KEY and EXTRACTOR_BASE_URL over generic API_KEY/BASE_URL, avoid confidential or unpublished papers unless provider terms allow it, and prefer safer install methods over curl | sh.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Supply ChainUnpinned Dependencies, External Script Fetching, Obfuscated Code
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Tool MisuseTool Parameter Abuse, Chaining Abuse, Unsafe Defaults
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection

Findings (17)

Context-Inappropriate Capability

Medium

Confidence: 91% confidence
Finding: The code loads API credentials for external AI/OCR providers and later uses them to transmit document content off-host. In the context of a PDF extractor, this is a genuine data-exposure risk if users are not clearly informed that their documents and extracted text may be sent to third parties.

Description-Behavior Mismatch

Medium

Confidence: 96% confidence
Finding: The Mathpix path uploads the entire input PDF to an external service, which exceeds a plain local 'extract structured data from PDFs' expectation unless clearly disclosed. This can leak sensitive or unpublished scientific content, especially because the full document is transmitted rather than a minimized subset.

Description-Behavior Mismatch

Medium

Confidence: 97% confidence
Finding: The extractor sends extracted paper text to an external chat-completions API for analysis, creating a real confidentiality and compliance risk for sensitive documents. The manifest describes extraction from PDFs but does not make clear that document contents are shared with a remote model provider.

Missing User Warnings

Medium

Confidence: 95% confidence
Finding: The README explicitly promotes OCR and AI-based extraction using third-party services, which implies scientific document contents may be transmitted off-host to Anthropic, OpenAI, or Mathpix. Because there is no clear privacy, confidentiality, or data-transmission warning near the feature description or setup flow, users may unknowingly send unpublished, proprietary, or sensitive research content to external providers.

Missing User Warnings

Medium

Confidence: 90% confidence
Finding: The README states that PDF content is processed via external AI/OCR services, but it does not clearly warn users that paper text, tables, figures, and possibly sensitive unpublished data may be transmitted to third-party providers. In a research workflow, this can create confidentiality, data governance, or IP leakage risk if users assume processing is local.

Missing User Warnings

Medium

Confidence: 94% confidence
Finding: The feature description advertises PDF extraction and LLM-based processing but does not warn that uploaded PDF contents may be sent to external OCR or LLM providers. In a scientific-literature context, PDFs may contain unpublished, licensed, or sensitive research data, so omission of this warning can cause unintended data disclosure.

Missing User Warnings

Medium

Confidence: 95% confidence
Finding: The script uploads the PDF to Mathpix without an explicit privacy or data-transfer warning at the point of use. Users may unknowingly send confidential research documents to a third party, making this a meaningful transparency and data-handling vulnerability.

Missing User Warnings

Medium

Confidence: 96% confidence
Finding: The code forwards extracted PDF text to a remote LLM provider without clearly warning the user that document contents leave the local environment. This is dangerous because papers may contain unpublished, regulated, or proprietary information that should not be shared externally by default.

External Script Fetching

Low

Category: Supply Chain
Content: ```bash # Install uv (if not already installed) curl -LsSf https://astral.sh/uv/install.sh | sh # Create virtual environment and install dependencies in project directory cd ~/.claude/skills/sci-data-extractor
Confidence: 93% confidence
Finding: curl -LsSf https://astral.sh/uv/install.sh | sh

Unpinned Dependencies

Low

Category: Supply Chain
Content: # Sci-Data-Extractor 依赖列表 # 核心 PDF 处理 pymupdf>=1.23.0 # LLM API 调用 openai>=1.12.0
Confidence: 89% confidence
Finding: pymupdf>=1.23.0

Unpinned Dependencies

Low

Category: Supply Chain
Content: pymupdf>=1.23.0 # LLM API 调用 openai>=1.12.0 # HTTP 请求 (Mathpix OCR) requests>=2.31.0
Confidence: 87% confidence
Finding: openai>=1.12.0

Unpinned Dependencies

Low

Category: Supply Chain
Content: openai>=1.12.0 # HTTP 请求 (Mathpix OCR) requests>=2.31.0 # 环境变量管理 (可选) python-dotenv>=1.0.0
Confidence: 95% confidence
Finding: requests>=2.31.0

Unpinned Dependencies

Low

Category: Supply Chain
Content: requests>=2.31.0 # 环境变量管理 (可选) python-dotenv>=1.0.0
Confidence: 84% confidence
Finding: python-dotenv>=1.0.0

Known Vulnerable Dependency: pymupdf — 1 advisory(ies): CVE-2026-3029 (PyMuPDF has a path traversal in _main_.py)

Low

Category: Supply Chain
Confidence: 78% confidence
Finding: pymupdf

Known Vulnerable Dependency: requests — 10 advisory(ies): CVE-2014-1830 (Exposure of Sensitive Information to an Unauthorized Actor in Requests); CVE-2024-47081 (Requests vulnerable to .netrc credentials leak via malicious URLs); CVE-2024-35195 (Requests `Session` object does not verify requests after making first request wi) +7 more

High

Category: Supply Chain
Confidence: 97% confidence
Finding: requests

Known Vulnerable Dependency: python-dotenv — 1 advisory(ies): CVE-2026-28684 (python-dotenv: Symlink following in set_key allows arbitrary file overwrite via )

Low

Category: Supply Chain
Confidence: 76% confidence
Finding: python-dotenv

Chaining Abuse

High

Category: Tool Misuse
Content: ```bash # Install uv (if not already installed) curl -LsSf https://astral.sh/uv/install.sh | sh # Create virtual environment and install dependencies in project directory cd ~/.claude/skills/sci-data-extractor
Confidence: 96% confidence
Finding: | sh

VirusTotal

63/63 vendors flagged this skill as clean.

View on VirusTotal