Sci Data Extractor

Security checks across malware telemetry and agentic risk

Overview

This skill does what it claims: it extracts data from user-selected PDFs using local PDF parsing plus configured AI/OCR services, with some privacy and install-hygiene cautions.

Install only if you are comfortable sending extracted PDF content to your configured LLM provider and, when using --ocr mathpix, uploading the PDF to Mathpix. Use a virtual environment, prefer EXTRACTOR_API_KEY and EXTRACTOR_BASE_URL over generic API_KEY/BASE_URL, avoid confidential or unpublished papers unless provider terms allow it, and prefer safer install methods over curl | sh.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
  • Supply ChainUnpinned Dependencies, External Script Fetching, Obfuscated Code
  • Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
  • Tool MisuseTool Parameter Abuse, Chaining Abuse, Unsafe Defaults
  • MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Findings (17)

Context-Inappropriate Capability

Medium
Confidence
91% confidence
Finding
The code loads API credentials for external AI/OCR providers and later uses them to transmit document content off-host. In the context of a PDF extractor, this is a genuine data-exposure risk if users are not clearly informed that their documents and extracted text may be sent to third parties.

Description-Behavior Mismatch

Medium
Confidence
96% confidence
Finding
The Mathpix path uploads the entire input PDF to an external service, which exceeds a plain local 'extract structured data from PDFs' expectation unless clearly disclosed. This can leak sensitive or unpublished scientific content, especially because the full document is transmitted rather than a minimized subset.

Description-Behavior Mismatch

Medium
Confidence
97% confidence
Finding
The extractor sends extracted paper text to an external chat-completions API for analysis, creating a real confidentiality and compliance risk for sensitive documents. The manifest describes extraction from PDFs but does not make clear that document contents are shared with a remote model provider.

Missing User Warnings

Medium
Confidence
95% confidence
Finding
The README explicitly promotes OCR and AI-based extraction using third-party services, which implies scientific document contents may be transmitted off-host to Anthropic, OpenAI, or Mathpix. Because there is no clear privacy, confidentiality, or data-transmission warning near the feature description or setup flow, users may unknowingly send unpublished, proprietary, or sensitive research content to external providers.

Missing User Warnings

Medium
Confidence
90% confidence
Finding
The README states that PDF content is processed via external AI/OCR services, but it does not clearly warn users that paper text, tables, figures, and possibly sensitive unpublished data may be transmitted to third-party providers. In a research workflow, this can create confidentiality, data governance, or IP leakage risk if users assume processing is local.

Missing User Warnings

Medium
Confidence
94% confidence
Finding
The feature description advertises PDF extraction and LLM-based processing but does not warn that uploaded PDF contents may be sent to external OCR or LLM providers. In a scientific-literature context, PDFs may contain unpublished, licensed, or sensitive research data, so omission of this warning can cause unintended data disclosure.

Missing User Warnings

Medium
Confidence
95% confidence
Finding
The script uploads the PDF to Mathpix without an explicit privacy or data-transfer warning at the point of use. Users may unknowingly send confidential research documents to a third party, making this a meaningful transparency and data-handling vulnerability.

Missing User Warnings

Medium
Confidence
96% confidence
Finding
The code forwards extracted PDF text to a remote LLM provider without clearly warning the user that document contents leave the local environment. This is dangerous because papers may contain unpublished, regulated, or proprietary information that should not be shared externally by default.

External Script Fetching

Low
Category
Supply Chain
Content
```bash
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install dependencies in project directory
cd ~/.claude/skills/sci-data-extractor
Confidence
93% confidence
Finding
curl -LsSf https://astral.sh/uv/install.sh | sh

Unpinned Dependencies

Low
Category
Supply Chain
Content
# Sci-Data-Extractor 依赖列表

# 核心 PDF 处理
pymupdf>=1.23.0

# LLM API 调用
openai>=1.12.0
Confidence
89% confidence
Finding
pymupdf>=1.23.0

Unpinned Dependencies

Low
Category
Supply Chain
Content
pymupdf>=1.23.0

# LLM API 调用
openai>=1.12.0

# HTTP 请求 (Mathpix OCR)
requests>=2.31.0
Confidence
87% confidence
Finding
openai>=1.12.0

Unpinned Dependencies

Low
Category
Supply Chain
Content
openai>=1.12.0

# HTTP 请求 (Mathpix OCR)
requests>=2.31.0

# 环境变量管理 (可选)
python-dotenv>=1.0.0
Confidence
95% confidence
Finding
requests>=2.31.0

Unpinned Dependencies

Low
Category
Supply Chain
Content
requests>=2.31.0

# 环境变量管理 (可选)
python-dotenv>=1.0.0
Confidence
84% confidence
Finding
python-dotenv>=1.0.0

Known Vulnerable Dependency: pymupdf — 1 advisory(ies): CVE-2026-3029 (PyMuPDF has a path traversal in _main_.py)

Low
Category
Supply Chain
Confidence
78% confidence
Finding
pymupdf

Known Vulnerable Dependency: requests — 10 advisory(ies): CVE-2014-1830 (Exposure of Sensitive Information to an Unauthorized Actor in Requests); CVE-2024-47081 (Requests vulnerable to .netrc credentials leak via malicious URLs); CVE-2024-35195 (Requests `Session` object does not verify requests after making first request wi) +7 more

High
Category
Supply Chain
Confidence
97% confidence
Finding
requests

Known Vulnerable Dependency: python-dotenv — 1 advisory(ies): CVE-2026-28684 (python-dotenv: Symlink following in set_key allows arbitrary file overwrite via )

Low
Category
Supply Chain
Confidence
76% confidence
Finding
python-dotenv

Chaining Abuse

High
Category
Tool Misuse
Content
```bash
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install dependencies in project directory
cd ~/.claude/skills/sci-data-extractor
Confidence
96% confidence
Finding
| sh

VirusTotal

63/63 vendors flagged this skill as clean.

View on VirusTotal