pdf-miner

Security checks across malware telemetry and agentic risk

Overview

This PDF extraction skill needs review because it can automatically send PDF page images to an external OCR API even though its summary says scanned-PDF OCR is out of scope.

Install only if you are comfortable with remote OCR processing. For confidential PDFs, avoid configuring OCR credentials or run with --no-auto-ocr; if you do use OCR, prefer environment variables over storing keys in config.json and verify the configured OCR endpoint and model before processing sensitive documents.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
  • Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
  • MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
  • MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
  • Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Findings (15)

Lp3

Medium
Category
MCP Least Privilege
Confidence
89% confidence
Finding
The skill documentation instructs use of environment variables plus reading and writing local files, but the manifest does not declare those capabilities. Hidden capability gaps matter because users and policy layers cannot accurately assess what the skill needs to access, especially when it handles PDFs and optional credentials. In this context the issue is transparency and control failure rather than direct code execution, but it increases risk when combined with OCR and external API usage.

Tp4

High
Category
MCP Tool Poisoning
Confidence
97% confidence
Finding
The manifest says the skill is not for OCR on scanned PDFs, yet the body documents built-in OCR, automatic OCR triggering, and transmission of rendered page images to an external vision API. This mismatch can cause users or orchestrators to invoke the skill under false assumptions, leading to unanticipated network egress and exposure of sensitive document contents. Because PDFs often contain confidential business or financial data, the discrepancy materially increases harm.

Description-Behavior Mismatch

High
Confidence
98% confidence
Finding
The skill explicitly states it is not for OCR on scanned/image PDFs, but later adds full OCR support and even automatic OCR behavior. That inconsistency is dangerous because operators may rely on the manifest for data handling boundaries, while the skill can actually send image renderings of document pages to third-party services. In a document-processing skill, this hidden expansion of scope is a significant trust and privacy risk.

Context-Inappropriate Capability

Medium
Confidence
87% confidence
Finding
The skill introduces storage and handling of external vision API credentials in environment variables, config files, and command-line arguments, despite presenting itself primarily as a local PDF extraction tool. This broadens the attack surface by encouraging secret placement in files and shell history and by adding network dependency without strong justification in the manifest. The risk is amplified because users may not expect remote processing from a PDF reader skill.

Intent-Code Divergence

High
Confidence
96% confidence
Finding
The documentation is self-contradictory: it first excludes OCR, then later describes OCR as automatic/default behavior for low-text pages. Automatic remote OCR is especially sensitive because it can activate without the user deliberately choosing it, causing confidential page images to leave the local environment unexpectedly. The contradiction undermines informed consent and safe deployment decisions.

Description-Behavior Mismatch

High
Confidence
98% confidence
Finding
The skill metadata and docstring state OCR/scanned PDFs are out of scope, yet the implementation includes OCR and enables auto-OCR by default. That causes low-text PDF pages to be rendered and potentially transmitted to an external API, expanding the skill's capabilities beyond what users would reasonably expect and creating an unexpected data exfiltration path for document contents.

Context-Inappropriate Capability

High
Confidence
99% confidence
Finding
The OCR function converts PDF pages to images and sends them to a remote vision model endpoint using the OpenAI-compatible client. This means potentially sensitive PDF contents leave the local environment and are disclosed to a third party, which is especially risky for financial, research, or internal documents the skill is designed to process.

Context-Inappropriate Capability

Medium
Confidence
90% confidence
Finding
The script loads OCR API keys, base URLs, and models from environment variables and a local config file to support remote OCR connectivity even though the skill is presented primarily as PDF extraction and says OCR is not for this skill. While reading configuration is common, in this context it silently prepares a networked data-transfer capability that users may not expect.

Intent-Code Divergence

Medium
Confidence
97% confidence
Finding
The code advertises OCR as optional, but runtime logic enables automatic OCR by default through the hidden auto_ocr flag. This discrepancy undermines informed consent and can trigger remote processing of document images without a clear, intentional user action.

Description-Behavior Mismatch

High
Confidence
96% confidence
Finding
The code explicitly implements OCR for scanned/image-based PDFs even though the skill metadata says the skill is NOT for OCR on scanned image-based PDFs. This scope mismatch is security-relevant because it enables processing and transmission of document image content in a way users and reviewers would not expect, increasing the chance of covert data exfiltration or unauthorized handling of sensitive files.

Context-Inappropriate Capability

High
Confidence
98% confidence
Finding
The function sends rendered PDF page images to a remote OpenAI-compatible endpoint, which can disclose the full visual contents of potentially sensitive PDFs to a third party. This is especially dangerous because the advertised purpose is local PDF extraction, so users may not expect external transmission of document data.

Intent-Code Divergence

Medium
Confidence
90% confidence
Finding
The module docstring advertises OCR of scanned/image-based PDFs despite the manifest saying such use is out of scope. This inconsistency can mislead integrators and reviewers, weakening trust boundaries and making it easier for sensitive document-processing behavior to be introduced without proper review.

Missing User Warnings

Medium
Confidence
95% confidence
Finding
The OCR flow sends rendered PDF page images to an external vision API, but the documentation does not prominently warn that document contents may be transmitted off-host to a third party. This is a meaningful privacy and compliance issue because PDFs commonly contain proprietary, personal, or regulated information, and automatic OCR increases the chance of accidental disclosure. The absence of explicit warning makes misuse more likely.

Missing User Warnings

Medium
Confidence
95% confidence
Finding
In the OCR integration path, low-text pages are automatically selected and sent for OCR if credentials are available, but the user-facing output only notes that OCR is happening rather than clearly warning that page images are being transmitted off-box. For document-processing skills, this is a significant privacy and compliance issue because users may assume extraction is local.

Missing User Warnings

Medium
Confidence
95% confidence
Finding
Page images are uploaded to a remote vision API without any visible user-facing warning, consent check, or disclosure in this script. That creates a privacy and compliance risk because sensitive PDF contents may leave the local environment unexpectedly.

VirusTotal

66/66 vendors flagged this skill as clean.

View on VirusTotal