PDF Text Extractor

PassAudited by ClawScan on May 1, 2026.

Overview

The skill does not show exfiltration, persistence, or destructive behavior; it mainly reads user-selected PDFs, but its dependency and OCR claims are inconsistent and extracted document text should be treated as sensitive.

This skill appears safe to use for its stated purpose if you intentionally choose the PDFs. Before installing, note that it is not truly dependency-free, OCR support appears overstated, and any extracted document text may be visible to the agent and should not be treated as instructions.

Findings (3)

Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.

Low

#ASI02: Tool Misuse and Exploitation

What this means

If the agent or user selects the wrong PDF, private document text and metadata could be exposed in the conversation.

Why it was flagged

The skill reads the file path supplied by the caller. This is expected for a PDF extractor, but it means any selected PDF's contents can be brought into the agent output.

Skill content

const fileData = fs.readFileSync(pdfPath);

Recommendation

Use explicit, intended PDF paths and review batch inputs before extraction, especially for invoices, contracts, or other confidential documents.

Low

#ASI04: Agentic Supply Chain Vulnerabilities

What this means

A user expecting a dependency-free skill may need to install third-party npm code for the extractor to work.

Why it was flagged

The package declares an npm dependency even though the skill description says zero dependencies and the registry lists no install spec. The dependency is aligned with PDF parsing, but the dependency footprint is not consistently disclosed.

Skill content

"dependencies": { "pdfjs-dist": "^3.11.174" }

Recommendation

Install dependencies only from trusted sources, prefer the included lockfile or pinned versions, and treat the zero-dependency claim as inaccurate.

Low

#ASI06: Memory and Context Poisoning

What this means

Private document contents may become part of the model context, and text inside PDFs should not be treated as trusted instructions.

Why it was flagged

The documented workflow explicitly sends extracted document text toward LLM analysis. PDF text is untrusted user/document content and may contain sensitive information or prompt-like instructions.

Skill content

- Prepare content for LLM processing

Recommendation

Only process documents you are comfortable sharing with the agent, and instruct the agent to treat extracted PDF text as data rather than commands.