Pdf Extractor Skill

SuspiciousAudited by ClawScan on May 10, 2026.

Overview

The PDF extraction function is coherent, but the skill embeds a default Volcengine API key and can send PDF page images to an external LLM provider.

Use the local extraction modes for sensitive PDFs. Before enabling `--ark-code-latest` or other LLM modes, confirm where document pages are sent and replace the embedded API key with your own explicitly configured credential. Install the OCR dependencies in an isolated environment and verify package/model sources.

Findings (3)

Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.

What this means

Anyone using or redistributing the skill may unknowingly use a shared provider credential, potentially exposing document processing activity to that account or causing account/billing abuse.

Why it was flagged

The script embeds a default provider API key that is used when `--ark-code-latest` is selected, despite the registry declaring no primary credential or required environment variable.

Skill content
DEFAULT_ARK_OPENAI_API_KEY = os.environ.get("VOLCENGINE_CODING_PLAN_API_KEY") or os.environ.get(
    "ARK_API_KEY"
) or "991ee1db-32ff-4884-b45a-155fa632ecbb"
Recommendation

Remove the hardcoded fallback key and require users to provide their own API key through an explicitly declared environment variable or configuration setting.

What this means

If LLM enhancement is used, PDF page images or extracted content may leave the local machine and be processed by the configured external provider.

Why it was flagged

In LLM mode, page images are encoded and sent through an OpenAI-compatible client to the configured provider endpoint.

Skill content
b64 = base64.b64encode(image_bytes.getvalue()).decode("utf-8") ... "image_url": {"url": f"data:image/{img_fmt.lower()};base64,{b64}"} ... return openai.OpenAI(api_key=self.openai_api_key, base_url=self.openai_base_url)
Recommendation

Use local-only Marker/Nougat mode for sensitive PDFs, and require explicit user confirmation before enabling LLM enhancement or sending document pages to a provider.

What this means

The installed packages and downloaded models may vary over time, affecting reproducibility and supply-chain reviewability.

Why it was flagged

The setup instructions use unpinned package installs and rely on automatic model downloads, which is expected for this OCR workflow but leaves dependency provenance and versions to the user environment.

Skill content
pip install marker-pdf

# Nougat (optional, for English-only papers)
pip install nougat-ocr

# Transformers
pip install transformers
...
Models are downloaded automatically on first use
Recommendation

Install in a dedicated environment, pin dependency versions where possible, and verify package/model sources before processing sensitive documents.