Pdf Extractor Skill
SuspiciousAudited by ClawScan on May 10, 2026.
Overview
The PDF extraction function is coherent, but the skill embeds a default Volcengine API key and can send PDF page images to an external LLM provider.
Use the local extraction modes for sensitive PDFs. Before enabling `--ark-code-latest` or other LLM modes, confirm where document pages are sent and replace the embedded API key with your own explicitly configured credential. Install the OCR dependencies in an isolated environment and verify package/model sources.
Findings (3)
Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.
Anyone using or redistributing the skill may unknowingly use a shared provider credential, potentially exposing document processing activity to that account or causing account/billing abuse.
The script embeds a default provider API key that is used when `--ark-code-latest` is selected, despite the registry declaring no primary credential or required environment variable.
DEFAULT_ARK_OPENAI_API_KEY = os.environ.get("VOLCENGINE_CODING_PLAN_API_KEY") or os.environ.get(
"ARK_API_KEY"
) or "991ee1db-32ff-4884-b45a-155fa632ecbb"Remove the hardcoded fallback key and require users to provide their own API key through an explicitly declared environment variable or configuration setting.
If LLM enhancement is used, PDF page images or extracted content may leave the local machine and be processed by the configured external provider.
In LLM mode, page images are encoded and sent through an OpenAI-compatible client to the configured provider endpoint.
b64 = base64.b64encode(image_bytes.getvalue()).decode("utf-8") ... "image_url": {"url": f"data:image/{img_fmt.lower()};base64,{b64}"} ... return openai.OpenAI(api_key=self.openai_api_key, base_url=self.openai_base_url)Use local-only Marker/Nougat mode for sensitive PDFs, and require explicit user confirmation before enabling LLM enhancement or sending document pages to a provider.
The installed packages and downloaded models may vary over time, affecting reproducibility and supply-chain reviewability.
The setup instructions use unpinned package installs and rely on automatic model downloads, which is expected for this OCR workflow but leaves dependency provenance and versions to the user environment.
pip install marker-pdf # Nougat (optional, for English-only papers) pip install nougat-ocr # Transformers pip install transformers ... Models are downloaded automatically on first use
Install in a dedicated environment, pin dependency versions where possible, and verify package/model sources before processing sensitive documents.
