Junyi Doc Reader

v1.0.0

大文档归档与检索管线。将 Word/PDF/TXT/Markdown 文档转换、分块、可选 LLM 增强,输出结构化 Markdown 和索引,适合存入 Obsidian 或知识库。触发词:读大文档、归档文档、junyi-doc-reader、doc-reader、文档索引、帮我读这个PDF、把文档存到Obsid...

0· 171·0 current·0 all-time
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description (document conversion, chunking, optional LLM enrichment, Obsidian output) align with the provided scripts (converter, chunker, enricher, assembler, pipeline). Optional system binaries (pandoc, pdftotext/poppler) are appropriate for converting .docx/.pdf. No unexpected services or credentials are required by default.
Instruction Scope
SKILL.md and pipeline instructions stay within the stated purpose: they read the supplied input file, convert it, split into chunks, optionally call an external LLM, and write outputs into the specified output_dir. One important behavior: enrichment mode will transmit document chunks to the configured API endpoint (DOC_READER_API_URL) when DOC_READER_ALLOW_EXTERNAL=true and an API key is provided — the README and code do document this, but users should note this explicit external data transmission.
Install Mechanism
No install spec; the skill is instruction+script only (no downloads or installers). The Python scripts use only the stdlib for network calls. System dependencies (pandoc, pdftotext/poppler) are optional and are standard tools for document conversion.
Credentials
No required env vars by registry metadata; enrichment requires DOC_READER_API_KEY and optional DOC_READER_API_URL/DOC_READER_MODEL and DOC_READER_ALLOW_EXTERNAL. Those env vars are proportionate to optional LLM enrichment. Minor inconsistency to be aware of: default DOC_READER_API_URL is an OpenAI-compatible endpoint while the default DOC_READER_MODEL string references a 'claude' style model — this is a configuration mismatch that requires the user to set correct API_URL/MODEL for their provider.
Persistence & Privilege
The skill does not request forced/always-enabled execution. It writes state.json, manifest.json and output files inside the user-specified output_dir for crash recovery and auditing — expected behavior for a pipeline. It does not modify system-wide agent settings or other skills.
Assessment
This skill is internally consistent with its stated purpose. Two practical things to check before using it: (1) Enrichment mode will send chunks of your document to whatever API endpoint and key you configure — by default DOC_READER_ALLOW_EXTERNAL is false, so enrichment is disabled unless you explicitly set DOC_READER_ALLOW_EXTERNAL=true and supply DOC_READER_API_KEY. Only enable that for non-sensitive documents or when you trust the target LLM provider. (2) Confirm the API endpoint and model: the default URL is an OpenAI-style endpoint but the default model name looks like a Claude model — set DOC_READER_API_URL and DOC_READER_MODEL to values that match your provider. Also: converter steps may call pandoc/pdftotext (poppler) which are optional system dependencies. The scripts write state.json, manifest.json, converted.md, chunks.jsonl, and other output files into your chosen output_dir — review those files and the target Obsidian vault path before copying. If you want to avoid any network transmission, leave DOC_READER_API_KEY unset and keep DOC_READER_ALLOW_EXTERNAL=false (the pipeline will downgrade to archive-only or archive+index modes).

Like a lobster shell, security has layers — review code before you run it.

latestvk9727de4jysgjcmf7c0ffrzswx82t2k9

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Comments