docx-pdf-knowledge-parser

v1.0.1

Parse local `.docx` and `.pdf` files into structured knowledge artifacts with detailed reports, tracking successes, failures, and summaries without auto-writ...

⭐ 0· 147·0 current·0 all-time

by@kaiasdobi

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for kaiasdobi/docx-pdf-knowledge-parser.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "docx-pdf-knowledge-parser" (kaiasdobi/docx-pdf-knowledge-parser) from ClawHub.
Skill page: https://clawhub.ai/kaiasdobi/docx-pdf-knowledge-parser
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install docx-pdf-knowledge-parser

ClawHub CLI

Package manager switcher

npx clawhub@latest install docx-pdf-knowledge-parser

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

The name/description (parse local .docx/.pdf into report-first outputs) matches the code and SKILL.md. The included parsers and run.py implement the declared behavior. Mentions of Feishu in README/agent metadata are informational for future connectors but do not imply hidden Feishu integration.

ℹ

Instruction Scope

SKILL.md explicitly limits processing to local/already-available files and the code follows that. Be aware the tool will iterate all files in the provided input directory and will attempt to parse any .docx/.pdf it finds — so the operator must ensure the input directory contains only files intended for ingestion to avoid accidental parsing of sensitive documents.

✓

Install Mechanism

There is no install spec; requirements.txt lists python-docx and pypdf which are appropriate for the task. No downloads from arbitrary URLs or extract operations are present.

✓

Credentials

The skill requests no environment variables, no credentials, and no config paths. The code does not reference any secrets or external services; the lack of credentials is consistent with an offline/local parsing utility.

✓

Persistence & Privilege

always is false and the skill does not attempt to modify other skills or global agent settings. It writes output files only to the user-specified output directory (kb-items.jsonl, failed-items.jsonl, ingest-report.md, MEMORY.candidate.md).

Scan Findings in Context

[no_issues_found] expected: Static scan did not flag suspicious patterns. File I/O and use of python-docx/pypdf are expected for this purpose.

Assessment

This skill appears to be what it says: a local batch parser for .docx and .pdf files. Before running it, (1) ensure the --input-dir contains only files you want parsed (it will read and extract text from each .docx/.pdf it finds); (2) be aware extracted text and summary files (including MEMORY.candidate.md) will be written in plaintext to --output-dir — avoid writing to a shared or sensitive location; (3) install the two Python dependencies (python-docx, pypdf) in a controlled environment; (4) the README/metadata mentions Feishu but no network connector or credentials are included — adding Feishu integration would require extra code/credentials; and (5) if you need OCR for image-based PDFs, this version will mark them as failed and recommend manual/OCR workflows. No network exfiltration or credential use was found.

Like a lobster shell, security has layers — review code before you run it.

latestvk97ette4y348279f5b60sh4m0x83evyh

147downloads

0stars

2versions

Updated 1mo ago

v1.0.1

MIT-0

name: docx-pdf-knowledge-parser description: parse local docx and pdf files into report-first knowledge artifacts. use when chatgpt needs to extract text from uploaded or locally available attachments, generate ingest-report.md, kb-items.jsonl, failed-items.jsonl, and memory.candidate.md without directly writing memory.md.

Docx PDF Knowledge Parser

Use this skill to turn local or uploaded .docx and .pdf files into structured, reviewable knowledge outputs.

What this skill does

Accept local or already-available .docx and .pdf files.
Classify files into parseable, manual-review, or failed.
Parse .docx and .pdf in v1.0.
Produce report-first outputs instead of writing MEMORY.md directly.
Preserve failures and uncertainty instead of guessing content.

Supported v1.0 scope

Inputs

Local .docx file path
Local .pdf file path
A batch of local .docx and .pdf files in one directory

Parsing

.docx
.pdf

Outputs

ingest-report.md
kb-items.jsonl
failed-items.jsonl
MEMORY.candidate.md

Required behavior

Only process files that are already available locally or have already been provided to the runtime.
Do not claim file content was learned unless text was actually extracted.
Default to report-first. Do not write MEMORY.md in v1.0.
Record every failed file with a concrete reason.
Prefer plain-text summaries over complex cards when reporting progress.

File routing rules

Parseable

Treat these as parseable in v1.0:

.docx
.pdf

Manual-review

Route here when the file is out of scope or low-confidence in v1.0:

.pptx
images
scans with no extractable text
archives
unusual file types

Failed

Route here when the file cannot be opened, parsed, or extracted successfully.

Standard workflow

Resolve input type.
- Single file path -> process one file
- Directory path -> enumerate supported files
Create a batch record.
- Generate batch_id
- Record started_at
Build a manifest.
- File name
- File path
- File type
- Route decision
Attempt extraction.
- .docx -> use parsers/parse_docx.py
- .pdf -> use parsers/parse_pdf.py
Produce structured outputs.
- success -> append to kb-items.jsonl
- failure -> append to failed-items.jsonl
Summarize the batch.
- Write ingest-report.md
- Write MEMORY.candidate.md
Finish the batch.
- Record finished_at
- Never auto-write MEMORY.md

Output contracts

kb-items.jsonl

Write one JSON object per successfully extracted knowledge item with at least:

batch_id
source_file
source_path
file_type
topic
content_type
summary
extracted_at
confidence

failed-items.jsonl

Write one JSON object per failed file with at least:

batch_id
source_file
source_path
file_type
failure_reason
error_detail
suggested_action
failed_at

MEMORY.candidate.md

Include:

batch header (batch_id, started_at, finished_at, source_directory or source_file)
grouped knowledge summaries
source references
confidence notes
items needing review

ingest-report.md

Include:

Batch summary
Input scope
File counts and routing counts
Successful extraction summary
Failures and risks
Recommended next actions

Safety rules

Never invent text that was not extracted.
If parsing fails, say so plainly and log it.
Treat filenames as hints only, never as proof of document contents.
Keep sensitive data out of MEMORY.candidate.md unless the workflow explicitly allows it.

Included files

run.py: minimal batch runner for local testing
parsers/parse_docx.py: docx text extraction helper
parsers/parse_pdf.py: pdf text extraction helper
references/output_examples.md: sample output shapes and field guidance
README.md: setup and usage notes

Comments

Loading comments...