DOCX Toolkit

v1.0.0

Extract text, tables, and images from .docx and legacy .doc files. Handles large documents, CJK text, and complex table structures. Includes deduplication an...

⭐ 0· 587·6 current·6 all-time

byShihao Jiang (Zac)@zacjiang

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

Name/description match the included scripts: extract_text.py, extract_doc_text.py, extract_images.py, and resize_images.py. Declared Python libraries (python-docx, olefile, Pillow) are appropriate for the stated functionality. No unrelated binaries, env vars, or external services are requested.

ℹ

Instruction Scope

SKILL.md only instructs running the included scripts on local files and directories. The scripts read input document files, write extracted text/images to an output directory, and optionally write a JSON manifest. This is within scope. Notes: extract_doc_text reads raw OLE streams and may use significant RAM for very large .doc files; resize_images will overwrite files if output_dir is omitted; classify_by_context uses heuristic keyword matching (mostly Chinese keywords) and can misclassify. The scripts do not contact external endpoints or read environment variables.

✓

Install Mechanism

No install spec is provided (instruction-only), and the code is bundled with the skill. Dependencies are normal Python packages installable via pip. No downloads from arbitrary URLs or archive extraction are present.

✓

Credentials

The skill requests no environment variables, credentials, or special config paths. All required resources are local files and standard Python packages, which is proportionate to the functionality.

✓

Persistence & Privilege

The skill is not always-enabled and does not request persistent or elevated platform privileges. It does not alter other skills' configuration or require platform-wide settings.

Assessment

This skill appears to do what it claims: local extraction of text, tables, and images from Word files. Before using on sensitive content, consider: run it on a sandbox or isolated environment for untrusted documents; expect the scripts to write files to the specified output_dir and note that resize_images overwrites in-place by default; very large legacy .doc files may use a lot of RAM; image extraction can pull out sensitive items (IDs, certificates)—review outputs before uploading anywhere; classification is heuristic and language-specific (may mislabel). No network exfiltration or secret usage was observed in the code. If you require stronger assurance, inspect the bundled scripts locally or run them in a container.

Like a lobster shell, security has layers — review code before you run it.

latestvk97a3qe6fxa2d4r0p9qect6ceh82bpzb

587downloads

0stars

1versions

Updated 1mo ago

v1.0.0

MIT-0

DOCX Toolkit

A complete toolkit for processing Microsoft Word documents (.docx and legacy .doc formats).

Capabilities

1. Text + Table Extraction (.docx)

python3 {baseDir}/scripts/extract_text.py input.docx output.txt

Extracts all paragraphs and tables with structure preserved. Tables are formatted as pipe-delimited rows for easy parsing.

2. Text Extraction (Legacy .doc)

python3 {baseDir}/scripts/extract_doc_text.py input.doc output.txt

Handles legacy OLE2 .doc format using olefile. Extracts Unicode text from the WordDocument stream.

3. Image Extraction (.docx)

python3 {baseDir}/scripts/extract_images.py input.docx output_dir/

Extracts all embedded images with:

Automatic deduplication (MD5 hash comparison)
Size filtering (skips tiny icons <5KB by default)
Sequential renaming (img_001.png, img_002.jpg, etc.)

4. Image Compression

python3 {baseDir}/scripts/resize_images.py input_dir/ output_dir/ [--max-width 1024]

Batch resize/compress images for API processing (saves 50-70% on vision API costs).

Dependencies

Python 3.6+
python-docx — for .docx processing
olefile — for legacy .doc processing
Pillow — for image resizing (optional, only needed for resize script)

Install:

pip3 install python-docx olefile Pillow

Use Cases

Document analysis: Extract text for AI review/summarization
Migration: Pull content from Word docs into other formats
Image audit: Extract and review all embedded images
Cost optimization: Compress images before sending to vision APIs
Batch processing: Process multiple documents in a pipeline

Notes

Large .doc files (>200MB) may require significant RAM for olefile processing
Image extraction preserves original format (png/jpg/gif/etc.)
Deduplication catches exact duplicates; near-duplicates still pass through
CJK (Chinese/Japanese/Korean) text is fully supported in both extractors

Comments

Loading comments...