Office Document Assistant

v0.1.1

Read, extract, summarize, and compare office documents including PDF, Word, Excel, and PowerPoint. Use when a user provides .pdf/.doc/.docx/.xls/.xlsx/.ppt/....

⭐ 0· 244·0 current·0 all-time

byWinrunner_20@windrunner20

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for windrunner20/office-document-assistant.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Office Document Assistant" (windrunner20/office-document-assistant) from ClawHub.
Skill page: https://clawhub.ai/windrunner20/office-document-assistant
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install office-document-assistant

ClawHub CLI

Package manager switcher

npx clawhub@latest install office-document-assistant

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

Name/description match the included scripts and declared dependencies: the code extracts text from PDFs, DOCX, XLSX, PPTX and falls back to OCR or libreoffice for legacy formats. Required binaries and Python modules listed in SKILL.md are reasonable for this purpose.

✓

Instruction Scope

SKILL.md instructs the agent to run the bundled Python extractor against user-supplied files and to prefer structured extraction then summarize. The runtime instructions only reference local files, local tools (pdftoppm, tesseract, libreoffice, antiword, catdoc), and the bundled scripts — no external network endpoints or unrelated system paths are referenced.

✓

Install Mechanism

No install spec is provided (instruction-only plus bundled scripts). The scripts use standard system binaries and Python packages; nothing is downloaded from arbitrary URLs or written to unexpected locations.

✓

Credentials

No environment variables, secrets, or external credentials are required. The script only checks for/uses local binaries and Python modules needed to perform extraction.

✓

Persistence & Privilege

always is false and the skill does not attempt to modify agent configuration or persist credentials. It runs on-demand and has normal, limited local execution scope.

Assessment

This skill appears to do what it says: it runs bundled Python scripts that call local extraction libraries and optional system tools (pdftoppm, tesseract, libreoffice, antiword, catdoc). Before installing or enabling: (1) be aware it will execute local binaries named in PATH — on a compromised host an attacker could replace those binaries, so ensure the runtime environment is trusted; (2) install recommended Python packages and system tools if you need OCR/legacy-format support; (3) avoid providing password-protected/encrypted documents (the skill does not handle decryption); and (4) no credentials or network endpoints are requested by the skill itself. If you need network-based extraction or higher-fidelity layout analysis, consider a different tool.

Like a lobster shell, security has layers — review code before you run it.

latestvk971rfweb5qn5hazf2wqmdmf4983tbnq

244downloads

0stars

2versions

Updated 4w ago

v0.1.1

MIT-0

Office Document Assistant

Read, extract, summarize, and compare common office documents:

PDF
Word (.docx, .doc)
Excel (.xlsx, .xls)
PowerPoint (.pptx, .ppt)

Use this skill when the user wants the contents of a document explained, summarized, searched, or extracted into a simpler structure.

When to Use

Use this skill when the user:

uploads a .pdf / .doc / .docx / .xls / .xlsx / .ppt / .pptx
asks to summarize a document
asks to extract dates, amounts, contacts, conclusions, specifications, risks, or action items
asks for page-by-page / slide-by-slide structure
asks what a spreadsheet or slide deck is saying
asks to compare two or more documents after extracting their text

When Not to Use

Do not position this skill as a high-fidelity layout or visual analysis system.

It is not ideal for:

precise preservation of original layout, formatting, or pagination
detailed chart / diagram / image interpretation
password-protected or encrypted files
OCR-heavy image understanding beyond basic text recovery
advanced spreadsheet analytics or formula auditing
tracked-changes / redline reconstruction in Office documents

Core Workflow

Confirm the document path.
Run the bundled script:
- python3 {skill_dir}/scripts/extract_office_text.py <file> --json
Inspect the JSON fields:
- type
- extraction
- warning
- truncated
- text
Separate clearly in your response:
- directly extracted content
- your summary / inference based on that content
If extraction is empty or weak:
- for PDF, check OCR availability first
- for legacy Office formats, check conversion tools
If the user asks for a summary, default to:
- one-sentence overview
- 3–8 key points
- extra sections only when clearly present (dates, people, risks, data, conclusions, contacts)
If the user asks for extraction, prefer structured fields over long prose.

Supported Formats and Strategy

PDF

First extract embedded text with pypdf.
If extracted text is too short, fall back to OCR.
OCR prefers chi_sim+eng, then chi_sim, then eng.
OCR pipeline requires both pdftoppm and tesseract.
If an official first-class PDF tool is exposed in the environment and the task is high-value or multi-PDF, you may prefer that tool; otherwise use this skill's script.

Word

.docx: extract paragraphs and tables directly.
.doc: try antiword, then catdoc, then LibreOffice conversion to .docx.

Excel

Extract sheet names and the first rows of each sheet.
Best for quickly understanding workbook structure and core fields.
When explaining, focus on what each sheet represents, key columns, important figures, and obvious anomalies.

PowerPoint

Extract slide text from shapes.
Extract speaker notes when present.
Summaries should usually be slide-by-slide or theme-based, not a giant raw dump.

Tools and Dependencies

Document clearly what is required versus optional.

Required runtime

python3

Required Python packages

pypdf — embedded text extraction from PDFs
python-docx — .docx extraction
openpyxl — .xlsx extraction
python-pptx — .pptx extraction

Optional but strongly recommended system tools

poppler-utils — provides pdftoppm for PDF → image conversion before OCR
tesseract-ocr — OCR engine
tesseract-ocr-chi-sim — Simplified Chinese OCR language pack
libreoffice — conversion fallback for legacy .doc, .xls, .ppt
antiword — direct .doc extraction fallback
catdoc — additional .doc extraction fallback

What each tool is used for

pypdf: try text-layer extraction from PDFs first
pdftoppm: rasterize PDF pages when OCR is needed
tesseract: recover text from scanned/image PDFs
python-docx: read paragraphs and tables from .docx
openpyxl: read sheets and rows from .xlsx
python-pptx: read slide text and notes from .pptx
libreoffice: convert older Office formats into newer parseable formats
antiword / catdoc: lightweight extraction options for .doc

Minimum useful setup

If only modern documents matter, the minimum practical setup is:

python3
Python packages: pypdf, python-docx, openpyxl, python-pptx

Recommended full setup

For the most robust behavior across real-world files, install:

python3
Python packages: pypdf, python-docx, openpyxl, python-pptx
system tools: poppler-utils, tesseract-ocr, tesseract-ocr-chi-sim, libreoffice, antiword, catdoc

Dependency check

Use the bundled checker to quickly see what is missing in the current environment:

python3 {skill_dir}/scripts/check_deps.py

Common Commands

python3 {skill_dir}/scripts/extract_office_text.py "/path/to/file.pdf" --json
python3 {skill_dir}/scripts/extract_office_text.py "/path/to/file.docx" --json
python3 {skill_dir}/scripts/extract_office_text.py "/path/to/file.xlsx" --json
python3 {skill_dir}/scripts/extract_office_text.py "/path/to/file.pptx" --json

Useful flags:

# limit PDF pages scanned/extracted
python3 {skill_dir}/scripts/extract_office_text.py "/path/to/file.pdf" --page-limit 10 --json

# limit rows per sheet when probing spreadsheets
python3 {skill_dir}/scripts/extract_office_text.py "/path/to/file.xlsx" --row-limit 30 --json

# cap output text size
python3 {skill_dir}/scripts/extract_office_text.py "/path/to/file.pdf" --max-chars 30000 --json

Output Style

Default to a compact answer:

one-sentence summary
3–8 key points
then expand only if the user asks for:
- detailed summary
- page-by-page / slide-by-slide notes
- field extraction
- document comparison

Failure Handling

If PDF text is empty, suspect scanned pages or missing OCR tools.
If Chinese OCR is weak, check whether tesseract-ocr-chi-sim is installed.
If .doc / .xls / .ppt extraction fails, check libreoffice, antiword, and catdoc.
If tables look messy, explain that this is text-first extraction rather than full layout reconstruction.
If a file is encrypted or unreadable, say so plainly and stop guessing.

References

Read these only when needed:

references/capabilities.md — capability boundaries and what each format can/can't do well
references/troubleshooting.md — dependency checks and common failure modes

Comments

Loading comments...