Install
openclaw skills install office-document-assistantRead, extract, summarize, and compare office documents including PDF, Word, Excel, and PowerPoint. Use when a user provides .pdf/.doc/.docx/.xls/.xlsx/.ppt/.pptx files and asks for summaries, key point extraction, page-by-page outlines, field extraction, table explanation, or multi-document comparison. Prefer the bundled extraction script for deterministic text extraction; for PDFs, fall back to OCR when embedded text is missing.
openclaw skills install office-document-assistantRead, extract, summarize, and compare common office documents:
.docx, .doc).xlsx, .xls).pptx, .ppt)Use this skill when the user wants the contents of a document explained, summarized, searched, or extracted into a simpler structure.
Use this skill when the user:
.pdf / .doc / .docx / .xls / .xlsx / .ppt / .pptxDo not position this skill as a high-fidelity layout or visual analysis system.
It is not ideal for:
python3 {skill_dir}/scripts/extract_office_text.py <file> --jsontypeextractionwarningtruncatedtextpypdf.chi_sim+eng, then chi_sim, then eng.pdftoppm and tesseract..docx: extract paragraphs and tables directly..doc: try antiword, then catdoc, then LibreOffice conversion to .docx.Document clearly what is required versus optional.
python3pypdf — embedded text extraction from PDFspython-docx — .docx extractionopenpyxl — .xlsx extractionpython-pptx — .pptx extractionpoppler-utils — provides pdftoppm for PDF → image conversion before OCRtesseract-ocr — OCR enginetesseract-ocr-chi-sim — Simplified Chinese OCR language packlibreoffice — conversion fallback for legacy .doc, .xls, .pptantiword — direct .doc extraction fallbackcatdoc — additional .doc extraction fallbackpypdf: try text-layer extraction from PDFs firstpdftoppm: rasterize PDF pages when OCR is neededtesseract: recover text from scanned/image PDFspython-docx: read paragraphs and tables from .docxopenpyxl: read sheets and rows from .xlsxpython-pptx: read slide text and notes from .pptxlibreoffice: convert older Office formats into newer parseable formatsantiword / catdoc: lightweight extraction options for .docIf only modern documents matter, the minimum practical setup is:
python3pypdf, python-docx, openpyxl, python-pptxFor the most robust behavior across real-world files, install:
python3pypdf, python-docx, openpyxl, python-pptxpoppler-utils, tesseract-ocr, tesseract-ocr-chi-sim, libreoffice, antiword, catdocUse the bundled checker to quickly see what is missing in the current environment:
python3 {skill_dir}/scripts/check_deps.py
python3 {skill_dir}/scripts/extract_office_text.py "/path/to/file.pdf" --json
python3 {skill_dir}/scripts/extract_office_text.py "/path/to/file.docx" --json
python3 {skill_dir}/scripts/extract_office_text.py "/path/to/file.xlsx" --json
python3 {skill_dir}/scripts/extract_office_text.py "/path/to/file.pptx" --json
Useful flags:
# limit PDF pages scanned/extracted
python3 {skill_dir}/scripts/extract_office_text.py "/path/to/file.pdf" --page-limit 10 --json
# limit rows per sheet when probing spreadsheets
python3 {skill_dir}/scripts/extract_office_text.py "/path/to/file.xlsx" --row-limit 30 --json
# cap output text size
python3 {skill_dir}/scripts/extract_office_text.py "/path/to/file.pdf" --max-chars 30000 --json
Default to a compact answer:
tesseract-ocr-chi-sim is installed..doc / .xls / .ppt extraction fails, check libreoffice, antiword, and catdoc.Read these only when needed:
references/capabilities.md — capability boundaries and what each format can/can't do wellreferences/troubleshooting.md — dependency checks and common failure modes