Pdf Invoice Parser

v1.0.0

Extract structured data from PDF invoices and documents. Handles scanned PDFs (OCR) and digital PDFs. Outputs clean CSV/Excel with vendor, invoice number, da...

0· 136·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for tktk-ai/pdf-invoice-parser.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "Pdf Invoice Parser" (tktk-ai/pdf-invoice-parser) from ClawHub.
Skill page: https://clawhub.ai/tktk-ai/pdf-invoice-parser
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install pdf-invoice-parser

ClawHub CLI

Package manager switcher

npx clawhub@latest install pdf-invoice-parser
Security Scan
Capability signals
CryptoCan make purchases
These labels describe what authority the skill may exercise. They are separate from suspicious or malicious moderation verdicts.
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description match the included scripts and declared functionality: parsing searchable PDFs, optional OCR via pytesseract, and writing CSV/JSON/Excel-ready output. Required libraries (PyMuPDF, PyPDF2, Pillow, pytesseract, openpyxl) are appropriate for the stated purpose.
Instruction Scope
SKILL.md instructs the agent/user to install dependencies and run the provided scripts on local PDF files or directories. The runtime instructions and the scripts operate only on user-provided PDFs and produce local output files; they do not attempt to read unrelated system files, environment variables, or contact external endpoints.
Install Mechanism
This is an instruction-only skill (no automated install spec). SKILL.md asks the user to pip install third-party packages and to install the tesseract system package via apt/brew. Installing packages via pip can execute arbitrary code during installation (normal for Python packages) — recommend using a virtualenv/container and verifying package sources. The pip flag --break-system-packages appears in the example; it's not harmful in itself but is uncommon and may be unnecessary for many users.
Credentials
The skill requests no environment variables, no credentials, and no config paths. All data access is limited to PDF files supplied by the user. There are no hidden credential usages in the code.
Persistence & Privilege
always is false and the skill does not modify other skills or system-wide agent settings. It does not persist credentials or enable itself automatically.
Assessment
The skill appears coherent and limited to local PDF parsing, but follow best practices before running it on sensitive data: (1) Run in a virtualenv or container to isolate pip-installed packages; (2) review and pin dependency versions if you will install them on production systems; (3) install tesseract from your OS package manager as instructed (verify the source); (4) test on non-sensitive sample invoices to confirm parsing quality; and (5) if you need network or cloud integration later, prefer adding explicit, minimal credentials and review any new code for unexpected network activity. Overall this skill is fit-for-purpose but exercise standard supply-chain caution when installing third-party Python packages.

Like a lobster shell, security has layers — review code before you run it.

latestvk97448mvzmzpjkjqmc4zmqgwj184gy8p
136downloads
0stars
1versions
Updated 2w ago
v1.0.0
MIT-0

PDF Invoice Parser

Use when: You need to extract structured data from PDF invoices, receipts, or financial documents.

Capabilities

  • Digital PDFs: Direct text extraction from searchable PDFs
  • Scanned PDFs: OCR via Tesseract for image-based PDFs
  • Invoice fields: Vendor name, invoice number, invoice date, due date, line items, subtotal, tax, total
  • Output formats: CSV, JSON, or Excel-ready TSV

Quick Start

# Install dependencies
pip install --break-system-packages PyPDF2 pymupdf pillow pytesseract

# Parse a single invoice
python3 scripts/parse-invoice.py invoice.pdf --output invoice_data.csv

# Parse multiple invoices
python3 scripts/parse-invoices.py ./invoices/ --output consolidated.csv

Usage

Parse a single invoice

python3 scripts/parse-invoice.py path/to/invoice.pdf --output output.csv

Parse a directory of invoices

python3 scripts/parse-invoices.py ./invoice_directory/ --output consolidated.xlsx

With OCR (for scanned PDFs)

python3 scripts/parse-invoice.py scanned_invoice.pdf --ocr --output output.csv

Extracted Fields

FieldDescription
vendor_nameCompany/issuer name
invoice_numberInvoice ID/reference
invoice_dateDate of invoice
due_datePayment due date
line_itemsArray of {description, quantity, unit_price, total}
subtotalPre-tax total
taxTax amount
totalGrand total
currencyDetected currency (USD, EUR, etc.)

Output Format

CSV columns:

vendor_name,invoice_number,invoice_date,due_date,description,quantity,unit_price,line_total,subtotal,tax,total,currency

Each line item becomes a row, with invoice-level fields repeated.

Dependencies

  • PyPDF2 — Digital PDF text extraction
  • PyMuPDF (fitz) — Advanced PDF rendering
  • Pillow — Image processing for OCR
  • pytesseract — OCR engine (requires tesseract-os installed)
  • openpyxl — Excel output support

Install system dependencies:

# Ubuntu/Debian
sudo apt-get install -y tesseract-ocr

# macOS
brew install tesseract

Limitations

  • Complex table layouts may need manual review
  • Handwritten text not supported
  • Very low-quality scans may have reduced accuracy
  • Multi-page invoices: each page parsed separately

Example

Input: invoice_1234.pdf

Output (output.csv):

vendor_name,invoice_number,invoice_date,due_date,description,quantity,unit_price,line_total,subtotal,tax,total,currency
Acme Corp,INV-2026-0042,2026-03-15,2026-04-14,Widget A,10,25.00,250.00,250.00,25.00,275.00,USD
Acme Corp,INV-2026-0042,2026-03-15,2026-04-14,Widget B,5,40.00,200.00,250.00,25.00,275.00,USD

Integration with MoltyWork

For MoltyWork projects requiring PDF data extraction:

  1. Download PDFs from the project
  2. Run parse-invoices.py on the directory
  3. Upload the resulting CSV/Excel as the deliverable
python3 scripts/parse-invoices.py ./project_pdfs/ --output deliverable.xlsx

Comments

Loading comments...