PDF Field Extractor
AI-powered PDF structured data extraction — convert PDF key fields into Excel/JSON.
End-to-End Flow
User uploads PDF → Document type identification → AI field extraction → Structured output (Excel/JSON)
from scripts.pdf_extractor import extract_pdf_text
from scripts.field_extractor import extract_fields
from scripts.output_generator import generate_excel, generate_json
# Step 1: Extract PDF text (PyMuPDF + pdfplumber)
text, tables, images = extract_pdf_text("invoice.pdf")
# Step 2: AI field extraction (user provides own API Key, OpenAI-compatible)
fields = extract_fields(
text=text,
doc_type="invoice",
api_key="sk-xxx",
api_base="https://api.openai.com/v1",
model="gpt-4o",
)
Supported Document Types
| Type | Description |
|---|
| Invoice | VAT invoice, receipt invoice, electronic invoice |
| Contract | Contracts, agreements |
| Receipt | Receipts, tickets |
| Bank Statement | Bank reconciliation statements |
| License | Business license |
| ID Card | ID card, passport |
| Express | Waybill, shipping label |
| Generic | User-defined custom extraction |
Detection Modes
| Mode | Description |
|---|
| Auto | AI automatically identifies document type |
| Manual | User specifies document type |
Tiered Features
| Feature | FREE | PRO |
|---|
| Monthly pages | 10 | Unlimited |
| Document types | Invoice only | All types |
| Output formats | Text | Excel + JSON + Text |
| OCR languages | English | English + Chinese + 9 more |
| Batch processing | 1 page | Unlimited |
| Custom fields | — | Yes |
| Price | Free | $0.01/call |
Technical Implementation
- PDF parsing: PyMuPDF (fitz) + pdfplumber for text and table extraction
- OCR: EasyOCR / Tesseract for scanned documents (multi-language support)
- AI extraction: OpenAI-compatible API, model-agnostic (GPT-4o, DeepSeek, GLM, etc.)
- Output: Excel (.xlsx) with formatted sheets, JSON with structured hierarchy
Output Format
Excel Output
- Sheet per document type
- Header row with field names
- Data rows with extracted values
- Color-coded by confidence
JSON Output
{
"doc_type": "invoice",
"fields": {
"invoice_number": "...",
"date": "...",
"amount": "...",
"buyer": "...",
"seller": "..."
},
"confidence": 0.95
}
Security Notes
- AI API calls: Uses
requests.post to OpenAI-compatible endpoints with user-provided API key (not stored)
- Data storage: Uses
/tmp/pdf-extractor/ for temporary processing files (no home directory write)
- OCR: Local processing via EasyOCR/Tesseract (no external data transmission)
- Billing data:
FEISHU_USER_ID transmitted to skillpay.me/api/v1/billing for per-call charging
Billing
- Billing via
skillpay.me/api/v1/billing/charge
- User data transmitted to SkillPay for billing identification
- $0.01 USD per extraction call (PRO tier)
Required Environment Variables
| Variable | Description |
|---|
FEISHU_USER_ID | User open_id for billing |
SKILL_BILLING_API_KEY | SkillPay Builder API Key |
SKILL_BILLING_SKILL_ID | SkillPay Skill ID (default: pdf-extractor) |
Common Errors
| Error | Cause | Solution |
|---|
NO_TEXT_EXTRACTED | Scanned PDF without OCR | Enable OCR or use digital PDF |
UNSUPPORTED_DOC_TYPE | Document type not recognized | Specify type manually |
API_ERROR | AI API key invalid or quota exceeded | Check API key |