PaddleOCR Document Parsing V2
v1.0.4Parse documents using PaddleOCR's API. Supports both sync and async modes for images and PDFs.
PaddleOCR Document Parsing
Parse images and PDF files using PaddleOCR's API. Supports both synchronous and asynchronous parsing modes with structured output.
Resource Links
| Resource | Link |
|---|---|
| Official Website | https://www.paddleocr.com |
| API Documentation | https://ai.baidu.com/ai-doc/AISTUDIO/Cmkz2m0ma |
| GitHub | https://github.com/PaddlePaddle/PaddleOCR |
Key Features
- Multi-format support: PDF and image files (JPG, PNG, BMP, TIFF)
- Two parsing modes:
- Sync mode: Fast response for small files (<600s timeout)
- Async mode: For large files with progress polling
- Layout analysis: Automatic detection of text blocks, tables, formulas
- Multi-language: Support for 110+ languages
- Structured output: Markdown format with preserved document structure
Setup
- Visit PaddleOCR to obtain your API credentials
- Set environment variables:
export PADDLEOCR_ACCESS_TOKEN="your_token_here"
export PADDLEOCR_API_URL="https://your-endpoint.aistudio-app.com/layout-parsing"
# Optional: For async mode
export PADDLEOCR_JOB_URL="https://your-job-endpoint.aistudio-app.com/api/v2/ocr/jobs"
export PADDLEOCR_MODEL="PaddleOCR-VL-1.5"
Usage Examples
Sync Mode (Default)
For small files and quick processing:
# Parse local image
{baseDir}/paddleocr_parse.sh document.jpg
# Parse PDF
{baseDir}/paddleocr_parse.sh -t pdf document.pdf
# Parse from URL
{baseDir}/paddleocr_parse.sh https://example.com/document.jpg
# Save output to file
{baseDir}/paddleocr_parse.sh -o result.json document.jpg
# Verbose output
{baseDir}/paddleocr_parse.sh -v document.jpg
Async Mode
For large files with progress tracking:
# Parse large PDF with async mode
{baseDir}/paddleocr_parse.sh --async large-document.pdf
# Parse from URL with async mode
{baseDir}/paddleocr_parse.sh --async -t pdf https://example.com/doc.pdf
# Save async result to file
{baseDir}/paddleocr_parse.sh --async -o result.json document.pdf
Using Python Script Directly
# Sync mode
python3 {baseDir}/paddleocr_parse.py document.jpg
# Async mode
python3 {baseDir}/paddleocr_parse.py --async-mode document.pdf
# With output file
python3 {baseDir}/paddleocr_parse.py -o result.json --async-mode document.pdf
Response Structure
{
"logId": "unique_request_id",
"errorCode": 0,
"errorMsg": "Success",
"result": {
"layoutParsingResults": [
{
"prunedResult": [...],
"markdown": {
"text": "# Document Title\n\nParagraph content...",
"images": {}
},
"outputImages": [...],
"inputImage": "http://input-image"
}
],
"dataInfo": {...}
}
}
Important Fields:
prunedResult- Contains detailed layout element information including positions, categories, etc.markdown- Stores the document content converted to Markdown format with preserved structure and formatting.
Mode Selection Guide
| Use Case | Recommended Mode |
|---|---|
| Small images (< 10MB) | Sync |
| Single page PDFs | Sync |
| Large PDFs (> 10MB) | Async |
| Multi-page documents | Async |
| Batch processing | Async |
| Quick text extraction | Sync |
Error Handling
The script will exit with code 1 and print error message for:
- Missing required environment variables
- File not found
- API authentication failures
- Invalid JSON responses
- API error codes (non-zero)
Quota Information
See official documentation: https://ai.baidu.com/ai-doc/AISTUDIO/Xmjclapam
Version tags
latest
Runtime requirements
📄 Clawdis
OSmacOS · Linux
Binscurl, base64, jq, python3
EnvPADDLEOCR_ACCESS_TOKEN, PADDLEOCR_API_URL
