Pdf Intelligence Suite
v1.0.0PDF智能处理套件 - 文本提取、表格识别、OCR、PDF转Word/Excel等 | PDF Intelligence Suite - Text extraction, table recognition, OCR, PDF to Word/Excel conversion
Security Scan
OpenClaw
Benign
medium confidencePurpose & Capability
Name, README, SKILL.md, requirements, and the shown source files (extractor, ocr, converter, manipulator, etc.) are coherent: the requested libraries (PyPDF2, pdfplumber, pytesseract, pdf2image, python-docx, openpyxl, reportlab, Pillow, camelot) match the described features (text extraction, table recognition, OCR, conversion, page manipulation, security). No unrelated cloud credentials, binaries, or config paths are requested.
Instruction Scope
SKILL.md gives concrete install and usage steps (pip install -r requirements.txt, install system Tesseract/poppler), and the runtime examples and APIs operate only on local PDF files. The instructions do not ask the agent to read unrelated host files, access external endpoints, or exfiltrate environment variables.
Install Mechanism
There is no special install spec (the skill relies on pip requirements and system packages). This is low risk in terms of arbitrary downloads, but the dependency list includes system-level components (Tesseract, poppler) and heavy Python packages (camelot, opencv, pdf2image) that require native libraries; the README documents those needs. Minor inconsistency: setup.py defines a console entry_point 'pdf-suite=pdf_intelligence_suite.cli:main' but no cli.py was listed in the manifest, which may be a packaging oversight (not necessarily malicious).
Credentials
The skill does not declare required environment variables or credentials. The README notes optional TESSDATA_PREFIX for nonstandard Tesseract installs (reasonable). No environment variables named SECRET/TOKEN/KEY are requested and the code shown does not read unrelated env vars.
Persistence & Privilege
The skill does not request always:true and has default invocation privileges. It does not attempt to modify other skills or system-wide agent configuration in the reviewed files.
Assessment
This package looks like a straightforward local PDF processing library and its files (extractor, ocr, converter, manipulator) match the documented features. Before installing or running on sensitive documents: 1) Review the omitted/truncated files (security.py, utils.py, tables.py and any remaining code) for any network calls or unexpected file access—these files were not fully shown. 2) Run the package in an isolated environment (VM or container) because several dependencies require native system packages (Tesseract, poppler, Ghostscript) and heavy Python packages. 3) Note the small packaging inconsistency: setup.py references pdf_intelligence_suite.cli:main but cli.py isn't in the manifest—expect that the console script may not work until fixed. 4) If you will process confidential PDFs, verify security.py (encryption/decryption) behavior and any logging/network functionality to ensure no external transmission. If you want, I can scan the omitted files for network or subprocess usage if you provide them.Like a lobster shell, security has layers — review code before you run it.
latest
PDF Intelligence Suite - PDF智能处理套件
中文描述
概述
PDF智能处理套件是一个功能强大的PDF文档处理工具集,提供文本提取、表格识别、OCR文字识别、格式转换等一站式服务。
功能特性
- 📄 文本提取: 从PDF中提取纯文本或结构化文本,支持多种布局分析
- 📊 表格识别: 自动识别PDF中的表格并提取为结构化数据(CSV/Excel)
- 🔍 OCR识别: 对扫描件和图片型PDF进行文字识别,支持多语言
- 🔄 格式转换: PDF转Word、PDF转Excel、PDF转图片等
- ✂️ 页面操作: 合并、拆分、旋转、删除页面
- 🔒 安全处理: 加密、解密、添加水印、数字签名
- 📝 元数据管理: 读取和修改PDF文档属性
技术栈
- PyPDF2: PDF基础操作(合并、拆分、加密等)
- pdfplumber: 高级文本和表格提取,精准定位
- camelot-py: 专业表格识别引擎
- pytesseract: OCR文字识别(需安装Tesseract)
- pdf2image: PDF转图片
- reportlab: PDF生成和编辑
- Pillow: 图像处理
目录结构
pdf-intelligence-suite/
├── SKILL.md # 本文件
├── README.md # 使用文档
├── requirements.txt # 依赖声明
├── setup.py # 安装配置
├── src/
│ └── pdf_intelligence_suite/
│ ├── __init__.py
│ ├── extractor.py # 文本提取模块
│ ├── tables.py # 表格识别模块
│ ├── ocr.py # OCR识别模块
│ ├── converter.py # 格式转换模块
│ ├── manipulator.py # 页面操作模块
│ ├── security.py # 安全处理模块
│ └── utils.py # 工具函数
├── examples/
│ └── basic_usage.py # 使用示例
└── tests/
└── test_pdf_suite.py # 单元测试
快速开始
from pdf_intelligence_suite import PDFExtractor, TableExtractor, OCRProcessor
# 文本提取
extractor = PDFExtractor()
text = extractor.extract_text("document.pdf")
# 表格提取
tables = TableExtractor.extract_tables("report.pdf", output_format="excel")
# OCR识别
ocr = OCRProcessor(lang='chi_sim+eng')
text = ocr.process("scanned.pdf")
安装
pip install -r requirements.txt
# 安装Tesseract OCR引擎(Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra
# macOS
brew install tesseract tesseract-lang
# Windows: 下载安装包 https://github.com/UB-Mannheim/tesseract/wiki
English Description
Overview
PDF Intelligence Suite is a powerful PDF document processing toolkit providing one-stop services for text extraction, table recognition, OCR, format conversion, and more.
Features
- 📄 Text Extraction: Extract plain or structured text from PDFs with layout analysis
- 📊 Table Recognition: Automatically detect and extract tables as structured data (CSV/Excel)
- 🔍 OCR Recognition: Recognize text in scanned documents and image-based PDFs, multi-language support
- 🔄 Format Conversion: PDF to Word, PDF to Excel, PDF to images, etc.
- ✂️ Page Operations: Merge, split, rotate, delete pages
- 🔒 Security: Encryption, decryption, watermarking, digital signatures
- 📝 Metadata: Read and modify PDF document properties
Tech Stack
- PyPDF2: Basic PDF operations (merge, split, encrypt, etc.)
- pdfplumber: Advanced text and table extraction with precise positioning
- camelot-py: Professional table recognition engine
- pytesseract: OCR text recognition (requires Tesseract installation)
- pdf2image: PDF to image conversion
- reportlab: PDF generation and editing
- Pillow: Image processing
Quick Start
from pdf_intelligence_suite import PDFExtractor, TableExtractor, OCRProcessor
# Text extraction
extractor = PDFExtractor()
text = extractor.extract_text("document.pdf")
# Table extraction
tables = TableExtractor.extract_tables("report.pdf", output_format="excel")
# OCR recognition
ocr = OCRProcessor(lang='eng')
text = ocr.process("scanned.pdf")
Installation
pip install -r requirements.txt
# Install Tesseract OCR engine (Ubuntu/Debian)
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
License
MIT License
Author
ClawHub Skills Collection
Comments
Loading comments...
