pdf-ocr-byzhangchong

v1.0.0

批量 OCR 处理扫描 PDF,自动生成带文字层的 PDF 并可导出为 Markdown/纯文本。使用场景包括老师 Agent 需要将大量扫描教材 PDF 转化为可检索文本。

1· 86·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for openclawzhangchong/pdf-ocr-zc.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "pdf-ocr-byzhangchong" (openclawzhangchong/pdf-ocr-zc) from ClawHub.
Skill page: https://clawhub.ai/openclawzhangchong/pdf-ocr-zc
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install pdf-ocr-zc

ClawHub CLI

Package manager switcher

npx clawhub@latest install pdf-ocr-zc
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description (batch OCR -> searchable PDFs / text/Markdown) match the included script and docs. The script calls ocrmypdf/Tesseract (expected for OCR) and includes sensible CLI options. No unrelated credentials, binaries, or configuration paths are requested.
Instruction Scope
SKILL.md limits actions to local OCR processing and shows expected commands. The script only invokes local binaries (ocrmypdf/Tesseract), traverses an input directory when asked, and writes outputs and logs locally. Caution: it executes external binaries found on PATH (ocrmypdf/tesseract) — if those binaries were replaced or malicious, the skill would run them. Also, integrating the script into HEARTBEAT/cron grants regular automated access to whatever directories are configured.
Install Mechanism
There is no install spec; the skill is instruction-only with a small helper script. The references recommend installing Tesseract (from a GitHub repo) and ocrmypdf via pip — reasonable and traceable guidance. No downloads from obscure URLs or archive extraction are present.
Credentials
The skill requests no environment variables, credentials, or config paths. This is proportionate for a local OCR utility. Note: it relies on PATH for required binaries, so PATH integrity matters but no secrets are requested/exfiltrated by the code.
Persistence & Privilege
The skill does not force persistent installation (always:false) and is user-invocable. The docs suggest adding the script to scheduled workflows (cron/HEARTBEAT), which is a design choice — scheduling gives regular file access but is not performed by the skill itself. Autonomous model invocation is allowed by default (normal) but not used in the files.
Assessment
This skill appears to do exactly what it says: run ocrmypdf/Tesseract locally to add a text layer to PDFs and extract text. Before installing or enabling automated runs, consider the following: 1) Install ocrmypdf and Tesseract from trusted sources and verify their checksums or official release pages. 2) Run the script on copies of important PDFs first to avoid accidental overwrites; the script writes _ocr.pdf files next to inputs by default. 3) Ensure the system PATH points to the genuine ocrmypdf/tesseract binaries (PATH hijacking is a general risk when running subprocesses). 4) If you plan to schedule it (cron/HEARTBEAT), limit the watched directory to only the PDFs that should be processed and ensure the agent account has only the necessary filesystem permissions. 5) Review logs/log path (logs/pdf_ocr_error.log) and monitor disk usage for large batches. If you want extra isolation, run OCR jobs in a sandbox/container or a dedicated user account.

Like a lobster shell, security has layers — review code before you run it.

latestvk97a7mg81prkrftv9rt57mt4rn852jtz
86downloads
1stars
1versions
Updated 1w ago
v1.0.0
MIT-0

PDF OCR 处理技能

何时使用

  • 需要对大量扫描件 PDF 进行文字识别(OCR)
  • 希望直接得到可搜索的 PDF(文字层)或提取的纯文本/Markdown
  • 需要在老师 Agent 工作流中自动化该步骤

基本使用方式

# 运行一次 OCR(需要已安装 Tesseract 与 ocrmypdf)
openclaw exec python skills/pdf-ocr/scripts/ocr_batch.py <input-pdf> <output-pdf>
  • <input-pdf>:原始扫描 PDF 路径
  • <output-pdf>:输出带文字层的 PDF(同目录或指定路径)

高级选项

  • 若想一次性处理目录下所有 PDF,使用 --batch-dir 参数:
openclaw exec python skills/pdf-ocr/scripts/ocr_batch.py --batch-dir <pdf-dir>
  • 可加 --lang chi_sim 指定中文简体模型(默认 tesseract 会自动检测语言)

脚本说明 (scripts/ocr_batch.py)

  • 检测并确保 ocrmypdf 可用;如未安装会提示安装指令
  • 使用 ocrmypdf 完成 OCR,内部调用已装好的 Tesseract
  • 支持批量目录模式,遍历 *.pdf 并生成对应带文字层文件
  • 错误会记录到 logs/pdf_ocr_error.log,便于排查

参考资源

  • references/ocr_tips.md:常见 OCR 参数调优技巧(如 DPI、图片预处理)
  • references/install_ocr.md:在 Windows 上安装 Tesseract 与 ocrmypdf 的详细步骤

与老师 Agent 的集成

在老师 Agent 的工作流(如 auto_ingest)中,可在 HEARTBEAT.md 或 cron 中加入如下调用,以实现每日自动 OCR:

openclaw exec python skills/pdf-ocr/scripts/ocr_batch.py --batch-dir /path/to/teacher-pdfs

这样老师 Agent 在 ingest 前就已拥有文字层,后续向量化、检索都能顺畅进行。


使用示例

  1. 单文件 OCR:
openclaw exec python skills/pdf-ocr/scripts/ocr_batch.py D:\docs\scan1.pdf D:\docs\scan1_text.pdf
  1. 批量目录 OCR:
openclaw exec python skills/pdf-ocr/scripts/ocr_batch.py --batch-dir D:\teacher-pdfs

如需更细粒度的文本(Markdown),可在脚本后接 pdf2txt.py 转换。


注意:此技能仅在本机执行,不会触发外部网络请求,符合安全策略。

Comments

Loading comments...