#ocr#pdf#批量

pdf-ocr-byzhangchong

批量 OCR 处理扫描 PDF，自动生成带文字层的 PDF 并可导出为 Markdown/纯文本。使用场景包括老师 Agent 需要将大量扫描教材 PDF 转化为可检索文本。

openclaw skills install @openclawzhangchong/pdf-ocr-zc

PDF OCR 处理技能

bash

# 运行一次 OCR（需要已安装 Tesseract 与 ocrmypdf）
openclaw exec python skills/pdf-ocr/scripts/ocr_batch.py <input-pdf> <output-pdf>

bash

openclaw exec python skills/pdf-ocr/scripts/ocr_batch.py --batch-dir <pdf-dir>

在老师 Agent 的工作流（如 auto_ingest）中，可在 HEARTBEAT.md 或 cron 中加入如下调用，以实现每日自动 OCR：

bash

openclaw exec python skills/pdf-ocr/scripts/ocr_batch.py --batch-dir /path/to/teacher-pdfs

这样老师 Agent 在 ingest 前就已拥有文字层，后续向量化、检索都能顺畅进行。

bash

openclaw exec python skills/pdf-ocr/scripts/ocr_batch.py D:\docs\scan1.pdf D:\docs\scan1_text.pdf

bash

openclaw exec python skills/pdf-ocr/scripts/ocr_batch.py --batch-dir D:\teacher-pdfs

如需更细粒度的文本（Markdown），可在脚本后接 pdf2txt.py 转换。

注意：此技能仅在本机执行，不会触发外部网络请求，符合安全策略。