PDF Processor

v2.0.0

学术PDF处理:提取文字、判断语言、翻译(英文→中文)、生成200字纯中文概述。使用本地Ollama模型,不消耗线上API。适用于学术论文、研究报告等PDF文件的处理。当用户说"处理PDF"、"翻译论文"、"生成论文概述"时,或用户将PDF放入论文处理目录时使用。

0· 264·4 current·4 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for reaperchen/pdf-processor.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "PDF Processor" (reaperchen/pdf-processor) from ClawHub.
Skill page: https://clawhub.ai/reaperchen/pdf-processor
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install pdf-processor

ClawHub CLI

Package manager switcher

npx clawhub@latest install pdf-processor
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Overall coherent: name/description (PDF extraction, translate, summarize using local Ollama) matches the included scripts which call a local Ollama HTTP API and read/write files under a user Documents directory. Minor inconsistency: the registry metadata at the top lists no required binaries/env, while package.json's openclaw.requires lists 'python3' and 'pdfplumber' and service 'ollama' — carnet: 'pdfplumber' is a Python package (not a system binary) and package.json also includes install steps. This appears to be a packaging/metadata mismatch rather than a functional red flag.
Instruction Scope
SKILL.md and scripts limit activity to PDF text extraction, local model calls (http://localhost:11434), progress files, and moving files within ~/Documents/论文处理; instructions do not attempt to read unrelated system paths or exfiltrate data to remote servers. The scripts do start 'ollama serve' locally and make HTTP requests to localhost only.
Install Mechanism
This is instruction-only in the skill bundle (no remote downloads). package.json documents Python dependency installation and manual instructions to install Ollama from the official site — no suspicious external URLs or automatic archive downloads are present.
Credentials
The skill requests no environment variables or credentials. The scripts operate against a local HTTP service and the user's home Documents directories only, which is proportionate to the stated purpose.
Persistence & Privilege
always is false and the skill does not request elevated privileges or attempt to alter other skills or global agent settings. It creates and deletes progress files only within its working directory.
Assessment
This skill appears to do what it claims: extract text from PDFs, call a local Ollama model to translate and summarize, and organize results under ~/Documents/论文处理. Before running: (1) confirm you have Ollama and the qwen2.5:7b model installed locally and that you're comfortable starting a local Ollama process; (2) inspect scripts/process_pdf.py and scripts/generate_index.py (they are included) and consider running them on non-sensitive PDFs first; (3) note the metadata mismatch (package.json lists python3/pdfplumber/ollama while registry metadata listed none) — ensure you install required Python packages (pdfplumber, requests) in a controlled environment (virtualenv/container) to avoid impacting system Python; (4) back up any sensitive PDFs you don't want moved or deleted, because the script moves PDF files and deletes temporary extraction files; (5) run the tool locally (no external network calls are present other than localhost) and consider limiting its filesystem access by running in a dedicated directory or container if you want extra containment.

Like a lobster shell, security has layers — review code before you run it.

latestvk9793tkdwtt6xqsb0nngwzbegd846red
264downloads
0stars
1versions
Updated 3w ago
v2.0.0
MIT-0

PDF Processor

Overview

处理学术论文PDF的完整工作流:文字提取、语言判断、翻译、概述生成。使用本地Ollama模型(qwen2.5:7b),成本为0元,适合处理大量学术文献。

核心功能

  • 提取PDF全部文字(按页分隔)
  • 判断语言(中文/英文)
  • 提取和翻译论文标题(英文→中文)
  • 英文PDF分段翻译(每段2000字符)
  • 实时进度显示(显示当前段/总段数、百分比)
  • 断点续传(中断后可从断点继续)
  • 生成200字纯中文概述
  • 自动文件组织和清理

Quick Start

处理单个PDF

python3 scripts/process_pdf.py <pdf_path> <output_base_dir>

示例

# 处理英文PDF
python3 scripts/process_pdf.py \
  ~/Documents/论文处理/未处理/英文/2602.23362v1.pdf \
  ~/Documents/论文处理

# 处理中文PDF
python3 scripts/process_pdf.py \
  ~/Documents/论文处理/未处理/中文/test.pdf \
  ~/Documents/论文处理

目录结构

PDF处理前需先创建目录:

~/Documents/论文处理/
├── 未处理/
│   ├── 中文/
│   └── 英文/
├── 处理中/
├── 已完成/
│   ├── 原文/
│   ├── 翻译/
│   └── 概述/
└── 索引/

详见 directory-structure.md

Workflow

完整处理流程

1. PDF提取文字 → 保存到"处理中/文件名_提取.txt"
2. 提取和翻译论文标题(英文→中文)
3. 判断语言(中文/英文)
4. 如果是英文:
   - 分段(每段2000字符)
   - 逐段翻译(本地Ollama)
   - **显示实时进度**(当前段/总段数 | 百分比 | 字符数)
   - **保存进度**(处理中/文件名_progress.json)
   - 合并翻译结果
   - **删除进度文件**
5. 生成概述(200字纯中文,本地Ollama)
6. 保存翻译文件到"已完成/翻译/"
7. 保存概述文件到"已完成/概述/"(包含中英文标题)
8. 移动PDF到"已完成/原文/"
9. 删除"处理中/提取.txt"

详细说明见 workflow.md

分段翻译策略

  • 分段大小: 每段最多4000字符
  • 分段边界: 优先在段落边界断开(\n\n\n, \n\n, 。, !, ?)
  • 翻译模型: 本地Ollama (qwen2.5:7b)
  • 优势: 避免长文本处理问题,提高翻译质量

概述生成

  • 输入: 前5000字符(摘要和引言)
  • 输出: 200字纯中文
  • 内容: 研究背景、主要方法、核心贡献、应用价值
  • 清理: 移除所有英文、数字、符号

Technical Details

依赖项

  • Python库: pdfplumber, requests, re, shutil, pathlib
  • Ollama: qwen2.5:7b模型
  • Ollama服务: 本地运行在http://localhost:11434

性能特点

  • 成本: 0元(本地模型,不消耗线上API)
  • 处理时间:
    • 提取: 秒级
    • 翻译: 每段30秒-1分钟,总时间取决于段数
    • 概述: 秒级
  • 质量:
    • 翻译: 学术准确性高,保持段落结构
    • 概述: 200字纯中文,简洁准确
  • v2.0新功能:
    • 实时进度显示: 显示当前段/总段数、百分比、字符数
    • 断点续传: 中断后可从断点继续,自动跳过已翻译段落
    • 进度保存: 保存到临时JSON文件,完成后自动删除

分段示例

对于45,873字符的PDF:

  • 分段数: 17段
  • 每段字符数: 2,000-4,000
  • 总翻译时间: 约8-17分钟

Output Files

翻译文件

位置: 已完成/翻译/文件名_翻译.txt

格式:

# 论文翻译

**源文件**: 文件名.pdf
**处理时间**: YYYY-MM-DD HH:MM:SS
**翻译模型**: 本地Ollama (qwen2.5:7b)
**分段数**: N

## 📄 翻译内容

[翻译内容]

概述文件

位置: 已完成/概述/文件名_概述.txt

格式:

# 论文概述

**源文件**: 文件名.pdf
**处理时间**: YYYY-MM-DD HH:MM:SS
**概述模型**: 本地Ollama (qwen2.5:7b)

## 📚 论文标题

**英文**: [论文英文标题]
**中文**: [论文中文标题]

## 📝 论文概述

[200字纯中文概述]

Resources

scripts/

  • process_pdf.py: 完整的PDF处理脚本(v2.0)

    • 文字提取、语言判断、翻译、概述生成
    • 实时进度显示(当前段/总段数 | 百分比)
    • 断点续传(中断后可从断点继续,自动跳过已翻译段落)
    • 自动文件组织和清理
    • 可独立运行或被其他脚本调用
  • generate_index.py: 索引生成脚本

    • 生成已完成论文的索引文件
    • 包含标题、语言、概述、处理时间
    • 支持关键词搜索

references/

  • workflow.md: 完整工作流程说明

    • 关键步骤说明
    • 技术依赖
    • 性能特点
  • directory-structure.md: 目录结构和使用说明

    • 完整目录树
    • 文件夹用途
    • 文件生命周期

Troubleshooting

Ollama服务未启动

错误: Connection refused

解决:

# 启动Ollama服务
ollama serve

Ollama模型未安装

错误: model 'qwen2.5:7b' not found

解决:

# 安装模型
ollama pull qwen2.5:7b

# 查看已安装模型
ollama list

PDF路径错误

错误: PDF文件不存在

检查:

  • 确认PDF路径正确
  • 确认输出目录存在
  • 确认有读写权限

翻译质量不佳

可能原因:

  • 分段过大(超过4000字符)
  • 温度设置过高(当前0.3)

调整:

  • 修改split_text()函数的max_length参数
  • 修改translate_segment()temperature选项

Notes

  • 成本优势: 使用本地Ollama,完全避免线上API费用
  • 质量平衡: 分段翻译在质量和速度之间取得平衡
  • 自动化: 文件自动组织和清理,无需手动管理
  • v2.0改进:
    • 进度显示: 实时显示翻译进度(当前段/总段数 | 百分比 | 字符数)
    • 断点续传: 中断后可从断点继续,自动跳过已翻译段落,节省时间
    • 进度文件: 保存到处理中/文件名_progress.json,完成后自动删除
    • 串行翻译: 稳定可靠,适合各类机器性能
  • 适用范围: 适用于学术论文、研究报告、技术文档等PDF文件

Comments

Loading comments...