Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

WPS PDF Processing

v1.0.0

当用户需要对 PDF 文件进行任何操作时,使用本技能。包括:读取或提取 PDF 中的文字/表格、合并多个 PDF、拆分 PDF、旋转页面、添加水印、创建新 PDF、填写 PDF 表单、加密/解密 PDF、提取图片,以及对扫描版 PDF 进行 OCR 识别使其可搜索。只要用户提到 .pdf 文件或希望生成 PDF,...

0· 62·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for xixihaha123123123123/wps-pdf.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "WPS PDF Processing" (xixihaha123123123123/wps-pdf) from ClawHub.
Skill page: https://clawhub.ai/xixihaha123123123123/wps-pdf
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install wps-pdf

ClawHub CLI

Package manager switcher

npx clawhub@latest install wps-pdf
Security Scan
VirusTotalVirusTotal
Suspicious
View report →
OpenClawOpenClaw
Suspicious
high confidence
!
Purpose & Capability
The skill's description promises local PDF processing (text/table extraction, OCR, merging, etc.), which can be implemented with local libraries. However, the shipped script posts the user's PDF to https://api.wps.cn (endpoint /v7/longtask/exporter/export_file_content) and requires a WPS session id (TMP_LX_UUID / wps_sid / WPS_SID). Requiring a cloud service credential is not documented in the SKILL metadata and is not obviously necessary for all the advertised operations (many could be done locally). This is an incoherence between claimed purpose and actual behavior.
!
Instruction Scope
SKILL.md instructs agents to call parse('<PDF路径>', '<输出目录>') via the included pdf_to_md.parse, but does not disclose that parse will upload the PDF to an external WPS API and relies on an environment credential. The runtime instructions do not warn about network transmission or the need to provide a WPS sid. The script also rewrites image links to absolute local paths after downloading images from remote URLs returned by the API. The upload of potentially sensitive PDFs is within the script's runtime behavior but is not stated in the SKILL.md.
Install Mechanism
There is no installer that downloads remote code; the skill is instruction-first and includes a local script. No external install URLs or archive extraction are present. Risk mainly comes from the script's network behavior, not from an install mechanism.
!
Credentials
Registry metadata lists no required env vars, but the script requires a session credential via TMP_LX_UUID or wps_sid or WPS_SID (and optionally WPS_API_BASE to change the API host). These are sensitive values (session cookies / tokens) and are not declared in the skill manifest or SKILL.md. Requesting such credentials is disproportionate unless the author explicitly states the skill uses WPS cloud OCR and needs an account/session.
Persistence & Privilege
The skill does not request always:true and does not modify other skills or system-wide settings. Autonomous invocation is enabled by default but is not itself a new concern in this skill's manifest.
What to consider before installing
Before installing or using this skill, ask the author to clarify and document whether PDF files are uploaded to the WPS cloud and why a WPS session id is required. Do not set TMP_LX_UUID / wps_sid / WPS_SID in your environment unless you trust the service and the skill's owner. If you need offline processing for sensitive documents, request or prefer a version that uses only local tools (pdfplumber, pypdf, tesseract/pypdfium2) and does not call external APIs. Test with non-sensitive PDFs first, and consider scanning the script locally to verify network calls. If you can't get a clear answer or a local-only alternative, treat this skill as risky for confidential documents.

Like a lobster shell, security has layers — review code before you run it.

documentvk97fkk6svf62ag93aenwc5k6qh852mw3latestvk97fkk6svf62ag93aenwc5k6qh852mw3ocrvk97fkk6svf62ag93aenwc5k6qh852mw3pdfvk97fkk6svf62ag93aenwc5k6qh852mw3
62downloads
0stars
1versions
Updated 1w ago
v1.0.0
MIT-0

PDF 处理指南

工具速查

任务推荐库说明
合并 / 拆分 / 旋转 / 水印 / 加密pypdf轻量,纯 Python
提取文本 / 表格(结构化)pdfplumber精度高,支持坐标;
创建排版 PDFreportlab支持段落、表格、样式
扫描版 OCR / 结构化转 Markdownpdf-to-md返回图片 + markdown;文字提取失败时的兜底方案

pypdf — 基础操作

from pypdf import PdfReader, PdfWriter

# 提取文本
reader = PdfReader("doc.pdf")
text = "".join(page.extract_text() for page in reader.pages)

# 合并
writer = PdfWriter()
for path in ["a.pdf", "b.pdf"]:
    for page in PdfReader(path).pages:
        writer.add_page(page)
with open("merged.pdf", "wb") as f:
    writer.write(f)

# 拆分(每页单独保存)
for i, page in enumerate(reader.pages):
    w = PdfWriter()
    w.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as f:
        w.write(f)

# 旋转 / 水印 / 加密 / 裁剪
page.rotate(90)
page.merge_page(PdfReader("watermark.pdf").pages[0])
writer.encrypt("user_pass", "owner_pass")
page.mediabox.left, page.mediabox.bottom, page.mediabox.right, page.mediabox.top = 50, 50, 550, 750

pdfplumber — 文本与表格提取

import pdfplumber, pandas as pd

with pdfplumber.open("doc.pdf") as pdf:
    # 文本
    text = pdf.pages[0].extract_text()
    # 表格 → DataFrame
    for t in pdf.pages[0].extract_tables():
        if t:
            df = pd.DataFrame(t[1:], columns=t[0])
    # 按坐标区域提取(左、上、右、下)
    region_text = pdf.pages[0].within_bbox((100, 100, 400, 200)).extract_text()

注意事项

  • 中文字体:reportlab 默认字体不含中文字形,生成含中文的 PDF 时必须先通过 pdfmetrics.registerFont(TTFont(...)) 注册系统中文字体(如 Noto Sans CJK、微软雅黑、文泉驿等),并在样式中指定该字体,否则中文会显示为乱码。

reportlab — 创建 PDF

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("out.pdf", pagesize=letter)
styles = getSampleStyleSheet()
doc.build([
    Paragraph("标题", styles["Title"]),
    Spacer(1, 12),
    Paragraph("正文内容", styles["Normal"]),
])

下标/上标:不要用 Unicode 字符(₀¹²),改用 XML 标签:H<sub>2</sub>Ox<super>2</super>


OCR 与常见问题

OCR:PDF → Markdown(含图片)

如果用 pypdfpdfplumber 提取到的文字为空、极少,或出现大量乱码,必须改用此 OCR 方案。 扫描版 PDF、拍照 PDF、图片型 PDF 均无法通过普通文本提取获得内容,OCR 是唯一可靠手段。

import sys, os
sys.path.insert(0, os.path.join(os.getenv('skill_path'), 'pdf', 'scripts'))
from pdf_to_md import parse

# 结果写入 <输出目录>/content.md,图片写入 <输出目录>/images/
parse('<PDF路径>', '<输出目录>')

适用场景:扫描版 PDF、图文混排、需要保留图片、中文内容居多。

其他常见问题

# 处理加密 PDF
from pypdf import PdfReader
reader = PdfReader("enc.pdf")
if reader.is_encrypted:
    reader.decrypt("password")

# 提取嵌入图片
from PIL import Image
import io
for page in reader.pages:
    for img_obj in page.images:
        Image.open(io.BytesIO(img_obj.data)).save(f"{img_obj.name}.png")

# 宽容模式读取损坏 PDF
reader = PdfReader("damaged.pdf", strict=False)

PDF高级处理参考


pypdfium2 — 渲染为图片

基于 Chromium PDFium,无需 poppler 等外部依赖,适合渲染和 OCR 场景。

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("doc.pdf")
for i, page in enumerate(pdf):
    bitmap = page.render(scale=2.0)   # scale=2 ≈ 192 DPI
    bitmap.to_pil().save(f"page_{i+1}.png")

# 提取文本
for i, page in enumerate(pdf):
    print(f"第 {i+1} 页:{len(page.get_text())} 字符")

pdfplumber — 精确坐标与复杂表格

import pdfplumber

with pdfplumber.open("doc.pdf") as pdf:
    page = pdf.pages[0]

    # 逐字符提取含坐标信息
    for char in page.chars[:10]:
        print(f"'{char['text']}' x:{char['x0']:.1f} y:{char['y0']:.1f}")

    # 复杂布局自定义策略
    tables = page.extract_tables({
        "vertical_strategy": "lines",
        "horizontal_strategy": "lines",
        "snap_tolerance": 3,
        "intersection_tolerance": 15,
    })

    # 可视化调试表格检测结果
    page.to_image(resolution=150).save("debug.png")

reportlab — 复杂表格与多页报告

from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter

doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()

data = [
    ["产品", "Q1", "Q2", "Q3", "Q4"],
    ["Widget", "120", "135", "142", "158"],
    ["Gadget", "85",  "92",  "98",  "105"],
]
table = Table(data, colWidths=[120, 60, 60, 60, 60])
table.setStyle(TableStyle([
    ("BACKGROUND",     (0, 0), (-1,  0), colors.HexColor("#4472C4")),
    ("TEXTCOLOR",      (0, 0), (-1,  0), colors.white),
    ("FONTNAME",       (0, 0), (-1,  0), "Helvetica-Bold"),
    ("ALIGN",          (0, 0), (-1, -1), "CENTER"),
    ("ROWBACKGROUNDS", (0, 1), (-1, -1), [colors.white, colors.HexColor("#EEF2FF")]),
    ("GRID",           (0, 0), (-1, -1), 0.5, colors.grey),
    ("BOX",            (0, 0), (-1, -1), 1,   colors.black),
]))

doc.build([
    Paragraph("销售报告", styles["Title"]),
    table,
    PageBreak(),
    Paragraph("第二页内容", styles["Normal"]),
])

批量处理

import glob, logging
from pypdf import PdfReader, PdfWriter

logger = logging.getLogger(__name__)

# 批量合并目录下所有 PDF
writer = PdfWriter()
for pdf_file in sorted(glob.glob("input/*.pdf")):
    try:
        for page in PdfReader(pdf_file).pages:
            writer.add_page(page)
    except Exception as e:
        logger.error(f"跳过 {pdf_file}:{e}")
with open("merged_all.pdf", "wb") as f:
    writer.write(f)

Comments

Loading comments...