My Pdf Extract Skill
v1.0.0智能从文本PDF中提取商品编码、名称、批次和数量,支持跨行名称处理并输出结构化Excel文件。
Security Scan
OpenClaw
Suspicious
medium confidencePurpose & Capability
The stated purpose (extract structured label data from text PDFs and write Excel) matches the declared dependencies (pdfplumber, pandas, openpyxl). However, the package does not include the referenced executable/script (scripts/extract_exact.py), so the archive is incomplete or mispackaged.
Instruction Scope
SKILL.md tells the user to activate a virtualenv at ../venv and run python scripts/extract_exact.py and to edit variables inside that script. Those instructions reference files and an environment outside the skill bundle and assume a script that is not present. Activating ../venv may modify or reuse the user's existing environment—this is risky if the missing script is retrieved from elsewhere.
Install Mechanism
No install spec is provided; the README instructs to pip install common libraries (pdfplumber, pandas, openpyxl), which is proportionate to the described task. Installing from PyPI is normal but you should still vet package versions and perform installs inside an isolated venv.
Credentials
The skill does not request environment variables, credentials, or config paths. The requested resources (a local PDF file and writing an output XLSX) are proportional to the stated purpose.
Persistence & Privilege
The skill does not request always:true or other elevated persistence. It is user-invocable and allows autonomous invocation by default (platform default); nothing here grants it unusual system-wide privileges.
What to consider before installing
Do not run the provided commands or install packages until you verify the missing script. Steps to take:
1) Request the maintainer or publisher to provide the missing scripts (scripts/extract_exact.py). Do not fetch unknown code from other servers without review.
2) If you receive the script, inspect its source for any network calls, reading of unrelated files, or use of environment variables before running. Look for requests to send data to external endpoints or to read credentials.
3) Run installations and the script inside a brand-new isolated virtual environment you create (python -m venv .venv; source .venv/bin/activate) rather than activating ../venv, which could reuse/modify your existing environment.
4) Prefer running the script on a sample PDF in a disposable environment first. Verify the output path and file writes.
5) If you cannot obtain the script source, treat this package as incomplete/untrusted and do not execute arbitrary pip installs or commands it suggests. Providing the missing script and its content would change this assessment to benign if the script is small, self-contained, and contains no unexpected network or credential access.Like a lobster shell, security has layers — review code before you run it.
latest
PDF数据提取技能
描述
从PDF文件中智能提取商品标签数据(商品编码、商品名称、商品批次、商品数量),并输出到Excel文件。
使用场景
- 从PDF文件中提取结构化数据
- 处理包含跨行文本的商品名称
- 将提取的数据保存为Excel格式
- 需要精确匹配商品编码和名称的场景
核心功能
- 智能提取:自动识别PDF中的数据块
- 跨行处理:正确处理跨越多行的商品名称
- 精确匹配:基于预定义的名称列表进行精确匹配
- 数据验证:验证提取结果的准确性
使用方法
基本用法
# 激活虚拟环境
source ../venv/bin/activate
# 运行提取脚本
python extract_exact.py
脚本说明
extract_exact.py:主提取脚本- 输入:
Lisa-3.pdf - 输出:
Lisa-3_精确提取.xlsx
文件结构
my-pdf-extract-skill/
├── SKILL.md # 本文件
├── references/
│ └── 完整标签数据.png # 参考图片
├── scripts/
│ └── extract_exact.py # 提取脚本
└── README.md # 使用说明
依赖
- Python 3.8+
- pdfplumber
- pandas
- openpyxl
安装依赖
pip install pdfplumber pandas openpyxl
配置
在脚本中修改以下变量:
pdf_path = "./Lisa-3.pdf" # PDF文件路径
output_path = "./Lisa-3_精确提取.xlsx" # 输出文件路径
示例
# 提取数据
labels = extract_exact_data(pdf_path)
# 保存到Excel
df = pd.DataFrame(labels)
df.to_excel(output_path, index=False)
注意事项
- PDF文件必须是文本可提取的(非扫描件)
- 商品名称列表需要根据实际情况调整
- 跨行名称需要手动合并处理
- 建议先测试小批量数据
故障排除
- 问题:提取的商品数量不正确 解决:检查PDF中的CODIGO行格式
- 问题:商品名称不完整 解决:调整名称分割逻辑
- 问题:Excel文件无法打开 解决:检查openpyxl安装和文件权限
扩展
要适配其他PDF格式,可以:
- 修改
extract_exact_data函数中的正则表达式 - 更新
get_exact_names函数中的名称列表 - 调整数据块识别逻辑
作者
[你的名字]
版本
v1.0.0
Comments
Loading comments...
