Install
openclaw skills install glmv-pdf-to-pptConvert a PDF (research paper, report, or any document) into a polished multi-slide HTML presentation with a structured outline JSON and summary markdown. Trigger this skill when the user mentions making slides or a PPT from a PDF — in Chinese or English.
openclaw skills install glmv-pdf-to-pptConvert any PDF into a multi-slide HTML presentation. Pages are converted to images at DPI 120, read sequentially to understand the content, then a structured outline.json is saved, images are cropped locally (no cloud upload), slides are rendered one by one, and finally a summary.md is generated.
Scripts are in: {SKILL_DIR}/scripts/
Python packages (install once):
pip install pymupdf pillow
System tools: curl (pre-installed on macOS/Linux).
Trigger when the user asks to make slides or a presentation from a PDF — phrases like: "make a PPT from a PDF", "convert PDF to slides", "create a presentation from this paper", "根据pdf做ppt", "根据论文做幻灯片", "做PPT", "做幻灯片", "生成演示文稿", "把这个pdf转成ppt", or any similar intent in Chinese or English.
All output goes under {WORKSPACE}/ppt/<pdf_stem>_<timestamp>/:
ppt/
└── <pdf_stem>_<timestamp>/
├── outline.json ← structured slide plan (SlidesPlan schema)
├── crops/ ← locally-saved cropped images
│ ├── slide3_method_crop.png
│ └── slide5_results_crop.png
├── slide_01.html
├── slide_02.html
├── ...
└── summary.md ← final summary document
<pdf_stem> = PDF filename without extension<timestamp> = format YYYYMMDD_HHMMSS (e.g. 20240119_143022)crops/ subfoldercrops/<name>.png$ARGUMENTS is the path to the PDF file (local) or an HTTP/HTTPS URL.
Compute the output path:
import os, datetime
pdf_stem = os.path.splitext(os.path.basename(pdf_path))[0]
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
out_dir = os.path.join(workspace, "ppt", f"{pdf_stem}_{timestamp}")
Create it immediately:
mkdir -p "<out_dir>/crops"
Record out_dir — use it for all subsequent phases.
If the input is a URL, download it first:
pdf_stem=$(basename "$ARGUMENTS" .pdf)
curl -L -o "/tmp/${pdf_stem}.pdf" "$ARGUMENTS"
Then convert (pass either the downloaded path or the original local path):
python {SKILL_DIR}/scripts/pdf_to_images.py "<pdf_path>" --dpi 120
Outputs JSON to stdout:
[{"page": 1, "path": "/abs/path/page_001.png"}, ...]
Parse and store the full page → path map. These local paths are used for viewing pages and as --path input to crop.py.
View all page images sequentially before planning anything. Your goal here is pure understanding — absorb the full structure, content, figures, and arguments of the document.
While reading, note:
Do NOT plan or write slides yet — just read and understand all pages first.
After reading all pages, plan 8–15 slides (adapt freely for non-academic documents).
| Slide | Typical purpose |
|---|---|
| 1 | Title, authors, affiliation, venue/year |
| 2 | Motivation / Problem statement |
| 3 | Related Work (brief) |
| 4–N-2 | Method / Core contributions (one concept per slide) |
| N-1 | Results & Experiments |
| N | Conclusion & Future Work |
For each slide that needs a visual, identify:
Save the outline as <out_dir>/outline.json using exactly this schema:
{
"presentation_title": "Paper Title Here",
"lang": "Chinese",
"total_slides": 10,
"slides_plan": [
{
"slide_index": 1,
"title": "Slide Title",
"main_content": "Key points and text content for this slide",
"template_id": null,
"required_crops": [
{
"url": "<page_image_url_from_phase1>",
"visual_description": "Figure 3: architecture diagram showing encoder-decoder",
"usage_reason": "Illustrates the core model structure for slide 4"
}
]
}
]
}
Field notes:
lang: "Chinese" or "English" — match the PDF language
template_id: always null
required_crops: empty array [] if this slide needs no images
url in each crop: the local file path of the source page image (from Phase 1 path field) — this is what crop.py will open and crop from
visual_description: what the visual shows, including figure/table number if available
usage_reason: why this visual belongs on this particular slide
For images that need cropping, note the approximate region — exact crop boxes are determined in Phase 4
Write outline.json using the Write tool to <out_dir>/outline.json.
IMPORTANT: You MUST delegate ALL cropping to a clean subagent using the Agent tool. By this phase your context is very long (all page images + outline), which degrades visual coordinate accuracy. A fresh subagent with only the target image produces much more precise coordinates.
IMPORTANT: You MUST use the provided {SKILL_DIR}/scripts/crop.py script for ALL image cropping. Do NOT write your own cropping code, do NOT use PIL/Pillow directly, do NOT use any other method.
Read outline.json. Collect all crops needed, then launch one subagent per source page (or one per crop if pages differ). The subagent uses grounding-style localization — it views the image, locates the target element, and outputs a precise bounding box in normalized 0–999 coordinates.
Use the Agent tool like this:
Agent tool call:
description: "Grounding crop page N"
prompt: |
You are a visual grounding and cropping assistant. Your task is to precisely
locate specified visual elements in a page image and crop them out.
## Grounding method
Use visual grounding to locate each target:
1. Read the source image using the Read tool to view it
2. Identify the target element described below
3. Determine its bounding box as normalized coordinates in the 0–999 range:
- 0 = left/top edge of the image
- 999 = right/bottom edge of the image
- These are thousandths, NOT pixels, NOT percentages (0–100)
- Format: [x1, y1, x2, y2] where (x1,y1) is top-left, (x2,y2) is bottom-right
- Example: [0, 0, 500, 500] = top-left quarter of the image
4. Be precise: tightly bound the target element with a small margin (~10–20 units)
around it. Do NOT crop too wide or too narrow.
## Source image
<page_image_path>
## Crops needed
For each crop below, first do grounding (locate the element), then crop:
1. Name: "slide<N>_<descriptive_name>"
Target: "<visual_description from outline.json>"
Context: "<usage_reason from outline.json>"
## Crop command
After determining the bounding box [X1, Y1, X2, Y2] for each target, run:
```bash
python <SKILL_DIR>/scripts/crop.py \
--path "<page_image_path>" \
--box X1 Y1 X2 Y2 \
--name "<crop_name>" \
--out-dir "<out_dir>/crops"
```
## Verification
After each crop, READ the output image to visually verify the correct region
was captured. If the crop missed the target or is too wide/narrow, adjust the
coordinates and re-run crop.py.
## Output
Report the final results as a list:
- crop_name: <name>, file: <output_filename>, box: [X1, Y1, X2, Y2]
Replace <page_image_path>, <SKILL_DIR>, <out_dir>, and crop details with actual values from your context.
The crop.py script outputs JSON: {"path": "/abs/path/slide3_method_crop.png"}
Collect results from all subagents and build the mapping: slide_index → [crop filename, ...] to reference in HTML. The filename will be <name>_crop.png.
Launch subagents for independent pages in parallel when possible. Wait for all to complete before proceeding.
After cropping, get pixel dimensions:
python3 -c "
from PIL import Image; import os, json
d = '<out_dir>/crops'
sizes = {}
for f in sorted(os.listdir(d)):
if f.endswith('.png'):
w, h = Image.open(os.path.join(d, f)).size
sizes[f] = {'width': w, 'height': h, 'aspect': round(w/h, 2)}
print(json.dumps(sizes, indent=2))
"
Use aspect ratios to pick each slide's layout:
| Aspect ratio | Layout recommendation |
|---|---|
| < 0.7 (tall/narrow) | text + image side-by-side — max-height: 600px on image |
| 0.7 – 1.3 (square-ish) | text + image — image takes ~50% width |
| > 1.3 (wide) | Image on top or bottom, text above/below |
| > 2.0 (very wide, e.g. tables) | full-image — spans full 1280px width, caption below |
For each slide, write the HTML, save it to a temp file, then call generate_slide.py.
Step A — Write HTML to /tmp/slide_N.html
<img src="..."> must use relative paths: crops/<name>_crop.png← / → arrows also navigate<div> overlays covering each half, positioned absolute over the slide canvasStep B — Save slide:
python {SKILL_DIR}/scripts/generate_slide.py \
--html-file /tmp/slide_N.html \
--index N \
--total <total> \
--title "<presentation title>" \
--out-dir "<out_dir>/"
Repeat until all slides are saved.
Write <out_dir>/summary.md in the same language as the slides (lang from outline.json).
Include:
slide_01.html to open the first slideExample structure:
# [Presentation Title]
> **来源 / Source:** [PDF filename] | **语言 / Language:** Chinese | **幻灯片数 / Slides:** 10
## 摘要
[2-3 sentence overview]
## 幻灯片概览
| # | 标题 | 主要内容 |
|---|------|---------|
| 1 | 标题页 | ... |
...
## 主要贡献
- ...
## 📂 打开演示文稿
[▶ 开始播放](slide_01.html)
Each slide is a standalone HTML file — full <html>…</html> with embedded CSS only.
Canvas: fixed 1280 × 720 px, overflow: hidden — nothing scrolls.
Consistent design across all slides:
Navigation on each slide:
← / → arrows also navigate‹ / › hint at the edges that fades in on hoverLayout patterns:
title-card — centered hero, large title, authors/venue belowtext-only — structured bullet points, max 5–6 items, generous whitespacetext + image — image right or left, text oppositefull-image — image fills canvas, minimal text overlaygrid — 2×2 or 3-column figures with captionsImages:
crops/<name>_crop.pngstyle="object-fit: contain; max-width: 100%; max-height: 100%;"Do NOT:
<pdf_stem>_<timestamp>/outline.json saved with valid SlidesPlan schemacrops/ (local only, no cloud upload)crops/<name>_crop.pngsummary.md written in the correct language, links to slide_01.htmlMatch the PDF language. Chinese PDF → Chinese slides and summary. English → English. No mixing.