Page Fetch

Data & APIs

Extract readable content from webpages with a stable, low-dependency workflow. Use when the user asks to open, inspect, summarize, translate, verify, or quote a web page, article, blog post, documentation page, or similar URL. Prefer this skill when cross-model reliability matters: fetch raw HTML first, inspect embedded data next, and escalate to browser rendering only when lightweight methods fail.

Install

openclaw skills install page-fetch

Page Fetch

Use this skill to extract webpage content in a reproducible way that works well across different models and avoids browser dependence unless necessary.

This skill is built for reliability first:

start with lightweight deterministic fetches
inspect embedded page data before escalating
use browser rendering only as a fallback
report the extraction method and any access limits clearly

What this skill is for

Use this skill when a user says things like:

"看一下这个网页的内容"
"Open this article and summarize it"
"Tell me what this page says"
"Translate this webpage"
"Check whether this page mentions X"
"Quote the main points from this documentation page"

This skill is best for:

news articles
blog posts
documentation pages
product pages
general public webpages

This skill is not magic. If a page is blocked by login, CAPTCHA, region restrictions, or aggressive anti-bot controls, report that clearly.

Design goal

The core goal is cross-model reliability.

Different LLMs often choose different ad-hoc ways to read webpages. This skill reduces that variance by giving them a standard path:

Route mp.weixin.qq.com links to the dedicated WeChat extractor first.
Try the lightweight deterministic extractor for general webpages.
Inspect embedded page data when available.
Use a real browser only when necessary.
Tell the user which path worked and what the limits were.

Workflow

Step 1: Use the unified runner by default

For routine webpage reads, run the wrapper first:

python3 scripts/page_fetch.py "https://example.com/article" --format json

What it does:

routes mp.weixin.qq.com to the dedicated WeChat extractor first
uses lightweight HTML extraction for general pages
escalates to browser rendering only when needed
does not persist files unless --save-json is explicitly passed
never defaults to writing transient JSON into the current working directory

Persistence rule:

default: no disk writes
only save when the caller explicitly passes --save-json
when saving is requested without --output, write to a non-workspace report path chosen by the caller or local runtime convention
do not write transient artifacts into the workspace root

Step 2: Direct script usage when debugging or forcing a method

Use direct scripts only when you need to debug or force a particular extraction path.

WeChat public articles

python3 scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/..." --format json

What it does:

uses a WeChat mobile-style request header
extracts article metadata from page meta tags and script variables
reads the article body from #js_content / .rich_media_content
reports explicit access limits when the page is replaced by verification or anti-bot flows
does not persist files; it only returns structured extraction output

What to look for in the output:

title
author
account_nickname
published_time
text
method
access_limited
access_limit_reason

General lightweight fetch

python3 scripts/fetch_page.py "https://example.com/article" --format json

What it does:

fetches raw HTML via requests
extracts metadata from HTML/meta tags
inspects JSON-LD
inspects embedded payloads such as __NEXT_DATA__
falls back to DOM paragraph extraction

What to look for in the output:

title
author
published_time
text
method
notes

Browser-render fallback

If the lightweight path returns thin, broken, or clearly incomplete content, run:

python3 scripts/render_page.py "https://example.com/article" --format json

What it does:

launches headless Chromium via Node Playwright
waits for the page to render
extracts title, metadata, and readable text from the rendered DOM

Use this only when needed. It is slower and heavier than the first-pass extractor.

Step 3: Report method and limitations

Always tell the user which method worked:

wechat-dom
wechat-access-limited
json-ld
embedded-data:__NEXT_DATA__
dom-paragraphs
browser-render:playwright

Also mention known limitations when relevant:

text was truncated
metadata only
browser runtime unavailable
login wall / CAPTCHA / region restriction
anti-bot blocking

Output contract

When using this skill, aim to return the following whenever possible:

page title
author and publish/update time
the main body text or a concise faithful summary
the extraction method used
any missing sections, uncertainty, or access limitations

Do not imply full page access if only metadata or fragments were recovered.

Scripts

`scripts/page_fetch.py`

Purpose:

unified no-persist entry point
routes WeChat vs general webpages automatically
escalates to browser rendering only when lightweight extraction is insufficient
only saves JSON when --save-json is explicitly requested

Typical usage:

python3 scripts/page_fetch.py "https://example.com/article" --format json

Optional explicit persistence:

python3 scripts/page_fetch.py "https://example.com/article" --format json --save-json --output ./example.json

Output fields:

all fields returned by the selected extraction path
notes including runner step trace
saved_to only when explicit persistence is requested

`scripts/fetch_wechat_article.py`

Purpose:

WeChat public article extraction without persistence
optimized for mp.weixin.qq.com article pages

Typical usage:

python3 scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/..." --format json --max-chars 12000

Output fields:

url
final_url
status_code
title
description
author
account_nickname
published_time
method
text
content_html
excerpt
notes
access_limited
access_limit_reason

`scripts/fetch_page.py`

Purpose:

deterministic first-pass extraction
optimized for cost, speed, and portability

Typical usage:

python3 scripts/fetch_page.py "https://example.com/article" --format json --max-chars 8000

Output fields:

url
final_url
status_code
title
description
author
published_time
method
text
excerpt
notes

`scripts/render_page.py`

Purpose:

browser-render fallback for JS-heavy or client-rendered pages

Typical usage:

python3 scripts/render_page.py "https://example.com/article" --format json --wait-ms 2500

Important notes:

requires Node Playwright for browser fallback
requires Chromium installed via Playwright for browser fallback
requires system shared libraries for headless Chromium when browser fallback is used
returns explicit machine-readable failure states when unavailable or broken

References

Read these when you need more context than the main workflow:

references/strategy.md
- default extraction strategy
- failure-to-next-action mapping
references/browser-runtime.md
- browser fallback runtime expectations
- common failure modes
- operational guidance

Guardrails

Prefer lightweight fetches over browser automation.
Do not silently switch to expensive browser rendering for every page.
Do not bluff when access is blocked.
For routine reads, do not save page contents to disk unless the user explicitly wants export or archival.
If output must be saved, prefer a caller-chosen report/output path rather than workspace-root artifacts.

Quick examples

Example A: WeChat public article

If the URL is mp.weixin.qq.com, try fetch_wechat_article.py first. If it returns article body text, use that directly. If it reports access limits, say so plainly.

Example B: standard news article

If fetch_page.py returns a solid body via embedded-data:__NEXT_DATA__ or dom-paragraphs, use that result directly.

Example C: JS-rendered docs site

If fetch_page.py returns thin text or metadata only, escalate to render_page.py.

Example D: blocked page

If browser rendering fails because of login, CAPTCHA, or anti-bot controls, report the limitation plainly and, when appropriate, look for an alternate accessible source.

Page Fetch

Install

Page Fetch

What this skill is for

Design goal

Workflow

Step 1: Use the unified runner by default

Step 2: Direct script usage when debugging or forcing a method

WeChat public articles

General lightweight fetch

Browser-render fallback

Step 3: Report method and limitations

Output contract

Scripts

scripts/page_fetch.py

scripts/fetch_wechat_article.py

scripts/fetch_page.py

scripts/render_page.py

References

Guardrails

Quick examples

Example A: WeChat public article

Example B: standard news article

Example C: JS-rendered docs site

Example D: blocked page

Related skills

`scripts/page_fetch.py`

`scripts/fetch_wechat_article.py`

`scripts/fetch_page.py`

`scripts/render_page.py`