Page Fetch
v1.0.1Extract readable content from webpages with a stable, low-dependency workflow. Use when the user asks to open, inspect, summarize, translate, verify, or quot...
Like a lobster shell, security has layers — review code before you run it.
Page Fetch
Use this skill to extract webpage content in a reproducible way that works well across different models and avoids browser dependence unless necessary.
This skill is built for reliability first:
- start with lightweight deterministic fetches
- inspect embedded page data before escalating
- use browser rendering only as a fallback
- report the extraction method and any access limits clearly
What this skill is for
Use this skill when a user says things like:
- "看一下这个网页的内容"
- "Open this article and summarize it"
- "Tell me what this page says"
- "Translate this webpage"
- "Check whether this page mentions X"
- "Quote the main points from this documentation page"
This skill is best for:
- news articles
- blog posts
- documentation pages
- product pages
- general public webpages
This skill is not magic. If a page is blocked by login, CAPTCHA, region restrictions, or aggressive anti-bot controls, report that clearly.
Design goal
The core goal is cross-model reliability.
Different LLMs often choose different ad-hoc ways to read webpages. This skill reduces that variance by giving them a standard path:
- Route
mp.weixin.qq.comlinks to the dedicated WeChat extractor first. - Try the lightweight deterministic extractor for general webpages.
- Inspect embedded page data when available.
- Use a real browser only when necessary.
- Tell the user which path worked and what the limits were.
Workflow
Step 1: Use the unified runner by default
For routine webpage reads, run the wrapper first:
python3 scripts/page_fetch.py "https://example.com/article" --format json
What it does:
- routes
mp.weixin.qq.comto the dedicated WeChat extractor first - uses lightweight HTML extraction for general pages
- escalates to browser rendering only when needed
- does not persist files unless
--save-jsonis explicitly passed - never defaults to writing transient JSON into the current working directory
Persistence rule:
- default: no disk writes
- only save when the caller explicitly passes
--save-json - when saving is requested without
--output, write to a non-workspace report path chosen by the caller or local runtime convention - do not write transient artifacts into the workspace root
Step 2: Direct script usage when debugging or forcing a method
Use direct scripts only when you need to debug or force a particular extraction path.
WeChat public articles
python3 scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/..." --format json
What it does:
- uses a WeChat mobile-style request header
- extracts article metadata from page meta tags and script variables
- reads the article body from
#js_content/.rich_media_content - reports explicit access limits when the page is replaced by verification or anti-bot flows
- does not persist files; it only returns structured extraction output
What to look for in the output:
titleauthoraccount_nicknamepublished_timetextmethodaccess_limitedaccess_limit_reason
General lightweight fetch
python3 scripts/fetch_page.py "https://example.com/article" --format json
What it does:
- fetches raw HTML via
requests - extracts metadata from HTML/meta tags
- inspects JSON-LD
- inspects embedded payloads such as
__NEXT_DATA__ - falls back to DOM paragraph extraction
What to look for in the output:
titleauthorpublished_timetextmethodnotes
Browser-render fallback
If the lightweight path returns thin, broken, or clearly incomplete content, run:
python3 scripts/render_page.py "https://example.com/article" --format json
What it does:
- launches headless Chromium via Node Playwright
- waits for the page to render
- extracts title, metadata, and readable text from the rendered DOM
Use this only when needed. It is slower and heavier than the first-pass extractor.
Step 3: Report method and limitations
Always tell the user which method worked:
wechat-domwechat-access-limitedjson-ldembedded-data:__NEXT_DATA__dom-paragraphsbrowser-render:playwright
Also mention known limitations when relevant:
- text was truncated
- metadata only
- browser runtime unavailable
- login wall / CAPTCHA / region restriction
- anti-bot blocking
Output contract
When using this skill, aim to return the following whenever possible:
- page title
- author and publish/update time
- the main body text or a concise faithful summary
- the extraction method used
- any missing sections, uncertainty, or access limitations
Do not imply full page access if only metadata or fragments were recovered.
Scripts
scripts/page_fetch.py
Purpose:
- unified no-persist entry point
- routes WeChat vs general webpages automatically
- escalates to browser rendering only when lightweight extraction is insufficient
- only saves JSON when
--save-jsonis explicitly requested
Typical usage:
python3 scripts/page_fetch.py "https://example.com/article" --format json
Optional explicit persistence:
python3 scripts/page_fetch.py "https://example.com/article" --format json --save-json --output ./example.json
Output fields:
- all fields returned by the selected extraction path
notesincluding runner step tracesaved_toonly when explicit persistence is requested
scripts/fetch_wechat_article.py
Purpose:
- WeChat public article extraction without persistence
- optimized for
mp.weixin.qq.comarticle pages
Typical usage:
python3 scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/..." --format json --max-chars 12000
Output fields:
urlfinal_urlstatus_codetitledescriptionauthoraccount_nicknamepublished_timemethodtextcontent_htmlexcerptnotesaccess_limitedaccess_limit_reason
scripts/fetch_page.py
Purpose:
- deterministic first-pass extraction
- optimized for cost, speed, and portability
Typical usage:
python3 scripts/fetch_page.py "https://example.com/article" --format json --max-chars 8000
Output fields:
urlfinal_urlstatus_codetitledescriptionauthorpublished_timemethodtextexcerptnotes
scripts/render_page.py
Purpose:
- browser-render fallback for JS-heavy or client-rendered pages
Typical usage:
python3 scripts/render_page.py "https://example.com/article" --format json --wait-ms 2500
Important notes:
- requires Node Playwright for browser fallback
- requires Chromium installed via Playwright for browser fallback
- requires system shared libraries for headless Chromium when browser fallback is used
- returns explicit machine-readable failure states when unavailable or broken
References
Read these when you need more context than the main workflow:
references/strategy.md- default extraction strategy
- failure-to-next-action mapping
references/browser-runtime.md- browser fallback runtime expectations
- common failure modes
- operational guidance
Guardrails
- Prefer lightweight fetches over browser automation.
- Do not silently switch to expensive browser rendering for every page.
- Do not bluff when access is blocked.
- For routine reads, do not save page contents to disk unless the user explicitly wants export or archival.
- If output must be saved, prefer a caller-chosen report/output path rather than workspace-root artifacts.
Quick examples
Example A: WeChat public article
If the URL is mp.weixin.qq.com, try fetch_wechat_article.py first. If it returns article body text, use that directly. If it reports access limits, say so plainly.
Example B: standard news article
If fetch_page.py returns a solid body via embedded-data:__NEXT_DATA__ or dom-paragraphs, use that result directly.
Example C: JS-rendered docs site
If fetch_page.py returns thin text or metadata only, escalate to render_page.py.
Example D: blocked page
If browser rendering fails because of login, CAPTCHA, or anti-bot controls, report the limitation plainly and, when appropriate, look for an alternate accessible source.
Comments
Loading comments...
