Page Fetch

v1.0.1

Extract readable content from webpages with a stable, low-dependency workflow. Use when the user asks to open, inspect, summarize, translate, verify, or quot...

0· 55·0 current·0 all-time
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description match the code and runtime behavior. The scripts implement lightweight HTML extraction, JSON-LD/embedded-data inspection, a WeChat-specific extractor, and an explicit browser-render fallback via Playwright — all consistent with the stated purpose.
Instruction Scope
Runtime instructions and scripts stay within the scope of fetching and extracting webpage content. They accept an optional cookie (WeChat use-case) and may write JSON only when --save-json is passed. Two points to note: (1) browser rendering executes page JavaScript in a headless Chromium instance (expected for rendering but worth awareness), and (2) the runner defaults to a fixed filesystem path (/home/admin/projects/openclaw/reports/page-fetch) when --save-json is used without --output, which may write outside the workspace.
Install Mechanism
There is no install spec (instruction-only install), which lowers risk. However the browser fallback depends on Node.js and Playwright (documented in references). The Python scripts implicitly require requests and BeautifulSoup but the skill does not declare Python package installation — callers must ensure these dependencies are available. No remote arbitrary download/install steps are present.
Credentials
The skill does not request environment variables or credentials. It accepts an optional cookie parameter for WeChat article fetching (justified by that use-case). No unrelated secrets or config paths are requested. The scripts do modify the process environment to include a global npm root in NODE_PATH when invoking Node, but this is a local runtime adjustment for Playwright resolution and not an exfiltration mechanism.
Persistence & Privilege
always:false (normal). The default behavior is no disk writes; persistence only occurs when the caller explicitly passes --save-json. If --save-json is used without --output, the skill writes to DEFAULT_SAVE_DIR (/home/admin/projects/openclaw/reports/page-fetch), which is outside a generic workspace and may be unexpected — review that path before enabling saves. The skill does not modify other skills or global agent config.
Assessment
This skill appears to do exactly what it says: fetch and extract webpage content, with a WeChat-specific path and an optional Playwright browser fallback. Before installing or running it, consider: (1) Dependencies: ensure Python packages (requests, bs4) and, if you intend to use browser rendering, Node.js and Playwright + a browser are installed in a controlled environment. (2) Cookies: the WeChat extractor accepts an optional cookie argument — only pass cookies you trust and understand. (3) Persistence: by default the runner does not write files; only use --save-json when you intend to persist output. If you do use --save-json without --output, the skill will write to /home/admin/projects/openclaw/reports/page-fetch/latest.json — change the --output path or inspect DEFAULT_SAVE_DIR if that behavior is undesirable. (4) Browser rendering runs page JS in a headless browser to extract text; this is expected but means pages execute normally (avoid rendering untrusted pages in privileged hosts). Overall the package is coherent and proportionate, but run it in a sandbox or review the default save path and dependencies before enabling persistence or browser fallback.

Like a lobster shell, security has layers — review code before you run it.

latestvk97cbjbqczgecp5v7vnz88twds84qmn4
55downloads
0stars
1versions
Updated 6d ago
v1.0.1
MIT-0

Page Fetch

Use this skill to extract webpage content in a reproducible way that works well across different models and avoids browser dependence unless necessary.

This skill is built for reliability first:

  • start with lightweight deterministic fetches
  • inspect embedded page data before escalating
  • use browser rendering only as a fallback
  • report the extraction method and any access limits clearly

What this skill is for

Use this skill when a user says things like:

  • "看一下这个网页的内容"
  • "Open this article and summarize it"
  • "Tell me what this page says"
  • "Translate this webpage"
  • "Check whether this page mentions X"
  • "Quote the main points from this documentation page"

This skill is best for:

  • news articles
  • blog posts
  • documentation pages
  • product pages
  • general public webpages

This skill is not magic. If a page is blocked by login, CAPTCHA, region restrictions, or aggressive anti-bot controls, report that clearly.

Design goal

The core goal is cross-model reliability.

Different LLMs often choose different ad-hoc ways to read webpages. This skill reduces that variance by giving them a standard path:

  1. Route mp.weixin.qq.com links to the dedicated WeChat extractor first.
  2. Try the lightweight deterministic extractor for general webpages.
  3. Inspect embedded page data when available.
  4. Use a real browser only when necessary.
  5. Tell the user which path worked and what the limits were.

Workflow

Step 1: Use the unified runner by default

For routine webpage reads, run the wrapper first:

python3 scripts/page_fetch.py "https://example.com/article" --format json

What it does:

  • routes mp.weixin.qq.com to the dedicated WeChat extractor first
  • uses lightweight HTML extraction for general pages
  • escalates to browser rendering only when needed
  • does not persist files unless --save-json is explicitly passed
  • never defaults to writing transient JSON into the current working directory

Persistence rule:

  • default: no disk writes
  • only save when the caller explicitly passes --save-json
  • when saving is requested without --output, write to a non-workspace report path chosen by the caller or local runtime convention
  • do not write transient artifacts into the workspace root

Step 2: Direct script usage when debugging or forcing a method

Use direct scripts only when you need to debug or force a particular extraction path.

WeChat public articles

python3 scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/..." --format json

What it does:

  • uses a WeChat mobile-style request header
  • extracts article metadata from page meta tags and script variables
  • reads the article body from #js_content / .rich_media_content
  • reports explicit access limits when the page is replaced by verification or anti-bot flows
  • does not persist files; it only returns structured extraction output

What to look for in the output:

  • title
  • author
  • account_nickname
  • published_time
  • text
  • method
  • access_limited
  • access_limit_reason

General lightweight fetch

python3 scripts/fetch_page.py "https://example.com/article" --format json

What it does:

  • fetches raw HTML via requests
  • extracts metadata from HTML/meta tags
  • inspects JSON-LD
  • inspects embedded payloads such as __NEXT_DATA__
  • falls back to DOM paragraph extraction

What to look for in the output:

  • title
  • author
  • published_time
  • text
  • method
  • notes

Browser-render fallback

If the lightweight path returns thin, broken, or clearly incomplete content, run:

python3 scripts/render_page.py "https://example.com/article" --format json

What it does:

  • launches headless Chromium via Node Playwright
  • waits for the page to render
  • extracts title, metadata, and readable text from the rendered DOM

Use this only when needed. It is slower and heavier than the first-pass extractor.

Step 3: Report method and limitations

Always tell the user which method worked:

  • wechat-dom
  • wechat-access-limited
  • json-ld
  • embedded-data:__NEXT_DATA__
  • dom-paragraphs
  • browser-render:playwright

Also mention known limitations when relevant:

  • text was truncated
  • metadata only
  • browser runtime unavailable
  • login wall / CAPTCHA / region restriction
  • anti-bot blocking

Output contract

When using this skill, aim to return the following whenever possible:

  • page title
  • author and publish/update time
  • the main body text or a concise faithful summary
  • the extraction method used
  • any missing sections, uncertainty, or access limitations

Do not imply full page access if only metadata or fragments were recovered.

Scripts

scripts/page_fetch.py

Purpose:

  • unified no-persist entry point
  • routes WeChat vs general webpages automatically
  • escalates to browser rendering only when lightweight extraction is insufficient
  • only saves JSON when --save-json is explicitly requested

Typical usage:

python3 scripts/page_fetch.py "https://example.com/article" --format json

Optional explicit persistence:

python3 scripts/page_fetch.py "https://example.com/article" --format json --save-json --output ./example.json

Output fields:

  • all fields returned by the selected extraction path
  • notes including runner step trace
  • saved_to only when explicit persistence is requested

scripts/fetch_wechat_article.py

Purpose:

  • WeChat public article extraction without persistence
  • optimized for mp.weixin.qq.com article pages

Typical usage:

python3 scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/..." --format json --max-chars 12000

Output fields:

  • url
  • final_url
  • status_code
  • title
  • description
  • author
  • account_nickname
  • published_time
  • method
  • text
  • content_html
  • excerpt
  • notes
  • access_limited
  • access_limit_reason

scripts/fetch_page.py

Purpose:

  • deterministic first-pass extraction
  • optimized for cost, speed, and portability

Typical usage:

python3 scripts/fetch_page.py "https://example.com/article" --format json --max-chars 8000

Output fields:

  • url
  • final_url
  • status_code
  • title
  • description
  • author
  • published_time
  • method
  • text
  • excerpt
  • notes

scripts/render_page.py

Purpose:

  • browser-render fallback for JS-heavy or client-rendered pages

Typical usage:

python3 scripts/render_page.py "https://example.com/article" --format json --wait-ms 2500

Important notes:

  • requires Node Playwright for browser fallback
  • requires Chromium installed via Playwright for browser fallback
  • requires system shared libraries for headless Chromium when browser fallback is used
  • returns explicit machine-readable failure states when unavailable or broken

References

Read these when you need more context than the main workflow:

  • references/strategy.md
    • default extraction strategy
    • failure-to-next-action mapping
  • references/browser-runtime.md
    • browser fallback runtime expectations
    • common failure modes
    • operational guidance

Guardrails

  • Prefer lightweight fetches over browser automation.
  • Do not silently switch to expensive browser rendering for every page.
  • Do not bluff when access is blocked.
  • For routine reads, do not save page contents to disk unless the user explicitly wants export or archival.
  • If output must be saved, prefer a caller-chosen report/output path rather than workspace-root artifacts.

Quick examples

Example A: WeChat public article

If the URL is mp.weixin.qq.com, try fetch_wechat_article.py first. If it returns article body text, use that directly. If it reports access limits, say so plainly.

Example B: standard news article

If fetch_page.py returns a solid body via embedded-data:__NEXT_DATA__ or dom-paragraphs, use that result directly.

Example C: JS-rendered docs site

If fetch_page.py returns thin text or metadata only, escalate to render_page.py.

Example D: blocked page

If browser rendering fails because of login, CAPTCHA, or anti-bot controls, report the limitation plainly and, when appropriate, look for an alternate accessible source.

Comments

Loading comments...