Web Content Fetcher (WeChat images fix)

v0.0.1

Extract article content from any URL as clean Markdown. Uses Scrapling script as primary method (with auto fast→stealth fallback), Jina Reader as alternative...

⭐ 1· 774·2 current·2 all-time

by@haanya168·fork of @mrtommywu/web-content-fetcher (1.0.1)

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for haanya168/web-content-fetcher-hanya.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Web Content Fetcher (WeChat images fix)" (haanya168/web-content-fetcher-hanya) from ClawHub.
Skill page: https://clawhub.ai/haanya168/web-content-fetcher-hanya
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install web-content-fetcher-hanya

ClawHub CLI

Package manager switcher

npx clawhub@latest install web-content-fetcher-hanya

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

Name/description (extract article content to Markdown) align with the provided files and runtime actions. The skill ships a Python script that fetches HTML, normalizes lazy images, and converts to Markdown using html2text — all expected for a webpage extractor.

ℹ

Instruction Scope

SKILL.md instructs the agent to run the included scripts/fetch.py and to prefer Scrapling with an optional Jina Reader fallback — instructions stay within the stated purpose. Note: the script will perform arbitrary HTTP(S) requests (including headless browser fetches) to whatever URL the user or agent supplies, which is expected but can fetch internal endpoints (e.g., cloud metadata or intranet) if given such URLs.

✓

Install Mechanism

No install spec is declared (instruction-only), and dependencies are standard Python packages listed in requirements.txt. Nothing is downloaded from an unknown URL or written to disk during an automated install step. The skill does require pip installation of scrapling and html2text as documented.

✓

Credentials

The skill requires no environment variables, secrets, or config paths. The code uses only local imports and network access to the target URLs — this is proportional to its purpose.

✓

Persistence & Privilege

The skill is not always-enabled and does not request elevated or persistent platform privileges. It does not modify other skills or system-wide settings.

Assessment

This skill appears to be what it claims: a local Python-based webpage-to-Markdown extractor. Before installing or enabling it, consider: (1) run pip installs inside a virtualenv/isolated environment (the script requires scrapling and html2text and may need a headless browser runtime like Chromium depending on your Scrapling setup); (2) audit and sandbox use if you plan to let the agent call arbitrary URLs — the script will fetch any URL given (so avoid giving it internal/cloud-metadata URLs or other sensitive endpoints); (3) verify the upstream 'scrapling' package and ensure you trust it for headless browsing; (4) if you need offline copies with embedded images, implement the post-processing step yourself since image downloading is not included. Overall the skill is coherent and proportionate, but deploy with usual caution for code that performs network fetching and headless browser actions.

Like a lobster shell, security has layers — review code before you run it.

latestvk970tp307mxtqbv4317297hhtx83f6nt

774downloads

1stars

1versions

Updated 1mo ago

v0.0.1

MIT-0

Web Content Fetcher

Given a URL, return its main content as clean Markdown — headings, links, images, lists, code blocks all preserved.

Note: This skill extracts content + remote image URLs. If the user wants an "offline" copy (download images to local disk and rewrite links), add a post-processing step (not included by default in this skill).

Extraction Strategy

Always try one method per URL — don't cascade blindly. Pick the right one upfront.

URL
 │
 ├─ 1. Scrapling script (preferred)
 │     Run fetch.py — check the domain routing table to decide fast vs --stealth.
 │     Works for most sites. Returns clean Markdown directly.
 │
 └─ 2. Jina Reader (fallback — only if Scrapling fails or dependencies not installed)
       web_fetch("https://r.jina.ai/<url>")
       Free tier: 200 req/day. Fast (~1-2s), good Markdown output.
       Does NOT work for: WeChat (403), some Chinese platforms.

Scrapling script

python3 <SKILL_DIR>/scripts/fetch.py "<url>" [max_chars] [--stealth]

<SKILL_DIR> is the directory where this SKILL.md lives. Resolve it before calling the script.

The script has two modes built in:

Default (fast): HTTP fetch, ~1-3s, works for most sites
--stealth: Headless browser, ~5-15s, for JS-rendered or anti-scraping sites

When run without --stealth, the script automatically falls back to stealth if the fast result has too little content. So you rarely need to specify --stealth manually — the only reason to force it is when you already know the site needs it (see routing table), which saves the initial fast attempt.

Domain Routing

Use this table to pick the right mode on the first call:

Domain	Command	Why
`mp.weixin.qq.com`	`fetch.py <url> --stealth`	JS-rendered content
`zhuanlan.zhihu.com`	`fetch.py <url> --stealth`	Anti-scraping + JS
`juejin.cn`	`fetch.py <url> --stealth`	JS-rendered SPA
`sspai.com`	`fetch.py <url>`	Static HTML
`blog.csdn.net`	`fetch.py <url>`	Static HTML
`ruanyifeng.com`	`fetch.py <url>`	Static blog
`openai.com`	`fetch.py <url>`	Static HTML
`blog.google`	`fetch.py <url>`	Static HTML
Everything else	`fetch.py <url>`	Auto-fallback handles it

Script Options

# Basic — auto-selects fast or stealth
python3 <SKILL_DIR>/scripts/fetch.py "https://sspai.com/post/73145"

# Force stealth for known JS-heavy sites
python3 <SKILL_DIR>/scripts/fetch.py "https://mp.weixin.qq.com/s/xxx" --stealth

# Limit output to 15000 characters (default: 30000)
python3 <SKILL_DIR>/scripts/fetch.py "https://example.com/article" 15000

# JSON output with metadata (url, mode, selector, content_length)
python3 <SKILL_DIR>/scripts/fetch.py "https://example.com" --json

Install Dependencies

First use only — the script checks and tells you if anything is missing:

pip install scrapling html2text

If on system-managed Python (macOS/Linux), add --break-system-packages or use a venv.

Failure Rules

Same URL fails once → give up, tell the user "unable to extract content from this URL"
Do not retry — each failed call wastes context tokens

WeChat-specific gotchas

WeChat often uses lazy-loaded images where the real URL is in data-src and src is a tiny placeholder.
The extractor script normalizes these to real src URLs before running html2text.
If you ever see Markdown image lines that contain weird URL-encoded SVG fragments (e.g. ...%3Csvg...%3E) appended after the closing ) of an image, it means the placeholder leaked into Markdown parsing; update fix_lazy_images() in scripts/fetch.py to remove/replace placeholder data:image/svg+xml src values.

Comments

Loading comments...