Scrapling Web Extractor

v1.0.0

Fetch one or more public webpages with Scrapling, extract the main content, and convert HTML into Markdown using html2text. Supports static HTTP, concurrent...

0· 460· 1 versions· 0 current· 0 all-time· Updated 1mo ago· MIT-0

by@yumiu8103-hue

Web Markdown Scraper

Use this skill when the user wants to:

Scrape one or more public webpages (static or JavaScript-rendered)
Convert HTML pages into clean Markdown
Extract article/body text for summarization, analysis, or indexing
Bypass anti-bot protections (Cloudflare, Datadome, etc.) via stealth mode
Scrape many URLs concurrently (async mode)
Track page elements reliably across website redesigns (automatch)
Save the extracted results as .md files

Fetcher Mode Selection Guide

Mode	Fetcher Class	Best For
`http` (default)	`Fetcher`	Fast static pages, RSS, APIs
`async`	`AsyncFetcher`	Batch of 5+ static URLs in parallel
`stealth`	`StealthyFetcher`	Anti-bot sites, Cloudflare, fingerprint checks
`dynamic`	`PlayWrightFetcher`	Heavy SPAs, React/Vue/Angular apps

Decision rule: Start with http. If you get a 403 / CAPTCHA / empty body, switch to stealth. If the content is rendered client-side (empty on first load), use dynamic. Use async when scraping many static URLs at once to save time.

Inputs

URL sources

--url URL — one target URL (repeat flag for multiple: --url A --url B)
--url-file FILE — plain text file with one URL per line

Fetcher

--mode http|async|stealth|dynamic — fetcher backend (default: http)

Content extraction

--selector CSS — CSS selector for the main content area (omit = full page)
--preserve-links — keep hyperlinks in the Markdown output
--output-dir DIR — save per-page .md files and a master index.json here

AutoMatch — production resilience

--auto-save — fingerprint & persist selected elements to the local DB on first run
--auto-match — on subsequent runs, find elements by fingerprint even if the site layout has changed (do NOT need to update the CSS selector)

Browser options (stealth / dynamic only)

--headless true|false|virtual — headless mode; virtual uses Xvfb (default: true)
--network-idle — wait until no network activity for ≥500 ms before capturing
--block-images — block image loading (saves bandwidth and proxy quota)
--disable-resources — drop fonts/images/media/stylesheets for ~25% faster loads
--wait-selector CSS — pause until this element appears in the DOM
--wait-selector-state attached|visible|detached|hidden — element state (default: attached)
--timeout MS — global timeout in ms (default: 30 000)
--wait MS — extra idle wait after page load in ms

StealthyFetcher extras (stealth mode only)

--humanize SECONDS — simulate human-like cursor movement (max duration in seconds)
--geoip — spoof browser timezone, locale, language, and WebRTC IP from proxy geolocation
--block-webrtc — prevent real-IP leaks via WebRTC
--disable-ads — install uBlock Origin in the browser session
--proxy URL — HTTP/SOCKS proxy as a URL string, or JSON: '{"server":"host:port","username":"u","password":"p"}'

Reliability

--retry N — retry failed requests up to N times with exponential backoff (max 30 s)

Rules

Only process public http:// or https:// pages.
Never bypass login walls, CAPTCHAs, paywalls, or access controls.
Prefer the main article or body content; avoid polluting the output with navigation, headers, footers, or cookie banners — use --selector to target the content area.
When --auto-save is used, always also pass --selector so Scrapling knows which element fingerprint to record.
On subsequent runs for layout-changed pages, use --auto-match instead of --auto-save. Do not use both flags at once.
Use --mode async for batch jobs with 5+ static URLs for parallel execution.
Combine --disable-resources with --block-images in stealth/dynamic mode when you only need text content — this can cut load times by up to 40%.
Always inspect the top-level ok field and per-result ok fields before using content.
If ok is false, report the exact error string — do not invent or guess content.
When --network-idle is insufficient, use --wait-selector for a specific DOM element to guarantee the content has loaded before capture.

Command Patterns

Basic static page

python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>"

Static page — target specific content area

python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" --selector "article.main-content"

Stealth mode — bypass anti-bot protection

python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" --mode stealth --network-idle

Stealth + proxy + human fingerprint (maximum stealth)

python3 "{baseDir}/scrape_to_markdown.py" \
  --url "<URL>" \
  --mode stealth \
  --proxy "http://user:pass@host:port" \
  --humanize 2.0 \
  --geoip \
  --block-webrtc \
  --network-idle

Dynamic SPA page (Playwright Chromium)

python3 "{baseDir}/scrape_to_markdown.py" \
  --url "<URL>" \
  --mode dynamic \
  --wait-selector ".product-list" \
  --network-idle \
  --disable-resources

Async concurrent batch (multiple URLs)

python3 "{baseDir}/scrape_to_markdown.py" \
  --mode async \
  --url "<URL1>" --url "<URL2>" --url "<URL3>"

Batch from file + stealth + save to disk

python3 "{baseDir}/scrape_to_markdown.py" \
  --url-file urls.txt \
  --mode stealth \
  --disable-resources \
  --output-dir outputs

First-run automatch setup (save fingerprint)

python3 "{baseDir}/scrape_to_markdown.py" \
  --url "<URL>" \
  --selector ".article-body" \
  --auto-save \
  --output-dir outputs

Subsequent run after site layout change (adaptive match)

python3 "{baseDir}/scrape_to_markdown.py" \
  --url "<URL>" \
  --selector ".article-body" \
  --auto-match \
  --output-dir outputs

Full production scrape

python3 "{baseDir}/scrape_to_markdown.py" \
  --url "<URL>" \
  --mode stealth \
  --selector "main article" \
  --auto-match \
  --preserve-links \
  --network-idle \
  --disable-resources \
  --timeout 60000 \
  --retry 3 \
  --output-dir outputs

Output Handling

JSON is printed to stdout. Always check ok before using content.

Top-level fields:

ok — true only if every URL succeeded
total / succeeded / failed — count summary
results — array of per-URL result objects
output_index_file — path to saved index.json (if --output-dir used)

Per-URL result fields (when ok: true):

url — the requested URL
status — HTTP status code (e.g. 200)
title — page <title> text
markdown — extracted content as Markdown ← use this as main content
markdown_length — character count (useful for quality checks)
output_markdown_file — path to saved .md file (if --output-dir used)

On failure (ok: false in a result):

error — exact error message; report this verbatim, do not invent content

Version tags

latestvk970q8dwjrvr84vbfn7v9pskp182nnzy

Runtime requirements

🕸️ Clawdis

Binspython3