Scrapling Web Extractor
v1.0.0Fetch one or more public webpages with Scrapling, extract the main content, and convert HTML into Markdown using html2text. Supports static HTTP, concurrent...
Web Markdown Scraper
Use this skill when the user wants to:
- Scrape one or more public webpages (static or JavaScript-rendered)
- Convert HTML pages into clean Markdown
- Extract article/body text for summarization, analysis, or indexing
- Bypass anti-bot protections (Cloudflare, Datadome, etc.) via stealth mode
- Scrape many URLs concurrently (async mode)
- Track page elements reliably across website redesigns (automatch)
- Save the extracted results as
.mdfiles
Fetcher Mode Selection Guide
| Mode | Fetcher Class | Best For |
|---|---|---|
http (default) | Fetcher | Fast static pages, RSS, APIs |
async | AsyncFetcher | Batch of 5+ static URLs in parallel |
stealth | StealthyFetcher | Anti-bot sites, Cloudflare, fingerprint checks |
dynamic | PlayWrightFetcher | Heavy SPAs, React/Vue/Angular apps |
Decision rule: Start with http. If you get a 403 / CAPTCHA / empty body, switch
to stealth. If the content is rendered client-side (empty on first load), use dynamic.
Use async when scraping many static URLs at once to save time.
Inputs
URL sources
--url URL— one target URL (repeat flag for multiple:--url A --url B)--url-file FILE— plain text file with one URL per line
Fetcher
--mode http|async|stealth|dynamic— fetcher backend (default:http)
Content extraction
--selector CSS— CSS selector for the main content area (omit = full page)--preserve-links— keep hyperlinks in the Markdown output--output-dir DIR— save per-page.mdfiles and a masterindex.jsonhere
AutoMatch — production resilience
--auto-save— fingerprint & persist selected elements to the local DB on first run--auto-match— on subsequent runs, find elements by fingerprint even if the site layout has changed (do NOT need to update the CSS selector)
Browser options (stealth / dynamic only)
--headless true|false|virtual— headless mode;virtualuses Xvfb (default:true)--network-idle— wait until no network activity for ≥500 ms before capturing--block-images— block image loading (saves bandwidth and proxy quota)--disable-resources— drop fonts/images/media/stylesheets for ~25% faster loads--wait-selector CSS— pause until this element appears in the DOM--wait-selector-state attached|visible|detached|hidden— element state (default:attached)--timeout MS— global timeout in ms (default: 30 000)--wait MS— extra idle wait after page load in ms
StealthyFetcher extras (stealth mode only)
--humanize SECONDS— simulate human-like cursor movement (max duration in seconds)--geoip— spoof browser timezone, locale, language, and WebRTC IP from proxy geolocation--block-webrtc— prevent real-IP leaks via WebRTC--disable-ads— install uBlock Origin in the browser session--proxy URL— HTTP/SOCKS proxy as a URL string, or JSON:'{"server":"host:port","username":"u","password":"p"}'
Reliability
--retry N— retry failed requests up to N times with exponential backoff (max 30 s)
Rules
- Only process public
http://orhttps://pages. - Never bypass login walls, CAPTCHAs, paywalls, or access controls.
- Prefer the main article or body content; avoid polluting the output with navigation,
headers, footers, or cookie banners — use
--selectorto target the content area. - When
--auto-saveis used, always also pass--selectorso Scrapling knows which element fingerprint to record. - On subsequent runs for layout-changed pages, use
--auto-matchinstead of--auto-save. Do not use both flags at once. - Use
--mode asyncfor batch jobs with 5+ static URLs for parallel execution. - Combine
--disable-resourceswith--block-imagesin stealth/dynamic mode when you only need text content — this can cut load times by up to 40%. - Always inspect the top-level
okfield and per-resultokfields before using content. - If
okisfalse, report the exacterrorstring — do not invent or guess content. - When
--network-idleis insufficient, use--wait-selectorfor a specific DOM element to guarantee the content has loaded before capture.
Command Patterns
Basic static page
python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>"
Static page — target specific content area
python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" --selector "article.main-content"
Stealth mode — bypass anti-bot protection
python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" --mode stealth --network-idle
Stealth + proxy + human fingerprint (maximum stealth)
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--mode stealth \
--proxy "http://user:pass@host:port" \
--humanize 2.0 \
--geoip \
--block-webrtc \
--network-idle
Dynamic SPA page (Playwright Chromium)
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--mode dynamic \
--wait-selector ".product-list" \
--network-idle \
--disable-resources
Async concurrent batch (multiple URLs)
python3 "{baseDir}/scrape_to_markdown.py" \
--mode async \
--url "<URL1>" --url "<URL2>" --url "<URL3>"
Batch from file + stealth + save to disk
python3 "{baseDir}/scrape_to_markdown.py" \
--url-file urls.txt \
--mode stealth \
--disable-resources \
--output-dir outputs
First-run automatch setup (save fingerprint)
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--selector ".article-body" \
--auto-save \
--output-dir outputs
Subsequent run after site layout change (adaptive match)
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--selector ".article-body" \
--auto-match \
--output-dir outputs
Full production scrape
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--mode stealth \
--selector "main article" \
--auto-match \
--preserve-links \
--network-idle \
--disable-resources \
--timeout 60000 \
--retry 3 \
--output-dir outputs
Output Handling
JSON is printed to stdout. Always check ok before using content.
Top-level fields:
ok—trueonly if every URL succeededtotal/succeeded/failed— count summaryresults— array of per-URL result objectsoutput_index_file— path to savedindex.json(if--output-dirused)
Per-URL result fields (when ok: true):
url— the requested URLstatus— HTTP status code (e.g.200)title— page<title>textmarkdown— extracted content as Markdown ← use this as main contentmarkdown_length— character count (useful for quality checks)output_markdown_file— path to saved.mdfile (if--output-dirused)
On failure (ok: false in a result):
error— exact error message; report this verbatim, do not invent content
Version tags
latest
Runtime requirements
🕸️ Clawdis
Binspython3
