Install
openclaw skills install webclaw-extractionWeb extraction for LLMs and agents. Scrape, crawl, map, search, extract, summarize, diff, monitor, and research any URL into clean Markdown, text, or JSON, including pages that block bots or render with JavaScript. Use when you need reliable web content, the built-in web_fetch fails, or you need structured data from a page.
openclaw skills install webclaw-extractionHigh-quality web extraction for LLMs and agents. Turns any URL into clean Markdown, text, or JSON, and handles pages that block bots or render content with JavaScript.
web_fetch returns empty/blocked content (403s, bot challenges)All requests go to https://api.webclaw.io/v1/.
Authentication: Authorization: Bearer $WEBCLAW_API_KEY
Before calling any endpoint, look at the URL. If it matches one of the 28 vertical extractor patterns below, use POST /v1/scrape/{vertical} immediately. Do not try /v1/scrape first, do not append .json, do not parse markdown. The vertical endpoint returns typed JSON with the fields you need in one call.
| URL shape | Use this endpoint | Returns |
|---|---|---|
reddit.com/r/*/comments/* | /v1/scrape/reddit | {post: {title, author, score, comment_count}, comments: [...]} |
github.com/{owner}/{repo} | /v1/scrape/github_repo | repo metadata, stars, language, readme |
github.com/*/pull/* | /v1/scrape/github_pr | title, state, author, commits, reviews |
github.com/*/issues/* | /v1/scrape/github_issue | title, state, labels, comments |
github.com/*/releases/tag/* | /v1/scrape/github_release | name, tag, assets, body |
news.ycombinator.com/item?id=* | /v1/scrape/hackernews | story + threaded comments |
pypi.org/project/* | /v1/scrape/pypi | name, version, description, downloads |
npmjs.com/package/* | /v1/scrape/npm | name, version, deps, weekly downloads |
crates.io/crates/* | /v1/scrape/crates_io | name, version, dependents |
huggingface.co/{owner}/{name} | /v1/scrape/huggingface_model | model card, downloads, license |
huggingface.co/datasets/* | /v1/scrape/huggingface_dataset | dataset card, rows, license |
arxiv.org/abs/* | /v1/scrape/arxiv | title, authors, abstract, pdf_url |
hub.docker.com/_/* | /v1/scrape/docker_hub | image tags, description, pulls |
dev.to/*/* | /v1/scrape/dev_to | title, author, body, tags, reactions |
stackoverflow.com/questions/* | /v1/scrape/stackoverflow | question + answers + votes |
{pub}.substack.com/p/* | /v1/scrape/substack_post | title, author, body |
youtube.com/watch?v=* (or /shorts/*, youtu.be/*) | /v1/scrape (auto-detected) | full transcript + title, channel, channel_url, duration_seconds, view_count, like_count, tags, thumbnail. See "YouTube auto-detection" below. |
linkedin.com/feed/update/* | /v1/scrape/linkedin_post | author, body, engagement |
instagram.com/p/* | /v1/scrape/instagram_post | caption, likes, user |
instagram.com/{user}/ | /v1/scrape/instagram_profile | bio, followers, posts |
amazon.com/dp/* | /v1/scrape/amazon_product | title, price, rating, review_count |
ebay.com/itm/* | /v1/scrape/ebay_listing | title, price, condition, seller |
etsy.com/listing/* | /v1/scrape/etsy_listing | title, price, shop, reviews |
trustpilot.com/review/* | /v1/scrape/trustpilot_reviews | aggregate + individual reviews |
any {shop}/products/{handle} (Shopify) | /v1/scrape/shopify_product | title, variants, price, description |
any {shop}/collections/{handle} (Shopify) | /v1/scrape/shopify_collection | name, products |
| generic ecommerce product page | /v1/scrape/ecommerce_product | Schema.org product fields |
WooCommerce {shop}/product/* | /v1/scrape/woocommerce_product | title, price, description |
Call one of the above if the URL matches. Only fall back to /v1/scrape below when the URL does NOT match any vertical.
Example vertical call:
curl -X POST https://api.webclaw.io/v1/scrape/reddit \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url":"https://www.reddit.com/r/rust/comments/abc/title/"}'
Response: {"vertical":"reddit","url":"...","data":{"post":{...},"comments":[...]}}. Bills 1 credit.
curl -X POST https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown"],
"only_main_content": true
}'
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
url | string | required | URL to scrape |
formats | string[] | ["markdown"] | Output formats: markdown, text, llm, json |
include_selectors | string[] | [] | CSS selectors to keep (e.g. ["article", ".content"]) |
exclude_selectors | string[] | [] | CSS selectors to remove (e.g. ["nav", "footer", ".ads"]) |
only_main_content | bool | false | Extract only the main article/content area |
no_cache | bool | false | Skip cache, fetch fresh |
max_cache_age | int | server default | Max acceptable cache age in seconds |
Response:
{
"url": "https://example.com",
"metadata": {
"title": "Example",
"description": "...",
"language": "en",
"word_count": 1234
},
"markdown": "# Page Title\n\nContent here...",
"cache": { "status": "miss" }
}
Format options:
markdown — clean markdown, best for general usetext — plain text without formattingllm — optimized for LLM consumption: includes page title, URL, and cleaned content with link references. Best for feeding to AI models.json — full extraction result with all metadataWhen bot-protection handling activates (automatic, no extra config):
{
"antibot": {
"bypass": true,
"elapsed_ms": 3200
}
}
YouTube auto-detection (automatic, no extra config):
Pass any youtube.com/watch, youtube.com/shorts/, or youtu.be/ URL and the response carries two extra fields alongside markdown / text / llm:
transcript (string) — full auto-caption text. null when the video has no captions.youtube (object) — { video_id, title, description, channel, channel_url, uploader, upload_date, duration_seconds, view_count, like_count, thumbnail, tags, categories, language }.curl -X POST https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}'
{
"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"metadata": { "title": "...", "image": "https://i.ytimg.com/..." },
"youtube": {
"video_id": "dQw4w9WgXcQ",
"channel": "Rick Astley",
"duration_seconds": 213,
"view_count": 1490000000,
"tags": ["rick astley", "..."]
},
"transcript": "We're no strangers to love...",
"markdown": "# Never Gonna Give You Up\n\n**Channel:** Rick Astley\n\n## Description\n..."
}
Cold p50 latency 2-4s; cache hits ~150ms. 1 credit per call. Same fields available on the SDKs as result.youtube.* / result.transcript (Python, JS) and resp.YouTube.* / resp.Transcript (Go).
Starts an async job. Poll for results.
Start crawl:
curl -X POST https://api.webclaw.io/v1/crawl \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"max_depth": 3,
"max_pages": 50,
"use_sitemap": true
}'
Response: { "job_id": "abc-123", "status": "running" }
Poll status:
curl https://api.webclaw.io/v1/crawl/abc-123 \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
Response when complete:
{
"job_id": "abc-123",
"status": "completed",
"total": 47,
"completed": 45,
"errors": 2,
"pages": [
{
"url": "https://docs.example.com/intro",
"markdown": "# Introduction\n...",
"metadata": { "title": "Intro", "word_count": 500 }
}
]
}
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
url | string | required | Starting URL |
max_depth | int | 3 | How many links deep to follow |
max_pages | int | 100 | Maximum pages to crawl |
use_sitemap | bool | false | Seed URLs from sitemap.xml |
formats | string[] | ["markdown"] | Output formats per page |
include_selectors | string[] | [] | CSS selectors to keep |
exclude_selectors | string[] | [] | CSS selectors to remove |
only_main_content | bool | false | Main content only |
Fast URL discovery without full content extraction.
curl -X POST https://api.webclaw.io/v1/map \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
Response:
{
"url": "https://example.com",
"count": 142,
"urls": [
"https://example.com/about",
"https://example.com/pricing",
"https://example.com/docs/intro"
]
}
Scans a page's inline JavaScript and <script src> bundles for API
endpoints (relative paths, absolute URLs, GraphQL, WebSocket). This
surfaces the request surface that map (sitemap only) cannot see —
useful for reverse-engineering a site's backend before scraping or
extracting. Heuristic v1: regex over fetched JS, single URL.
curl -X POST https://api.webclaw.io/v1/endpoints \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"include_third_party": false,
"max_bundles": 20
}'
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
url | string | required | Page URL to scan |
include_third_party | bool | false | Also report endpoints on hosts other than the target site |
max_bundles | int | 20 | Max number of <script src> bundles to fetch and scan (capped at 20) |
Response:
{
"url": "https://example.com",
"bundles_scanned": 8,
"endpoint_count": 23,
"endpoints": [
{ "value": "/api/v1/users", "kind": "relative_path", "first_party": true, "source": "inline" },
{ "value": "https://api.example.com/graphql", "kind": "absolute_url", "first_party": true, "source": "https://example.com/static/app.4f2c.js" },
{ "value": "query { viewer { id } }", "kind": "graph_ql", "first_party": true, "source": "https://example.com/static/app.4f2c.js" },
{ "value": "wss://realtime.example.com/socket", "kind": "web_socket", "first_party": true, "source": "inline" }
],
"hosts": ["api.example.com", "realtime.example.com"],
"truncated": false
}
Endpoint kind values: relative_path, absolute_url, graph_ql, web_socket.
source is either "inline" (found in the page's inline <script>) or
the URL of the bundle the endpoint was extracted from. first_party is
true when the endpoint host matches the target site;
include_third_party: false filters the rest out. truncated is true
when more bundles existed than max_bundles allowed.
Bills 2 credits per call.
curl -X POST https://api.webclaw.io/v1/batch \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://a.com",
"https://b.com",
"https://c.com"
],
"formats": ["markdown"],
"concurrency": 5
}'
Response:
{
"total": 3,
"completed": 3,
"errors": 0,
"results": [
{ "url": "https://a.com", "markdown": "...", "metadata": {} },
{ "url": "https://b.com", "markdown": "...", "metadata": {} },
{ "url": "https://c.com", "error": "timeout" }
]
}
Pull structured data from any page using a JSON schema or plain-text prompt.
With JSON schema:
curl -X POST https://api.webclaw.io/v1/extract \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/pricing",
"schema": {
"type": "object",
"properties": {
"plans": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "string" },
"features": { "type": "array", "items": { "type": "string" } }
}
}
}
}
}
}'
With prompt:
curl -X POST https://api.webclaw.io/v1/extract \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/pricing",
"prompt": "Extract all pricing tiers with names, monthly prices, and key features"
}'
Response:
{
"url": "https://example.com/pricing",
"data": {
"plans": [
{ "name": "Starter", "price": "$19/mo", "features": ["10k credits", "3 research runs"] },
{ "name": "Pro", "price": "$99/mo", "features": ["250k credits", "20 research runs"] }
]
}
}
curl -X POST https://api.webclaw.io/v1/summarize \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/long-article",
"max_sentences": 3
}'
Response:
{
"url": "https://example.com/long-article",
"summary": "The article discusses... Key findings include... The author concludes that..."
}
Compare current page content against a previous snapshot.
curl -X POST https://api.webclaw.io/v1/diff \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"previous": {
"markdown": "# Old content...",
"metadata": { "title": "Old Title" }
}
}'
Response:
{
"url": "https://example.com",
"status": "changed",
"diff": "--- previous\n+++ current\n@@ -1 +1 @@\n-# Old content\n+# New content",
"metadata_changes": [
{ "field": "title", "old": "Old Title", "new": "New Title" }
]
}
Analyze a website's visual identity: colors, fonts, logo.
curl -X POST https://api.webclaw.io/v1/brand \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
Response:
{
"url": "https://example.com",
"brand": {
"colors": [
{ "hex": "#FF6B35", "usage": "primary" },
{ "hex": "#1A1A2E", "usage": "background" }
],
"fonts": ["Inter", "JetBrains Mono"],
"logo_url": "https://example.com/logo.svg",
"favicon_url": "https://example.com/favicon.ico"
}
}
Site-specific extractors that return typed JSON instead of generic markdown. Use when the target URL is a GitHub PR, Reddit thread, Amazon product, YouTube video, PyPI/npm/crates package, HuggingFace model/dataset, ArXiv paper, Instagram profile, Shopify product, Etsy listing, Trustpilot reviews, or similar.
# List every extractor with its label and URL shape
curl https://api.webclaw.io/v1/extractors \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
# Run a specific extractor (GitHub PR)
curl -X POST https://api.webclaw.io/v1/scrape/github_pr \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://github.com/rust-lang/rust/pull/123456"}'
Response shape (extractor-specific data):
{
"vertical": "github_pr",
"url": "https://github.com/rust-lang/rust/pull/123456",
"data": {
"title": "...",
"state": "open",
"author": "...",
"additions": 42,
"deletions": 7,
"commits": 3,
"reviews": [ /* ... */ ]
}
}
Full catalog (28): reddit, hackernews, github_repo, github_pr, github_issue, github_release, pypi, npm, crates_io, huggingface_model, huggingface_dataset, arxiv, docker_hub, dev_to, stackoverflow, substack_post, youtube_video, linkedin_post, instagram_post, instagram_profile, shopify_product, shopify_collection, ecommerce_product, woocommerce_product, amazon_product, ebay_listing, etsy_listing, trustpilot_reviews.
Most of these auto-dispatch from a plain POST /v1/scrape call (their URL patterns are distinctive). The generic-pattern ones (shopify_*, ecommerce_product, woocommerce_product, substack_post) require the explicit /v1/scrape/{vertical} route.
youtube_video is not a /v1/scrape/{vertical} route. YouTube watch / shorts / youtu.be URLs are handled by a short-circuit inside POST /v1/scrape that returns the transcript + youtube block described under "YouTube auto-detection" above. Do not POST to /v1/scrape/youtube_video.
Bills 1 credit per successful call.
Search the web and optionally scrape each result page.
curl -X POST https://api.webclaw.io/v1/search \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "best rust web frameworks 2026",
"num_results": 5,
"scrape": true,
"formats": ["markdown"]
}'
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
query | string | required | Search query |
num_results | int | 10 | Number of search results to return |
scrape | bool | false | Also scrape each result page for full content |
formats | string[] | ["markdown"] | Output formats when scrape is true |
country | string | none | Country code for localized results (e.g. "us", "de") |
lang | string | none | Language code for results (e.g. "en", "fr") |
Response:
{
"query": "best rust web frameworks 2026",
"results": [
{
"title": "Top Rust Web Frameworks in 2026",
"url": "https://blog.example.com/rust-frameworks",
"snippet": "A comprehensive comparison of Axum, Actix, and Rocket...",
"position": 1,
"markdown": "# Top Rust Web Frameworks\n\n..."
},
{
"title": "Choosing a Rust Backend Framework",
"url": "https://dev.to/rust-backends",
"snippet": "When starting a new Rust web project...",
"position": 2,
"markdown": "# Choosing a Rust Backend\n\n..."
}
]
}
The markdown field on each result is only present when scrape: true. Without it, you get titles, URLs, snippets, and positions only.
Starts an async research job that searches, scrapes, and synthesizes information across multiple sources. Poll for results.
Start research:
curl -X POST https://api.webclaw.io/v1/research \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "How does Cloudflare Turnstile work and what are its known bypass methods?",
"max_iterations": 5,
"max_sources": 10,
"topic": "security",
"deep": true
}'
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
query | string | required | Research question or topic |
max_iterations | int | server default | Maximum research iterations (search-read-analyze cycles) |
max_sources | int | server default | Maximum number of sources to consult. Tier capped: Free 0, Starter 10, Pro 30, Scale 100 |
topic | string | none | Topic hint to guide search strategy (e.g. "security", "finance", "engineering") |
deep | bool | false | Enable deep research mode for more thorough analysis (costs 10 credits instead of 1) |
Response: { "id": "res-abc-123", "status": "running" }
Poll results:
curl https://api.webclaw.io/v1/research/res-abc-123 \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
Response when complete:
{
"id": "res-abc-123",
"status": "completed",
"query": "How does Cloudflare Turnstile work and what are its known bypass methods?",
"report": "# Cloudflare Turnstile Analysis\n\n## Overview\nCloudflare Turnstile is a CAPTCHA replacement that...\n\n## How It Works\n...\n\n## Known Bypass Methods\n...",
"sources": [
{ "url": "https://developers.cloudflare.com/turnstile/", "title": "Turnstile Documentation" },
{ "url": "https://blog.cloudflare.com/turnstile-ga/", "title": "Turnstile GA Announcement" }
],
"findings": [
"Turnstile uses browser environment signals and proof-of-work challenges",
"Managed mode auto-selects challenge difficulty based on visitor risk score",
"Known bypass approaches include instrumented browser automation"
],
"iterations": 5,
"elapsed_ms": 34200
}
Status values: running, completed, failed
Create persistent monitors that check a URL on a schedule and notify via webhook when content changes.
Create a monitor:
curl -X POST https://api.webclaw.io/v1/watch \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/pricing",
"interval": "0 */6 * * *",
"webhook_url": "https://hooks.example.com/pricing-changed",
"formats": ["markdown"]
}'
Request fields:
| Field | Type | Default | Description |
|---|---|---|---|
url | string | required | URL to monitor |
interval | string | required | Check frequency as cron expression or seconds (e.g. "0 */6 * * *" or "3600") |
webhook_url | string | none | URL to POST when changes are detected |
formats | string[] | ["markdown"] | Output formats for snapshots |
Response:
{
"id": "watch-abc-123",
"url": "https://example.com/pricing",
"interval": "0 */6 * * *",
"webhook_url": "https://hooks.example.com/pricing-changed",
"formats": ["markdown"],
"created_at": "2026-03-20T10:00:00Z",
"last_check": null,
"status": "active"
}
List all monitors:
curl https://api.webclaw.io/v1/watch \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
Response:
{
"monitors": [
{
"id": "watch-abc-123",
"url": "https://example.com/pricing",
"interval": "0 */6 * * *",
"status": "active",
"last_check": "2026-03-20T16:00:00Z",
"checks": 4
}
]
}
Get a monitor with snapshots:
curl https://api.webclaw.io/v1/watch/watch-abc-123 \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
Response:
{
"id": "watch-abc-123",
"url": "https://example.com/pricing",
"interval": "0 */6 * * *",
"status": "active",
"snapshots": [
{
"checked_at": "2026-03-20T16:00:00Z",
"status": "changed",
"diff": "--- previous\n+++ current\n@@ -5 +5 @@\n-Pro: $99/mo\n+Pro: $119/mo"
},
{
"checked_at": "2026-03-20T10:00:00Z",
"status": "baseline"
}
]
}
Trigger an immediate check:
curl -X POST https://api.webclaw.io/v1/watch/watch-abc-123/check \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
Delete a monitor:
curl -X DELETE https://api.webclaw.io/v1/watch/watch-abc-123 \
-H "Authorization: Bearer $WEBCLAW_API_KEY"
If you already have code written against the Firecrawl v2 API, point it at
https://api.webclaw.io and use these drop-in endpoints instead of the
native /v1/* ones. They accept Firecrawl-shaped requests and return
Firecrawl-shaped responses, backed by the same webclaw extraction pipeline.
| Method | Path | Maps to |
|---|---|---|
| POST | /v2/scrape | single-page scrape |
| POST | /v2/crawl | start async crawl |
| GET | /v2/crawl/{id} | crawl status/results |
| DELETE | /v2/crawl/{id} | cancel a crawl |
| POST | /v2/map | URL discovery |
| POST | /v2/search | web search |
curl -X POST https://api.webclaw.io/v2/scrape \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'
Same auth (Authorization: Bearer $WEBCLAW_API_KEY) and the same credit
billing as the native endpoints. Prefer the native /v1/* API for new
code — the /v2 layer exists only for Firecrawl migration compatibility
and does not expose webclaw-only features (vertical extractors, YouTube
short-circuit, llm format, endpoints discovery, brand, diff, research,
watch).
| Goal | Format | Why |
|---|---|---|
| Read and understand a page | markdown | Clean structure, headings, links preserved |
| Feed content to an AI model | llm | Optimized: includes title + URL header, clean link refs |
| Search or index content | text | Plain text, no formatting noise |
| Programmatic analysis | json | Full metadata, structured data, DOM statistics |
llm format when passing content to yourself or another AI — it's specifically optimized for LLM consumption with better context framing.only_main_content: true to skip navigation, sidebars, and footers. Reduces noise significantly.include_selectors/exclude_selectors for fine-grained control when only_main_content isn't enough.map before crawl to discover the site structure first, then crawl specific sections.endpoints to reverse-engineer a site's backend — it pulls API paths, GraphQL, and WebSocket URLs out of the page's JS that map (sitemap only) can't see. Set include_third_party: true to also see analytics/3rd-party calls.extract with a JSON schema for reliable structured output (e.g., pricing tables, product specs, contact info).search with scrape: true to get full page content for each search result in one call instead of searching then scraping separately.research for complex questions that need multiple sources — it handles the search-read-synthesize loop automatically. Enable deep: true for thorough analysis.watch for ongoing monitoring — set up a cron schedule and a webhook to get notified when a page changes without polling manually.Scope: This section describes the bundled CLI wrapper
scripts/webclaw.pyonly. It does not describe the HTTP API. Every call tohttps://api.webclaw.io/*(including/v2) bills credits — the API has no local-first path and no "free" tier of requests. The local-first / zero-credit behaviour below exists only because the wrapper runs a plain HTTP fetch on your own machine first and only calls the API when that fetch can't finish the job.
The scripts/webclaw.py wrapper uses a local-first approach:
extraction runs in your own process for the common case, and the cloud
API (api.webclaw.io) is only consulted — and only then billed — when
a local fetch can't finish the job.
┌───────────────────┐
scripts/webclaw.py scrape ─▶ │ local HTTP fetch │
└─────┬─────────────┘
│
┌─────────┴──────────┐
▼ ▼
success, clean HTML one of:
→ local extract • bot-protection page
→ return (0 credits) • JS-rendered SPA shell
• network / DNS error
│
▼
WEBCLAW_API_KEY set?
┌─────────────┴─────────────┐
yes no
│ │
▼ ▼
cloud API (api.webclaw.io) return best-effort
credits spent per request local result + warn
webclaw-core
pipeline but costs nothing.WEBCLAW_API_KEY is optional for the wrapper.
Without it, the local path still works for the 80% — only
bot-protected / JS-rendered sites surface a warning and return
degraded content. (The HTTP API always requires a key.)--cloud to skip the local attempt and go
straight to the API.| webclaw | web_fetch | |
|---|---|---|
| Bot-protected sites | Automatic | Fails (403) |
| JS-rendered pages | Automatic | Readability only |
| Output quality | 20-step optimization pipeline | Basic HTML parsing |
| Structured extraction | LLM-powered, schema-based | None |
| Crawling | Full site crawl with sitemap | Single page only |
| Caching | Built-in, configurable TTL | Per-session |
| Rate limiting | Managed server-side | Client responsibility |
Use web_fetch for simple, fast lookups. Use webclaw when you need reliability, quality, or advanced features.