webclaw

Other

Web extraction for LLMs and agents. Scrape, crawl, map, search, extract, summarize, diff, monitor, and research any URL into clean Markdown, text, or JSON, including pages that block bots or render with JavaScript. Use when you need reliable web content, the built-in web_fetch fails, or you need structured data from a page.

Install

openclaw skills install webclaw-extraction

webclaw

High-quality web extraction for LLMs and agents. Turns any URL into clean Markdown, text, or JSON, and handles pages that block bots or render content with JavaScript.

When to use this skill

  • Always when you need to fetch web content and want reliable results
  • When web_fetch returns empty/blocked content (403s, bot challenges)
  • When you need structured data extraction (pricing tables, product info)
  • When you need to crawl an entire site or discover all URLs
  • When you need to discover the API endpoints a page's JavaScript calls
  • When you need LLM-optimized content (cleaner than raw markdown)
  • When you need to summarize a page without reading the full content
  • When you need to detect content changes between visits
  • When you need brand identity analysis (colors, fonts, logos)
  • When you need web search results with optional page scraping
  • When you need deep multi-source research on a topic
  • When you need AI-guided scraping to accomplish a goal on a page
  • When you need to monitor a URL for changes over time

API base

All requests go to https://api.webclaw.io/v1/.

Authentication: Authorization: Bearer $WEBCLAW_API_KEY

Decision order — READ THIS FIRST

Before calling any endpoint, look at the URL. If it matches one of the 28 vertical extractor patterns below, use POST /v1/scrape/{vertical} immediately. Do not try /v1/scrape first, do not append .json, do not parse markdown. The vertical endpoint returns typed JSON with the fields you need in one call.

URL shapeUse this endpointReturns
reddit.com/r/*/comments/*/v1/scrape/reddit{post: {title, author, score, comment_count}, comments: [...]}
github.com/{owner}/{repo}/v1/scrape/github_reporepo metadata, stars, language, readme
github.com/*/pull/*/v1/scrape/github_prtitle, state, author, commits, reviews
github.com/*/issues/*/v1/scrape/github_issuetitle, state, labels, comments
github.com/*/releases/tag/*/v1/scrape/github_releasename, tag, assets, body
news.ycombinator.com/item?id=*/v1/scrape/hackernewsstory + threaded comments
pypi.org/project/*/v1/scrape/pypiname, version, description, downloads
npmjs.com/package/*/v1/scrape/npmname, version, deps, weekly downloads
crates.io/crates/*/v1/scrape/crates_ioname, version, dependents
huggingface.co/{owner}/{name}/v1/scrape/huggingface_modelmodel card, downloads, license
huggingface.co/datasets/*/v1/scrape/huggingface_datasetdataset card, rows, license
arxiv.org/abs/*/v1/scrape/arxivtitle, authors, abstract, pdf_url
hub.docker.com/_/*/v1/scrape/docker_hubimage tags, description, pulls
dev.to/*/*/v1/scrape/dev_totitle, author, body, tags, reactions
stackoverflow.com/questions/*/v1/scrape/stackoverflowquestion + answers + votes
{pub}.substack.com/p/*/v1/scrape/substack_posttitle, author, body
youtube.com/watch?v=* (or /shorts/*, youtu.be/*)/v1/scrape (auto-detected)full transcript + title, channel, channel_url, duration_seconds, view_count, like_count, tags, thumbnail. See "YouTube auto-detection" below.
linkedin.com/feed/update/*/v1/scrape/linkedin_postauthor, body, engagement
instagram.com/p/*/v1/scrape/instagram_postcaption, likes, user
instagram.com/{user}//v1/scrape/instagram_profilebio, followers, posts
amazon.com/dp/*/v1/scrape/amazon_producttitle, price, rating, review_count
ebay.com/itm/*/v1/scrape/ebay_listingtitle, price, condition, seller
etsy.com/listing/*/v1/scrape/etsy_listingtitle, price, shop, reviews
trustpilot.com/review/*/v1/scrape/trustpilot_reviewsaggregate + individual reviews
any {shop}/products/{handle} (Shopify)/v1/scrape/shopify_producttitle, variants, price, description
any {shop}/collections/{handle} (Shopify)/v1/scrape/shopify_collectionname, products
generic ecommerce product page/v1/scrape/ecommerce_productSchema.org product fields
WooCommerce {shop}/product/*/v1/scrape/woocommerce_producttitle, price, description

Call one of the above if the URL matches. Only fall back to /v1/scrape below when the URL does NOT match any vertical.

Example vertical call:

curl -X POST https://api.webclaw.io/v1/scrape/reddit \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://www.reddit.com/r/rust/comments/abc/title/"}'

Response: {"vertical":"reddit","url":"...","data":{"post":{...},"comments":[...]}}. Bills 1 credit.

Endpoints

1. Scrape — extract content from a single URL

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'

Request fields:

FieldTypeDefaultDescription
urlstringrequiredURL to scrape
formatsstring[]["markdown"]Output formats: markdown, text, llm, json
include_selectorsstring[][]CSS selectors to keep (e.g. ["article", ".content"])
exclude_selectorsstring[][]CSS selectors to remove (e.g. ["nav", "footer", ".ads"])
only_main_contentboolfalseExtract only the main article/content area
no_cacheboolfalseSkip cache, fetch fresh
max_cache_ageintserver defaultMax acceptable cache age in seconds

Response:

{
  "url": "https://example.com",
  "metadata": {
    "title": "Example",
    "description": "...",
    "language": "en",
    "word_count": 1234
  },
  "markdown": "# Page Title\n\nContent here...",
  "cache": { "status": "miss" }
}

Format options:

  • markdown — clean markdown, best for general use
  • text — plain text without formatting
  • llm — optimized for LLM consumption: includes page title, URL, and cleaned content with link references. Best for feeding to AI models.
  • json — full extraction result with all metadata

When bot-protection handling activates (automatic, no extra config):

{
  "antibot": {
    "bypass": true,
    "elapsed_ms": 3200
  }
}

YouTube auto-detection (automatic, no extra config):

Pass any youtube.com/watch, youtube.com/shorts/, or youtu.be/ URL and the response carries two extra fields alongside markdown / text / llm:

  • transcript (string) — full auto-caption text. null when the video has no captions.
  • youtube (object) — { video_id, title, description, channel, channel_url, uploader, upload_date, duration_seconds, view_count, like_count, thumbnail, tags, categories, language }.
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}'
{
  "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
  "metadata": { "title": "...", "image": "https://i.ytimg.com/..." },
  "youtube": {
    "video_id": "dQw4w9WgXcQ",
    "channel": "Rick Astley",
    "duration_seconds": 213,
    "view_count": 1490000000,
    "tags": ["rick astley", "..."]
  },
  "transcript": "We're no strangers to love...",
  "markdown": "# Never Gonna Give You Up\n\n**Channel:** Rick Astley\n\n## Description\n..."
}

Cold p50 latency 2-4s; cache hits ~150ms. 1 credit per call. Same fields available on the SDKs as result.youtube.* / result.transcript (Python, JS) and resp.YouTube.* / resp.Transcript (Go).

2. Crawl — scrape an entire website

Starts an async job. Poll for results.

Start crawl:

curl -X POST https://api.webclaw.io/v1/crawl \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "max_depth": 3,
    "max_pages": 50,
    "use_sitemap": true
  }'

Response: { "job_id": "abc-123", "status": "running" }

Poll status:

curl https://api.webclaw.io/v1/crawl/abc-123 \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Response when complete:

{
  "job_id": "abc-123",
  "status": "completed",
  "total": 47,
  "completed": 45,
  "errors": 2,
  "pages": [
    {
      "url": "https://docs.example.com/intro",
      "markdown": "# Introduction\n...",
      "metadata": { "title": "Intro", "word_count": 500 }
    }
  ]
}

Request fields:

FieldTypeDefaultDescription
urlstringrequiredStarting URL
max_depthint3How many links deep to follow
max_pagesint100Maximum pages to crawl
use_sitemapboolfalseSeed URLs from sitemap.xml
formatsstring[]["markdown"]Output formats per page
include_selectorsstring[][]CSS selectors to keep
exclude_selectorsstring[][]CSS selectors to remove
only_main_contentboolfalseMain content only

3. Map — discover all URLs on a site

Fast URL discovery without full content extraction.

curl -X POST https://api.webclaw.io/v1/map \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Response:

{
  "url": "https://example.com",
  "count": 142,
  "urls": [
    "https://example.com/about",
    "https://example.com/pricing",
    "https://example.com/docs/intro"
  ]
}

4. Endpoints — discover API endpoints embedded in a page

Scans a page's inline JavaScript and <script src> bundles for API endpoints (relative paths, absolute URLs, GraphQL, WebSocket). This surfaces the request surface that map (sitemap only) cannot see — useful for reverse-engineering a site's backend before scraping or extracting. Heuristic v1: regex over fetched JS, single URL.

curl -X POST https://api.webclaw.io/v1/endpoints \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "include_third_party": false,
    "max_bundles": 20
  }'

Request fields:

FieldTypeDefaultDescription
urlstringrequiredPage URL to scan
include_third_partyboolfalseAlso report endpoints on hosts other than the target site
max_bundlesint20Max number of <script src> bundles to fetch and scan (capped at 20)

Response:

{
  "url": "https://example.com",
  "bundles_scanned": 8,
  "endpoint_count": 23,
  "endpoints": [
    { "value": "/api/v1/users", "kind": "relative_path", "first_party": true, "source": "inline" },
    { "value": "https://api.example.com/graphql", "kind": "absolute_url", "first_party": true, "source": "https://example.com/static/app.4f2c.js" },
    { "value": "query { viewer { id } }", "kind": "graph_ql", "first_party": true, "source": "https://example.com/static/app.4f2c.js" },
    { "value": "wss://realtime.example.com/socket", "kind": "web_socket", "first_party": true, "source": "inline" }
  ],
  "hosts": ["api.example.com", "realtime.example.com"],
  "truncated": false
}

Endpoint kind values: relative_path, absolute_url, graph_ql, web_socket.

source is either "inline" (found in the page's inline <script>) or the URL of the bundle the endpoint was extracted from. first_party is true when the endpoint host matches the target site; include_third_party: false filters the rest out. truncated is true when more bundles existed than max_bundles allowed.

Bills 2 credits per call.

5. Batch — scrape multiple URLs in parallel

curl -X POST https://api.webclaw.io/v1/batch \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://a.com",
      "https://b.com",
      "https://c.com"
    ],
    "formats": ["markdown"],
    "concurrency": 5
  }'

Response:

{
  "total": 3,
  "completed": 3,
  "errors": 0,
  "results": [
    { "url": "https://a.com", "markdown": "...", "metadata": {} },
    { "url": "https://b.com", "markdown": "...", "metadata": {} },
    { "url": "https://c.com", "error": "timeout" }
  ]
}

6. Extract — LLM-powered structured extraction

Pull structured data from any page using a JSON schema or plain-text prompt.

With JSON schema:

curl -X POST https://api.webclaw.io/v1/extract \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "schema": {
      "type": "object",
      "properties": {
        "plans": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "price": { "type": "string" },
              "features": { "type": "array", "items": { "type": "string" } }
            }
          }
        }
      }
    }
  }'

With prompt:

curl -X POST https://api.webclaw.io/v1/extract \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "prompt": "Extract all pricing tiers with names, monthly prices, and key features"
  }'

Response:

{
  "url": "https://example.com/pricing",
  "data": {
    "plans": [
      { "name": "Starter", "price": "$19/mo", "features": ["10k credits", "3 research runs"] },
      { "name": "Pro", "price": "$99/mo", "features": ["250k credits", "20 research runs"] }
    ]
  }
}

7. Summarize — get a quick summary of any page

curl -X POST https://api.webclaw.io/v1/summarize \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/long-article",
    "max_sentences": 3
  }'

Response:

{
  "url": "https://example.com/long-article",
  "summary": "The article discusses... Key findings include... The author concludes that..."
}

8. Diff — detect content changes

Compare current page content against a previous snapshot.

curl -X POST https://api.webclaw.io/v1/diff \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "previous": {
      "markdown": "# Old content...",
      "metadata": { "title": "Old Title" }
    }
  }'

Response:

{
  "url": "https://example.com",
  "status": "changed",
  "diff": "--- previous\n+++ current\n@@ -1 +1 @@\n-# Old content\n+# New content",
  "metadata_changes": [
    { "field": "title", "old": "Old Title", "new": "New Title" }
  ]
}

9. Brand — extract brand identity

Analyze a website's visual identity: colors, fonts, logo.

curl -X POST https://api.webclaw.io/v1/brand \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Response:

{
  "url": "https://example.com",
  "brand": {
    "colors": [
      { "hex": "#FF6B35", "usage": "primary" },
      { "hex": "#1A1A2E", "usage": "background" }
    ],
    "fonts": ["Inter", "JetBrains Mono"],
    "logo_url": "https://example.com/logo.svg",
    "favicon_url": "https://example.com/favicon.ico"
  }
}

10. Vertical extractors — typed JSON for 28 sites

Site-specific extractors that return typed JSON instead of generic markdown. Use when the target URL is a GitHub PR, Reddit thread, Amazon product, YouTube video, PyPI/npm/crates package, HuggingFace model/dataset, ArXiv paper, Instagram profile, Shopify product, Etsy listing, Trustpilot reviews, or similar.

# List every extractor with its label and URL shape
curl https://api.webclaw.io/v1/extractors \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

# Run a specific extractor (GitHub PR)
curl -X POST https://api.webclaw.io/v1/scrape/github_pr \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://github.com/rust-lang/rust/pull/123456"}'

Response shape (extractor-specific data):

{
  "vertical": "github_pr",
  "url": "https://github.com/rust-lang/rust/pull/123456",
  "data": {
    "title": "...",
    "state": "open",
    "author": "...",
    "additions": 42,
    "deletions": 7,
    "commits": 3,
    "reviews": [ /* ... */ ]
  }
}

Full catalog (28): reddit, hackernews, github_repo, github_pr, github_issue, github_release, pypi, npm, crates_io, huggingface_model, huggingface_dataset, arxiv, docker_hub, dev_to, stackoverflow, substack_post, youtube_video, linkedin_post, instagram_post, instagram_profile, shopify_product, shopify_collection, ecommerce_product, woocommerce_product, amazon_product, ebay_listing, etsy_listing, trustpilot_reviews.

Most of these auto-dispatch from a plain POST /v1/scrape call (their URL patterns are distinctive). The generic-pattern ones (shopify_*, ecommerce_product, woocommerce_product, substack_post) require the explicit /v1/scrape/{vertical} route.

youtube_video is not a /v1/scrape/{vertical} route. YouTube watch / shorts / youtu.be URLs are handled by a short-circuit inside POST /v1/scrape that returns the transcript + youtube block described under "YouTube auto-detection" above. Do not POST to /v1/scrape/youtube_video.

Bills 1 credit per successful call.

11. Search — web search with optional scraping

Search the web and optionally scrape each result page.

curl -X POST https://api.webclaw.io/v1/search \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "best rust web frameworks 2026",
    "num_results": 5,
    "scrape": true,
    "formats": ["markdown"]
  }'

Request fields:

FieldTypeDefaultDescription
querystringrequiredSearch query
num_resultsint10Number of search results to return
scrapeboolfalseAlso scrape each result page for full content
formatsstring[]["markdown"]Output formats when scrape is true
countrystringnoneCountry code for localized results (e.g. "us", "de")
langstringnoneLanguage code for results (e.g. "en", "fr")

Response:

{
  "query": "best rust web frameworks 2026",
  "results": [
    {
      "title": "Top Rust Web Frameworks in 2026",
      "url": "https://blog.example.com/rust-frameworks",
      "snippet": "A comprehensive comparison of Axum, Actix, and Rocket...",
      "position": 1,
      "markdown": "# Top Rust Web Frameworks\n\n..."
    },
    {
      "title": "Choosing a Rust Backend Framework",
      "url": "https://dev.to/rust-backends",
      "snippet": "When starting a new Rust web project...",
      "position": 2,
      "markdown": "# Choosing a Rust Backend\n\n..."
    }
  ]
}

The markdown field on each result is only present when scrape: true. Without it, you get titles, URLs, snippets, and positions only.

12. Research — deep multi-source research

Starts an async research job that searches, scrapes, and synthesizes information across multiple sources. Poll for results.

Start research:

curl -X POST https://api.webclaw.io/v1/research \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How does Cloudflare Turnstile work and what are its known bypass methods?",
    "max_iterations": 5,
    "max_sources": 10,
    "topic": "security",
    "deep": true
  }'

Request fields:

FieldTypeDefaultDescription
querystringrequiredResearch question or topic
max_iterationsintserver defaultMaximum research iterations (search-read-analyze cycles)
max_sourcesintserver defaultMaximum number of sources to consult. Tier capped: Free 0, Starter 10, Pro 30, Scale 100
topicstringnoneTopic hint to guide search strategy (e.g. "security", "finance", "engineering")
deepboolfalseEnable deep research mode for more thorough analysis (costs 10 credits instead of 1)

Response: { "id": "res-abc-123", "status": "running" }

Poll results:

curl https://api.webclaw.io/v1/research/res-abc-123 \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Response when complete:

{
  "id": "res-abc-123",
  "status": "completed",
  "query": "How does Cloudflare Turnstile work and what are its known bypass methods?",
  "report": "# Cloudflare Turnstile Analysis\n\n## Overview\nCloudflare Turnstile is a CAPTCHA replacement that...\n\n## How It Works\n...\n\n## Known Bypass Methods\n...",
  "sources": [
    { "url": "https://developers.cloudflare.com/turnstile/", "title": "Turnstile Documentation" },
    { "url": "https://blog.cloudflare.com/turnstile-ga/", "title": "Turnstile GA Announcement" }
  ],
  "findings": [
    "Turnstile uses browser environment signals and proof-of-work challenges",
    "Managed mode auto-selects challenge difficulty based on visitor risk score",
    "Known bypass approaches include instrumented browser automation"
  ],
  "iterations": 5,
  "elapsed_ms": 34200
}

Status values: running, completed, failed

13. Watch — monitor a URL for changes

Create persistent monitors that check a URL on a schedule and notify via webhook when content changes.

Create a monitor:

curl -X POST https://api.webclaw.io/v1/watch \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "interval": "0 */6 * * *",
    "webhook_url": "https://hooks.example.com/pricing-changed",
    "formats": ["markdown"]
  }'

Request fields:

FieldTypeDefaultDescription
urlstringrequiredURL to monitor
intervalstringrequiredCheck frequency as cron expression or seconds (e.g. "0 */6 * * *" or "3600")
webhook_urlstringnoneURL to POST when changes are detected
formatsstring[]["markdown"]Output formats for snapshots

Response:

{
  "id": "watch-abc-123",
  "url": "https://example.com/pricing",
  "interval": "0 */6 * * *",
  "webhook_url": "https://hooks.example.com/pricing-changed",
  "formats": ["markdown"],
  "created_at": "2026-03-20T10:00:00Z",
  "last_check": null,
  "status": "active"
}

List all monitors:

curl https://api.webclaw.io/v1/watch \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Response:

{
  "monitors": [
    {
      "id": "watch-abc-123",
      "url": "https://example.com/pricing",
      "interval": "0 */6 * * *",
      "status": "active",
      "last_check": "2026-03-20T16:00:00Z",
      "checks": 4
    }
  ]
}

Get a monitor with snapshots:

curl https://api.webclaw.io/v1/watch/watch-abc-123 \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Response:

{
  "id": "watch-abc-123",
  "url": "https://example.com/pricing",
  "interval": "0 */6 * * *",
  "status": "active",
  "snapshots": [
    {
      "checked_at": "2026-03-20T16:00:00Z",
      "status": "changed",
      "diff": "--- previous\n+++ current\n@@ -5 +5 @@\n-Pro: $99/mo\n+Pro: $119/mo"
    },
    {
      "checked_at": "2026-03-20T10:00:00Z",
      "status": "baseline"
    }
  ]
}

Trigger an immediate check:

curl -X POST https://api.webclaw.io/v1/watch/watch-abc-123/check \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Delete a monitor:

curl -X DELETE https://api.webclaw.io/v1/watch/watch-abc-123 \
  -H "Authorization: Bearer $WEBCLAW_API_KEY"

Firecrawl v2 compatibility layer

If you already have code written against the Firecrawl v2 API, point it at https://api.webclaw.io and use these drop-in endpoints instead of the native /v1/* ones. They accept Firecrawl-shaped requests and return Firecrawl-shaped responses, backed by the same webclaw extraction pipeline.

MethodPathMaps to
POST/v2/scrapesingle-page scrape
POST/v2/crawlstart async crawl
GET/v2/crawl/{id}crawl status/results
DELETE/v2/crawl/{id}cancel a crawl
POST/v2/mapURL discovery
POST/v2/searchweb search
curl -X POST https://api.webclaw.io/v2/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "formats": ["markdown"]}'

Same auth (Authorization: Bearer $WEBCLAW_API_KEY) and the same credit billing as the native endpoints. Prefer the native /v1/* API for new code — the /v2 layer exists only for Firecrawl migration compatibility and does not expose webclaw-only features (vertical extractors, YouTube short-circuit, llm format, endpoints discovery, brand, diff, research, watch).

Choosing the right format

GoalFormatWhy
Read and understand a pagemarkdownClean structure, headings, links preserved
Feed content to an AI modelllmOptimized: includes title + URL header, clean link refs
Search or index contenttextPlain text, no formatting noise
Programmatic analysisjsonFull metadata, structured data, DOM statistics

Tips

  • Use llm format when passing content to yourself or another AI — it's specifically optimized for LLM consumption with better context framing.
  • Use only_main_content: true to skip navigation, sidebars, and footers. Reduces noise significantly.
  • Use include_selectors/exclude_selectors for fine-grained control when only_main_content isn't enough.
  • Batch over individual scrapes when fetching multiple URLs — it's faster and more efficient.
  • Use map before crawl to discover the site structure first, then crawl specific sections.
  • Use endpoints to reverse-engineer a site's backend — it pulls API paths, GraphQL, and WebSocket URLs out of the page's JS that map (sitemap only) can't see. Set include_third_party: true to also see analytics/3rd-party calls.
  • Use extract with a JSON schema for reliable structured output (e.g., pricing tables, product specs, contact info).
  • Bot-protection handling is automatic — no extra configuration needed. Works on sites that block bots and JS-rendered SPAs.
  • Use search with scrape: true to get full page content for each search result in one call instead of searching then scraping separately.
  • Use research for complex questions that need multiple sources — it handles the search-read-synthesize loop automatically. Enable deep: true for thorough analysis.
  • Use watch for ongoing monitoring — set up a cron schedule and a webhook to get notified when a page changes without polling manually.

Smart Fetch Architecture (local CLI wrapper only)

Scope: This section describes the bundled CLI wrapper scripts/webclaw.py only. It does not describe the HTTP API. Every call to https://api.webclaw.io/* (including /v2) bills credits — the API has no local-first path and no "free" tier of requests. The local-first / zero-credit behaviour below exists only because the wrapper runs a plain HTTP fetch on your own machine first and only calls the API when that fetch can't finish the job.

The scripts/webclaw.py wrapper uses a local-first approach: extraction runs in your own process for the common case, and the cloud API (api.webclaw.io) is only consulted — and only then billed — when a local fetch can't finish the job.

Decision tree (wrapper)

                                ┌───────────────────┐
  scripts/webclaw.py scrape ─▶  │  local HTTP fetch │
                                └─────┬─────────────┘
                                      │
                            ┌─────────┴──────────┐
                            ▼                    ▼
                      success, clean HTML    one of:
                      → local extract         • bot-protection page
                      → return (0 credits)    • JS-rendered SPA shell
                                              • network / DNS error
                                              │
                                              ▼
                                    WEBCLAW_API_KEY set?
                                    ┌─────────────┴─────────────┐
                                   yes                          no
                                    │                            │
                                    ▼                            ▼
                       cloud API (api.webclaw.io)        return best-effort
                       credits spent per request          local result + warn

What "~80% local" means in practice (wrapper)

  • Local path (free, zero credits — wrapper only): static HTML sites, public docs, product pages that server-render, news sites, blogs, anything on plain nginx/Apache/CDN. The wrapper does a basic stdlib text extraction here; quality is lower than the cloud webclaw-core pipeline but costs nothing.
  • Cloud fallback (credits charged): sites that block bots challenges, JS-rendered SPAs (React / Next.js / Vue shells where the HTML has no content until hydration), sites that require a browser TLS fingerprint or solved captcha. The wrapper fails cleanly on the local path and automatically retries against the cloud API.
  • No API key set: WEBCLAW_API_KEY is optional for the wrapper. Without it, the local path still works for the 80% — only bot-protected / JS-rendered sites surface a warning and return degraded content. (The HTTP API always requires a key.)

Practical guidance (wrapper)

  • If you're scraping public docs, blogs, reference material, or an API's HTML docs through the wrapper: local path is enough, don't worry about credits.
  • If the user asks for Amazon / LinkedIn / Twitter / Instagram / any aggressively-protected site: expect the wrapper's cloud fallback to kick in (credits billed).
  • If you get an empty / short result and the page looks bot-protected, re-run the wrapper with --cloud to skip the local attempt and go straight to the API.

vs web_fetch

webclawweb_fetch
Bot-protected sitesAutomaticFails (403)
JS-rendered pagesAutomaticReadability only
Output quality20-step optimization pipelineBasic HTML parsing
Structured extractionLLM-powered, schema-basedNone
CrawlingFull site crawl with sitemapSingle page only
CachingBuilt-in, configurable TTLPer-session
Rate limitingManaged server-sideClient responsibility

Use web_fetch for simple, fast lookups. Use webclaw when you need reliability, quality, or advanced features.