Install
openclaw skills install @abhishekj9621/web-scraper-skillUse this skill to scrape, crawl, or extract data from websites using Apify or Firecrawl APIs. Trigger whenever the user wants to: scrape a URL, crawl a website, extract structured data from web pages, run an Apify Actor, batch scrape multiple URLs, search and scrape the web, map a site's URLs, collect product/price/review data, or build any web data pipeline. If the user says things like "scrape this site", "get data from this URL", "crawl this website", "run an Apify actor", "use Firecrawl", "extract content from a page", "pull data from the web", or mentions any web data extraction task — always use this skill. Also use it when the user wants to choose between Apify and Firecrawl.
openclaw skills install @abhishekj9621/web-scraper-skillThis skill helps Openclaw scrape and extract data from websites using two powerful APIs:
| Use Case | Recommended Tool |
|---|---|
| Scrape a single page into markdown/JSON | Firecrawl /scrape |
| Crawl an entire website (follow links) | Firecrawl /crawl |
| Map all URLs on a site | Firecrawl /map |
| Search web + scrape results | Firecrawl /search |
| Scrape Instagram / TikTok / Twitter | Apify (social actors) |
| Scrape Google Maps / reviews | Apify (compass/crawler-google-places) |
| Scrape Amazon products | Apify (apify/amazon-scraper) |
| Scrape Google Search results | Apify (apify/google-search-scraper) |
| Custom actor / any Apify Store actor | Apify |
Both APIs require API keys passed via headers. Always ask the user for their key if not provided.
Firecrawl: Authorization: Bearer fc-YOUR_API_KEY
Apify: Authorization: Bearer YOUR_APIFY_TOKEN (or ?token=YOUR_TOKEN in URL)
Base URL: https://api.firecrawl.dev/v2
POST /v2/scrape
Authorization: Bearer fc-YOUR_API_KEY
Content-Type: application/json
{
"url": "https://example.com",
"formats": ["markdown"], // Options: markdown, html, rawHtml, links, screenshot, json
"onlyMainContent": true, // Strips nav/footer/ads
"waitFor": 0, // ms to wait before scraping (for JS-heavy pages)
"timeout": 30000, // ms
"blockAds": true,
"proxy": "auto" // "auto", "basic", or "stealth"
}
Response: { "success": true, "data": { "markdown": "...", "metadata": {...} } }
Crawling is async — starts a job, then poll for results.
POST /v2/crawl
{
"url": "https://docs.example.com",
"limit": 50, // Max pages
"maxDepth": 3,
"allowExternalLinks": false,
"scrapeOptions": {
"formats": ["markdown"],
"onlyMainContent": true
}
}
Response: { "success": true, "id": "crawl-job-id" }
Poll status:
GET /v2/crawl/{crawl-job-id}
Response: { "status": "completed", "total": 50, "data": [...] }
POST /v2/map
{ "url": "https://example.com" }
Response: { "success": true, "links": [{ "url": "...", "title": "..." }] }
POST /v2/search
{
"query": "best web scraping tools 2025",
"limit": 5,
"scrapeOptions": { "formats": ["markdown"] }
}
Response: { "data": [{ "url": "...", "title": "...", "markdown": "..." }] }
POST /v2/batch/scrape
{
"urls": ["https://a.com", "https://b.com"],
"formats": ["markdown"]
}
Returns a job ID; poll with GET /v2/batch/scrape/{id}
Base URL: https://api.apify.com/v2
Auth: Pass token as query param ?token=YOUR_TOKEN or in Authorization header.
Apify runs "Actors" (pre-built scrapers). The flow is:
runId and defaultDatasetIdSUCCEEDEDPOST /v2/acts/{actorId}/runs?token=YOUR_TOKEN
Content-Type: application/json
{ ...actor-specific input... }
Response:
{
"data": {
"id": "RUN_ID",
"status": "RUNNING",
"defaultDatasetId": "DATASET_ID"
}
}
Common Actor IDs:
apify/web-scraper — generic JS scraperapify/google-search-scraper — Google SERPscompass/crawler-google-places — Google Mapsapify/instagram-scraper — Instagramclockworks/free-tiktok-scraper — TikTokapify/amazon-scraper — Amazon productsGET /v2/acts/{actorId}/runs/{runId}?token=YOUR_TOKEN
Poll until status is SUCCEEDED or FAILED. Recommended interval: 5 seconds.
GET /v2/datasets/{datasetId}/items?token=YOUR_TOKEN&format=json
Optional params: format (json/csv/xlsx/xml), limit, offset
For short runs, use the sync endpoint — it waits and returns dataset items directly:
POST /v2/acts/{actorId}/run-sync-get-dataset-items?token=YOUR_TOKEN
Content-Type: application/json
{ ...actor input... }
Google Search Scraper:
{ "queries": "web scraping tools", "maxPagesPerQuery": 1, "resultsPerPage": 10 }
Google Maps Scraper:
{ "searchStringsArray": ["restaurants in Mumbai"], "maxCrawledPlaces": 20 }
Web Scraper (generic):
{
"startUrls": [{ "url": "https://example.com" }],
"pageFunction": "async function pageFunction(context) { const $ = context.jQuery; return { title: $('title').text() }; }",
"maxPagesPerCrawl": 10
}
GET /v2/datasets/{id}/items.See references/code-templates.md for ready-to-run Python and JavaScript code for both APIs.
GET /v2/acts/{id}/runs/{runId}/logsuccess: false in Firecrawl responsesrobots.txt by defaultwaitFor (Firecrawl) or use Playwright/Puppeteer actors (Apify)limit before scalingonlyMainContent: true in Firecrawl to remove nav/footer noise