Web Scraper Skill

v1.0.0

Use this skill to scrape, crawl, or extract data from websites using Apify or Firecrawl APIs. Trigger whenever the user wants to: scrape a URL, crawl a websi...

0· 126·1 current·1 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for abhishekj9621/web-scraper-skill.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "Web Scraper Skill" (abhishekj9621/web-scraper-skill) from ClawHub.
Skill page: https://clawhub.ai/abhishekj9621/web-scraper-skill
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install web-scraper-skill

ClawHub CLI

Package manager switcher

npx clawhub@latest install web-scraper-skill
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name and description match the SKILL.md and code templates: the skill is explicitly for driving Apify and Firecrawl APIs to scrape/crawl web pages. It does not request unrelated binaries, config paths, or credentials in the registry metadata.
Instruction Scope
Runtime instructions and templates focus on calling Firecrawl and Apify endpoints, starting/polling jobs, and saving results. They include concrete HTTP examples and code templates that post URLs and receive scraped content. The SKILL.md does not warn about sending potentially sensitive/private site content to third-party APIs (data exfiltration risk), and it instructs the agent to always use this skill for scraping tasks — which could cause automatic outbound requests if combined with autonomous invocation.
Install Mechanism
Instruction-only skill (no install spec, no code files executed on install). This is the lowest-risk install model — nothing is pulled or executed at install time.
Credentials
The registry lists no required environment variables, but the SKILL.md clearly expects the user to provide Firecrawl and Apify API keys at runtime. That is proportionate to the purpose (those APIs require keys), but the skill does not declare or manage any required env vars up-front which could be confusing. It does not request unrelated credentials.
Persistence & Privilege
always:false (normal). disable-model-invocation:false allows autonomous invocation (platform default). This is expected for a skill that performs network calls, but users should be aware that the agent could call the external APIs without additional prompts when it decides scraping is required.
Assessment
This skill appears to do what it claims (drive Apify and Firecrawl). Before installing or using it, consider: (1) the agent will send URLs and scraped page content to third-party services — do not send private, behind-auth, or sensitive pages unless you trust those services and have permission; (2) the skill will ask you for API keys at runtime — treat those keys like secrets (use least-privilege tokens, rotate/revoke if needed, and avoid sharing long-lived account-wide keys); (3) if you don't want the agent to autonomously make scraping requests, restrict its ability to invoke the skill or require explicit confirmation before running; and (4) verify legal/robots/terms-of-service constraints for sites you plan to scrape.

Like a lobster shell, security has layers — review code before you run it.

latestvk9782r5q0vfpt7a5b5f5f6eawd8426e7
126downloads
0stars
1versions
Updated 3w ago
v1.0.0
MIT-0

Web Scraper Skill (Apify + Firecrawl)

This skill helps Openclaw scrape and extract data from websites using two powerful APIs:

  • Firecrawl — best for scraping individual pages, crawling entire sites, and getting LLM-ready content (markdown)
  • Apify — best for specialized scrapers (social media, Google Maps, e-commerce, etc.) via pre-built Actors

Quick Decision Guide: Apify vs Firecrawl

Use CaseRecommended Tool
Scrape a single page into markdown/JSONFirecrawl /scrape
Crawl an entire website (follow links)Firecrawl /crawl
Map all URLs on a siteFirecrawl /map
Search web + scrape resultsFirecrawl /search
Scrape Instagram / TikTok / TwitterApify (social actors)
Scrape Google Maps / reviewsApify (compass/crawler-google-places)
Scrape Amazon productsApify (apify/amazon-scraper)
Scrape Google Search resultsApify (apify/google-search-scraper)
Custom actor / any Apify Store actorApify

Authentication

Both APIs require API keys passed via headers. Always ask the user for their key if not provided.

Firecrawl: Authorization: Bearer fc-YOUR_API_KEY Apify: Authorization: Bearer YOUR_APIFY_TOKEN (or ?token=YOUR_TOKEN in URL)


Firecrawl API Reference

Base URL: https://api.firecrawl.dev/v2

1. Scrape a Single Page

POST /v2/scrape
Authorization: Bearer fc-YOUR_API_KEY
Content-Type: application/json

{
  "url": "https://example.com",
  "formats": ["markdown"],          // Options: markdown, html, rawHtml, links, screenshot, json
  "onlyMainContent": true,          // Strips nav/footer/ads
  "waitFor": 0,                     // ms to wait before scraping (for JS-heavy pages)
  "timeout": 30000,                 // ms
  "blockAds": true,
  "proxy": "auto"                   // "auto", "basic", or "stealth"
}

Response: { "success": true, "data": { "markdown": "...", "metadata": {...} } }

2. Crawl an Entire Website

Crawling is async — starts a job, then poll for results.

POST /v2/crawl
{
  "url": "https://docs.example.com",
  "limit": 50,                      // Max pages
  "maxDepth": 3,
  "allowExternalLinks": false,
  "scrapeOptions": {
    "formats": ["markdown"],
    "onlyMainContent": true
  }
}

Response: { "success": true, "id": "crawl-job-id" }

Poll status:

GET /v2/crawl/{crawl-job-id}

Response: { "status": "completed", "total": 50, "data": [...] }

3. Map a Website's URLs

POST /v2/map
{ "url": "https://example.com" }

Response: { "success": true, "links": [{ "url": "...", "title": "..." }] }

4. Search + Scrape in One Call

POST /v2/search
{
  "query": "best web scraping tools 2025",
  "limit": 5,
  "scrapeOptions": { "formats": ["markdown"] }
}

Response: { "data": [{ "url": "...", "title": "...", "markdown": "..." }] }

5. Batch Scrape Multiple URLs

POST /v2/batch/scrape
{
  "urls": ["https://a.com", "https://b.com"],
  "formats": ["markdown"]
}

Returns a job ID; poll with GET /v2/batch/scrape/{id}


Apify API Reference

Base URL: https://api.apify.com/v2 Auth: Pass token as query param ?token=YOUR_TOKEN or in Authorization header.

Core Workflow

Apify runs "Actors" (pre-built scrapers). The flow is:

  1. Start a run → get a runId and defaultDatasetId
  2. Poll status until SUCCEEDED
  3. Fetch results from the dataset

1. Run an Actor (Async)

POST /v2/acts/{actorId}/runs?token=YOUR_TOKEN
Content-Type: application/json

{ ...actor-specific input... }

Response:

{
  "data": {
    "id": "RUN_ID",
    "status": "RUNNING",
    "defaultDatasetId": "DATASET_ID"
  }
}

Common Actor IDs:

  • apify/web-scraper — generic JS scraper
  • apify/google-search-scraper — Google SERPs
  • compass/crawler-google-places — Google Maps
  • apify/instagram-scraper — Instagram
  • clockworks/free-tiktok-scraper — TikTok
  • apify/amazon-scraper — Amazon products

2. Poll Run Status

GET /v2/acts/{actorId}/runs/{runId}?token=YOUR_TOKEN

Poll until status is SUCCEEDED or FAILED. Recommended interval: 5 seconds.

3. Fetch Results

GET /v2/datasets/{datasetId}/items?token=YOUR_TOKEN&format=json

Optional params: format (json/csv/xlsx/xml), limit, offset

4. Run Synchronously (≤5 minutes)

For short runs, use the sync endpoint — it waits and returns dataset items directly:

POST /v2/acts/{actorId}/run-sync-get-dataset-items?token=YOUR_TOKEN
Content-Type: application/json

{ ...actor input... }

Common Actor Inputs

Google Search Scraper:

{ "queries": "web scraping tools", "maxPagesPerQuery": 1, "resultsPerPage": 10 }

Google Maps Scraper:

{ "searchStringsArray": ["restaurants in Mumbai"], "maxCrawledPlaces": 20 }

Web Scraper (generic):

{
  "startUrls": [{ "url": "https://example.com" }],
  "pageFunction": "async function pageFunction(context) { const $ = context.jQuery; return { title: $('title').text() }; }",
  "maxPagesPerCrawl": 10
}

Output Handling

  • Firecrawl returns data directly in the response (or via polling for crawl/batch).
  • Apify stores results in a dataset; retrieve with GET /v2/datasets/{id}/items.
  • Both support JSON output. Firecrawl also provides clean markdown ideal for LLMs.
  • Apify also supports CSV, XLSX, XML output formats.

Code Templates

See references/code-templates.md for ready-to-run Python and JavaScript code for both APIs.


Error Handling

  • Firecrawl 402 → out of credits; user needs to upgrade plan
  • Firecrawl 429 → rate limited; add delays between requests
  • Apify FAILED run → check run logs via GET /v2/acts/{id}/runs/{runId}/log
  • Always wrap API calls in try/catch and check success: false in Firecrawl responses
  • Firecrawl crawls respect robots.txt by default
  • For JS-heavy pages, increase waitFor (Firecrawl) or use Playwright/Puppeteer actors (Apify)

Best Practices

  1. Start small — test with 1 URL or a small limit before scaling
  2. Use onlyMainContent: true in Firecrawl to remove nav/footer noise
  3. Choose async for large jobs — don't use sync endpoints for crawls with 50+ pages
  4. Store API keys securely — never hardcode them; use environment variables
  5. Check rate limits — Firecrawl: varies by plan; Apify: 250k requests/min global
  6. Prefer Firecrawl for LLM pipelines — markdown output is clean and ready for RAG/AI
  7. Prefer Apify for social/structured data — specialized actors handle anti-bot better

Comments

Loading comments...