Web Scraper Skill

v1.0.0

Use this skill to scrape, crawl, or extract data from websites using Apify or Firecrawl APIs. Trigger whenever the user wants to: scrape a URL, crawl a websi...

⭐ 0· 126·1 current·1 all-time

by@abhishekj9621

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for abhishekj9621/web-scraper-skill.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Web Scraper Skill" (abhishekj9621/web-scraper-skill) from ClawHub.
Skill page: https://clawhub.ai/abhishekj9621/web-scraper-skill
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install web-scraper-skill

ClawHub CLI

Package manager switcher

npx clawhub@latest install web-scraper-skill

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

Name and description match the SKILL.md and code templates: the skill is explicitly for driving Apify and Firecrawl APIs to scrape/crawl web pages. It does not request unrelated binaries, config paths, or credentials in the registry metadata.

ℹ

Instruction Scope

Runtime instructions and templates focus on calling Firecrawl and Apify endpoints, starting/polling jobs, and saving results. They include concrete HTTP examples and code templates that post URLs and receive scraped content. The SKILL.md does not warn about sending potentially sensitive/private site content to third-party APIs (data exfiltration risk), and it instructs the agent to always use this skill for scraping tasks — which could cause automatic outbound requests if combined with autonomous invocation.

✓

Install Mechanism

Instruction-only skill (no install spec, no code files executed on install). This is the lowest-risk install model — nothing is pulled or executed at install time.

ℹ

Credentials

The registry lists no required environment variables, but the SKILL.md clearly expects the user to provide Firecrawl and Apify API keys at runtime. That is proportionate to the purpose (those APIs require keys), but the skill does not declare or manage any required env vars up-front which could be confusing. It does not request unrelated credentials.

✓

Persistence & Privilege

always:false (normal). disable-model-invocation:false allows autonomous invocation (platform default). This is expected for a skill that performs network calls, but users should be aware that the agent could call the external APIs without additional prompts when it decides scraping is required.

Assessment

This skill appears to do what it claims (drive Apify and Firecrawl). Before installing or using it, consider: (1) the agent will send URLs and scraped page content to third-party services — do not send private, behind-auth, or sensitive pages unless you trust those services and have permission; (2) the skill will ask you for API keys at runtime — treat those keys like secrets (use least-privilege tokens, rotate/revoke if needed, and avoid sharing long-lived account-wide keys); (3) if you don't want the agent to autonomously make scraping requests, restrict its ability to invoke the skill or require explicit confirmation before running; and (4) verify legal/robots/terms-of-service constraints for sites you plan to scrape.

Like a lobster shell, security has layers — review code before you run it.

latestvk9782r5q0vfpt7a5b5f5f6eawd8426e7

126downloads

0stars

1versions

Updated 3w ago

v1.0.0

MIT-0

Web Scraper Skill (Apify + Firecrawl)

This skill helps Openclaw scrape and extract data from websites using two powerful APIs:

Firecrawl — best for scraping individual pages, crawling entire sites, and getting LLM-ready content (markdown)
Apify — best for specialized scrapers (social media, Google Maps, e-commerce, etc.) via pre-built Actors

Quick Decision Guide: Apify vs Firecrawl

Use Case	Recommended Tool
Scrape a single page into markdown/JSON	Firecrawl `/scrape`
Crawl an entire website (follow links)	Firecrawl `/crawl`
Map all URLs on a site	Firecrawl `/map`
Search web + scrape results	Firecrawl `/search`
Scrape Instagram / TikTok / Twitter	Apify (social actors)
Scrape Google Maps / reviews	Apify (compass/crawler-google-places)
Scrape Amazon products	Apify (apify/amazon-scraper)
Scrape Google Search results	Apify (apify/google-search-scraper)
Custom actor / any Apify Store actor	Apify

Authentication

Both APIs require API keys passed via headers. Always ask the user for their key if not provided.

Firecrawl: Authorization: Bearer fc-YOUR_API_KEY Apify: Authorization: Bearer YOUR_APIFY_TOKEN (or ?token=YOUR_TOKEN in URL)

Firecrawl API Reference

Base URL: https://api.firecrawl.dev/v2

1. Scrape a Single Page

POST /v2/scrape
Authorization: Bearer fc-YOUR_API_KEY
Content-Type: application/json

{
  "url": "https://example.com",
  "formats": ["markdown"],          // Options: markdown, html, rawHtml, links, screenshot, json
  "onlyMainContent": true,          // Strips nav/footer/ads
  "waitFor": 0,                     // ms to wait before scraping (for JS-heavy pages)
  "timeout": 30000,                 // ms
  "blockAds": true,
  "proxy": "auto"                   // "auto", "basic", or "stealth"
}

Response: { "success": true, "data": { "markdown": "...", "metadata": {...} } }

2. Crawl an Entire Website

Crawling is async — starts a job, then poll for results.

POST /v2/crawl
{
  "url": "https://docs.example.com",
  "limit": 50,                      // Max pages
  "maxDepth": 3,
  "allowExternalLinks": false,
  "scrapeOptions": {
    "formats": ["markdown"],
    "onlyMainContent": true
  }
}

Response: { "success": true, "id": "crawl-job-id" }

Poll status:

GET /v2/crawl/{crawl-job-id}

Response: { "status": "completed", "total": 50, "data": [...] }

3. Map a Website's URLs

POST /v2/map
{ "url": "https://example.com" }

Response: { "success": true, "links": [{ "url": "...", "title": "..." }] }

4. Search + Scrape in One Call

POST /v2/search
{
  "query": "best web scraping tools 2025",
  "limit": 5,
  "scrapeOptions": { "formats": ["markdown"] }
}

Response: { "data": [{ "url": "...", "title": "...", "markdown": "..." }] }

5. Batch Scrape Multiple URLs

POST /v2/batch/scrape
{
  "urls": ["https://a.com", "https://b.com"],
  "formats": ["markdown"]
}

Returns a job ID; poll with GET /v2/batch/scrape/{id}

Apify API Reference

Base URL: https://api.apify.com/v2 Auth: Pass token as query param ?token=YOUR_TOKEN or in Authorization header.

Core Workflow

Apify runs "Actors" (pre-built scrapers). The flow is:

Start a run → get a runId and defaultDatasetId
Poll status until SUCCEEDED
Fetch results from the dataset

1. Run an Actor (Async)

POST /v2/acts/{actorId}/runs?token=YOUR_TOKEN
Content-Type: application/json

{ ...actor-specific input... }

Response:

{
  "data": {
    "id": "RUN_ID",
    "status": "RUNNING",
    "defaultDatasetId": "DATASET_ID"
  }
}

Common Actor IDs:

apify/web-scraper — generic JS scraper
apify/google-search-scraper — Google SERPs
compass/crawler-google-places — Google Maps
apify/instagram-scraper — Instagram
clockworks/free-tiktok-scraper — TikTok
apify/amazon-scraper — Amazon products

2. Poll Run Status

GET /v2/acts/{actorId}/runs/{runId}?token=YOUR_TOKEN

Poll until status is SUCCEEDED or FAILED. Recommended interval: 5 seconds.

3. Fetch Results

GET /v2/datasets/{datasetId}/items?token=YOUR_TOKEN&format=json

Optional params: format (json/csv/xlsx/xml), limit, offset

4. Run Synchronously (≤5 minutes)

For short runs, use the sync endpoint — it waits and returns dataset items directly:

POST /v2/acts/{actorId}/run-sync-get-dataset-items?token=YOUR_TOKEN
Content-Type: application/json

{ ...actor input... }

Common Actor Inputs

Google Search Scraper:

{ "queries": "web scraping tools", "maxPagesPerQuery": 1, "resultsPerPage": 10 }

Google Maps Scraper:

{ "searchStringsArray": ["restaurants in Mumbai"], "maxCrawledPlaces": 20 }

Web Scraper (generic):

{
  "startUrls": [{ "url": "https://example.com" }],
  "pageFunction": "async function pageFunction(context) { const $ = context.jQuery; return { title: $('title').text() }; }",
  "maxPagesPerCrawl": 10
}

Output Handling

Firecrawl returns data directly in the response (or via polling for crawl/batch).
Apify stores results in a dataset; retrieve with GET /v2/datasets/{id}/items.
Both support JSON output. Firecrawl also provides clean markdown ideal for LLMs.
Apify also supports CSV, XLSX, XML output formats.

Code Templates

See references/code-templates.md for ready-to-run Python and JavaScript code for both APIs.

Error Handling

Firecrawl 402 → out of credits; user needs to upgrade plan
Firecrawl 429 → rate limited; add delays between requests
Apify FAILED run → check run logs via GET /v2/acts/{id}/runs/{runId}/log
Always wrap API calls in try/catch and check success: false in Firecrawl responses
Firecrawl crawls respect robots.txt by default
For JS-heavy pages, increase waitFor (Firecrawl) or use Playwright/Puppeteer actors (Apify)

Best Practices

Start small — test with 1 URL or a small limit before scaling
Use onlyMainContent: true in Firecrawl to remove nav/footer noise
Choose async for large jobs — don't use sync endpoints for crawls with 50+ pages
Store API keys securely — never hardcode them; use environment variables
Check rate limits — Firecrawl: varies by plan; Apify: 250k requests/min global
Prefer Firecrawl for LLM pipelines — markdown output is clean and ready for RAG/AI
Prefer Apify for social/structured data — specialized actors handle anti-bot better

Comments

Loading comments...