Install
openclaw skills install firecrawl-api-alUse Firecrawl API to scrape, crawl, map, and search web pages, extracting clean Markdown or structured data with JavaScript rendering and citation control.
openclaw skills install firecrawl-api-alfirecrawl-web-scraping-skill — instructional knowledge that teaches an AI agent WHEN and HOW to use Firecrawl (firecrawl.dev) to scrape, crawl, map, and search the web, and how to turn the results into clean, cited, trustworthy content.
This is a skill (instructional knowledge), not an MCP server. An MCP server is executable infrastructure that exposes callable tools. This skill assumes Firecrawl is already reachable through tools (e.g. the
firecrawl-mcpserver toolsfirecrawl_scrape,firecrawl_crawl,firecrawl_map,firecrawl_search) or through a direct HTTP client tohttps://api.firecrawl.dev/v2. It does not run anything itself; it tells you how to drive those tools well.
Use this skill to:
json format with a prompt and schema.Reach for Firecrawl when ANY of these is true:
scrape.crawl (async).scrape with waitFor and/or actions.scrape with the json format (prompt + schema).search (optionally with scrapeOptions).map.Do not use Firecrawl when:
search only when you also want to read/scrape the pages.robots.txt, a plain .txt file). A direct HTTP GET via a simple fetch tool is cheaper. Use Firecrawl when you need JS rendering, anti-bot handling, or clean Markdown conversion.If in doubt, prefer the cheapest operation that satisfies the need (see §16).
FIRECRAWL_API_KEY — required. Sent as the HTTP header Authorization: Bearer <FIRECRAWL_API_KEY> when calling the API directly, or configured into the firecrawl-mcp server's environment.Rules:
FIRECRAWL_API_KEY.FIRECRAWL_API_KEY is not configured; do not attempt calls that will 401.Base URL (direct HTTP): https://api.firecrawl.dev/v2. Tool names (MCP): firecrawl_scrape, firecrawl_crawl, firecrawl_map, firecrawl_search.
| Operation | Purpose | Sync/Async |
|---|---|---|
| scrape | Fetch and clean a single URL into Markdown/HTML/links/screenshot/summary/structured JSON. | Synchronous |
| crawl | Scrape many pages across a site, with path filters and a limit. | Asynchronous — start then poll |
| map | List URLs discovered on a site, optionally filtered by a search term. | Synchronous |
| search | Web search returning ranked results; optionally scrape each result. | Synchronous |
| structured extraction | A scrape with the json format (prompt + schema) to return typed data. Not a separate endpoint. | Synchronous |
The standalone
/v2/extractendpoint is deprecated. Do structured extraction throughscrapewith thejsonformat instead.
Every response reports creditsUsed (on the operation or inside metadata). Read it and account for cost (§16).
Goal: get exactly the content you need from one URL, as cheaply as possible.
formats is an array; each format you add costs more processing/credits. Default to ["markdown"]. Add others only when needed:
"markdown" — clean text. The default for reading and RAG."html" — when you need raw structure (tables, attributes)."links" — when you want outbound links (e.g. to feed a follow-up crawl/map)."screenshot" — only when a visual is genuinely required."summary" — a model-generated summary of the page.{ "type": "json", "prompt": "...", "schema": {...} } — structured extraction (§11).onlyMainContent: true (the usual default) to strip nav, footers, ads, and boilerplate. Set it to false only when you specifically need the full page chrome.waitFor: <ms> — wait a fixed time for client-side rendering before capture.actions: [...] — perform steps (e.g. wait, click, scroll, type, navigate) for content behind interaction. Use the smallest sequence that reveals the content.proxy (e.g. a stealth/residential mode) only when a normal scrape returns empty or a block page. Proxies can cost more; do not enable by default.maxAge: <ms> to accept a cached copy newer than that age, avoiding a fresh fetch and saving credits/time. Use a small/zero maxAge when freshness matters (§17).data.markdown (and/or data.html, data.links, data.json). Always capture data.metadata.sourceURL for citation and data.metadata.statusCode / data.metadata.creditsUsed for diagnostics.Minimal scrape (conceptual):
{ "url": "https://example.com/article", "formats": ["markdown"], "onlyMainContent": true }
JS-heavy scrape (conceptual):
{ "url": "https://example.com/app", "formats": ["markdown"], "onlyMainContent": true, "waitFor": 3000 }
Crawl scrapes many pages. It is asynchronous: you start a job, then poll until it finishes. Never assume a crawl returns content in the first response.
scrape. If you only need the list of URLs → map. If you need a handful of known pages → scrape them individually. Crawl only when you need the contents of many pages and cannot enumerate them cheaply.url — the root.limit — a hard cap on pages. Always set this. Start small (e.g. 10–50) and raise only if needed.includePaths / excludePaths — regex/path filters to stay within the relevant section and skip junk (login, cart, tag pages).scrapeOptions — the same options as scrape (formats, onlyMainContent, waitFor, etc.), applied to every page. Minimize formats here too — it multiplies across all pages.{ success, id, url }. Keep the id.GET /v2/crawl/{id} (or the MCP status tool) → { status, completed, total, data: [...], next? }.
status is scraping (in progress) or completed (done); other states (e.g. failed, cancelled) mean stop and report.completed / total.status is still scraping, do not discard the job — keep the id and either keep polling or report partial results and the id for later retrieval. A crawl timeout is not a failure of the job.next is present, the result set is paged; follow next to collect all data items.data has the same shape as a scrape data object (markdown, metadata with sourceURL, etc.). Cite each page by its own sourceURL.Use map to discover URLs on a site cheaply, before committing to scrapes/crawls.
url (the site), optional search (filter URLs by keyword/relevance), and limit.{ success, links: [{ url, ... }] }.includePaths for a targeted crawl.Prefer map + targeted scrapes over a broad crawl whenever you can enumerate the pages you need.
Use search when you have a query, not a URL.
query, optional limit, and optional scrapeOptions.
scrapeOptions: you get ranked results metadata only (cheap) — { success, data: { web: [{ url, title, description, position }] }, creditsUsed }. Results may also include news / images groupings depending on the query.scrapeOptions: Firecrawl also scrapes each result and returns its content (more credits). Only do this when you actually need the page bodies.scrapeOptions, review titles/descriptions/positions, pick the few relevant URLs, then scrape only those.url (use the scraped metadata.sourceURL when you scraped it).Extract typed data from a page using a scrape json format. Do not use the deprecated /v2/extract endpoint.
{
"url": "https://example.com/product",
"formats": [
{
"type": "json",
"prompt": "Extract the product name, price in USD, and whether it is in stock.",
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"priceUsd": { "type": "number" },
"inStock": { "type": "boolean" }
},
"required": ["name"]
}
}
],
"onlyMainContent": true
}
data.json. Validate it against your expected shape before using it.prompt AND a schema. The schema constrains output; the prompt disambiguates intent. Keep fields minimal.data.json is empty or wrong, the page content may be missing (add waitFor/actions) or the schema/prompt may be unclear (tighten them). Do not blindly retry the same call.data.metadata.sourceURL (per page) as the canonical citation URL — not the URL you requested, which may differ after redirects.metadata.sourceURL for the canonical URL of each scraped/crawled page; use url for search results you did not scrape.[n] at the point of each claim that depends on a source.[n] to its title and URL.Example:
The pricing page lists a free tier [1] but the docs mention a usage cap [2].
Sources [1] Pricing — https://example.com/pricing [2] Docs: Limits — https://example.com/docs/limits
Choose the operation deliberately:
scrape.search (then scrape the chosen few).map.crawl (bounded).Planning rules:
limit on crawl, map, and search to bound work and cost.includePaths/excludePaths to keep crawls on-target and skip noise (auth, cart, archive/tag pages, infinite calendars).map + targeted scrape over a broad crawl when you can enumerate the pages.React to errors by class — do not blindly retry everything.
FIRECRAWL_API_KEY is invalid or missing.Retry-After). Cap attempts.scraping) — not a failure. Keep the id and resume polling later, or report partial data plus the id. Do not start a new crawl.waitFor and/or actions (and proxy only if a block is evident). If still empty, report that the content could not be extracted.data.json — tighten the prompt/schema or add waitFor; do not loop on the same call.General: log/inspect metadata.statusCode and any error body to choose the right reaction. Never retry a non-transient error.
Every call spends credits. Minimize them:
formats. Request only what you will use; default to ["markdown"]. Each extra format costs more, and in crawl/search it multiplies across pages.onlyMainContent: true to reduce processing and output size.maxAge caching to reuse a recent scrape instead of re-fetching, when freshness allows (§17).limit on crawl/map/search. Use includePaths/excludePaths to avoid scraping junk pages.creditsUsed on every response and stop if cost is escalating unexpectedly.maxAge so Firecrawl fetches fresh rather than serving cache.maxAge saves credits and time at the cost of possibly serving slightly stale data.maxAge) and note the retrieval time in your output.FIRECRAWL_API_KEY in output, logs, errors, code, or citations. Reference it only by name.localhost, 127.0.0.1, 169.254.169.254, RFC-1918 ranges, internal hostnames) unless the user explicitly and legitimately intends it. Treat user- or content-supplied URLs with suspicion; prefer scraping only URLs the user actually asked about or that you found via a trusted search/map.Before each Firecrawl call, confirm:
formats minimized (default ["markdown"])? onlyMainContent: true? (§7, §16)limit set on crawl/map/search; path filters on crawl? (§14, §16)waitFor/actions planned only if needed? (§7)maxAge for the freshness need? (§17)After each call:
metadata.sourceURL for citation? (§12, §13)statusCode and creditsUsed? (§15, §16)completed (or recorded id + partial)? (§8)A. Read one article for a summary.
scrape {url, formats:["markdown"], onlyMainContent:true}.data.markdown; cite data.metadata.sourceURL.B. Answer a question from the web.
search {query, limit:5} (no scrapeOptions).data.web[].title/description/position; pick the 2–3 most relevant urls.scrape each chosen URL (["markdown"], onlyMainContent:true).[n] citations and a Sources list. Note conflicts.C. Build a small knowledge base of a docs section.
map {url, search:"guides", limit:50} to discover URLs.crawl {url, limit, includePaths:["^/docs/guides"], excludePaths:["/changelog"], scrapeOptions:{formats:["markdown"], onlyMainContent:true}}.GET /v2/crawl/{id} with backoff until completed; follow next to page all data.sourceURL; cite per page.id; §8.)limit on crawl/map/search → runaway cost.formats (screenshot, html, json, summary) when only markdown is needed.metadata.sourceURL (the post-redirect canonical URL).id and resuming./v2/extract; use scrape json format instead.FIRECRAWL_API_KEY.reference/ must agree with it.Authorization: Bearer <FIRECRAWL_API_KEY>; base https://api.firecrawl.dev/v2; operations scrape/crawl/map/search; crawl is async (start + poll id, paginate next); structured extraction via scrape json format; /v2/extract deprecated; every response reports creditsUsed; errors 401/400/402/429/5xx.Verification needed: confirm current endpoints, parameters, action types, proxy modes, and credit costs against the official docs at https://docs.firecrawl.dev before relying on edge details.