Web Crawler

Other

Node.js web crawler for production-grade, large-scale tasks — NOT for simple one-off requests. Use only when: bulk scraping, batch downloading, multi-page crawling, long-running spiders, or complex multi-step workflows that need connection pooling, rate limiting, proxy rotation, automatic retries, and Cheerio parsing. Skip this for single pages or simple fetch — use `curl` instead. Triggers: scrape website at scale, crawl hundreds of pages, batch download files, web spider, data extraction from many URLs, crawl multiple pages with retries and proxies.

Install

openclaw skills install node-crawler

Node Crawler (crawler package)

crawler is a Node.js web spider library: internal queue + configurable connection pool + per-domain rate limiting + automatic retries + proxy rotation + charset detection + server-side Cheerio (jQuery-style) HTML parsing. Built on got, supports HTTP/2.

Prerequisites

  • Node.js >= 22
  • Pure ESM: import Crawler from "crawler"
  • If the codebase must use CommonJS, install crawler@beta
npm install crawler

When to use

This skill is for production-grade, large-scale crawling. Reach for it when the task is substantial:

  • Scraping many pages (dozens to millions) with structured data extraction
  • Batch-downloading files — images, PDFs, archives — with retry and resume (resume logic is developer-implemented via userParams and file existence checks)
  • Long-running spiders that need rate limiting, retries, and connection pooling
  • Multi-step workflows — pagination, link following, cascading crawls
  • Proxy rotation, charset detection, HTTP/2 — infrastructure a real production crawler depends on

When NOT to use

  • A single page or one-off request → curl is far lighter. Spinning up a Crawler instance for 1-2 pages is overkill.
  • Pages requiring JavaScript rendering → use the agent-browser skill (Playwright/Puppeteer) instead
  • Simple API data fetchingfetch / got with JSON parsing

API

Core decision: Queue vs Send

ApproachWhen
crawler.add() + 'drain' eventMost cases. Goes through the queue, pool, rate limiter, retries, proxy rotation
crawler.send()One-off requests. Bypasses queue, rate limiter, preRequest, 'request' event

Basic usage: Queue mode

import Crawler from "crawler";

const c = new Crawler({
  maxConnections: 10,
  callback: (error, res, done) => {
    if (error) {
      console.error(error);
    } else {
      const $ = res.$;  // Cheerio instance (enabled by default)
      console.log($("title").text());
    }
    done();  // REQUIRED: releases the connection slot, or the crawler deadlocks
  },
});

c.on("drain", () => console.log("All done"));

c.add("https://example.com");
c.add(["https://a.com", "https://b.com"]);
c.add({ url: "https://c.com", jQuery: false,
        callback: (e, res, done) => { /* custom callback */ done(); } });

Two most critical rules:

  1. Call done() from every branch of every queue callback, including the if (error) branch
  2. The crawler is finished when the 'drain' event fires, not when add() returns. add() merely enqueues tasks

Rate limiting and retries

const c = new Crawler({
  rateLimit: 1000,    // minimum gap between requests >=1000ms (forces maxConnections=1)
  retries: 2,         // default: 2
  retryInterval: 3000,// ms to wait before retrying
  timeout: 20000,     // request timeout in ms
  callback: (e, res, done) => { /* ... */ done(); },
});

Data extraction with Cheerio

Cheerio is enabled by default. Use jQuery selectors to extract data:

callback: (e, res, done) => {
  const $ = res.$;
  const titles = $("h2.title").map((i, el) => $(el).text().trim()).get();
  const links  = $("a").map((i, el) => $(el).attr("href")).get();
  done();
}

Binary file download

import fs from "fs";
const c = new Crawler({
  encoding: null,     // keep body as Buffer
  jQuery: false,      // skip Cheerio parsing
  callback: (err, res, done) => {
    if (!err) fs.writeFileSync(res.options.userParams.filename, res.body);
    done();
  },
});
c.add({ url: "https://host/file.png", userParams: { filename: "file.png" } });

Proxy rotation

Use the 'schedule' event for dynamic assignment (preferred over using the proxies array):

c.on("schedule", options => { options.proxy = "http://proxy:port"; });
c.on("request",  options => { options.searchParams = { t: Date.now() }; });

Different proxies can have different rate limiters:

c.add({ url: "...", rateLimiterId: 1, proxy: "http://p1:port" });
c.add({ url: "...", rateLimiterId: 2, proxy: "http://p2:port" });

HTTP/2

c.add({ url: "https://...", http2: true, callback: (e, res, done) => { done(); } });

When using Charles or self-signed certs, add rejectUnauthorized: false.

Passing context data

Use userParams to attach data; read it back in the callback via res.options.userParams. Do not attach custom fields directly on the options object.

Gotchas

  • ❌ Forgetting done() in the if (error) branch → crawler deadlocks
  • ❌ Writing console.log("done") right after add() → listen for 'drain' instead
  • ❌ Setting maxConnections > 1 when rateLimit > 0 → it gets overridden to 1
  • ❌ Expecting send() to trigger preRequest or 'request'send() bypasses all queue mechanics
  • ❌ POST form data via body → v2 requires form
  • ❌ Binary download without encoding: null → corrupt output

Options reference

The complete options table is in references/options.md.

Code examples

Full, runnable examples for every scenario are in references/examples.md:

  • Basic queue crawling
  • Rate limiting
  • Cheerio data extraction
  • Binary download
  • Direct requests
  • HTTP/2
  • Proxy rotation
  • preRequest hooks
  • Full spider (pagination + extraction + following links)