Install
openclaw skills install node-crawlerNode.js web crawler for production-grade, large-scale tasks — NOT for simple one-off requests. Use only when: bulk scraping, batch downloading, multi-page crawling, long-running spiders, or complex multi-step workflows that need connection pooling, rate limiting, proxy rotation, automatic retries, and Cheerio parsing. Skip this for single pages or simple fetch — use `curl` instead. Triggers: scrape website at scale, crawl hundreds of pages, batch download files, web spider, data extraction from many URLs, crawl multiple pages with retries and proxies.
openclaw skills install node-crawlercrawler package)crawler is a Node.js web spider library: internal queue + configurable
connection pool + per-domain rate limiting + automatic retries + proxy
rotation + charset detection + server-side Cheerio (jQuery-style) HTML
parsing. Built on got, supports HTTP/2.
import Crawler from "crawler"crawler@betanpm install crawler
This skill is for production-grade, large-scale crawling. Reach for it when the task is substantial:
userParams and file existence checks)curl is far lighter.
Spinning up a Crawler instance for 1-2 pages is overkill.agent-browser skill
(Playwright/Puppeteer) insteadfetch / got with JSON parsing| Approach | When |
|---|---|
crawler.add() + 'drain' event | Most cases. Goes through the queue, pool, rate limiter, retries, proxy rotation |
crawler.send() | One-off requests. Bypasses queue, rate limiter, preRequest, 'request' event |
import Crawler from "crawler";
const c = new Crawler({
maxConnections: 10,
callback: (error, res, done) => {
if (error) {
console.error(error);
} else {
const $ = res.$; // Cheerio instance (enabled by default)
console.log($("title").text());
}
done(); // REQUIRED: releases the connection slot, or the crawler deadlocks
},
});
c.on("drain", () => console.log("All done"));
c.add("https://example.com");
c.add(["https://a.com", "https://b.com"]);
c.add({ url: "https://c.com", jQuery: false,
callback: (e, res, done) => { /* custom callback */ done(); } });
Two most critical rules:
done() from every branch of every queue callback, including the
if (error) branch'drain' event fires, not when
add() returns. add() merely enqueues tasksconst c = new Crawler({
rateLimit: 1000, // minimum gap between requests >=1000ms (forces maxConnections=1)
retries: 2, // default: 2
retryInterval: 3000,// ms to wait before retrying
timeout: 20000, // request timeout in ms
callback: (e, res, done) => { /* ... */ done(); },
});
Cheerio is enabled by default. Use jQuery selectors to extract data:
callback: (e, res, done) => {
const $ = res.$;
const titles = $("h2.title").map((i, el) => $(el).text().trim()).get();
const links = $("a").map((i, el) => $(el).attr("href")).get();
done();
}
import fs from "fs";
const c = new Crawler({
encoding: null, // keep body as Buffer
jQuery: false, // skip Cheerio parsing
callback: (err, res, done) => {
if (!err) fs.writeFileSync(res.options.userParams.filename, res.body);
done();
},
});
c.add({ url: "https://host/file.png", userParams: { filename: "file.png" } });
Use the 'schedule' event for dynamic assignment (preferred over using the
proxies array):
c.on("schedule", options => { options.proxy = "http://proxy:port"; });
c.on("request", options => { options.searchParams = { t: Date.now() }; });
Different proxies can have different rate limiters:
c.add({ url: "...", rateLimiterId: 1, proxy: "http://p1:port" });
c.add({ url: "...", rateLimiterId: 2, proxy: "http://p2:port" });
c.add({ url: "https://...", http2: true, callback: (e, res, done) => { done(); } });
When using Charles or self-signed certs, add rejectUnauthorized: false.
Use userParams to attach data; read it back in the callback via
res.options.userParams. Do not attach custom fields directly on the
options object.
done() in the if (error) branch → crawler deadlocksconsole.log("done") right after add() → listen for 'drain' insteadmaxConnections > 1 when rateLimit > 0 → it gets overridden to 1send() to trigger preRequest or 'request' → send() bypasses all queue mechanicsbody → v2 requires formencoding: null → corrupt outputThe complete options table is in references/options.md.
Full, runnable examples for every scenario are in references/examples.md: