Crawl4AI Web Scraper

v1.0.1

Full web page scraping with JavaScript rendering via local Crawl4AI instance, delivering clean markdown or detailed JSON including links and media.

5· 2.8k·15 current·15 all-time
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description match what the files do: the SKILL.md and included Node script send URLs to a Crawl4AI instance and return markdown/JSON. Declared required binary (node) is appropriate. One minor inconsistency: registry metadata earlier listed no required env vars, but SKILL.md (and the script) require CRAWL4AI_URL (and optionally CRAWL4AI_KEY). This appears to be a metadata omission rather than malicious.
Instruction Scope
Instructions and the script remain within scope: they POST {urls: [...] } to the configured Crawl4AI URL and print returned markdown/JSON. The script only reads CRAWL4AI_URL and optional CRAWL4AI_KEY. Notes: the script uses Node's http module (not https) which means HTTPS endpoints will not be handled correctly; the help text mentions a default API key of '1234' (harmless but confusing). There are no instructions to read unrelated files or environment variables.
Install Mechanism
No install spec or external downloads; the skill is instruction+embedded script only. No third-party packages are fetched at install time, so install risk is low.
Credentials
Only CRAWL4AI_URL (required) and CRAWL4AI_KEY (optional) are used. These directly map to the stated purpose (target instance URL and optional auth). No unrelated secrets or config paths are requested.
Persistence & Privilege
Skill does not request always:true and does not attempt to modify other skills or system-wide settings. Agent-autonomous invocation is allowed by default (normal for skills) but not specially privileged here.
Assessment
This skill appears to do what it claims, but review these points before installing: - You must set CRAWL4AI_URL to a trusted Crawl4AI endpoint (for local usage set it to http://localhost:11235). The skill will send the URL(s) you want scraped to that endpoint, so pointing it at an untrusted remote service could leak the pages you request and any scraped content. - The script only supports plain HTTP (uses Node's http module). If you provide an https:// URL the script may fail; treat that as an implementation bug, not a security feature. - The only environment variables used are CRAWL4AI_URL and optional CRAWL4AI_KEY. If your Crawl4AI instance requires a key, use a secret you trust. There are no other hidden env reads or file accesses. - Metadata mismatch: the registry metadata omitted the required env var that the SKILL.md and script require. This is likely a packaging oversight—confirm CRAWL4AI_URL will be supplied before use. - If you plan to let the agent invoke skills autonomously, remember the agent could call this skill and thereby send target URLs to the configured Crawl4AI endpoint; ensure that behavior is acceptable in your environment. Overall: coherent and low-risk provided you point it to a trusted Crawl4AI instance and are aware of the HTTP/HTTPS limitation.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

🕷️ Clawdis
Binsnode
latestvk974zkd8pc606c24efkfxg877n81582h
2.8kdownloads
5stars
2versions
Updated 1mo ago
v1.0.1
MIT-0

Crawl4AI Web Scraper

Local Crawl4AI instance for full web page extraction with JavaScript rendering.

Endpoints

Proxy (port 11234) — Clean output, OpenWebUI-compatible

  • Returns: [{page_content, metadata}]
  • Use for: Simple content extraction

Direct (port 11235) — Full output with all data

  • Returns: {results: [{markdown, html, links, media, ...}]}
  • Use for: When you need links, media, or other metadata

Usage

# Via script
node {baseDir}/scripts/crawl4ai.js "url"
node {baseDir}/scripts/crawl4ai.js "url" --json

Script options:

  • --json — Full JSON response

Output: Clean markdown from the page.

Configuration

Required environment variable:

  • CRAWL4AI_URL — Your Crawl4AI instance URL (e.g., http://localhost:11235)

Optional:

  • CRAWL4AI_KEY — API key if your instance requires authentication

Features

  • JavaScript rendering — Handles dynamic content
  • Unlimited usage — Local instance, no API limits
  • Full content — HTML, markdown, links, media, tables
  • Better than Tavily for complex pages with JS

API

Uses your local Crawl4AI instance REST API. Auth header only sent if CRAWL4AI_KEY is set.

Comments

Loading comments...