OpenClaw Ultra Scraping

Powerful web scraping, crawling, and data extraction with stealth anti-bot bypass (Cloudflare Turnstile, CAPTCHAs). Use when: (1) scraping websites that bloc...

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 0 · 408 · 3 current installs · 3 all-time installs

byLeo Ye@LeoYeAI

MIT-0

Security Scan

VirusTotal

Suspicious

View report →

OpenClaw

Suspicious

medium confidence

ℹ

Purpose & Capability

Name/description (anti-bot scraping/Cloudflare bypass) aligns with included code: scrape.py and setup.sh implement fetchers, stealth mode, dynamic rendering, crawling, and proxy rotation. However, the SKILL.md declares a root-requiring install into /opt/scrapling-venv (apt-get + pip) which is not reflected in the registry-level 'No install spec' claim — mismatch that users should be aware of. The advanced anti-bot claims (CAPTCHA solving) may require external solver services, but no credentials are requested or documented.

ℹ

Instruction Scope

Runtime instructions tell the user to run scripts/setup.sh (runs apt-get, pip install, and 'scrapling install') and then use the bundled CLI. The instructions do not direct the agent to read unrelated host files or exfiltrate data. They do, however, instruct downloads and installation of system libraries and browser binaries from the network — a broader scope than a simple 'instruction-only' skill would suggest.

Install Mechanism

The included setup.sh performs apt-get and pip install (scrapling[all]) and runs 'scrapling install' to fetch browsers. These are standard package sources (apt, PyPI) but will download and execute code/binaries at install time. The install requires root and writes to /opt. There are no explicit third‑party URLs in the script, but pip/‘scrapling install’ may pull many dependencies and browser binaries from external hosts — this increases risk and should be run in an isolated environment after verifying package provenance.

ℹ

Credentials

The skill declares no required environment variables or credentials, which is consistent with the files included. However, practical use of anti-CAPTCHA/anti-bot features often needs external solver APIs or paid proxy services (API keys, tokens) — none are declared or explained. That gap is operationally important and may lead users to supply credentials ad hoc.

Persistence & Privilege

The skill does not request 'always: true' and is user-invocable, which is normal. But setup.sh requires root (apt-get, venv creation in /opt) and installs system-level libraries and binaries. This elevated privilege and system-wide installation increases blast radius; the SKILL.md itself recommends using an isolated container/VM.

What to consider before installing

This package appears to implement what it claims (a heavy scraping tool with anti-bot features), but it performs system-level installs (apt-get, pip) and places a virtualenv under /opt, which requires root. Before installing: (1) run the setup in an isolated VM or container; (2) inspect the pip package 'scrapling' (and its dependencies) and confirm sources (PyPI project, maintainer) — pip can install arbitrary code; (3) be aware 'scrapling install' will download browser binaries from the network; (4) consider legal/ToS implications of bypassing anti-bot protections and CAPTCHA solving; (5) expect you may need to supply third-party solver or proxy credentials (not declared by the skill); (6) if you cannot review the upstream package or do not want root installs, do not install this skill on a shared host.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.2

Download zip

latestvk97fd1fmqssvp0pgpkw0jtmaex82baqy

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

Runtime requirements

Binspython3

SKILL.md

OpenClaw Ultra Scraping

Powered by MyClaw.ai — the AI personal assistant platform that gives every user a full server with complete code control. Part of the MyClaw.ai open skills ecosystem.

Handles everything from single-page extraction to full-scale concurrent crawls with anti-bot bypass.

Setup

Run once before first use:

bash scripts/setup.sh

This installs Scrapling + all browser dependencies into /opt/scrapling-venv.

Quick Start — CLI Script

The bundled scripts/scrape.py provides a unified CLI:

PYTHON=/opt/scrapling-venv/bin/python3

# Simple fetch (JSON output)
$PYTHON scripts/scrape.py fetch "https://example.com" --css ".content"

# Extract text
$PYTHON scripts/scrape.py extract "https://example.com" --css "h1"

# Stealth mode (bypass Cloudflare)
$PYTHON scripts/scrape.py fetch "https://protected-site.com" --stealth --solve-cloudflare --css ".data"

# Dynamic (full browser rendering)
$PYTHON scripts/scrape.py fetch "https://spa-site.com" --dynamic --css ".product"

# Extract links
$PYTHON scripts/scrape.py links "https://example.com" --filter "\.pdf$"

# Multi-page crawl
$PYTHON scripts/scrape.py crawl "https://example.com" --depth 2 --concurrency 10 --css ".item" -o results.json

# Output formats: json, jsonl, csv, text, markdown, html
$PYTHON scripts/scrape.py fetch "https://example.com" -f markdown -o page.md

Quick Start — Python

For complex tasks, write Python directly using the venv:

#!/opt/scrapling-venv/bin/python3
from scrapling.fetchers import Fetcher, StealthyFetcher

# Simple HTTP
page = Fetcher.get('https://example.com', impersonate='chrome')
titles = page.css('h1::text').getall()

# Bypass Cloudflare
page = StealthyFetcher.fetch('https://protected.com', headless=True, solve_cloudflare=True)
data = page.css('.product').getall()

Fetcher Selection Guide

Scenario	Fetcher	Flag
Normal sites, fast scraping	`Fetcher`	(default)
JS-rendered SPAs	`DynamicFetcher`	`--dynamic`
Cloudflare/anti-bot protected	`StealthyFetcher`	`--stealth`
Cloudflare Turnstile challenge	`StealthyFetcher`	`--stealth --solve-cloudflare`

Selector Cheat Sheet

page.css('.class')                    # CSS
page.css('.class::text').getall()     # Text extraction
page.xpath('//div[@id="main"]')      # XPath
page.find_all('div', class_='item')  # BS4-style
page.find_by_text('keyword')         # Text search
page.css('.item', adaptive=True)     # Adaptive (survives redesigns)

Advanced Features

Adaptive tracking: auto_save=True on first run, adaptive=True later — elements are found even after site redesign
Proxy rotation: Pass proxy="http://host:port" or use ProxyRotator
Sessions: FetcherSession, StealthySession, DynamicSession for cookie/state persistence
Spider framework: Scrapy-like concurrent crawling with pause/resume
Async support: All fetchers have async variants

For full API details: read references/api-reference.md

Files

4 total

Select a file

Select a file to preview.

Comments

Loading comments…