Install
openclaw skills install openclaw-scraplingAdvanced web scraping with anti-bot bypass, JavaScript support, and adaptive selectors. Use when scraping websites with Cloudflare protection, dynamic content, or frequent UI changes.
openclaw skills install openclaw-scraplingUse Scrapling to scrape modern websites, including those with anti-bot protection, JavaScript-rendered content, and adaptive element tracking.
All commands use the scrape.py script in this skill's directory.
python scrape.py \
--url "https://example.com" \
--selector ".product" \
--output products.json
Use when: Static HTML, no JavaScript, no bot protection
python scrape.py \
--url "https://nopecha.com/demo/cloudflare" \
--stealth \
--selector "#content" \
--output data.json
Use when: Cloudflare protection, bot detection, fingerprinting
Features:
python scrape.py \
--url "https://spa-website.com" \
--dynamic \
--selector ".loaded-content" \
--wait-for ".loaded-content" \
--output data.json
Use when: React/Vue/Angular apps, lazy-loaded content, AJAX
Features:
# First time - save the selector pattern
python scrape.py \
--url "https://example.com" \
--selector ".product-card" \
--adaptive-save \
--output products.json
# Later, if website structure changes
python scrape.py \
--url "https://example.com" \
--adaptive \
--output products.json
Use when: Website frequently redesigns, need robust scraping
How it works:
# Login and save session
python scrape.py \
--url "https://example.com/dashboard" \
--stealth \
--login \
--username "user@example.com" \
--password "password123" \
--session-name "my-session" \
--selector ".protected-data" \
--output data.json
# Reuse saved session (no login needed)
python scrape.py \
--url "https://example.com/another-page" \
--stealth \
--session-name "my-session" \
--selector ".more-data" \
--output more_data.json
Use when: Content requires authentication, multi-step scraping
Text only:
python scrape.py \
--url "https://example.com" \
--selector ".content" \
--extract text \
--output content.txt
Markdown:
python scrape.py \
--url "https://docs.example.com" \
--selector "article" \
--extract markdown \
--output article.md
Attributes:
# Extract href links
python scrape.py \
--url "https://example.com" \
--selector "a.product-link" \
--extract attr:href \
--output links.json
Multiple fields:
python scrape.py \
--url "https://example.com/products" \
--selector ".product" \
--fields "title:.title::text,price:.price::text,link:a::attr(href)" \
--output products.json
Proxy support:
python scrape.py \
--url "https://example.com" \
--proxy "http://user:pass@proxy.com:8080" \
--selector ".content"
Rate limiting:
python scrape.py \
--url "https://example.com" \
--selector ".content" \
--delay 2 # 2 seconds between requests
Custom headers:
python scrape.py \
--url "https://api.example.com" \
--headers '{"Authorization": "Bearer token123"}' \
--selector "body"
Screenshot (for debugging):
python scrape.py \
--url "https://example.com" \
--stealth \
--screenshot debug.png
You can also use Scrapling directly in Python scripts:
from scrapling.fetchers import Fetcher, StealthyFetcher, DynamicFetcher
# Basic HTTP request
page = Fetcher.get('https://example.com')
products = page.css('.product')
for product in products:
title = product.css('.title::text').get()
price = product.css('.price::text').get()
print(f"{title}: {price}")
# Stealth mode (bypass anti-bot)
page = StealthyFetcher.fetch('https://protected-site.com', headless=True)
data = page.css('.content').getall()
# Dynamic content (full browser)
page = DynamicFetcher.fetch('https://spa-app.com', network_idle=True)
items = page.css('.loaded-item').getall()
# Sessions (login)
from scrapling.fetchers import StealthySession
with StealthySession(headless=True) as session:
# Login
login_page = session.fetch('https://example.com/login')
login_page.fill('#username', 'user@example.com')
login_page.fill('#password', 'password123')
login_page.click('#submit')
# Access protected content
protected_page = session.fetch('https://example.com/dashboard')
data = protected_page.css('.private-data').getall()
--output data.json--output data.jsonl--output data.csv--output data.txt--output data.md--output data.htmlScrapling supports multiple selector formats:
CSS selectors:
--selector ".product"
--selector "div.container > p.text"
--selector "a[href*='product']"
XPath selectors:
--selector "//div[@class='product']"
--selector "//a[contains(@href, 'product')]"
Pseudo-elements (like Scrapy):
--selector ".product::text" # Text content
--selector "a::attr(href)" # Attribute value
--selector ".price::text::strip" # Text with whitespace removed
Combined selectors:
--selector ".product .title::text" # Nested elements
Issue: "Element not found"
--dynamic if content is JavaScript-loaded--wait-for SELECTOR to wait for element--screenshot to debug what's visibleIssue: "Cloudflare blocking"
--stealth mode--solve-cloudflare flag (enabled by default in stealth)--delay 2 to slow down requestsIssue: "Login not working"
--headless false to see browser interactionIssue: "Selector broke after website update"
--adaptive mode to auto-relocate elements--adaptive-save to update saved patternspython scrape.py \
--url "https://news.ycombinator.com" \
--selector ".athing" \
--fields "title:.titleline>a::text,link:.titleline>a::attr(href)" \
--output hn_stories.json
python scrape.py \
--url "https://example.com/data" \
--stealth \
--login \
--username "user@example.com" \
--password "secret" \
--session-name "example-session" \
--selector ".data-table tr" \
--output protected_data.json
# Save initial selector pattern
python scrape.py \
--url "https://store.com/product/123" \
--selector ".price" \
--adaptive-save \
--output price.txt
# Later, check price (even if page redesigned)
python scrape.py \
--url "https://store.com/product/123" \
--adaptive \
--output price_new.txt
python scrape.py \
--url "https://react-app.com/data" \
--dynamic \
--wait-for ".loaded-content" \
--selector ".item" \
--fields "name:.name::text,value:.value::text" \
--output app_data.json
sessions/ directory, reusable across runsselector_cache.json, auto-updatedrobots.txt and add delays for ethical scrapingInstalled automatically when skill is installed:
scrapling/
├── SKILL.md # This file
├── scrape.py # Main CLI script
├── requirements.txt # Python dependencies
├── sessions/ # Saved browser sessions
├── selector_cache.json # Adaptive selector patterns
└── examples/ # Example scripts
├── basic.py
├── stealth.py
├── dynamic.py
└── adaptive.py
For complex scraping tasks, you can create custom Python scripts in this directory:
# custom_scraper.py
from scrapling.fetchers import StealthyFetcher
from scrapling.spiders import Spider, Response
import json
class MySpider(Spider):
name = "custom"
start_urls = ["https://example.com/page1"]
async def parse(self, response: Response):
for item in response.css('.product'):
yield {
"title": item.css('.title::text').get(),
"price": item.css('.price::text').get()
}
# Follow pagination
next_page = response.css('.next-page::attr(href)').get()
if next_page:
yield response.follow(next_page)
# Run spider
result = MySpider().start()
with open('output.json', 'w') as f:
json.dump(result.items, f, indent=2)
Run with:
python custom_scraper.py
Questions? Check Scrapling docs: https://scrapling.readthedocs.io