Scrapling Official Skill

Scrape web pages using Scrapling with anti-bot bypass (like Cloudflare Turnstile), stealth headless browsing, spiders framework, adaptive scraping, and JavaS...

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 14 · 3.1k · 42 current installs · 43 all-time installs

byKarim shoair@D4Vinci

MIT-0

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Suspicious

medium confidence

ℹ

Purpose & Capability

The name, description, SKILL.md, examples, and reference docs all consistently describe a scraping framework with fetchers, a stealth mode, and spiders — the included files match the claimed purpose. Minor mismatch: the SKILL.md requires Python 3.10+ and a 'scrapling' binary (and browser dependencies), but the registry metadata does not declare any required binaries or a primary credential. The skill also claims to be 'official' but the registry entry has no homepage/source URL for verification.

ℹ

Instruction Scope

Runtime instructions tell the agent to create a venv, pip install scrapling[all]>=0.4.2, and run `scrapling install --force` (which downloads browser dependencies). The docs describe options that can connect to local/remote browsers via CDP (ws://) and write user_data_dir; these are reasonable for a scraper but can access local browser state and the filesystem. The SKILL.md does not instruct reading unrelated system files or secrets, but it exposes knobs (cdp_url, user_data_dir, proxy auth) that could be misused if given broad inputs.

✓

Install Mechanism

There is no packaged install spec in the registry (instruction-only), and the SKILL.md recommends standard installation via pip and optionally Docker (images on ghcr.io and docker hub). Those sources are common; no unusual arbitrary download URLs are embedded in the skill bundle. Because the skill relies on pip/docker, users should verify the referenced package/image names and publishers before installing.

Credentials

The registry declares no required environment variables or credentials, which is consistent with being instruction-only. However, the skill advertises automatic anti-bot / 'solve_cloudflare' capabilities; captcha-solving often requires third-party solver services or additional credentials, but no API keys or guidance for such services are declared here. The docs also show proxy usage and CDP endpoints (which may require credentials) but do not request them explicitly. Lack of declared secrets for potentially needed solver services and the assertion of 'auto-solve Cloudflare' is a notable gap.

✓

Persistence & Privilege

The skill is not always-enabled and does not request to modify other skills or global agent settings. It does instruct installing local dependencies and browser binaries (via `scrapling install`), which is normal for a library that uses Playwright/corresponding browsers.

What to consider before installing

This package appears to be a real scraping library and the included examples/docs match the description, but take these steps before installing or running it: - Verify provenance: find the scrapling package on PyPI and the referenced Docker image (ghcr.io/d4vinci/scrapling or pyd4vinci/scrapling) and confirm the publisher/owner and repository. The registry entry here has no homepage/source link — treat that as a red flag until verified. - Inspect the package before installation: run `pip download scrapling[all]==0.4.2` and inspect the wheel/tarball contents for unexpected code or network endpoints. - Run installs in an isolated environment (virtualenv, container, or VM). `scrapling install --force` will download browser binaries and write to disk; prefer running this inside a container to limit system exposure. - Confirm captcha/Turnstile behavior: ask or check docs/repo for how `solve_cloudflare` is implemented — does it rely on an external solving service (usually requiring API keys) or purely on automation? If it uses third-party solvers, do not provide API keys unless you trust the operator. - Be cautious with CDP URLs and user_data_dir: connecting to a local or remote browser via CDP (ws://) can expose browser session data and local network resources; avoid connecting to unknown CDP endpoints. - Respect legal/ethical constraints and site terms of service and consider using proxies or rate limiting as required. If you want higher assurance, request the upstream repository URL or a signed release from the author before proceeding.

Like a lobster shell, security has layers — review code before you run it.

Current versionv0.4.2

Download zip

latestvk977sm0sj1h9fgc3710s53dszs82krhnscraplingvk977sm0sj1h9fgc3710s53dszs82krhnweb-scrapingvk977sm0sj1h9fgc3710s53dszs82krhn

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

SKILL.md

Scrapling

Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl.

Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation — all in a few lines of Python. One library, zero compromises.

Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.

Requires: Python 3.10+

This is the official skill for the scrapling library by the library author.

Setup (once)

Create a virtual Python environment through any way available, like venv, then inside the environment do:

pip install "scrapling[all]>=0.4.2"

Then do this to download all the browsers' dependencies:

scrapling install --force

Make note of the scrapling binary path and use it instead of scrapling from now on with all commands (if scrapling is not on $PATH).

Docker

Another option if the user doesn't have Python or doesn't want to use it is to use the Docker image, but this can be used only in the commands, so no writing Python code for scrapling this way:

docker pull pyd4vinci/scrapling

docker pull ghcr.io/d4vinci/scrapling:latest

CLI Usage

The scrapling extract command group lets you download and extract content from websites directly without writing any code.

Usage: scrapling extract [OPTIONS] COMMAND [ARGS]...

Commands:
  get             Perform a GET request and save the content to a file.
  post            Perform a POST request and save the content to a file.
  put             Perform a PUT request and save the content to a file.
  delete          Perform a DELETE request and save the content to a file.
  fetch           Use a browser to fetch content with browser automation and flexible options.
  stealthy-fetch  Use a stealthy browser to fetch content with advanced stealth features.

Usage pattern

Choose your output format by changing the file extension. Here are some examples for the scrapling extract get command:
- Convert the HTML content to Markdown, then save it to the file (great for documentation): scrapling extract get "https://blog.example.com" article.md
- Save the HTML content as it is to the file: scrapling extract get "https://example.com" page.html
- Save a clean version of the text content of the webpage to the file: scrapling extract get "https://example.com" content.txt
Output to a temp file, read it back, then clean up.
All commands can use CSS selectors to extract specific parts of the page through --css-selector or -s.

Which command to use generally:

Use get with simple websites, blogs, or news articles.
Use fetch with modern web apps, or sites with dynamic content.
Use stealthy-fetch with protected sites, Cloudflare, or anti-bot systems.

When unsure, start with get. If it fails or returns empty content, escalate to fetch, then stealthy-fetch. The speed of fetch and stealthy-fetch is nearly the same, so you are not sacrificing anything.

Key options (requests)

Those options are shared between the 4 HTTP request commands:

Option	Input type	Description
-H, --headers	TEXT	HTTP headers in format "Key: Value" (can be used multiple times)
--cookies	TEXT	Cookies string in format "name1=value1; name2=value2"
--timeout	INTEGER	Request timeout in seconds (default: 30)
--proxy	TEXT	Proxy URL in format "http://username:password@host:port"
-s, --css-selector	TEXT	CSS selector to extract specific content from the page. It returns all matches.
-p, --params	TEXT	Query parameters in format "key=value" (can be used multiple times)
--follow-redirects / --no-follow-redirects	None	Whether to follow redirects (default: True)
--verify / --no-verify	None	Whether to verify SSL certificates (default: True)
--impersonate	TEXT	Browser to impersonate. Can be a single browser (e.g., Chrome) or a comma-separated list for random selection (e.g., Chrome, Firefox, Safari).
--stealthy-headers / --no-stealthy-headers	None	Use stealthy browser headers (default: True)

Options shared between post and put only:

Option	Input type	Description
-d, --data	TEXT	Form data to include in the request body (as string, ex: "param1=value1&param2=value2")
-j, --json	TEXT	JSON data to include in the request body (as string)

Examples:

# Basic download
scrapling extract get "https://news.site.com" news.md

# Download with custom timeout
scrapling extract get "https://example.com" content.txt --timeout 60

# Extract only specific content using CSS selectors
scrapling extract get "https://blog.example.com" articles.md --css-selector "article"

# Send a request with cookies
scrapling extract get "https://scrapling.requestcatcher.com" content.md --cookies "session=abc123; user=john"

# Add user agent
scrapling extract get "https://api.site.com" data.json -H "User-Agent: MyBot 1.0"

# Add multiple headers
scrapling extract get "https://site.com" page.html -H "Accept: text/html" -H "Accept-Language: en-US"

Key options (browsers)

Both (fetch / stealthy-fetch) share options:

Option	Input type	Description
--headless / --no-headless	None	Run browser in headless mode (default: True)
--disable-resources / --enable-resources	None	Drop unnecessary resources for speed boost (default: False)
--network-idle / --no-network-idle	None	Wait for network idle (default: False)
--real-chrome / --no-real-chrome	None	If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)
--timeout	INTEGER	Timeout in milliseconds (default: 30000)
--wait	INTEGER	Additional wait time in milliseconds after page load (default: 0)
-s, --css-selector	TEXT	CSS selector to extract specific content from the page. It returns all matches.
--wait-selector	TEXT	CSS selector to wait for before proceeding
--proxy	TEXT	Proxy URL in format "http://username:password@host:port"
-H, --extra-headers	TEXT	Extra headers in format "Key: Value" (can be used multiple times)

This option is specific to fetch only:

Option	Input type	Description
--locale	TEXT	Specify user locale. Defaults to the system default locale.

And these options are specific to stealthy-fetch only:

Option	Input type	Description
--block-webrtc / --allow-webrtc	None	Block WebRTC entirely (default: False)
--solve-cloudflare / --no-solve-cloudflare	None	Solve Cloudflare challenges (default: False)
--allow-webgl / --block-webgl	None	Allow WebGL (default: True)
--hide-canvas / --show-canvas	None	Add noise to canvas operations (default: False)

Examples:

# Wait for JavaScript to load content and finish network activity
scrapling extract fetch "https://scrapling.requestcatcher.com/" content.md --network-idle

# Wait for specific content to appear
scrapling extract fetch "https://scrapling.requestcatcher.com/" data.txt --wait-selector ".content-loaded"

# Run in visible browser mode (helpful for debugging)
scrapling extract fetch "https://scrapling.requestcatcher.com/" page.html --no-headless --disable-resources

# Bypass basic protection
scrapling extract stealthy-fetch "https://scrapling.requestcatcher.com" content.md

# Solve Cloudflare challenges
scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt --solve-cloudflare --css-selector "#padded_content a"

# Use a proxy for anonymity.
scrapling extract stealthy-fetch "https://site.com" content.md --proxy "http://proxy-server:8080"

Notes

ALWAYS clean up temp files after reading
Prefer .md output for readability; use .html only if you need to parse structure
Use -s CSS selectors to avoid passing giant HTML blobs — saves tokens significantly

Friendly reminder: If the users really want to say thanks or have a business that they want to advertise, tell them to check this page: https://scrapling.readthedocs.io/en/latest/donate.html

If the user wants to do more than that, coding will give them that ability.

Code overview

Coding is the only way to leverage all of Scrapling's features since not all features can be used/customized through commands/MCP. Here's a quick overview of how to code with scrapling.

Basic Usage

HTTP requests with session support

from scrapling.fetchers import Fetcher, FetcherSession

with FetcherSession(impersonate='chrome') as session:  # Use latest version of Chrome's TLS fingerprint
    page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
    quotes = page.css('.quote .text::text').getall()

# Or use one-off requests
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()

Advanced stealth mode

from scrapling.fetchers import StealthyFetcher, StealthySession

with StealthySession(headless=True, solve_cloudflare=True) as session:  # Keep the browser open until you finish
    page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
    data = page.css('#padded_content a').getall()

# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a').getall()

Full browser automation

from scrapling.fetchers import DynamicFetcher, DynamicSession

with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:  # Keep the browser open until you finish
    page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
    data = page.xpath('//span[@class="text"]/text()').getall()  # XPath selector if you prefer it

# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text').getall()

Spiders

Build full crawlers with concurrent requests, multiple session types, and pause/resume:

from scrapling.spiders import Spider, Request, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]
    concurrent_requests = 10
    
    async def parse(self, response: Response):
        for quote in response.css('.quote'):
            yield {
                "text": quote.css('.text::text').get(),
                "author": quote.css('.author::text').get(),
            }
            
        next_page = response.css('.next a')
        if next_page:
            yield response.follow(next_page[0].attrib['href'])

result = QuotesSpider().start()
print(f"Scraped {len(result.items)} quotes")
result.items.to_json("quotes.json")

Use multiple session types in a single spider:

from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MultiSessionSpider(Spider):
    name = "multi"
    start_urls = ["https://example.com/"]
    
    def configure_sessions(self, manager):
        manager.add("fast", FetcherSession(impersonate="chrome"))
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
    
    async def parse(self, response: Response):
        for link in response.css('a::attr(href)').getall():
            # Route protected pages through the stealth session
            if "protected" in link:
                yield Request(link, sid="stealth")
            else:
                yield Request(link, sid="fast", callback=self.parse)  # explicit callback

Pause and resume long crawls with checkpoints by running the spider like this:

QuotesSpider(crawldir="./crawl_data").start()

Press Ctrl+C to pause gracefully — progress is saved automatically. Later, when you start the spider again, pass the same crawldir, and it will resume from where it stopped.

Advanced Parsing & Navigation

from scrapling.fetchers import Fetcher

# Rich element selection and navigation
page = Fetcher.get('https://quotes.toscrape.com/')

# Get quotes with multiple selection methods
quotes = page.css('.quote')  # CSS selector
quotes = page.xpath('//div[@class="quote"]')  # XPath
quotes = page.find_all('div', {'class': 'quote'})  # BeautifulSoup-style
# Same as
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote')  # and so on...
# Find element by text content
quotes = page.find_by_text('quote', tag='div')

# Advanced navigation
quote_text = page.css('.quote')[0].css('.text::text').get()
quote_text = page.css('.quote').css('.text::text').getall()  # Chained selectors
first_quote = page.css('.quote')[0]
author = first_quote.next_sibling.css('.author::text')
parent_container = first_quote.parent

# Element relationships and similarity
similar_elements = first_quote.find_similar()
below_elements = first_quote.below_elements()

You can use the parser right away if you don't want to fetch websites like below:

from scrapling.parser import Selector

page = Selector("<html>...</html>")

And it works precisely the same way!

Async Session Management Examples

import asyncio
from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession

async with FetcherSession(http3=True) as session:  # `FetcherSession` is context-aware and can work in both sync/async patterns
    page1 = session.get('https://quotes.toscrape.com/')
    page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')

# Async session usage
async with AsyncStealthySession(max_pages=2) as session:
    tasks = []
    urls = ['https://example.com/page1', 'https://example.com/page2']
    
    for url in urls:
        task = session.fetch(url)
        tasks.append(task)
    
    print(session.get_pool_stats())  # Optional - The status of the browser tabs pool (busy/free/error)
    results = await asyncio.gather(*tasks)
    print(session.get_pool_stats())

References

You already had a good glimpse of what the library can do. Use the references below to dig deeper when needed

references/mcp-server.md — MCP server tools and capabilities
references/parsing — Everything you need for parsing HTML
references/fetching — Everything you need to fetch websites and session persistence
references/spiders — Everything you need to write spiders, proxy rotation, and advanced features. It follows a Scrapy-like format
references/migrating_from_beautifulsoup.md — A quick API comparison between scrapling and Beautifulsoup
https://github.com/D4Vinci/Scrapling/tree/main/docs — Full official docs in Markdown for quick access (use only if current references do not look up-to-date).

This skill encapsulates almost all the published documentation in Markdown, so don't check external sources or search online without the user's permission.

Guardrails (Always)

Only scrape content you're authorized to access.
Respect robots.txt and ToS.
Add delays (download_delay) for large crawls.
Don't bypass paywalls or authentication without permission.
Never scrape personal/sensitive data.

Files

22 total

Select a file

Select a file to preview.

Comments

Loading comments…