Scrapling MCP
Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) to call the `scrapling` MC...
MIT-0 · Free to use, modify, and redistribute. No attribution required.
⭐ 0 · 198 · 2 current installs · 2 all-time installs
byBurak@DevBD1
fork of @DevBD1/scrapling-web-scraping (based on 1.1.0)
MIT-0
Security Scan
OpenClaw
Benign
medium confidencePurpose & Capability
Name/description (Scrapling MCP guidance) align with the provided files and instructions: SKILL.md, reference docs, and helper scripts all focus on scraping, MCP setup, fetcher selection, spiders, proxies and anti-bot handling. No unrelated env vars, binaries, or platform access are requested.
Instruction Scope
Runtime instructions and examples legitimately show installing scrapling/playwright, configuring an MCP server, calling mcporter, and using fetcher/stealthy/dynamic modes. The instructions include explicit guidance and examples for proxy rotation and 'solve_cloudflare' / stealthy fetchers; those are coherent for advanced scraping but can enable bypassing anti-bot measures if used without authorization — the docs repeatedly note 'use only when authorized', which mitigates but does not remove misuse risk.
Install Mechanism
No install specification is included in the registry (instruction-only). The SKILL.md instructs pip installs from known packages (scrapling, playwright) and to run playwright install; helper scripts are shipped with the skill but there is no downloader or remote install URL that would write arbitrary code at runtime.
Credentials
The skill declares no required environment variables, no primary credential, and no config-path requirements. Example snippets show proxy URLs (including username:password examples) and an example API Authorization header in a recipe — these are examples only and not requested by the skill; exercise caution when inserting real credentials into proxy strings or requests.
Persistence & Privilege
The skill is not marked always:true and does not request persistent or cross-skill configuration changes. It does not attempt to modify other skills or system-wide settings.
Assessment
This skill appears to be a legitimate guidance layer + helper scripts for using Scrapling via MCP. Before installing/using it: 1) Verify you have permission to scrape target sites — do not use stealth or proxy rotation to evade protections, bypass paywalls, or access private data. 2) Inspect any proxy strings or Authorization headers you paste into configs; never store real credentials in public places. 3) The skill instructs installing third‑party packages (scrapling, playwright) — prefer installing those into a controlled virtualenv. 4) The SKILL.md links refer to GitHub repos/docs; confirm those projects are the official upstream sources you expect. If you need higher assurance, request the upstream package source and a checksum for the scrapling wheel or review the pip package metadata before installing.Like a lobster shell, security has layers — review code before you run it.
Current versionv0.1.2
Download ziplatest
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
SKILL.md
Scrapling MCP — Web Scraping Guidance
Source repo: https://github.com/DevBD1/openclaw-skill-scrapling-mcp
Guidance Layer + MCP Integration
Use this skill for strategy and patterns. For execution, call Scrapling's MCP server viamcporter.
Quick Start (MCP)
1. Install Scrapling with MCP support
pip install scrapling[mcp]
# Or for full features:
pip install scrapling[mcp,playwright]
python -m playwright install chromium
2. Add to OpenClaw MCP config
{
"mcpServers": {
"scrapling": {
"command": "python",
"args": ["-m", "scrapling.mcp"]
}
}
}
3. Call via mcporter
mcporter call scrapling fetch_page --url "https://example.com"
Execution vs Guidance
| Task | Tool | Example |
|---|---|---|
| Fetch a page | mcporter | mcporter call scrapling fetch_page --url URL |
| Extract with CSS | mcporter | mcporter call scrapling css_select --selector ".title::text" |
| Which fetcher to use? | This skill | See "Fetcher Selection Guide" below |
| Anti-bot strategy? | This skill | See "Anti-Bot Escalation Ladder" |
| Complex crawl patterns? | This skill | See "Spider Recipes" |
Fetcher Selection Guide
┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Fetcher │────▶│ DynamicFetcher │────▶│ StealthyFetcher │
│ (HTTP) │ │ (Browser/JS) │ │ (Anti-bot) │
└─────────────────┘ └──────────────────┘ └──────────────────┘
Fastest JS-rendered Cloudflare,
Static pages SPAs, React/Vue Turnstile, etc.
Decision Tree
- Static HTML? →
Fetcher(10-100x faster) - Need JS execution? →
DynamicFetcher - Getting blocked? →
StealthyFetcher - Complex session? → Use Session variants
MCP Fetch Modes
fetch_page— HTTP fetcherfetch_dynamic— Browser-based with Playwrightfetch_stealthy— Anti-bot bypass mode
Anti-Bot Escalation Ladder
Level 1: Polite HTTP
# MCP call: fetch_page with options
{
"url": "https://example.com",
"headers": {"User-Agent": "..."},
"delay": 2.0
}
Level 2: Session Persistence
# Use sessions for cookie/state across requests
FetcherSession(impersonate="chrome") # TLS fingerprint spoofing
Level 3: Stealth Mode
# MCP: fetch_stealthy
StealthyFetcher.fetch(
url,
headless=True,
solve_cloudflare=True, # Auto-solve Turnstile
network_idle=True
)
Level 4: Proxy Rotation
See references/proxy-rotation.md
Adaptive Scraping (Anti-Fragile)
Scrapling can survive website redesigns using adaptive selectors:
# First run — save fingerprints
products = page.css('.product', auto_save=True)
# Later runs — auto-relocate if DOM changed
products = page.css('.product', adaptive=True)
MCP usage:
mcporter call scrapling css_select \\
--selector ".product" \\
--adaptive true \\
--auto-save true
Spider Framework (Large Crawls)
When to use Spiders vs direct fetching:
- ✅ Spider: 10+ pages, concurrency needed, resume capability, proxy rotation
- ✅ Direct: 1-5 pages, quick extraction, simple flow
Basic Spider Pattern
from scrapling.spiders import Spider, Response
class ProductSpider(Spider):
name = "products"
start_urls = ["https://example.com/products"]
concurrent_requests = 10
download_delay = 1.0
async def parse(self, response: Response):
for product in response.css('.product'):
yield {
"name": product.css('h2::text').get(),
"price": product.css('.price::text').get(),
"url": response.url
}
# Follow pagination
next_page = response.css('.next a::attr(href)').get()
if next_page:
yield response.follow(next_page)
# Run with resume capability
result = ProductSpider(crawldir="./crawl_data").start()
result.items.to_jsonl("products.jsonl")
Advanced: Multi-Session Spider
from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class MultiSessionSpider(Spider):
name = "multi"
start_urls = ["https://example.com/"]
def configure_sessions(self, manager):
manager.add("fast", FetcherSession(impersonate="chrome"))
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
async def parse(self, response: Response):
for link in response.css('a::attr(href)').getall():
if "/protected/" in link:
yield Request(link, sid="stealth")
else:
yield Request(link, sid="fast")
Spider Features
- Pause/Resume:
crawldirparameter saves checkpoints - Streaming:
async for item in spider.stream()for real-time processing - Auto-retry: Configurable retry on blocked requests
- Export: Built-in
to_json(),to_jsonl()
CLI & Interactive Shell
Terminal Extraction (No Code)
# Extract to markdown
scrapling extract get 'https://example.com' content.md
# Extract specific element
scrapling extract get 'https://example.com' content.txt \\
--css-selector '.article' \\
--impersonate 'chrome'
# Stealth mode
scrapling extract stealthy-fetch 'https://protected.com' content.md \\
--no-headless \\
--solve-cloudflare
Interactive Shell
scrapling shell
# Inside shell:
>>> page = Fetcher.get('https://example.com')
>>> page.css('h1::text').get()
>>> page.find_all('div', class_='item')
Parser API (Beyond CSS/XPath)
BeautifulSoup-Style Methods
# Find by attributes
page.find_all('div', {'class': 'product', 'data-id': True})
page.find_all('div', class_='product', id=re.compile(r'item-\\d+'))
# Text search
page.find_by_text('Add to Cart', tag='button')
page.find_by_regex(r'\\$\\d+\\.\\d{2}')
# Navigation
first = page.css('.product')[0]
parent = first.parent
siblings = first.next_siblings
children = first.children
# Similarity
similar = first.find_similar() # Find visually/structurally similar elements
below = first.below_elements() # Elements below in DOM
Auto-Generated Selectors
# Get robust selector for any element
element = page.css('.product')[0]
selector = element.auto_css_selector() # Returns stable CSS path
xpath = element.auto_xpath()
Proxy Rotation
from scrapling.spiders import ProxyRotator
# Cyclic rotation
rotator = ProxyRotator([
"http://proxy1:8080",
"http://proxy2:8080",
"http://user:pass@proxy3:8080"
], strategy="cyclic")
# Use with any session
with FetcherSession(proxy=rotator.next()) as session:
page = session.get('https://example.com')
Common Recipes
Pagination Patterns
# Page numbers
for page_num in range(1, 11):
url = f"https://example.com/products?page={page_num}"
...
# Next button
while next_page := response.css('.next a::attr(href)').get():
yield response.follow(next_page)
# Infinite scroll (DynamicFetcher)
with DynamicSession() as session:
page = session.fetch(url)
page.scroll_to_bottom()
items = page.css('.item').getall()
Login Sessions
with StealthySession(headless=False) as session:
# Login
login_page = session.fetch('https://example.com/login')
login_page.fill('input[name="username"]', 'user')
login_page.fill('input[name="password"]', 'pass')
login_page.click('button[type="submit"]')
# Now session has cookies
protected_page = session.fetch('https://example.com/dashboard')
Next.js Data Extraction
# Extract JSON from __NEXT_DATA__
import json
import re
next_data = json.loads(
re.search(
r'__NEXT_DATA__" type="application/json">(.*?)</script>',
page.html_content,
re.S
).group(1)
)
props = next_data['props']['pageProps']
Output Formats
# JSON (pretty)
result.items.to_json('output.json')
# JSONL (streaming, one per line)
result.items.to_jsonl('output.jsonl')
# Python objects
for item in result.items:
print(item['title'])
Performance Tips
- Use HTTP fetcher when possible — 10-100x faster than browser
- Impersonate browsers —
impersonate='chrome'for TLS fingerprinting - HTTP/3 support —
FetcherSession(http3=True) - Limit resources —
disable_resources=Truein Dynamic/Stealthy - Connection pooling — Reuse sessions across requests
Guardrails (Always)
- Only scrape content you're authorized to access
- Respect robots.txt and ToS
- Add delays (
download_delay) for large crawls - Don't bypass paywalls or authentication without permission
- Never scrape personal/sensitive data
References
references/mcp-setup.md— Detailed MCP configurationreferences/anti-bot.md— Anti-bot handling strategiesreferences/proxy-rotation.md— Proxy setup and rotationreferences/spider-recipes.md— Advanced crawling patternsreferences/api-reference.md— Quick API referencereferences/links.md— Official docs links
Scripts
scripts/scrapling_scrape.py— Quick one-off extractionscripts/scrapling_smoke_test.py— Test connectivity and anti-bot indicators
Files
10 totalSelect a file
Select a file to preview.
Comments
Loading comments…
