Scrapling MCP

Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) to call the `scrapling` MC...

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 1.4k · 12 current installs · 13 all-time installs
byBurak@DevBD1
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description (Scrapling MCP web scraping) match the files and instructions: guidance, recipes, MCP setup, and two helper Python scripts that call Scrapling fetchers. No unrelated credentials, binaries, or config paths are requested.
Instruction Scope
SKILL.md focuses on scraping strategy, MCP configuration, and mcporter usage. Instructions reference only Scrapling APIs, proxy examples, and local crawl dirs. There are no instructions to read unrelated system files or exfiltrate secrets; the guidance repeatedly admonishes permissioned use for anti-bot/stealth modes.
Install Mechanism
No install spec is present (instruction-only), so nothing is automatically downloaded or written during install. The README suggests optional pip installs for Scrapling and Playwright, which is appropriate and expected for the skill's functionality.
Credentials
The skill declares no required environment variables, credentials, or config paths. Examples show optional PYTHONPATH and proxy URL examples (including example user:pass formats), which are reasonable for a scraping tool and proportionate to the stated purpose.
Persistence & Privilege
always:false and model invocation is not disabled (default), which is normal. The skill does not request persistent platform privileges or attempt to modify other skills or global agent config beyond advising the user how to add an MCP server entry for Scrapling.
Assessment
This skill appears coherent for web-scraping with Scrapling, but review and follow these precautions before use: 1) Ensure you have explicit permission to scrape target sites and comply with terms of service and law — the skill includes anti-bot/stealth features that should not be used to evade access controls. 2) Installing Playwright will download browser binaries; verify the install commands and their source. 3) Sample proxy examples include credentials in URLs — treat any real proxy credentials as sensitive and provide them only to trusted providers. 4) The included scripts write files (downloads, crawldir). Run in a controlled directory and inspect the code if you will run them. 5) If you run Scrapling as an MCP server, consider binding it to localhost and review its network surface before exposing it. 6) If you need higher assurance, verify the upstream Scrapling project (repo/docs links are provided) and review the included Python scripts for any local modifications.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.2.0
Download zip
latestvk97djhyvjrt5n5gaav4bzgaz7h82cnhm

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

Scrapling MCP — Web Scraping Guidance

Guidance Layer + MCP Integration
Use this skill for strategy and patterns. For execution, call Scrapling's MCP server via mcporter.

Quick Start (MCP)

1. Install Scrapling with MCP support

pip install scrapling[mcp]
# Or for full features:
pip install scrapling[mcp,playwright]
python -m playwright install chromium

2. Add to OpenClaw MCP config

{
  "mcpServers": {
    "scrapling": {
      "command": "python",
      "args": ["-m", "scrapling.mcp"]
    }
  }
}

3. Call via mcporter

mcporter call scrapling fetch_page --url "https://example.com"

Execution vs Guidance

TaskToolExample
Fetch a pagemcportermcporter call scrapling fetch_page --url URL
Extract with CSSmcportermcporter call scrapling css_select --selector ".title::text"
Which fetcher to use?This skillSee "Fetcher Selection Guide" below
Anti-bot strategy?This skillSee "Anti-Bot Escalation Ladder"
Complex crawl patterns?This skillSee "Spider Recipes"

Fetcher Selection Guide

┌─────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│   Fetcher       │────▶│ DynamicFetcher   │────▶│ StealthyFetcher  │
│   (HTTP)        │     │ (Browser/JS)     │     │ (Anti-bot)       │
└─────────────────┘     └──────────────────┘     └──────────────────┘
     Fastest              JS-rendered               Cloudflare, 
     Static pages         SPAs, React/Vue          Turnstile, etc.

Decision Tree

  1. Static HTML?Fetcher (10-100x faster)
  2. Need JS execution?DynamicFetcher
  3. Getting blocked?StealthyFetcher
  4. Complex session? → Use Session variants

MCP Fetch Modes

  • fetch_page — HTTP fetcher
  • fetch_dynamic — Browser-based with Playwright
  • fetch_stealthy — Anti-bot bypass mode

Anti-Bot Escalation Ladder

Level 1: Polite HTTP

# MCP call: fetch_page with options
{
  "url": "https://example.com",
  "headers": {"User-Agent": "..."},
  "delay": 2.0
}

Level 2: Session Persistence

# Use sessions for cookie/state across requests
FetcherSession(impersonate="chrome")  # TLS fingerprint spoofing

Level 3: Stealth Mode

# MCP: fetch_stealthy
StealthyFetcher.fetch(
    url,
    headless=True,
    solve_cloudflare=True,  # Auto-solve Turnstile
    network_idle=True
)

Level 4: Proxy Rotation

See references/proxy-rotation.md

Adaptive Scraping (Anti-Fragile)

Scrapling can survive website redesigns using adaptive selectors:

# First run — save fingerprints
products = page.css('.product', auto_save=True)

# Later runs — auto-relocate if DOM changed
products = page.css('.product', adaptive=True)

MCP usage:

mcporter call scrapling css_select \\
  --selector ".product" \\
  --adaptive true \\
  --auto-save true

Spider Framework (Large Crawls)

When to use Spiders vs direct fetching:

  • Spider: 10+ pages, concurrency needed, resume capability, proxy rotation
  • Direct: 1-5 pages, quick extraction, simple flow

Basic Spider Pattern

from scrapling.spiders import Spider, Response

class ProductSpider(Spider):
    name = "products"
    start_urls = ["https://example.com/products"]
    concurrent_requests = 10
    download_delay = 1.0
    
    async def parse(self, response: Response):
        for product in response.css('.product'):
            yield {
                "name": product.css('h2::text').get(),
                "price": product.css('.price::text').get(),
                "url": response.url
            }
        
        # Follow pagination
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page)

# Run with resume capability
result = ProductSpider(crawldir="./crawl_data").start()
result.items.to_jsonl("products.jsonl")

Advanced: Multi-Session Spider

from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MultiSessionSpider(Spider):
    name = "multi"
    start_urls = ["https://example.com/"]
    
    def configure_sessions(self, manager):
        manager.add("fast", FetcherSession(impersonate="chrome"))
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
    
    async def parse(self, response: Response):
        for link in response.css('a::attr(href)').getall():
            if "/protected/" in link:
                yield Request(link, sid="stealth")
            else:
                yield Request(link, sid="fast")

Spider Features

  • Pause/Resume: crawldir parameter saves checkpoints
  • Streaming: async for item in spider.stream() for real-time processing
  • Auto-retry: Configurable retry on blocked requests
  • Export: Built-in to_json(), to_jsonl()

CLI & Interactive Shell

Terminal Extraction (No Code)

# Extract to markdown
scrapling extract get 'https://example.com' content.md

# Extract specific element
scrapling extract get 'https://example.com' content.txt \\
  --css-selector '.article' \\
  --impersonate 'chrome'

# Stealth mode
scrapling extract stealthy-fetch 'https://protected.com' content.md \\
  --no-headless \\
  --solve-cloudflare

Interactive Shell

scrapling shell

# Inside shell:
>>> page = Fetcher.get('https://example.com')
>>> page.css('h1::text').get()
>>> page.find_all('div', class_='item')

Parser API (Beyond CSS/XPath)

BeautifulSoup-Style Methods

# Find by attributes
page.find_all('div', {'class': 'product', 'data-id': True})
page.find_all('div', class_='product', id=re.compile(r'item-\\d+'))

# Text search
page.find_by_text('Add to Cart', tag='button')
page.find_by_regex(r'\\$\\d+\\.\\d{2}')

# Navigation
first = page.css('.product')[0]
parent = first.parent
siblings = first.next_siblings
children = first.children

# Similarity
similar = first.find_similar()  # Find visually/structurally similar elements
below = first.below_elements()  # Elements below in DOM

Auto-Generated Selectors

# Get robust selector for any element
element = page.css('.product')[0]
selector = element.auto_css_selector()  # Returns stable CSS path
xpath = element.auto_xpath()

Proxy Rotation

from scrapling.spiders import ProxyRotator

# Cyclic rotation
rotator = ProxyRotator([
    "http://proxy1:8080",
    "http://proxy2:8080",
    "http://user:pass@proxy3:8080"
], strategy="cyclic")

# Use with any session
with FetcherSession(proxy=rotator.next()) as session:
    page = session.get('https://example.com')

Common Recipes

Pagination Patterns

# Page numbers
for page_num in range(1, 11):
    url = f"https://example.com/products?page={page_num}"
    ...

# Next button
while next_page := response.css('.next a::attr(href)').get():
    yield response.follow(next_page)

# Infinite scroll (DynamicFetcher)
with DynamicSession() as session:
    page = session.fetch(url)
    page.scroll_to_bottom()
    items = page.css('.item').getall()

Login Sessions

with StealthySession(headless=False) as session:
    # Login
    login_page = session.fetch('https://example.com/login')
    login_page.fill('input[name="username"]', 'user')
    login_page.fill('input[name="password"]', 'pass')
    login_page.click('button[type="submit"]')
    
    # Now session has cookies
    protected_page = session.fetch('https://example.com/dashboard')

Next.js Data Extraction

# Extract JSON from __NEXT_DATA__
import json
import re

next_data = json.loads(
    re.search(
        r'__NEXT_DATA__" type="application/json">(.*?)</script>',
        page.html_content,
        re.S
    ).group(1)
)
props = next_data['props']['pageProps']

Output Formats

# JSON (pretty)
result.items.to_json('output.json')

# JSONL (streaming, one per line)
result.items.to_jsonl('output.jsonl')

# Python objects
for item in result.items:
    print(item['title'])

Performance Tips

  1. Use HTTP fetcher when possible — 10-100x faster than browser
  2. Impersonate browsersimpersonate='chrome' for TLS fingerprinting
  3. HTTP/3 supportFetcherSession(http3=True)
  4. Limit resourcesdisable_resources=True in Dynamic/Stealthy
  5. Connection pooling — Reuse sessions across requests

Guardrails (Always)

  • Only scrape content you're authorized to access
  • Respect robots.txt and ToS
  • Add delays (download_delay) for large crawls
  • Don't bypass paywalls or authentication without permission
  • Never scrape personal/sensitive data

References

  • references/mcp-setup.md — Detailed MCP configuration
  • references/anti-bot.md — Anti-bot handling strategies
  • references/proxy-rotation.md — Proxy setup and rotation
  • references/spider-recipes.md — Advanced crawling patterns
  • references/api-reference.md — Quick API reference
  • references/links.md — Official docs links

Scripts

  • scripts/scrapling_scrape.py — Quick one-off extraction
  • scripts/scrapling_smoke_test.py — Test connectivity and anti-bot indicators

Files

11 total
Select a file
Select a file to preview.

Comments

Loading comments…