Scrapling MCP

Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) to call the `scrapling` MC...

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 0 · 198 · 2 current installs · 2 all-time installs

byBurak@DevBD1

fork of @DevBD1/scrapling-web-scraping (based on 1.1.0)

MIT-0

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

medium confidence

✓

Purpose & Capability

Name/description (Scrapling MCP guidance) align with the provided files and instructions: SKILL.md, reference docs, and helper scripts all focus on scraping, MCP setup, fetcher selection, spiders, proxies and anti-bot handling. No unrelated env vars, binaries, or platform access are requested.

ℹ

Instruction Scope

Runtime instructions and examples legitimately show installing scrapling/playwright, configuring an MCP server, calling mcporter, and using fetcher/stealthy/dynamic modes. The instructions include explicit guidance and examples for proxy rotation and 'solve_cloudflare' / stealthy fetchers; those are coherent for advanced scraping but can enable bypassing anti-bot measures if used without authorization — the docs repeatedly note 'use only when authorized', which mitigates but does not remove misuse risk.

✓

Install Mechanism

No install specification is included in the registry (instruction-only). The SKILL.md instructs pip installs from known packages (scrapling, playwright) and to run playwright install; helper scripts are shipped with the skill but there is no downloader or remote install URL that would write arbitrary code at runtime.

✓

Credentials

The skill declares no required environment variables, no primary credential, and no config-path requirements. Example snippets show proxy URLs (including username:password examples) and an example API Authorization header in a recipe — these are examples only and not requested by the skill; exercise caution when inserting real credentials into proxy strings or requests.

✓

Persistence & Privilege

The skill is not marked always:true and does not request persistent or cross-skill configuration changes. It does not attempt to modify other skills or system-wide settings.

Assessment

This skill appears to be a legitimate guidance layer + helper scripts for using Scrapling via MCP. Before installing/using it: 1) Verify you have permission to scrape target sites — do not use stealth or proxy rotation to evade protections, bypass paywalls, or access private data. 2) Inspect any proxy strings or Authorization headers you paste into configs; never store real credentials in public places. 3) The skill instructs installing third‑party packages (scrapling, playwright) — prefer installing those into a controlled virtualenv. 4) The SKILL.md links refer to GitHub repos/docs; confirm those projects are the official upstream sources you expect. If you need higher assurance, request the upstream package source and a checksum for the scrapling wheel or review the pip package metadata before installing.

Like a lobster shell, security has layers — review code before you run it.

Current versionv0.1.2

Download zip

latestvk97fccfxkkc6b5wy6wdkzc1gj582d4m6

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

SKILL.md

Scrapling MCP — Web Scraping Guidance

Source repo: https://github.com/DevBD1/openclaw-skill-scrapling-mcp

Guidance Layer + MCP Integration
Use this skill for strategy and patterns. For execution, call Scrapling's MCP server via mcporter.

Quick Start (MCP)

1. Install Scrapling with MCP support

pip install scrapling[mcp]
# Or for full features:
pip install scrapling[mcp,playwright]
python -m playwright install chromium

2. Add to OpenClaw MCP config

{
  "mcpServers": {
    "scrapling": {
      "command": "python",
      "args": ["-m", "scrapling.mcp"]
    }
  }
}

3. Call via mcporter

mcporter call scrapling fetch_page --url "https://example.com"

Execution vs Guidance

Task	Tool	Example
Fetch a page	mcporter	`mcporter call scrapling fetch_page --url URL`
Extract with CSS	mcporter	`mcporter call scrapling css_select --selector ".title::text"`
Which fetcher to use?	This skill	See "Fetcher Selection Guide" below
Anti-bot strategy?	This skill	See "Anti-Bot Escalation Ladder"
Complex crawl patterns?	This skill	See "Spider Recipes"

Fetcher Selection Guide

┌─────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│   Fetcher       │────▶│ DynamicFetcher   │────▶│ StealthyFetcher  │
│   (HTTP)        │     │ (Browser/JS)     │     │ (Anti-bot)       │
└─────────────────┘     └──────────────────┘     └──────────────────┘
     Fastest              JS-rendered               Cloudflare, 
     Static pages         SPAs, React/Vue          Turnstile, etc.

Decision Tree

Static HTML? → Fetcher (10-100x faster)
Need JS execution? → DynamicFetcher
Getting blocked? → StealthyFetcher
Complex session? → Use Session variants

MCP Fetch Modes

fetch_page — HTTP fetcher
fetch_dynamic — Browser-based with Playwright
fetch_stealthy — Anti-bot bypass mode

Anti-Bot Escalation Ladder

Level 1: Polite HTTP

# MCP call: fetch_page with options
{
  "url": "https://example.com",
  "headers": {"User-Agent": "..."},
  "delay": 2.0
}

Level 2: Session Persistence

# Use sessions for cookie/state across requests
FetcherSession(impersonate="chrome")  # TLS fingerprint spoofing

Level 3: Stealth Mode

# MCP: fetch_stealthy
StealthyFetcher.fetch(
    url,
    headless=True,
    solve_cloudflare=True,  # Auto-solve Turnstile
    network_idle=True
)

Level 4: Proxy Rotation

See references/proxy-rotation.md

Adaptive Scraping (Anti-Fragile)

Scrapling can survive website redesigns using adaptive selectors:

# First run — save fingerprints
products = page.css('.product', auto_save=True)

# Later runs — auto-relocate if DOM changed
products = page.css('.product', adaptive=True)

MCP usage:

mcporter call scrapling css_select \\
  --selector ".product" \\
  --adaptive true \\
  --auto-save true

Spider Framework (Large Crawls)

When to use Spiders vs direct fetching:

✅ Spider: 10+ pages, concurrency needed, resume capability, proxy rotation
✅ Direct: 1-5 pages, quick extraction, simple flow

Basic Spider Pattern

from scrapling.spiders import Spider, Response

class ProductSpider(Spider):
    name = "products"
    start_urls = ["https://example.com/products"]
    concurrent_requests = 10
    download_delay = 1.0
    
    async def parse(self, response: Response):
        for product in response.css('.product'):
            yield {
                "name": product.css('h2::text').get(),
                "price": product.css('.price::text').get(),
                "url": response.url
            }
        
        # Follow pagination
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page)

# Run with resume capability
result = ProductSpider(crawldir="./crawl_data").start()
result.items.to_jsonl("products.jsonl")

Advanced: Multi-Session Spider

from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MultiSessionSpider(Spider):
    name = "multi"
    start_urls = ["https://example.com/"]
    
    def configure_sessions(self, manager):
        manager.add("fast", FetcherSession(impersonate="chrome"))
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
    
    async def parse(self, response: Response):
        for link in response.css('a::attr(href)').getall():
            if "/protected/" in link:
                yield Request(link, sid="stealth")
            else:
                yield Request(link, sid="fast")

Spider Features

Pause/Resume: crawldir parameter saves checkpoints
Streaming: async for item in spider.stream() for real-time processing
Auto-retry: Configurable retry on blocked requests
Export: Built-in to_json(), to_jsonl()

CLI & Interactive Shell

Terminal Extraction (No Code)

# Extract to markdown
scrapling extract get 'https://example.com' content.md

# Extract specific element
scrapling extract get 'https://example.com' content.txt \\
  --css-selector '.article' \\
  --impersonate 'chrome'

# Stealth mode
scrapling extract stealthy-fetch 'https://protected.com' content.md \\
  --no-headless \\
  --solve-cloudflare

Interactive Shell

scrapling shell

# Inside shell:
>>> page = Fetcher.get('https://example.com')
>>> page.css('h1::text').get()
>>> page.find_all('div', class_='item')

Parser API (Beyond CSS/XPath)

BeautifulSoup-Style Methods

# Find by attributes
page.find_all('div', {'class': 'product', 'data-id': True})
page.find_all('div', class_='product', id=re.compile(r'item-\\d+'))

# Text search
page.find_by_text('Add to Cart', tag='button')
page.find_by_regex(r'\\$\\d+\\.\\d{2}')

# Navigation
first = page.css('.product')[0]
parent = first.parent
siblings = first.next_siblings
children = first.children

# Similarity
similar = first.find_similar()  # Find visually/structurally similar elements
below = first.below_elements()  # Elements below in DOM

Auto-Generated Selectors

# Get robust selector for any element
element = page.css('.product')[0]
selector = element.auto_css_selector()  # Returns stable CSS path
xpath = element.auto_xpath()

Proxy Rotation

from scrapling.spiders import ProxyRotator

# Cyclic rotation
rotator = ProxyRotator([
    "http://proxy1:8080",
    "http://proxy2:8080",
    "http://user:pass@proxy3:8080"
], strategy="cyclic")

# Use with any session
with FetcherSession(proxy=rotator.next()) as session:
    page = session.get('https://example.com')

Common Recipes

Pagination Patterns

# Page numbers
for page_num in range(1, 11):
    url = f"https://example.com/products?page={page_num}"
    ...

# Next button
while next_page := response.css('.next a::attr(href)').get():
    yield response.follow(next_page)

# Infinite scroll (DynamicFetcher)
with DynamicSession() as session:
    page = session.fetch(url)
    page.scroll_to_bottom()
    items = page.css('.item').getall()

Login Sessions

with StealthySession(headless=False) as session:
    # Login
    login_page = session.fetch('https://example.com/login')
    login_page.fill('input[name="username"]', 'user')
    login_page.fill('input[name="password"]', 'pass')
    login_page.click('button[type="submit"]')
    
    # Now session has cookies
    protected_page = session.fetch('https://example.com/dashboard')

Next.js Data Extraction

# Extract JSON from __NEXT_DATA__
import json
import re

next_data = json.loads(
    re.search(
        r'__NEXT_DATA__" type="application/json">(.*?)</script>',
        page.html_content,
        re.S
    ).group(1)
)
props = next_data['props']['pageProps']

Output Formats

# JSON (pretty)
result.items.to_json('output.json')

# JSONL (streaming, one per line)
result.items.to_jsonl('output.jsonl')

# Python objects
for item in result.items:
    print(item['title'])

Performance Tips

Use HTTP fetcher when possible — 10-100x faster than browser
Impersonate browsers — impersonate='chrome' for TLS fingerprinting
HTTP/3 support — FetcherSession(http3=True)
Limit resources — disable_resources=True in Dynamic/Stealthy
Connection pooling — Reuse sessions across requests

Guardrails (Always)

Only scrape content you're authorized to access
Respect robots.txt and ToS
Add delays (download_delay) for large crawls
Don't bypass paywalls or authentication without permission
Never scrape personal/sensitive data

References

references/mcp-setup.md — Detailed MCP configuration
references/anti-bot.md — Anti-bot handling strategies
references/proxy-rotation.md — Proxy setup and rotation
references/spider-recipes.md — Advanced crawling patterns
references/api-reference.md — Quick API reference
references/links.md — Official docs links

Scripts

scripts/scrapling_scrape.py — Quick one-off extraction
scripts/scrapling_smoke_test.py — Test connectivity and anti-bot indicators

Files

10 total

Select a file

Select a file to preview.

Comments

Loading comments…