Nmb Scrapling
Web scraping framework supporting anti-bot bypass, adaptive parsing, session and proxy management, large-scale crawling, and dynamic content extraction.
MIT-0 · Free to use, modify, and redistribute. No attribution required.
⭐ 0 · 32 · 0 current installs · 0 all-time installs
MIT-0
Security Scan
OpenClaw
Benign
medium confidencePurpose & Capability
Name, description, SKILL.md, and the included scrape.py all align: this is a web-scraping helper that relies on a 'scrapling' Python package and supports stealth/dynamic fetchers, sessions, proxies, and a CLI. The code and prose request no unrelated system access or credentials.
Instruction Scope
SKILL.md instructs installing and running the external 'scrapling' package and optionally starting an MCP server for AI integration. The runtime instructions and example code operate only on URLs and local outputs; they do not instruct reading unrelated local files or exfiltrating secrets. However, the MCP server and 'collect data for AI training/RAG' guidance mean the skill may be used to send scraped data off-host if the installed package or operator config does so.
Install Mechanism
The registry has no install spec (instruction-only), so platform won't install binaries automatically. SKILL.md explicitly tells users to pip install 'scrapling' (and extras). Installing a third‑party PyPI package is common but introduces standard supply-chain risk: the package is external and not vetted by this bundle. The included script itself does not download additional code or call unknown endpoints.
Credentials
The skill declares no required environment variables, credentials, or config paths and the code does not access environment secrets. This is proportionate to a scraping tool.
Persistence & Privilege
always:false and user-invocable:true. The skill does not request forced persistent presence or modify other skills/config. Autonomous invocation is allowed by default but is not combined with other high-risk indicators here.
Assessment
This bundle appears coherent for web scraping, but before installing or running anything: (1) Inspect the actual 'scrapling' PyPI package and its source (the SKILL.md instructs pip install of that package — it is external code). (2) If you plan to run the MCP server or allow autonomous runs, consider running in a sandbox/container and monitor network traffic because scraped data may be sent off-host. (3) Do not provide credentials (API keys, AWS, etc.) unless you verify the package's requirements. (4) Review license and terms for scraping target sites and ensure you have permission and comply with robots.txt and laws. (5) If you want stronger assurance, ask the skill author for the package's source repository or a signed release before installation.Like a lobster shell, security has layers — review code before you run it.
Current versionv1.0.0
Download ziplatest
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
SKILL.md
Scrapling
自适应Web爬虫框架,能过反爬、能大规模爬取、网站改版不崩。
Installation
# 基础安装(仅解析器)
pip install scrapling
# 完整安装(含fetchers和浏览器)
pip install "scrapling[all]"
scrapling install
# 或单独安装功能
pip install "scrapling[fetchers]" # 抓取功能
pip install "scrapling[ai]" # MCP服务
pip install "scrapling[shell]" # 交互式shell
Quick Start
基础抓取
from scrapling.fetchers import Fetcher
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()
print(quotes)
过反爬(Cloudflare等)
from scrapling.fetchers import StealthyFetcher
# 自动过Cloudflare Turnstile
page = StealthyFetcher.fetch(
'https://目标网站',
headless=True,
solve_cloudflare=True
)
data = page.css('.content::text').getall()
动态页面(JS渲染)
from scrapling.fetchers import DynamicFetcher
# 完整浏览器渲染
page = DynamicFetcher.fetch(
'https://spa网站',
headless=True,
network_idle=True # 等网络请求完成
)
Fetcher Types
| Fetcher | 用途 | 特点 |
|---|---|---|
Fetcher | 普通HTTP请求 | 最快,适合静态页面 |
StealthyFetcher | 隐身模式 | 过反爬,过Cloudflare |
DynamicFetcher | 浏览器模式 | JS渲染,SPA页面 |
Element Selection
page = Fetcher.get('https://example.com')
# CSS选择器
items = page.css('.item')
title = page.css('h1::text').get()
titles = page.css('h2::text').getall()
# XPath
items = page.xpath('//div[@class="item"]')
# BeautifulSoup风格
items = page.find_all('div', class_='item')
items = page.find_by_text('关键词', tag='div')
# 链式选择
quote_text = page.css('.quote')[0].css('.text::text').get()
# 导航
first = page.css('.item')[0]
parent = first.parent
sibling = first.next_sibling
similar = first.find_similar() # 找相似元素
Session Management
from scrapling.fetchers import FetcherSession, StealthySession
# 保持会话(cookie复用)
with StealthySession(headless=True, solve_cloudflare=True) as session:
page1 = session.fetch('https://example.com/login')
page2 = session.fetch('https://example.com/dashboard') # 已登录状态
# 异步Session
from scrapling.fetchers import AsyncStealthySession
async with AsyncStealthySession(headless=True) as session:
page = await session.fetch('https://example.com')
Building Spiders (大规模爬取)
from scrapling.spiders import Spider, Response
class MySpider(Spider):
name = "products"
start_urls = ["https://shop.example.com/"]
concurrent_requests = 10 # 并发数
async def parse(self, response: Response):
for item in response.css('.product'):
yield {
"title": item.css('h2::text').get(),
"price": item.css('.price::text').get(),
}
# 翻页
next_page = response.css('.next a::attr(href)').get()
if next_page:
yield response.follow(next_page)
# 运行
result = MySpider().start()
print(f"爬取了 {len(result.items)} 条")
# 导出
result.items.to_json("output.json")
result.items.to_jsonl("output.jsonl")
断点续爬
# 指定crawl目录,支持暂停/恢复
MySpider(crawldir="./crawl_data").start()
# Ctrl+C 暂停,再次运行从断点继续
多Session混用
from scrapling.spiders import Spider, Request
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class MultiSpider(Spider):
name = "multi"
def configure_sessions(self, manager):
# 普通请求 - 快
manager.add("fast", FetcherSession(impersonate="chrome"))
# 隐身请求 - 过反爬
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
async def parse(self, response):
for link in response.css('a::attr(href)').getall():
if "protected" in link:
yield Request(link, sid="stealth") # 用隐身session
else:
yield Request(link, sid="fast") # 用快速session
Adaptive Parsing (自适应解析)
网站改版后自动重新定位元素:
# 首次爬取,保存元素特征
products = page.css('.product', auto_save=True)
# 网站改版后,用adaptive=True自动重新定位
products = page.css('.product', adaptive=True)
Proxy Rotation
from scrapling.fetchers import StealthyFetcher, ProxyRotator
proxies = ProxyRotator([
"http://proxy1:8080",
"http://proxy2:8080",
])
page = StealthyFetcher.fetch(
'https://example.com',
proxy=proxies.next()
)
CLI Commands
# 交互式shell
scrapling shell
# 直接抓取(不用写代码)
scrapling extract get 'https://example.com' output.md
scrapling extract stealthy-fetch 'https://protected.com' output.html --solve-cloudflare
# 安装浏览器
scrapling install
scrapling install --force
MCP Server (AI集成)
让Claude/Cursor直接调Scrapling爬数据:
pip install "scrapling[ai]"
# 启动MCP服务
scrapling mcp
配置到Claude Desktop的config:
{
"mcpServers": {
"scrapling": {
"command": "scrapling",
"args": ["mcp"]
}
}
}
Common Use Cases
电商比价
from scrapling.fetchers import StealthyFetcher
page = StealthyFetcher.fetch('https://item.jd.com/12345.html', headless=True)
price = page.css('.price::text').get()
title = page.css('.sku-name::text').get()
招聘信息
from scrapling.spiders import Spider, Response
class JobsSpider(Spider):
name = "jobs"
start_urls = ["https://www.zhipin.com/job_detail/?query=Python"]
async def parse(self, response: Response):
for job in response.css('.job-list li'):
yield {
"title": job.css('.job-name::text').get(),
"salary": job.css('.salary::text').get(),
"company": job.css('.company-name::text').get(),
}
竞品监控
from scrapling.fetchers import Fetcher
import json
def check_competitor(url):
page = Fetcher.get(url)
return {
"products": len(page.css('.product')),
"price_range": page.css('.price::text').getall(),
"updated": page.css('.update-time::text').get(),
}
Tips
- 先测试后规模化:用
scrapling shell调试选择器 - 合理设置并发:
concurrent_requests别太高,容易被封 - 用Session复用:登录态、cookie保持用Session
- 断点续爬:长时间爬取务必设置
crawldir - 尊重robots.txt:合规爬取
References
Files
3 totalSelect a file
Select a file to preview.
Comments
Loading comments…
