Skill

v0.1.0

高性能自适应 Python 网页抓取框架,内置反爬虫绕过(Cloudflare Turnstile)、智能元素重定位、完整爬虫框架和 MCP 服务器,适合 AI 辅助数据提取和大规模爬取任务

0· 27·0 current·0 all-time
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
The name/description (Scrapling — adaptive web-scraper with Cloudflare/Turnstile handling, adaptive selectors, Spider and MCP server) matches the content of SKILL.md and included guides. Commands and examples (pip install, scrapling fetch, scrapling mcp, Docker image, Playwright/Camoufox, proxies, adaptive=True) are all coherent with a scraping framework.
Instruction Scope
The SKILL.md instructs the AI to install packages (pip install scrapling), install browser drivers, run the MCP server (scrapling mcp), and check local files (e.g., ~/.scrapling/storage.db). Those filesystem reads and local server configuration are within the framework's purpose, but they do give the agent permission to run installers, read a user home path, and start a local service — items the user should explicitly approve. The docs also include examples that disable robots.txt and enable proxy credentials; these are behavioral choices (ethical/legal) but technically consistent with the stated purpose.
Install Mechanism
This is an instruction-only skill (no install spec). The docs recommend standard install methods (pip install, playwright install, docker pull). Those are common and expected; there are no opaque download URLs or archive/extract steps embedded in the skill artifact itself.
Credentials
The skill declares no required environment variables or credentials, which matches the metadata. Example usage, however, shows proxy URLs with user:pass and MCP configuration snippets for AI integrations; the skill does not automatically request or store secrets, but the user or agent will need to supply proxy credentials or modify MCP/AI config if they follow those examples. Users should not provide credentials unless they trust the package and understand where they will be stored or transmitted.
Persistence & Privilege
always is false (normal). The skill documents starting an MCP server (local service) and integrating it into an AI/MCP configuration which opens a local control surface for AI client tools. That is consistent with the feature set but increases runtime exposure (a local HTTP/IPC endpoint that can control browser automation). The skill does not indicate modifying other skills or system-wide agent settings beyond its own MCP entry, which is expected for this use case.
Scan Findings in Context
[no_regex_findings] expected: The static regex scanner found nothing — expected because this skill is instruction-only (no code files) and the instructions direct use of external packages (pip/docker) rather than including code to scan.
Assessment
This skill appears internally consistent for a web-scraping framework, but before installing or letting an agent run these steps, consider: - Source verification: the SKILL.md references GitHub, PyPI, and Docker images; verify those upstream projects and package authors on PyPI/GitHub to avoid installing trojanized packages. - Pip/docker risk: running pip install or docker pull installs third-party code on your machine — only run them for trusted packages and review package releases. - MCP server exposure: starting scrapling mcp opens a local service that can be called by AI tooling — ensure you understand which clients are allowed to connect and that you don't expose it to untrusted networks. - Secrets and proxies: examples use proxy URLs with credentials and show how to configure MCP in AI clients. Do not provide credentials to the agent or paste them into configs unless you trust the package and know where the credentials will be stored/transmitted. - Legal/ethical: the docs show ways to bypass anti-bot protections (Cloudflare Turnstile) and to disable robots.txt — ensure your scraping activity complies with target site policies and laws. If you want a stronger assessment, provide the actual PyPI package page, the upstream GitHub repo contents (so we can inspect package code), or any release tarball URLs: with those we can scan the package code for dangerous behaviors (exfil endpoints, obfuscated code, unexpected credential access).

Like a lobster shell, security has layers — review code before you run it.

latestvk976a975wv112w595k50y0dt5985bdv6
27downloads
0stars
1versions
Updated 6h ago
v0.1.0
MIT-0

Scrapling — 自适应网页抓取框架

Scrapling 是 Google Chrome DevTools 生态之外最强大的 Python 网页抓取框架之一,能够处理从单次 HTTP 请求到大规模并发爬取的所有场景。它的自适应解析引擎在网页改版后自动重新定位元素,内置 Cloudflare Turnstile 绕过能力,Spider 框架支持暂停/恢复,并提供 MCP 服务器让 AI 直接辅助数据提取,从源头减少 Token 消耗。

核心使用场景

  • 反爬虫网站抓取StealthyFetcher 内置 Cloudflare Turnstile 绕过,支持 TLS 指纹伪装和浏览器自动化
  • 自适应数据采集:网页改版后,auto_save=True 保存元素快照,adaptive=True 自动重新定位变化元素
  • 大规模并发爬取:Spider 框架支持多 Session、代理轮换、暂停恢复,像 Scrapy 一样定义爬虫
  • AI 辅助提取:内置 MCP 服务器,Claude/Cursor 等 AI 工具可直接调用 Scrapling 提取目标内容
  • 动态页面处理DynamicFetcher 基于 Playwright,支持完整浏览器自动化和网络空闲等待

AI 辅助使用流程

  1. 安装依赖 — AI 执行 pip install scrapling 并按需安装浏览器驱动
  2. 选择 Fetcher — AI 根据目标网站类型推荐 Fetcher/StealthyFetcher/DynamicFetcher
  3. 编写抓取逻辑 — AI 生成 CSS/XPath 选择器代码,配置 auto_save 实现自适应
  4. 调试与优化 — AI 分析响应结果,调整选择器或切换 Fetcher 策略
  5. 扩展为 Spider — AI 将单页抓取扩展为完整 Spider 类,配置并发和代理
  6. MCP 模式 — 启动 Scrapling MCP Server,让 AI 直接操控浏览器提取数据

关键章节导航

  • 安装指南 — pip 安装、浏览器驱动、Docker 镜像
  • 快速开始 — Fetcher 选型、CSS/XPath 选择器、自适应抓取
  • 高级用法 — Spider 框架、代理轮换、MCP 服务器、CLI 工具
  • 故障排查 — 反爬虫、浏览器驱动、超时、代理问题

AI 助手能力

使用本技能时,AI 可以:

  • ✅ 安装 Scrapling 并配置浏览器驱动(scrapling install playwright / scrapling install camoufox
  • ✅ 根据目标网站自动选择最合适的 Fetcher 类
  • ✅ 编写 CSS/XPath 选择器提取目标数据
  • ✅ 配置 auto_save=Trueadaptive=True 实现自适应抓取
  • ✅ 构建完整的 Spider 类实现并发爬取,配置暂停/恢复
  • ✅ 设置代理轮换和防 DNS 泄露(DoH 模式)
  • ✅ 启动和配置 Scrapling MCP 服务器
  • ✅ 使用 CLI 工具快速测试 URL 抓取效果

核心功能

  • 三种 FetcherFetcher(快速 HTTP)、StealthyFetcher(反爬绕过)、DynamicFetcher(浏览器自动化)
  • 自适应解析 — 网页改版后自动重定位元素,降低维护成本
  • Cloudflare 绕过 — 内置 Turnstile/Interstitial 解决方案,免额外服务
  • Spider 框架 — Scrapy 风格 API,支持并发、多 Session、暂停恢复
  • 流式输出spider.stream() 实时推送抓取结果,适合大规模任务
  • MCP 服务器 — AI 工具直接调用 Scrapling 提取数据,减少 Token 消耗
  • 代理轮换 — 内置 ProxyRotator,支持循环或自定义策略
  • 会话管理FetcherSession/StealthySession/DynamicSession 跨请求保持状态
  • 开发模式 — 首次运行缓存响应,后续离线回放,快速迭代解析逻辑
  • CLI 工具 — 无需写代码直接从终端抓取页面
  • IPython Shell — 交互式调试,内置 curl 转换工具
  • Docker 镜像 — 预置所有浏览器的生产就绪镜像

快速示例

from scrapling.fetchers import Fetcher, StealthyFetcher, DynamicFetcher

# 普通 HTTP 抓取(最快)
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()

# 隐身模式绕过 Cloudflare
page = StealthyFetcher.fetch('https://protected-site.com', headless=True)
data = page.css('.content::text').get()

# 自适应抓取(网站改版后自动重定位)
page = Fetcher.get('https://example.com/products')
products = page.css('.product', auto_save=True)   # 首次保存元素快照
# 网站改版后:
products = page.css('.product', adaptive=True)    # 自动重新定位
# CLI 快速测试(无需写代码)
scrapling fetch https://quotes.toscrape.com/ --css ".quote .text"

# 启动 MCP 服务器
scrapling mcp

安装要求

依赖版本要求
Python>= 3.9
pip任意版本
Playwright可选(DynamicFetcher 使用)
Camoufox可选(StealthyFetcher 使用)
Docker可选(使用官方镜像)

项目链接

Comments

Loading comments...