Scrapling Fetch Basic

v1.0.0

基础网页抓取工具,支持绕过反爬系统、自动定位正文区域、HTML 转 Markdown。适合抓取博客、新闻、公告等静态页面。

0· 120·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for shuxiangfanclaw/scrapling-fetch-basic.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "Scrapling Fetch Basic" (shuxiangfanclaw/scrapling-fetch-basic) from ClawHub.
Skill page: https://clawhub.ai/shuxiangfanclaw/scrapling-fetch-basic
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install scrapling-fetch-basic

ClawHub CLI

Package manager switcher

npx clawhub@latest install scrapling-fetch-basic
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
Name/description (web scraping, Cloudflare/stealth, HTML→Markdown) align with the provided script and declared deps (scrapling, html2text, playwright). No unrelated env vars, binaries, or config paths are required.
Instruction Scope
SKILL.md describes running the included Python script; the script only fetches the target URL, extracts content with a set of selectors, converts to Markdown, and prints output or JSON. It does not attempt to read local files, other env vars, or exfiltrate results to external endpoints.
Install Mechanism
There is no install spec (instruction-only) and a single Python script is included. Dependencies are listed but not installed automatically; the user environment must install scrapling, html2text, and playwright. Playwright typically requires downloading browser binaries (user should be aware).
Credentials
No credentials, secret environment variables, or config paths are requested. Required libraries are proportional to the functionality (HTML parsing and optional browser automation).
Persistence & Privilege
Skill does not request persistent always:true, does not modify other skills or system configs, and is user-invocable only. The script is executed on demand and does not persist credentials or install itself.
Assessment
This skill appears internally consistent, but take these precautions before using it: - Source verification: the package has no homepage and an unknown owner; inspect the scrapling dependency source (PyPI/GitHub) before installing and prefer installing in an isolated environment (virtualenv/container). - Install notes: playwright will usually download browser binaries when first used (run 'playwright install' or follow its docs). That will add sizeable executables to the host; be prepared for that. - SSRF / network risk: the script fetches arbitrary URLs. If you run it on a server that can access internal resources, an attacker-supplied URL could cause server-side requests to internal endpoints. Only run with trusted URLs or in a network-isolated environment. - Legal/ethical: stealth mode and Cloudflare bypass are intended to evade anti-bot protections — ensure you have the right to scrape targets and comply with terms of service and laws. - Dependency hygiene: install dependencies from official registries or pinned releases, review the 'scrapling' package code because the skill relies on it for network access and stealth behavior. - Runtime safety: run first with --debug and limited targets; consider timeouts/rate limits to avoid unintended heavy load. If you want higher assurance, ask the author for a homepage or source repository and a release provenance (e.g., GitHub repo and PyPI package/version).

Like a lobster shell, security has layers — review code before you run it.

latestvk972s5jgcq5j942ewzwc5c042183r7ck
120downloads
0stars
1versions
Updated 1mo ago
v1.0.0
MIT-0

Scrapling Fetch Basic

基础版网页抓取工具,快速高效,适合大多数场景。

主要功能

🌐 网页内容抓取

  • 智能正文提取:自动识别并提取网页正文内容,无需手动指定选择器
  • Markdown 输出:将 HTML 自动转换为干净的 Markdown 格式
  • 字符数控制:支持自定义最大输出字符数(默认 30000)

🔓 反爬绕过

  • Cloudflare Turnstile:stealth 模式可绕过 Cloudflare 反爬验证
  • 浏览器指纹伪装:隐身模式下模拟真实浏览器

🎯 模式选择

  • basic 模式:快速 HTTP 抓取,适合静态页面(默认)
  • stealth 模式:隐身浏览器抓取,适合有反爬保护的网站

快速开始

# 基础抓取
python3 scripts/scrapling_fetch.py https://example.com/article

# 指定字符数
python3 scripts/scrapling_fetch.py https://example.com/article 50000

# 绕过反爬保护
python3 scripts/scrapling_fetch.py https://protected-site.com --mode stealth

# JSON 输出
python3 scripts/scrapling_fetch.py https://example.com --json

正文选择器(11个)

按优先级自动尝试:

  1. article - HTML5 article 元素
  2. main - HTML5 main 主元素
  3. .post-content - 博客常见内容区域
  4. .article-content - 新闻常见内容区域
  5. .entry-content - WordPress 常见
  6. .post-body - 文章正文
  7. [class*='body'] - 包含 "body" 的类名
  8. [class*='content'] - 包含 "content" 的类名
  9. #content - content ID
  10. #main - main ID
  11. body - 最后回退

依赖

包名用途
scrapling爬虫核心框架
html2textHTML 转 Markdown
playwright浏览器自动化(stealth 模式)

使用场景

  • ✅ 抓取博客文章
  • ✅ 抓取新闻页面
  • ✅ 抓取公告文档
  • ✅ 绕过基础反爬保护
  • ⚠️ 微信公众号文章(支持有限,建议使用专业版)

对比专业版

特性基础版专业版
抓取模式basic / stealthbasic / stealth / auto
选择器数量11 个16 个
微信公众号⚠️ 有限支持完整支持
噪音清理微信专用清理
自动检测智能模式选择

版本: 1.0.0
作者: OpenClaw

Comments

Loading comments...