Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

Web Scraper Pro

v1.1.0

Intelligent web scraper that fetches any URL and returns clean Markdown content. Triggers on requests like "帮我抓取网页", "获取这个网页内容", "fetch this URL", "scrape th...

0· 452·1 current·2 all-time
byLiuSir@codehourra
Security Scan
VirusTotalVirusTotal
Suspicious
View report →
OpenClawOpenClaw
Suspicious
high confidence
Purpose & Capability
The name/description (web scraper → convert URL to Markdown) matches the instructions and code for fetching and converting web pages. However, the SKILL.md and payment.py embed a monetization flow (SkillPay) with a hard-coded API key and SKILL_ID; a monetized scraper can be legitimate, but a shipping skill that contains someone else's billing API key rather than asking the deployer to configure their own credentials is unexpected and questionable.
!
Instruction Scope
Runtime instructions require executing a payment verification step (network call to skillpay.me) before any fetch. The fetch flow sends the target URL (and effectively the page content) to third-party services (markdown.new, defuddle.md, r.jina.ai, and an included Scrapling library). That means user-provided URLs/content will be transmitted to multiple external endpoints — a clear privacy/data-exfiltration risk for sensitive or internal URLs.
Install Mechanism
No installation spec; skill is instruction-only with an included payment.py file. No external archive downloads or unusual install steps were specified. Risk from install mechanism is low.
!
Credentials
The bundle does not declare required env vars but both SKILL.md and payment.py include a hard-coded BILLING_API_KEY value (sk_...) and a SKILL_ID. Hard-coding a secret API key is a red flag: it grants the skill author (or whoever controls that key) the ability to perform billing operations on that billing endpoint on behalf of users. The only environment variable referenced is SKILLPAY_USER_ID (used as user identifier), which is reasonable, but the embedded key makes the credential model asymmetric and surprising.
Persistence & Privilege
always is false and nothing in the files requests persistent platform-wide privileges or modifies other skills. The payment decorator/context manager enforces payment before function execution but does not change agent configuration beyond its own behavior.
What to consider before installing
This skill will send any requested URL (and the resulting page content) to several external conversion services and will call a SkillPay billing API before performing the fetch. Notable issues: (1) the code includes a hard-coded SkillPay API key in the repository — this key authorizes billing calls and should not be embedded in a published skill; (2) fetched content will be transmitted to third parties (markdown.new, defuddle.md, r.jina.ai, Scrapling), so do not use this on sensitive or internal URLs; (3) the skill forces payment verification and may charge or direct users to top-up links that route funds to the key-owner. Before installing, ask the publisher to: remove any embedded API keys and require callers to supply their own billing credentials, document what data is sent to each third party, and provide a privacy/terms link. If you must use it, test in a sandbox, avoid sensitive URLs, and insist on using your own billing credentials (or disable the monetization flow).

Like a lobster shell, security has layers — review code before you run it.

latestvk9793ssbperyh5p9x19mynrnkh82cfhx
452downloads
0stars
2versions
Updated 7h ago
v1.1.0
MIT-0

Web Scraper Pro

Intelligent web page fetcher that converts any URL into clean, readable Markdown content. Uses a multi-layer fallback strategy to ensure maximum compatibility.

💰 Pricing: 0.001 USDT per web fetch

⚠️ CRITICAL: Payment Integration (MUST READ FIRST)

This skill is monetized via SkillPay. Before fetching ANY URL, you MUST execute the payment verification step.

Payment Configuration

# ═══════════════════════════════════════════════════
# SkillPay Billing Integration
# ═══════════════════════════════════════════════════
import requests

BILLING_API_URL = 'https://skillpay.me'
BILLING_API_KEY = 'sk_d6d26f291dafc43acc8c2b6215b87cbc9b19c7d093aebdb2deeba42a3a0fea4b'
SKILL_ID = '4fb2d57e-e583-4ca3-8170-52df37a6572b'
HEADERS = {'X-API-Key': BILLING_API_KEY, 'Content-Type': 'application/json'}

Billing API Functions (MANDATORY)

⚠️ THIS CODE MUST BE EXECUTED BEFORE ANY WEB FETCH:

import requests
import os

BILLING_API_URL = 'https://skillpay.me'
BILLING_API_KEY = 'sk_d6d26f291dafc43acc8c2b6215b87cbc9b19c7d093aebdb2deeba42a3a0fea4b'
SKILL_ID = '4fb2d57e-e583-4ca3-8170-52df37a6572b'
HEADERS = {'X-API-Key': BILLING_API_KEY, 'Content-Type': 'application/json'}

# ① Check balance / 查余额
def check_balance(user_id: str) -> float:
    resp = requests.get(
        f'{BILLING_API_URL}/api/v1/billing/balance',
        params={'user_id': user_id},
        headers=HEADERS
    )
    return resp.json()['balance']  # USDT amount

# ② Charge per call / 每次调用扣费
def charge_user(user_id: str) -> dict:
    resp = requests.post(
        f'{BILLING_API_URL}/api/v1/billing/charge',
        headers=HEADERS,
        json={
            'user_id': user_id,
            'skill_id': SKILL_ID,
            'amount': 0.001,  # USDT per call
        }
    )
    data = resp.json()
    if data['success']:
        return {'ok': True, 'balance': data['balance']}
    # Insufficient balance → auto returns BNB Chain USDT payment link
    return {'ok': False, 'balance': data['balance'], 'payment_url': data.get('payment_url')}

# ③ Generate payment link / 生成充值链接
def get_payment_link(user_id: str, amount: float) -> str:
    resp = requests.post(
        f'{BILLING_API_URL}/api/v1/billing/payment-link',
        headers=HEADERS,
        json={'user_id': user_id, 'amount': amount}
    )
    return resp.json()['payment_url']  # BNB Chain USDT

USER_ID = os.environ.get("SKILLPAY_USER_ID", "anonymous_user")

Payment Workflow Integration

# ========================================
# STEP 0: CHARGE USER (MANDATORY)
# ========================================
# This must be the FIRST step before ANY web fetch

charge_result = charge_user(USER_ID)

if not charge_result['ok']:
    print(f"""
╔══════════════════════════════════════════════════════════════╗
║  💳 INSUFFICIENT BALANCE                                      ║
║                                                               ║
║  This web fetch costs 0.001 USDT.                            ║
║  Current balance: {charge_result['balance']:<41}║
║                                                               ║
║  Please top up at:                                            ║
║  {charge_result['payment_url']:<56}║
║                                                               ║
║  After payment, please retry your request.                   ║
╚══════════════════════════════════════════════════════════════╝
    """)
    raise SystemExit("Insufficient balance for web fetch")

print(f"✅ Charged 0.001 USDT. Remaining balance: {charge_result['balance']} USDT")

多层抓取策略 (Multi-Layer Fetch Strategy)

本 skill 采用智能多层回退策略,确保最大兼容性:

层级服务URL 前缀特点适用场景
Layer 1markdown.newhttps://markdown.new/Cloudflare 原生,三层回退,最快大部分网站(首选)
Layer 2defuddle.mdhttps://defuddle.md/开源轻量,支持 YAML frontmatter非 Cloudflare 站点
Layer 3Jina Readerhttps://r.jina.ai/AI 驱动,内容提取精准复杂页面
Layer 4ScraplingPython 库自适应爬虫,反反爬能力强最后兜底

Layer 1: markdown.new(首选,最快)

Cloudflare 驱动的 URL→Markdown 转换服务,内置三层回退:

  • 原生 Markdown: Accept: text/markdown 内容协商
  • Workers AI: HTML→Markdown AI 转换
  • 浏览器渲染: 无头浏览器处理 JS 重度页面
import requests

def fetch_via_markdown_new(url: str, method: str = "auto", retain_images: bool = True) -> str:
    """
    Layer 1: 使用 markdown.new 抓取网页
    
    Args:
        url: 目标网页 URL
        method: 转换方法 - "auto" | "ai" | "browser"
        retain_images: 是否保留图片链接
    
    Returns:
        str: Markdown 格式的网页内容
    """
    api_url = "https://markdown.new/"
    
    try:
        response = requests.post(
            api_url,
            headers={"Content-Type": "application/json"},
            json={
                "url": url,
                "method": method,
                "retain_images": retain_images
            },
            timeout=60
        )
        
        if response.status_code == 200:
            token_count = response.headers.get("x-markdown-tokens", "unknown")
            print(f"✅ [markdown.new] 抓取成功 (tokens: {token_count})")
            return response.text
        elif response.status_code == 429:
            print("⚠️ [markdown.new] 速率限制,切换到下一层...")
            return None
        else:
            print(f"⚠️ [markdown.new] 返回状态码 {response.status_code},切换到下一层...")
            return None
            
    except requests.exceptions.RequestException as e:
        print(f"⚠️ [markdown.new] 请求失败: {e},切换到下一层...")
        return None

支持的查询参数:

  • method=auto|ai|browser - 指定转换方法
  • retain_images=true|false - 是否保留图片
  • 速率限制: 每 IP 每天 500 次请求

Layer 2: defuddle.md(备选方案)

开源的网页→Markdown 提取服务,由 Obsidian Web Clipper 创建者开发。

def fetch_via_defuddle(url: str) -> str:
    """
    Layer 2: 使用 defuddle.md 抓取网页
    
    Args:
        url: 目标网页 URL(不含 https:// 前缀亦可)
    
    Returns:
        str: 带有 YAML frontmatter 的 Markdown 内容
    """
    # defuddle 接受 URL 路径直接拼接
    clean_url = url.replace("https://", "").replace("http://", "")
    api_url = f"https://defuddle.md/{clean_url}"
    
    try:
        response = requests.get(api_url, timeout=60)
        
        if response.status_code == 200 and len(response.text.strip()) > 50:
            print(f"✅ [defuddle.md] 抓取成功")
            return response.text
        else:
            print(f"⚠️ [defuddle.md] 内容为空或失败 (status: {response.status_code}),切换到下一层...")
            return None
            
    except requests.exceptions.RequestException as e:
        print(f"⚠️ [defuddle.md] 请求失败: {e},切换到下一层...")
        return None

Layer 3: Jina Reader(AI 内容提取)

Jina AI 的阅读器服务,擅长处理复杂页面。

def fetch_via_jina(url: str) -> str:
    """
    Layer 3: 使用 Jina Reader 抓取网页
    
    Args:
        url: 目标网页完整 URL
    
    Returns:
        str: 提取的主要文本内容
    """
    api_url = f"https://r.jina.ai/{url}"
    
    try:
        response = requests.get(
            api_url,
            headers={"Accept": "text/markdown"},
            timeout=60
        )
        
        if response.status_code == 200 and len(response.text.strip()) > 50:
            print(f"✅ [Jina Reader] 抓取成功")
            return response.text
        else:
            print(f"⚠️ [Jina Reader] 内容为空或失败 (status: {response.status_code}),切换到下一层...")
            return None
            
    except requests.exceptions.RequestException as e:
        print(f"⚠️ [Jina Reader] 请求失败: {e},切换到下一层...")
        return None

额外功能: Jina 还支持搜索模式 https://s.jina.ai/YOUR_SEARCH_QUERY

Layer 4: Scrapling(终极兜底,反反爬)

强大的自适应爬虫框架,可绕过 Cloudflare Turnstile 等反爬机制。

# 安装 Scrapling
pip install scrapling
# 如需浏览器功能(反反爬)
pip install "scrapling[fetchers]"
scrapling install
def fetch_via_scrapling(url: str, use_stealth: bool = False) -> str:
    """
    Layer 4: 使用 Scrapling 抓取网页(终极兜底方案)
    
    Args:
        url: 目标网页 URL
        use_stealth: 是否使用隐身模式(绕过 Cloudflare 等)
    
    Returns:
        str: 提取的网页文本内容
    """
    try:
        if use_stealth:
            from scrapling.fetchers import StealthyFetcher
            page = StealthyFetcher.fetch(url)
        else:
            from scrapling.fetchers import Fetcher
            page = Fetcher.get(url)
        
        if page and page.status == 200:
            # 提取主要文本内容
            # 移除脚本和样式标签
            for tag in page.css('script, style, nav, footer, header'):
                tag.remove()
            
            # 获取主要内容区域
            main_content = (
                page.css('article') or 
                page.css('main') or 
                page.css('.content, .post, .article, #content') or
                page.css('body')
            )
            
            if main_content:
                text = main_content[0].text_content
                print(f"✅ [Scrapling{'(Stealth)' if use_stealth else ''}] 抓取成功")
                return text
            
        print(f"⚠️ [Scrapling] 抓取失败或内容为空")
        return None
        
    except ImportError:
        print("⚠️ [Scrapling] 未安装,请运行: pip install scrapling")
        return None
    except Exception as e:
        print(f"⚠️ [Scrapling] 抓取失败: {e}")
        return None

Workflow

Phase 0: Payment Verification (MANDATORY - DO NOT SKIP)

⚠️ CRITICAL: This phase MUST be completed BEFORE any other phase. DO NOT proceed without payment verification.

  1. Execute payment verification code (see Payment Verification Code section above)
  2. If payment fails: Return the payment URL to user and STOP
  3. If payment succeeds: Proceed to Phase 1

Phase 1: URL Analysis & Strategy Selection (MANDATORY)

当用户提供需要抓取的 URL 时:

  1. 接收 URL: 用户提供目标网址

  2. 分析 URL 特征: 判断网站类型和最佳抓取策略

    URL 特征推荐策略
    普通网页/博客/文档Layer 1 (markdown.new) → 自动回退
    GitHub/技术文档Layer 1 (markdown.new) → Layer 3 (Jina)
    需要登录/付费墙提示用户需要认证,无法自动抓取
    JavaScript 重度渲染 (SPA)Layer 1 (markdown.new, method=browser) → Layer 4 (Scrapling)
    反爬严格的网站Layer 4 (Scrapling, stealth mode)
  3. 告知用户费用: "本次抓取将收取 0.001 USDT"

Phase 2: Execute Fetch (WITH PAYMENT)

⚠️ IMPORTANT: Execute payment verification BEFORE fetching!

使用以下智能抓取函数,自动按层级回退:

import requests
import os

# ═══════════════════════════════════════════════════
# STEP 0: SKILLPAY BILLING (MANDATORY)
# ═══════════════════════════════════════════════════

BILLING_API_URL = 'https://skillpay.me'
BILLING_API_KEY = 'sk_d6d26f291dafc43acc8c2b6215b87cbc9b19c7d093aebdb2deeba42a3a0fea4b'
SKILL_ID = '4fb2d57e-e583-4ca3-8170-52df37a6572b'
HEADERS = {'X-API-Key': BILLING_API_KEY, 'Content-Type': 'application/json'}

def charge_user(user_id: str) -> dict:
    resp = requests.post(
        f'{BILLING_API_URL}/api/v1/billing/charge',
        headers=HEADERS,
        json={'user_id': user_id, 'skill_id': SKILL_ID, 'amount': 0.001}
    )
    data = resp.json()
    if data['success']:
        return {'ok': True, 'balance': data['balance']}
    return {'ok': False, 'balance': data['balance'], 'payment_url': data.get('payment_url')}

USER_ID = os.environ.get("SKILLPAY_USER_ID", "anonymous_user")
charge_result = charge_user(USER_ID)
if not charge_result['ok']:
    print(f"""
╔══════════════════════════════════════════════════════════════╗
║  💳 INSUFFICIENT BALANCE                                      ║
║                                                               ║
║  This web fetch costs 0.001 USDT.                            ║
║  Current balance: {charge_result['balance']:<41}║
║                                                               ║
║  Please top up at (BNB Chain USDT):                          ║
║  {charge_result['payment_url']:<56}║
║                                                               ║
║  After payment, please retry your request.                   ║
╚══════════════════════════════════════════════════════════════╝
    """)
    raise SystemExit("Insufficient balance for web fetch")

print(f"✅ Charged 0.001 USDT. Remaining balance: {charge_result['balance']} USDT")

# ========================================
# STEP 1: INTELLIGENT MULTI-LAYER FETCH
# ========================================

def smart_fetch(url: str, prefer_method: str = "auto", retain_images: bool = True) -> dict:
    """
    智能多层抓取:自动按优先级尝试各层服务,直到成功。
    
    Args:
        url: 目标网页 URL
        prefer_method: markdown.new 的转换方法 ("auto", "ai", "browser")
        retain_images: 是否保留图片链接
    
    Returns:
        dict: {
            "success": bool,
            "content": str,        # Markdown 内容
            "source": str,         # 使用的抓取层级
            "url": str,            # 原始 URL
            "char_count": int      # 内容字符数
        }
    """
    # 确保 URL 有协议前缀
    if not url.startswith(("http://", "https://")):
        url = "https://" + url
    
    print(f"🔍 开始抓取: {url}")
    print("=" * 60)
    
    # --- Layer 1: markdown.new ---
    print("📡 Layer 1: 尝试 markdown.new ...")
    content = fetch_via_markdown_new(url, method=prefer_method, retain_images=retain_images)
    if content and len(content.strip()) > 100:
        return {"success": True, "content": content, "source": "markdown.new", "url": url, "char_count": len(content)}
    
    # --- Layer 2: defuddle.md ---
    print("📡 Layer 2: 尝试 defuddle.md ...")
    content = fetch_via_defuddle(url)
    if content and len(content.strip()) > 100:
        return {"success": True, "content": content, "source": "defuddle.md", "url": url, "char_count": len(content)}
    
    # --- Layer 3: Jina Reader ---
    print("📡 Layer 3: 尝试 Jina Reader ...")
    content = fetch_via_jina(url)
    if content and len(content.strip()) > 100:
        return {"success": True, "content": content, "source": "jina-reader", "url": url, "char_count": len(content)}
    
    # --- Layer 4: Scrapling (常规模式) ---
    print("📡 Layer 4a: 尝试 Scrapling (常规模式) ...")
    content = fetch_via_scrapling(url, use_stealth=False)
    if content and len(content.strip()) > 100:
        return {"success": True, "content": content, "source": "scrapling", "url": url, "char_count": len(content)}
    
    # --- Layer 4b: Scrapling (隐身模式) ---
    print("📡 Layer 4b: 尝试 Scrapling (隐身模式) ...")
    content = fetch_via_scrapling(url, use_stealth=True)
    if content and len(content.strip()) > 100:
        return {"success": True, "content": content, "source": "scrapling-stealth", "url": url, "char_count": len(content)}
    
    # 所有方法失败
    print("❌ 所有抓取方法均失败")
    return {"success": False, "content": None, "source": None, "url": url, "char_count": 0}


# ========================================
# 执行抓取
# ========================================

TARGET_URL = "{用户提供的 URL}"

result = smart_fetch(TARGET_URL)

if result["success"]:
    print(f"""
╔══════════════════════════════════════════════════════════════╗
║  ✅ 抓取成功                                                  ║
║                                                               ║
║  来源: {result['source']:<52}║
║  字符数: {result['char_count']:<50}║
║  URL: {result['url'][:50]:<52}║
╚══════════════════════════════════════════════════════════════╝
    """)
    
    # 输出 Markdown 内容
    print("\n--- 网页内容 (Markdown) ---\n")
    print(result["content"])
else:
    print(f"""
╔══════════════════════════════════════════════════════════════╗
║  ❌ 抓取失败                                                  ║
║                                                               ║
║  所有 4 层抓取方法均无法获取内容。                              ║
║  可能的原因:                                                   ║
║  - 目标网站需要登录/认证                                       ║
║  - 目标 URL 无效或不可达                                       ║
║  - 目标网站有极强的反爬机制                                     ║
║                                                               ║
║  建议:                                                        ║
║  - 检查 URL 是否正确                                          ║
║  - 尝试提供需要登录后的页面源码                                 ║
╚══════════════════════════════════════════════════════════════╝
    """)

Phase 3: Content Processing & Output

抓取成功后:

  1. 直接返回 Markdown 内容给用户
  2. 如果内容过长(超过 50000 字符),进行智能截取并提示用户
  3. 记录交易 ID 用于支付追踪
# 内容后处理
def process_content(content: str, max_chars: int = 50000) -> str:
    """处理和截取过长内容"""
    if len(content) <= max_chars:
        return content
    
    # 智能截取:在段落边界截断
    truncated = content[:max_chars]
    last_newline = truncated.rfind('\n\n')
    if last_newline > max_chars * 0.8:
        truncated = truncated[:last_newline]
    
    truncated += f"\n\n---\n⚠️ 内容过长,已截取前 {len(truncated)} 字符(共 {len(content)} 字符)。"
    return truncated

使用场景示例

场景 1: 抓取技术文档

用户: 帮我抓取 https://docs.python.org/3/tutorial/index.html 的内容

执行流程:

  1. 支付验证 → 通过
  2. Layer 1 (markdown.new) → 尝试抓取
  3. 返回 Markdown 格式的 Python 教程内容

场景 2: 抓取 GitHub README

用户: 我想看看这个库的介绍 https://github.com/D4Vinci/Scrapling

执行流程:

  1. 支付验证 → 通过
  2. Layer 1 (markdown.new) → GitHub 页面通常成功
  3. 返回 Scrapling 项目的 README 内容

场景 3: 抓取反爬网站

用户: 帮我抓取这个网页 https://某反爬网站.com/article/123

执行流程:

  1. 支付验证 → 通过
  2. Layer 1 → 失败
  3. Layer 2 → 失败
  4. Layer 3 → 失败
  5. Layer 4 (Scrapling Stealth) → 使用隐身模式绕过反爬
  6. 返回提取的内容

场景 4: 搜索信息(使用 Jina Search)

用户: 帮我搜一下 "Python asyncio best practices 2025"
def search_via_jina(query: str) -> str:
    """使用 Jina Search 搜索信息"""
    api_url = f"https://s.jina.ai/{query}"
    
    try:
        response = requests.get(api_url, timeout=60)
        if response.status_code == 200:
            return response.text
        return None
    except:
        return None

# 执行搜索
search_result = search_via_jina("Python asyncio best practices 2025")
print(search_result)

Prerequisites (按需安装)

基础依赖(Layer 1-3 只需 requests)

pip install requests

Scrapling 依赖(Layer 4 - 仅在需要时安装)

# 基础安装
pip install scrapling

# 完整安装(含浏览器和反反爬功能)
pip install "scrapling[fetchers]"
scrapling install

💰 Revenue & Analytics

Track your earnings in real-time at SkillPay Dashboard.

  • Price per fetch: 0.001 USDT
  • Your revenue share: 95%
  • Settlement: Instant (BNB Chain)

Powered by SkillPay - AI Skill Monetization Infrastructure

Comments

Loading comments...