Install
openclaw skills install cn-data-scraperChinese website data scraping expert with anti-bypass strategies (中国网站数据爬取专家+反爬绕过策略). Teach AI agents to scrape Chinese websites that Scrapling alone can't handle — Baidu anti-crawl, Taobao login walls, Douyin dynamic rendering, Zhihu verification, WeChat articles, 1688 product data. Features: (1) Platform-specific anti-crawl bypass recipes for 10+ Chinese websites, (2) Scrapling integration guides with Chinese site configurations, (3) Adaptive selector strategies for frequently-redesigned Chinese sites, (4) Legal compliance boundary guide (what's legal vs illegal in China's data scraping), (5) Executable scripts for common scraping tasks, (6) MCP server integration for AI agent workflows. ONLY skill combining Chinese website scraping expertise + legal compliance + Scrapling integration. Use when: scraping Chinese websites, bypassing Baidu anti-crawl, scraping Taobao data, Douyin data extraction, Zhihu scraping, WeChat article scraping, 1688 product scraping, Chinese data collection, 爬虫, 数据爬取, 反爬绕过, 百度反爬, 淘宝爬虫, 抖音数据, 知乎爬虫, 微信文章抓取, Scrapling Chinese, 中国网站爬虫. Triggers: Chinese web scraping, data scraping China, anti-crawl bypass, Baidu scraping, Taobao scraping, Douyin scraping, Zhihu scraping, WeChat scraping, 1688 scraping, 爬虫工具, 数据采集, 反爬虫, 信息差, Scrapling配置, 中国网站数据, cn-scraping, web scraping China, data extraction Chinese websites, crawler Chinese sites, Python爬虫中国网站
openclaw skills install cn-data-scraper⚡ INSTANT VALUE — Install This If You:
- Need to scrape Baidu/Taobao/Douyin/Zhihu/WeChat/1688 but keep hitting anti-crawl walls
- Want platform-specific bypass recipes — not generic "use Selenium" advice, but tested configs for each Chinese site
- Are using Scrapling but need Chinese site configurations (Baidu's cookie walls, Taobao's login gates, Douyin's dynamic rendering)
- Want to know what's legal — China's data scraping legal boundaries (Criminal Law 285/286, Data Security Law, PIPL)
🎯 Why this over generic scraping skills? Generic scraping skills give you BeautifulSoup/Selenium tutorials. We give you tested anti-crawl configs for 10+ Chinese websites, legal compliance boundaries (avoid Criminal Law 285!), and Scrapling integration with Chinese site presets. Tutorials vs Recipes — you decide.
🔗 Based on Scrapling (31K+ GitHub Stars) — the fastest Python scraping framework with adaptive selectors and Cloudflare bypass. We add the China layer on top.
You are a Chinese website data scraping expert. You help AI agents and developers scrape data from Chinese websites that are notoriously difficult to crawl — Baidu, Taobao, Douyin, Zhihu, WeChat, 1688, and more.
Scrapling solves the general scraping problem. We solve the China-specific problem.
Chinese websites have unique anti-crawl mechanisms that generic tools can't handle:
This skill provides tested recipes for each platform, not generic advice.
┌─────────────────────────────────────┐
│ AI Agent / User │
├─────────────────────────────────────┤
│ cn-data-scraper Skill │
│ ┌─────────────┐ ┌───────────────┐ │
│ │ Platform │ │ Legal │ │
│ │ Recipes │ │ Compliance │ │
│ │ (10+ sites) │ │ Boundaries │ │
│ └──────┬──────┘ └───────┬───────┘ │
│ │ │ │
│ ┌──────▼────────────────▼───────┐ │
│ │ Scrapling Framework │ │
│ │ (Adaptive Selectors + │ │
│ │ StealthyFetcher + │ │
│ │ Camoufox Engine) │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────┘
Anti-crawl mechanisms:
Scrapling configuration:
from scrapling import StealthyFetcher
# Baidu search with anti-crawl bypass
page = StealthyFetcher.fetch(
'https://www.baidu.com/s?wd=关键词',
headless=True,
network_idle=True, # Wait for JS execution
timeout=30000,
)
# Extract search results — use adaptive selectors
# Baidu frequently changes class names, so use structural selectors
results = page.css('div.c-container') # More stable than class-based
for result in results:
title = result.css_first('h3 a')
snippet = result.css_first('span.content-right_8Zs40')
# Fallback: adaptive selector if structure changed
if not snippet:
snippet = result.css_first('[class*="content"]')
Key tips:
User-Agent with Baidu app identifier for mobile resultsnetwork_idle=True — Baidu loads results via AJAXAnti-crawl mechanisms:
Scrapling configuration:
# Taobao requires login — use cookie injection
page = StealthyFetcher.fetch(
'https://s.taobao.com/search?q=关键词',
headless=True,
network_idle=True,
# Must inject login cookies
cookies={
'_m_h5_tk': 'your_token_here',
'_m_h5_tk_enc': 'your_enc_token_here',
'cookie2': 'your_cookie2',
'sgcookie': 'your_sgcookie',
}
)
# Extract product data
products = page.css('div.Card--doubleCardWrapper')
for product in products:
title = product.css_first('span.Title--titleSpan')
price = product.css_first('span.Price--priceInt')
sales = product.css_first('span.Sales--sales')
Key tips:
mtop.taobao.searchapi.search API endpoint for structured dataAnti-crawl mechanisms:
Scrapling configuration:
# Douyin web version — easier than app API
page = StealthyFetcher.fetch(
'https://www.douyin.com/search/关键词',
headless=True,
network_idle=True,
wait_selector='div.video-card', # Wait for video cards to load
)
# Extract video data
videos = page.css('div.video-card')
for video in videos:
title = video.css_first('p.title')
author = video.css_first('span.author-card-user-name')
likes = video.css_first('span.video-like-count')
Key tips:
Anti-crawl mechanisms:
Scrapling configuration:
# Zhihu search — use API endpoint for structured data
page = StealthyFetcher.fetch(
'https://www.zhihu.com/search?type=content&q=关键词',
headless=True,
network_idle=True,
# Zhihu requires specific headers
headers={
'Referer': 'https://www.zhihu.com/',
}
)
# Extract search results
results = page.css('div.SearchResult-Card')
for result in results:
title = result.css_first('h2.ContentItem-title a')
excerpt = result.css_first('span.RichText')
author = result.css_first('meta[itemprop="name"]')
Key tips:
api.zhihu.com/search_v3 returns JSON — easier to parseAnti-crawl mechanisms:
Scrapling configuration:
# WeChat article — direct URL access
page = StealthyFetcher.fetch(
'https://mp.weixin.qq.com/s/ARTICLE_ID',
headless=True,
network_idle=True,
)
# Extract article content
title = page.css_first('h1.rich_media_title')
content = page.css_first('div.rich_media_content')
author = page.css_first('a.rich_media_meta_link')
publish_time = page.css_first('em#publish_time')
Key tips:
weixin.sogou.com — but has aggressive anti-crawlmp.weixin.qq.com direct access for known URLsAnti-crawl mechanisms:
Scrapling configuration:
# 1688 search
page = StealthyFetcher.fetch(
'https://s.1688.com/selloffer/offer_search.htm?keywords=关键词',
headless=True,
network_idle=True,
)
# Extract product listings
products = page.css('div.offer-item')
for product in products:
title = product.css_first('a.title')
price = product.css_first('span.price')
min_order = product.css_first('span.min-order')
supplier = product.css_first('a.company-name')
Anti-crawl mechanisms:
Scrapling configuration:
# Xiaohongshu web version
page = StealthyFetcher.fetch(
'https://www.xiaohongshu.com/search_result?keyword=关键词',
headless=True,
network_idle=True,
wait_selector='section.note-item',
)
# Extract notes
notes = page.css('section.note-item')
for note in notes:
title = note.css_first('div.title')
author = note.css_first('span.name')
likes = note.css_first('span.like-wrapper .count')
Key tips:
Anti-crawl mechanisms:
Scrapling configuration:
# Weibo search
page = StealthyFetcher.fetch(
'https://s.weibo.com/weibo?q=关键词',
headless=True,
network_idle=True,
cookies={
'SUB': 'your_sub_cookie', # Required for search
}
)
# Extract posts
posts = page.css('div.card-wrap[action-type="feed_list_item"]')
for post in posts:
author = post.css_first('a.name')
content = post.css_first('p.txt')
reposts = post.css_first('a[action-type="fl_forward"] em')
comments = post.css_first('a[action-type="flcomment"] em')
likes = post.css_first('a[action-type="fl_like"] em')
Anti-crawl mechanisms:
Scrapling configuration:
# CSDN article
page = StealthyFetcher.fetch(
'https://blog.csdn.net/author/article/ID',
headless=True,
network_idle=True,
)
# Remove anti-copy overlay
content = page.css_first('article.baidu_pl')
# Content is in HTML, anti-copy is just a CSS overlay
Anti-crawl mechanisms:
Scrapling configuration:
# Boss Zhipin search
page = StealthyFetcher.fetch(
'https://www.zhipin.com/web/geek/job?query=关键词',
headless=True,
network_idle=True,
cookies={
'geek_zp_token': 'your_token',
}
)
# Extract job listings
jobs = page.css('li.job-card-wrapper')
for job in jobs:
title = job.css_first('span.job-name')
salary = job.css_first('span.salary')
company = job.css_first('h3.company-name')
location = job.css_first('span.job-area')
This is NOT optional. Violating these can result in criminal prosecution.
| Law | Scope | Max Penalty |
|---|---|---|
| Criminal Law Art. 253 | Personal information | 7 years + fine |
| Criminal Law Art. 285 | Unauthorized system access | 7 years |
| Criminal Law Art. 286 | System disruption | 15 years |
| Data Security Law | Data classification | ¥10M fine |
| PIPL (个人信息保护法) | Personal information | ¥50M or 5% revenue |
| Cybersecurity Law | Network data | ¥1M fine |
| Anti-Unfair Competition Law | Business data scraping | ¥3M fine |
# Install Scrapling with all features
pip install scrapling[all]
# Or minimal install
pip install scrapling
from scrapling import StealthyFetcher, Fetcher
# 1. Simple HTTP fetch (fast, no JS rendering)
page = Fetcher.get('https://example.com')
# 2. Stealthy browser fetch (bypasses anti-bot)
page = StealthyFetcher.fetch(
'https://www.baidu.com/s?wd=test',
headless=True,
network_idle=True,
)
# 3. Adaptive selectors — survive site redesigns
element = page.find_by_text('价格') # Find by text content
element = page.css_first('[class*="price"]') # Partial class match
Chinese websites redesign frequently. Use these strategies to make your selectors resilient:
# ❌ BAD: Exact class names (break on redesign)
title = page.css_first('span.title_3wVZ1')
# ✅ GOOD: Structural selectors
title = page.css_first('h2 a') # Semantic HTML
# ✅ GOOD: Partial class match
title = page.css_first('[class*="title"]')
# ✅ GOOD: Text-based selection
title = page.find_by_text('价格')
# ✅ GOOD: Attribute-based
title = page.css_first('[data-type="title"]')
# ✅ BEST: Scrapling's adaptive selector
# Scrapling remembers element characteristics and re-finds after changes
element = page.css_first('div.product-title')
# If class changes, Scrapling's smart locator adapts automatically
import time
import random
def polite_scrape(urls, min_delay=2, max_delay=5):
"""Scrape with polite rate limiting"""
results = []
for url in urls:
page = Fetcher.get(url)
results.append(page)
# Random delay to appear human
delay = random.uniform(min_delay, max_delay)
time.sleep(delay)
return results
# Platform-specific rate limits
RATE_LIMITS = {
'baidu': {'min': 3, 'max': 8, 'max_per_hour': 80},
'taobao': {'min': 5, 'max': 12, 'max_per_hour': 30},
'douyin': {'min': 5, 'max': 15, 'max_per_hour': 20},
'zhihu': {'min': 3, 'max': 8, 'max_per_hour': 40},
'wechat': {'min': 2, 'max': 5, 'max_per_hour': 60},
'1688': {'min': 5, 'max': 10, 'max_per_hour': 30},
'xiaohongshu': {'min': 8, 'max': 20, 'max_per_hour': 10},
'weibo': {'min': 3, 'max': 8, 'max_per_hour': 40},
}
def extract_product(page, platform='generic'):
"""Universal product data extractor"""
templates = {
'taobao': {
'title': 'span.Title--titleSpan',
'price': 'span.Price--priceInt',
'sales': 'span.Sales--sales',
'shop': 'a.ShopName--shopName',
},
'1688': {
'title': 'a.title',
'price': 'span.price',
'min_order': 'span.min-order',
'supplier': 'a.company-name',
},
'jd': {
'title': 'div.sku-name',
'price': 'span.price',
'comments': 'a.comment-count',
},
}
selector = templates.get(platform, templates['taobao'])
return {
field: page.css_first(sel).text() if page.css_first(sel) else None
for field, sel in selector.items()
}
def extract_social_post(page, platform='generic'):
"""Universal social media post extractor"""
templates = {
'weibo': {
'author': 'a.name',
'content': 'p.txt',
'reposts': 'a[action-type="fl_forward"] em',
'comments': 'a[action-type="flcomment"] em',
'likes': 'a[action-type="fl_like"] em',
},
'xiaohongshu': {
'author': 'span.name',
'content': 'span.note-text',
'likes': 'span.like-wrapper .count',
'collects': 'span.collect-wrapper .count',
},
'douyin': {
'author': 'span.author-card-user-name',
'content': 'p.title',
'likes': 'span.video-like-count',
},
}
selector = templates.get(platform, templates['weibo'])
return {
field: page.css_first(sel).text() if page.css_first(sel) else None
for field, sel in selector.items()
}
scripts/scrape.sh — Quick CLI Scraper#!/bin/bash
# cn-data-scraper CLI tool
# Usage: ./scripts/scrape.sh <platform> <keyword> [options]
PLATFORM=$1
KEYWORD=$2
OUTPUT=${3:-/tmp/scrape_result.json}
if [ -z "$PLATFORM" ] || [ -z "$KEYWORD" ]; then
echo "Usage: ./scripts/scrape.sh <platform> <keyword> [output_file]"
echo "Platforms: baidu taobao douyin zhihu wechat 1688 xiaohongshu weibo csdn boss"
exit 1
fi
python3 -c "
from scrapling import StealthyFetcher, Fetcher
import json
platform = '$PLATFORM'
keyword = '$KEYWORD'
output = '$OUTPUT'
URLS = {
'baidu': f'https://www.baidu.com/s?wd={keyword}',
'zhihu': f'https://www.zhihu.com/search?type=content&q={keyword}',
'weibo': f'https://s.weibo.com/weibo?q={keyword}',
'csdn': f'https://so.csdn.net/so/search?q={keyword}',
}
url = URLS.get(platform)
if not url:
print(json.dumps({'error': f'Platform {platform} not supported for CLI scraping. Use Python API for full features.'}))
exit(0)
try:
if platform in ['baidu', 'zhihu', 'weibo']:
page = StealthyFetcher.fetch(url, headless=True, network_idle=True, timeout=30000)
else:
page = Fetcher.get(url)
# Extract all text content
texts = [el.text() for el in page.css('p, span, h1, h2, h3, h4, h5, h6') if el.text()]
result = {
'platform': platform,
'keyword': keyword,
'url': url,
'content_count': len(texts),
'preview': texts[:20],
}
with open(output, 'w') as f:
json.dump(result, f, ensure_ascii=False, indent=2)
print(json.dumps(result, ensure_ascii=False, indent=2))
except Exception as e:
print(json.dumps({'error': str(e)}))
"
For AI agent workflows, this skill can be used with MCP servers:
# Example: MCP tool for scraping Chinese websites
from mcp.server import Server
server = Server("cn-data-scraper")
@server.tool()
def scrape_chinese_site(platform: str, keyword: str, max_results: int = 10) -> dict:
"""Scrape data from Chinese websites with anti-crawl bypass.
Args:
platform: Target platform (baidu/taobao/douyin/zhihu/wechat/1688/xiaohongshu/weibo)
keyword: Search keyword
max_results: Maximum number of results to return
Returns:
Dictionary with scraped data and metadata
"""
# Implementation using Scrapling + platform recipes
pass
@server.tool()
def check_legal_compliance(scraping_plan: str) -> dict:
"""Check if a scraping plan complies with Chinese law.
Args:
scraping_plan: Description of what data you plan to scrape
Returns:
Compliance assessment with risk level and legal references
"""
pass
| Feature | Scrapling | BeautifulSoup | Selenium | Playwright |
|---|---|---|---|---|
| Speed | 784x BS4 | Baseline | Slow | Medium |
| Anti-crawl bypass | ✅ Built-in | ❌ | ⚠️ Manual | ⚠️ Manual |
| Adaptive selectors | ✅ Auto | ❌ | ❌ | ❌ |
| Cloudflare bypass | ✅ Native | ❌ | ⚠️ Plugin | ⚠️ Plugin |
| Chinese site configs | ❌ (We add this) | ❌ | ❌ | ❌ |
| Legal compliance | ❌ (We add this) | ❌ | ❌ | ❌ |
| Memory usage | Low | Very Low | High | Medium |
| Setup complexity | pip install | pip install | Driver needed | pip install |
Our value-add: Scrapling handles the technical scraping. We add the China layer (platform recipes + legal compliance + adaptive selectors for Chinese sites).