#crawler #data-collection #playwright #selenium

Smart Crawler

智能爬虫工具 - 企业级数据采集与反爬虫处理 | Smart Web Crawler - Enterprise data collection with anti-detection

Install

openclaw skills install @kaiyuelv/smart-crawler

Smart Crawler - 智能爬虫工具

企业级数据采集解决方案，支持智能反爬虫处理、分布式爬取和数据清洗。

核心功能

功能模块	说明
智能爬虫引擎	基于 Playwright/Selenium 的动态渲染爬取
反爬虫处理	自动切换 User-Agent、代理池、请求频率控制
数据提取	XPath/CSS Selector/Regex 多模式数据提取
分布式支持	Redis 队列支持的分布式爬取
数据清洗	自动去重、格式标准化、敏感信息过滤

快速开始

python

from scripts.crawler_engine import CrawlerEngine

# 创建爬虫引擎
crawler = CrawlerEngine(use_proxy=True, headless=True)

# 爬取网页
result = crawler.crawl('https://example.com', 
                       extract_rules={'title': '//h1/text()',
                                     'content': '//div[@class="content"]//p/text()'})
print(result)

安装

bash

pip install -r requirements.txt
playwright install

项目结构

text

smart-crawler/
├── SKILL.md                 # Skill说明文档
├── README.md                # 完整文档
├── requirements.txt         # 依赖列表
├── scripts/                 # 核心模块
│   ├── crawler_engine.py    # 爬虫引擎
│   ├── proxy_manager.py     # 代理管理器
│   ├── data_extractor.py    # 数据提取器
│   └── anti_detection.py    # 反检测模块
├── examples/                # 使用示例
│   └── basic_usage.py
└── tests/                   # 单元测试
    └── test_crawler.py

运行测试

bash

cd tests
python test_crawler.py

Smart Crawler

Install

Smart Crawler - 智能爬虫工具

核心功能

快速开始

安装

项目结构

运行测试

Related skills