Install
openclaw skills install webscraperExtract readable content from web pages. Use when: user wants to read article content, fetch documentation, grab product info, or get text from URLs. NOT for: interactive sites, login-required pages, or complex JavaScript-rendered content.
openclaw skills install webscraperExtract and parse content from web pages into readable markdown or plain text.
✅ USE this skill when:
❌ DON'T use this skill when:
# Using OpenClaw web_fetch tool (recommended)
# Called via tool, not direct CLI
# Basic fetch (markdown output)
web_fetch(url: "https://example.com/article")
# Text-only mode (no markdown)
web_fetch(url: "https://example.com/article", extractMode: "text")
# Limit content length
web_fetch(url: "https://example.com/article", maxChars: 5000)
# Simple HTML fetch
curl -s "https://example.com" | html2text -width 80
# With user-agent (avoid bot detection)
curl -s -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" "https://example.com"
# Fetch and extract main content (requires readability-cli)
curl -s "https://example.com" | readability
# Get just the title
curl -s "https://example.com" | grep -oP '(?<=<title>).*?(?=</title>)'
# Install cheerio for HTML parsing
npm install -g cheerio
# Parse HTML with Node
node -e "
const cheerio = require('cheerio');
const html = \`\$(curl -s 'https://example.com')\`;
const \$ = cheerio.load(html);
console.log(\$('article').text());
"
When fetching content, structure responses as:
## 📄 [Page Title]
**Source:** [URL](https://...)
**Fetched:** 2026-03-20
### Content
[Extracted content here...]
---
*Summary: [1-2 sentence summary if helpful]*
# Add delay between requests
sleep 2 && curl "https://example.com/page1"
sleep 2 && curl "https://example.com/page2"
# Desktop Chrome
curl -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
# Mobile Safari
curl -A "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Mobile/15E148 Safari/604.1"
# Check HTTP status
curl -s -o /dev/null -w "%{http_code}" "https://example.com"
# Timeout after 10 seconds
curl -s --max-time 10 "https://example.com"
# Retry on failure
curl -s --retry 3 "https://example.com"
# Get all links
curl -s "https://example.com" | grep -oP 'href="\K[^"]+' | head -20
# Get images
curl -s "https://example.com" | grep -oP 'src="\K[^"]+\.(jpg|png|webp)'
# Get meta description
curl -s "https://example.com" | grep -oP '(?<=<meta name="description" content=")[^"]+'
// In your agent code
const content = await web_fetch({
url: "https://example.com/article",
extractMode: "markdown", // or "text"
maxChars: 10000
});
For multiple URLs, process sequentially with delays:
URL1 → fetch → wait 2s → URL2 → fetch → wait 2s → URL3 → fetch
1. Fetch article content
2. Extract main text (remove nav, footer, ads)
3. Generate summary
4. Return with source attribution
1. Fetch product page
2. Extract: name, price, description, specs
3. Format as structured data
4. Return comparison-ready format
1. Fetch docs page
2. Extract relevant section
3. Search for specific topic
4. Return code examples + explanations
| Problem | Solution |
|---|---|
| Content empty/missing | Site uses JS rendering → use BrowserAgent |
| Blocked by site | Add User-Agent, add delay, use proxy |
| Timeout | Increase timeout, check URL validity |
| Garbled text | Check charset, try text mode |
| Login required | Use BrowserAgent with session cookies |
⚠️ Important: