WebScraper
Extract readable content from web pages. Use when: user wants to read article content, fetch documentation, grab product info, or get text from URLs. NOT for...
MIT-0 · Free to use, modify, and redistribute. No attribution required.
⭐ 0 · 76 · 0 current installs · 0 all-time installs
MIT-0
Security Scan
OpenClaw
Suspicious
medium confidencePurpose & Capability
Name, description, and declared binaries (curl, node) are appropriate for a web content extraction skill. No unrelated credentials, config paths, or unexpected system access are requested.
Instruction Scope
Most instructions stay within scraping/extraction scope, but several recommendations are risky or inconsistent: (1) the provided Node one-liner embeds $(curl ...) into a node -e string, which causes shell substitution of untrusted HTML into executable JS and can enable command injection or execution if the fetched content contains quotes/backticks; (2) suggestions to 'use proxy' and to set UA to avoid bot detection encourage evasion of anti-bot measures (conflicts with the 'Respect robots.txt' admonition); (3) use of curl | readability and piping remote content into local commands is suggested without caution about executing untrusted data. These are sloppy/insecure operational patterns that could lead to accidental code execution or misuse.
Install Mechanism
The skill is instruction-only (no install spec) which is low-risk, but the doc recommends installing global tools (npm -g cheerio, readability-cli, html2text). Global npm installs and suggested third-party CLIs are normal for this task but increase attack surface and require user discretion; no install URLs or obscure downloads are present in the package metadata.
Credentials
The skill does not request environment variables, credentials, or config paths. Recommendations (e.g., using proxies) might imply credential usage in practice, but nothing is declared or required by the skill itself.
Persistence & Privilege
Flags are default (always:false, user-invocable:true, autonomous invocation allowed). The skill does not request permanent system presence nor modify other skills or system configs.
What to consider before installing
This skill appears to do what it says (fetch and extract page content), but the runtime instructions include unsafe command patterns and operational advice you should not run verbatim. Before installing or using: (1) avoid the node -e $(curl ...) pattern — instead download HTML to a file (curl -s URL > page.html) and run a safe parser that reads the file, or fetch directly from Node using libraries (axios/fetch) to avoid shell injection; (2) prefer installing libraries per-project rather than npm -g to reduce global risk; (3) do not use proxies or UA tricks to bypass site protections unless you have explicit permission — this may violate terms of service and law; (4) be cautious piping remote content into executables (curl | program) because that can execute untrusted data. If the maintainer can sanitize the examples (safe Node fetch, avoid shell interpolation of remote data, clarify proxy guidance), the skill would be coherent and much safer.Like a lobster shell, security has layers — review code before you run it.
Current versionv1.0.0
Download ziplatest
License
MIT-0
Free to use, modify, and redistribute. No attribution required.
Runtime requirements
🕸️ Clawdis
Binscurl, node
SKILL.md
WebScraper Skill
Extract and parse content from web pages into readable markdown or plain text.
When to Use
✅ USE this skill when:
- "Read this article: [URL]"
- "What does this page say?"
- "Get the content from [URL]"
- Fetch documentation, blog posts, news articles
- Extract product information from e-commerce sites
- Grab API documentation or tutorials
- Summarize web page content
When NOT to Use
❌ DON'T use this skill when:
- Login-required pages (use BrowserAgent with session)
- Heavy JavaScript-rendered content (use BrowserAgent)
- Interactive web apps (dashboards, SPAs)
- CAPTCHA-protected sites
- Sites with strict anti-bot measures
- Real-time data (stock tickers, live scores)
Commands
Fetch URL Content
# Using OpenClaw web_fetch tool (recommended)
# Called via tool, not direct CLI
# Basic fetch (markdown output)
web_fetch(url: "https://example.com/article")
# Text-only mode (no markdown)
web_fetch(url: "https://example.com/article", extractMode: "text")
# Limit content length
web_fetch(url: "https://example.com/article", maxChars: 5000)
Using curl (fallback)
# Simple HTML fetch
curl -s "https://example.com" | html2text -width 80
# With user-agent (avoid bot detection)
curl -s -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" "https://example.com"
# Fetch and extract main content (requires readability-cli)
curl -s "https://example.com" | readability
# Get just the title
curl -s "https://example.com" | grep -oP '(?<=<title>).*?(?=</title>)'
Using Node.js (advanced)
# Install cheerio for HTML parsing
npm install -g cheerio
# Parse HTML with Node
node -e "
const cheerio = require('cheerio');
const html = \`\$(curl -s 'https://example.com')\`;
const \$ = cheerio.load(html);
console.log(\$('article').text());
"
Response Format
When fetching content, structure responses as:
## 📄 [Page Title]
**Source:** [URL](https://...)
**Fetched:** 2026-03-20
### Content
[Extracted content here...]
---
*Summary: [1-2 sentence summary if helpful]*
Best Practices
1. Respect Rate Limits
# Add delay between requests
sleep 2 && curl "https://example.com/page1"
sleep 2 && curl "https://example.com/page2"
2. Use Proper User-Agent
# Desktop Chrome
curl -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
# Mobile Safari
curl -A "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Mobile/15E148 Safari/604.1"
3. Handle Errors
# Check HTTP status
curl -s -o /dev/null -w "%{http_code}" "https://example.com"
# Timeout after 10 seconds
curl -s --max-time 10 "https://example.com"
# Retry on failure
curl -s --retry 3 "https://example.com"
4. Extract Specific Content
# Get all links
curl -s "https://example.com" | grep -oP 'href="\K[^"]+' | head -20
# Get images
curl -s "https://example.com" | grep -oP 'src="\K[^"]+\.(jpg|png|webp)'
# Get meta description
curl -s "https://example.com" | grep -oP '(?<=<meta name="description" content=")[^"]+'
Integration with OpenClaw
Using web_fetch Tool
// In your agent code
const content = await web_fetch({
url: "https://example.com/article",
extractMode: "markdown", // or "text"
maxChars: 10000
});
Batch Processing
For multiple URLs, process sequentially with delays:
URL1 → fetch → wait 2s → URL2 → fetch → wait 2s → URL3 → fetch
Common Use Cases
1. Article Summarization
1. Fetch article content
2. Extract main text (remove nav, footer, ads)
3. Generate summary
4. Return with source attribution
2. Product Information
1. Fetch product page
2. Extract: name, price, description, specs
3. Format as structured data
4. Return comparison-ready format
3. Documentation Lookup
1. Fetch docs page
2. Extract relevant section
3. Search for specific topic
4. Return code examples + explanations
Troubleshooting
| Problem | Solution |
|---|---|
| Content empty/missing | Site uses JS rendering → use BrowserAgent |
| Blocked by site | Add User-Agent, add delay, use proxy |
| Timeout | Increase timeout, check URL validity |
| Garbled text | Check charset, try text mode |
| Login required | Use BrowserAgent with session cookies |
Related Skills
- BrowserAgent - For interactive/JS-heavy sites
- web_search - For finding URLs before fetching
- coding-agent - For processing extracted data
Security Notes
⚠️ Important:
- Respect robots.txt
- Don't scrape personal data
- Honor copyright/terms of service
- Add delays between requests (2-5s)
- Don't overload servers
- Use official APIs when available
Files
2 totalSelect a file
Select a file to preview.
Comments
Loading comments…
