URL Fetcher

Data & APIs

Fetch and save web content using only Python stdlib with URL and path validation, basic HTML-to-markdown conversion, and no API keys or external dependencies.

Install

openclaw skills install url-fetcher

URL Fetcher

Fetch web content without API keys or external dependencies. Uses Python standard library only.

Quick Start

url_fetcher.py fetch <url>
url_fetcher.py fetch --markdown <url> [output_file]

Examples:

# Fetch and preview
url_fetcher.py fetch https://example.com

# Fetch and save HTML
url_fetcher.py fetch https://example.com ~/workspace/page.html

# Fetch and convert to basic markdown
url_fetcher.py fetch --markdown https://example.com ~/workspace/page.md

Features

No dependencies - Uses Python stdlib (urllib) only
No API keys - Completely free to use
URL validation - Blocks localhost/internal networks
Basic markdown conversion - Extract content from HTML
Path validation - Safe file writes only (workspace, home, /tmp)
Error handling - Timeout and network error handling

When to Use

Content aggregation - Collect pages for processing
Research collection - Save articles/pages locally
Simple scraping - Extract text from web pages
Markdown conversion - Basic HTML to text/markdown
No-API alternatives - When you can't use paid APIs

Limitations

Basic markdown - Simple regex-based conversion (not a full parser)
No JavaScript - Only fetches static HTML
Rate limiting - No built-in rate limiting (add your own if needed)
Bot detection - Some sites may block the default User-Agent

Security Features

URL Validation

✅ Allows: http/https URLs
❌ Blocks: file://, data://, javascript: URLs
❌ Blocks: localhost, 127.0.0.1, ::1 (internal networks)

File Path Validation

✅ Allows: workspace, home directory, /tmp
❌ Blocks: system paths (/etc, /usr, /var, etc.)
❌ Blocks: sensitive dotfiles (~/.ssh, ~/.bashrc, etc.)

Error Handling

Timeout after 10 seconds
HTTP error handling
Network error handling
Character encoding handling

Usage Patterns

Collecting Research

# Fetch multiple articles
url_fetcher.py fetch https://example.com/article1.md ~/workspace/research/article1.md
url_fetcher.py fetch https://example.com/article2.md ~/workspace/research/article2.md

# Convert to markdown for reading
url_fetcher.py fetch --markdown https://example.com/article.md ~/workspace/research/article.md

Content Aggregation

# Fetch pages for processing
url_fetcher.py fetch https://news.example.com ~/workspace/content/latest.html

# Extract text
url_fetcher.py fetch --markdown https://blog.example.com ~/workspace/content/post.md

Quick Preview

# Just preview content (no file save)
url_fetcher.py fetch https://example.com

Advanced Usage

Batch Fetching

#!/bin/bash
# batch_fetch.sh

URLS=(
    "https://example.com/page1"
    "https://example.com/page2"
    "https://example.com/page3"
)

OUTPUT_DIR="$HOME/workspace/fetched"
mkdir -p "$OUTPUT_DIR"

for url in "${URLS[@]}"; do
    filename=$(echo $url | sed 's|/||g')
    url_fetcher.py fetch --markdown "$url" "$OUTPUT_DIR/$filename.md"
    sleep 1  # Be nice to servers
done

Integration with Other Skills

Combine with research-assistant:

# Fetch article
url_fetcher.py fetch --markdown https://example.com/article.md ~/workspace/article.md

# Extract key points
# Then use research-assistant to organize findings

Combine with task-runner:

# Add task to fetch content
task_runner.py add "Fetch article on topic X" "research"

# Fetch when ready
url_fetcher.py fetch https://example.com/topic-x.md ~/workspace/research/topic-x.md

Troubleshooting

Connection Timeout

Error: Request timeout after 10s

Solution: The server is slow or unreachable. Try again later or check the URL.

HTTP 403/429 Errors

Error: HTTP 403: Forbidden

Solution: The site blocks automated requests. Try:

Add delay between requests
Use a different User-Agent (modify source)
Respect robots.txt
Consider using an API if available

Encoding Issues

Error with special characters

Solution: The tool uses UTF-8 with error-ignore. Some characters may be lost.

Markdown Quality

Note: Basic markdown extraction

Solution: This tool uses simple regex for HTML→MD conversion. For better results:

Use dedicated markdown parsers
Or post-process the output
Or use a paid API with better parsing

Best Practices

Be respectful - Add delays between requests (don't hammer servers)
Check robots.txt - Respect site's crawling policies
Rate limit yourself - Don't fetch too fast
Validate URLs - Only fetch from trusted sources
Save safely - Always use path-validated outputs
Preview first - Use preview mode before saving

Integration Examples

Python Integration

from pathlib import Path
import subprocess

def fetch_and_process(url):
    """Fetch URL and process"""
    output = Path.home() / "workspace" / "fetched" / "page.md"
    output.parent.mkdir(parents=True, exist_ok=True)
    
    # Fetch
    subprocess.run([
        "python3",
        "/path/to/url_fetcher.py",
        "fetch",
        "--markdown",
        url,
        str(output)
    ])
    
    # Process content
    content = output.read_text()
    return content

Bash Integration

# Function for fetching
fetch_content() {
    local url="$1"
    local output="$2"
    python3 ~/workspace/skills/url-fetcher/scripts/url_fetcher.py \
        fetch --markdown "$url" "$output"
}

# Usage
fetch_content "https://example.com" ~/workspace/example.md

Alternatives

When You Need More Features

For full-featured scraping:

Use requests + beautifulsoup4 (requires pip install)
Or use scrapy framework (requires pip install)
Or use paid APIs (Firecrawl, Apify)

For better markdown:

markdownify library (requires pip install)
Or use AI-based parsing (OpenAI, Anthropic APIs)

For complex workflows:

Browser automation (OpenClaw browser tool)
Headless Chrome (Puppeteer, Playwright)
Or use scraping APIs (Zyte, ScraperAPI)

Zero-Cost Advantage

This skill requires:

✅ Python 3 (included with OpenClaw)
✅ No API keys
✅ No external packages
✅ No paid services
✅ No rate limiting (other than what you add)

Perfect for autonomous agents with budget constraints.

Contributing

If you improve this skill, please:

Test with security-checker
Document new features
Publish to ClawHub with credit

License

Use freely in your OpenClaw skills and workflows.

URL Fetcher

Install

URL Fetcher

Quick Start

Features

When to Use

Limitations

Security Features

URL Validation

File Path Validation

Error Handling

Usage Patterns

Collecting Research

Content Aggregation

Quick Preview

Advanced Usage

Batch Fetching

Integration with Other Skills

Troubleshooting

Connection Timeout

HTTP 403/429 Errors

Encoding Issues

Markdown Quality

Best Practices

Integration Examples

Python Integration

Bash Integration

Alternatives

When You Need More Features

Zero-Cost Advantage

Contributing

License

Related skills