URL to Markdown

The skill url2md converts HTML web pages from HTTP/HTTPS URLs to clean, readable Markdown files with optional batch processing and formatting features.

Audits

Pass

Install

openclaw skills install url2md

Url2md

Convert web pages to clean, readable Markdown.

Quick Start

Single URL

python3 scripts/url2md.py https://example.com/article

Output to a file:

python3 scripts/url2md.py https://example.com/article -o article.md

Batch Conversion

Create a file with URLs (one per line):

https://example.com/article-1
https://example.com/article-2
https://example.com/article-3

Convert all and save to a directory:

python3 scripts/url2md.py -f urls.txt -d ./markdown_files/

Features

  • No dependencies: Uses only Python standard library (urllib, html.parser)
  • Reader-style scope: Strips script/style/noscript/template, then prefers the first <article> or <main> (else <body>) so output focuses on primary content
  • Title extraction: Uses og:title / Twitter title when present, otherwise <title>, added as H1 when enabled
  • YAML Frontmatter: Extracts structured metadata (title, author, published, description, category, source) from <meta> tags and Schema.org JSON-LD for knowledge-base workflows
  • Template system: Customize output format with variables ({{title}}, {{content}}, {{author}}, {{published}}, {{date}}, etc.)
  • Link resolution: Converts relative URLs to absolute
  • Basic formatting: Headings, paragraphs, lists, links, images, fenced code (with optional language), GFM-style tables, bold/italic
  • Noise removal: Skips navigation, sidebars, footers, forms, and other chrome inside the parsed fragment

Script Reference

scripts/url2md.py

Usage:

url2md.py [url] [options]

Options:

OptionDescription
urlSingle URL to convert
-o, --outputOutput file (default: stdout)
-f, --fileFile containing URLs to convert
-d, --dirOutput directory for batch conversion
--no-titleSkip adding page title as H1
--full-pageParse full <body> instead of <article>/<main> first (more chrome, wider coverage)
--timeoutRequest timeout in seconds (default: 30)
--frontmatterAdd YAML frontmatter with extracted metadata
-t, --templatePath to a template file for customizing output
--filename-templateBatch mode filename pattern (e.g. {{date}}-{{title}}.md)
--download-imagesDownload remote images to a local folder (e.g. assets)
-v, --versionShow version

Examples:

# Single URL to stdout
python3 scripts/url2md.py https://docs.python.org/3

# Save to file
python3 scripts/url2md.py https://docs.python.org/3 -o python-docs.md

# Batch with custom timeout
python3 scripts/url2md.py -f urls.txt -d ./output/ --timeout 60

# Skip title
python3 scripts/url2md.py https://example.com --no-title

# Whole body (no article/main focus)
python3 scripts/url2md.py https://example.com/sitemap --full-page -o sitemap.md

# YAML frontmatter (great for Obsidian / PKM)
python3 scripts/url2md.py https://example.com/article --frontmatter -o article.md

# Custom template
python3 scripts/url2md.py https://example.com/article -t article.tpl -o article.md

# Batch with smart filenames
python3 scripts/url2md.py -f urls.txt -d ./output/ --filename-template "{{date}}-{{title}}.md"

# Download images locally
python3 scripts/url2md.py https://example.com/article -o article.md --download-images assets
python3 scripts/url2md.py -f urls.txt -d ./output/ --download-images assets

Template variables: {{title}}, {{content}}, {{url}}, {{source}}, {{author}}, {{published}}, {{description}}, {{category}}, {{site_name}}, {{date}}, {{datetime}}

When to Use

  • Converting documentation pages to Markdown for local reference
  • Archiving web articles as text files
  • Building a knowledge base with structured metadata (frontmatter / templates)
  • Building static content from dynamic sources
  • Extracting readable content when browser tools are unavailable
  • Batch processing a list of URLs

Limitations

  • Converts static HTML only; does not execute JavaScript
  • Complex layouts (multi-column, heavy CSS) may lose structural fidelity
  • Login-required or paywalled content requires authentication tokens
  • Rate-limited sites may block repeated requests