Sitemap Generator
Generate XML sitemaps by crawling a live website or scanning local HTML files.
Crawl a Website
python3 scripts/sitemap_gen.py https://example.com
Scan Local Files
python3 scripts/sitemap_gen.py --local ./public --base-url https://example.com
Save to File
# Save sitemap.xml
python3 scripts/sitemap_gen.py https://example.com --output sitemap.xml
# Save sitemap.xml + robots.txt
python3 scripts/sitemap_gen.py https://example.com --output sitemap.xml --robots
Output Formats
# XML (default — valid sitemap.xml)
python3 scripts/sitemap_gen.py https://example.com
# Text (human-readable summary + XML)
python3 scripts/sitemap_gen.py https://example.com --format text
# JSON (pages list + XML string)
python3 scripts/sitemap_gen.py https://example.com --format json
Options
| Flag | Default | Description |
|---|
--max-pages | 500 | Maximum pages to crawl |
--timeout | 10 | Request timeout in seconds |
--output / -o | stdout | Save sitemap.xml to file |
--robots | off | Also generate robots.txt |
--local | off | Scan local directory instead of crawling |
--base-url | — | Base URL for local mode (required) |
--verbose / -v | off | Show crawl progress |
Features
- Crawl mode: BFS link discovery, same-domain only, deduplication
- Local mode: Scan HTML/HTM/MD/PHP files, auto-detect lastmod from file mtime
- Smart filtering: Skips images, CSS, JS, PDFs, archives, media files
- URL normalization: Removes fragments, normalizes trailing slashes
- robots.txt generation: User-agent + Allow + Sitemap reference
- Valid XML: Proper XML escaping, sitemaps.org schema
Requirements
- Python 3.6+
- No external dependencies (stdlib only)