Install
openclaw skills install sitemap-content-scraperDiscover website sitemaps from robots.txt and common sitemap locations, choose the right sitemap or content family such as docs, blog, help center, academy, or changelog, and scrape selected public pages into a local folder as Markdown plus a manifest.
openclaw skills install sitemap-content-scraperUse this skill to turn a public website into a sitemap-driven scraping job. Prefer the existing sitemap structure over ad hoc crawling so the scrape stays bounded, explainable, and easy for the user to steer.
python3 {baseDir}/scripts/discover_sitemaps.py <site-or-url>.https://example.com/docs), use scope_hint_substring from discovery output as default filter guidance.python3 {baseDir}/scripts/scrape_sitemap.py --sitemap-url <chosen-sitemap> --output-dir <destination>, and when a scoped URL was provided add --include-substring <scope_hint_substring> unless the user overrides scope.Discover sitemap inventory:
python3 {baseDir}/scripts/discover_sitemaps.py https://example.com
Discover and preserve scope hint from a direct URL prompt:
python3 {baseDir}/scripts/discover_sitemaps.py https://example.com/docs
Scrape one sitemap into a chosen folder:
python3 {baseDir}/scripts/scrape_sitemap.py \
--sitemap-url https://example.com/docs-sitemap.xml \
--output-dir /tmp/example-docs
Filter to a subset of URLs when the sitemap mixes sections:
python3 {baseDir}/scripts/scrape_sitemap.py \
--sitemap-url https://example.com/sitemap.xml \
--output-dir /tmp/example-docs \
--include-substring /docs/ \
--exclude-substring /tag/
docs-sitemap.xml, post-sitemap.xml, kb-sitemap.xml, or academy-sitemap.xml.discover_sitemaps.py to explain why a sitemap looks like docs, blog, help center, or another category.manifest.json at the output root with success and failure details.Read {baseDir}/references/sitemap-selection.md when mapping user intent to sitemap candidates, handling ambiguous sitemap names, or explaining the output layout.
example.com/docs content into ./out/docs."https://example.com/help."example.com and scrape only posts."http and https targets.localhost, private IP ranges, and internal-only hostnames.