Sitemap Content Scraper

v1.0.2

Discover website sitemaps from robots.txt and common sitemap locations, choose the right sitemap or content family such as docs, blog, help center, academy,...

2· 131·0 current·0 all-time
bygunes alcan@quareth

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for quareth/sitemap-content-scraper.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "Sitemap Content Scraper" (quareth/sitemap-content-scraper) from ClawHub.
Skill page: https://clawhub.ai/quareth/sitemap-content-scraper
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Required binaries: python3
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install sitemap-content-scraper

ClawHub CLI

Package manager switcher

npx clawhub@latest install sitemap-content-scraper
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description match the included Python scripts (discover_sitemaps.py and scrape_sitemap.py). Required runtime (python3) and no credentials/config paths are consistent with a public-site sitemap discovery and scraping tool.
Instruction Scope
SKILL.md restricts activity to public http/https targets and instructs running the included scripts; the scripts perform network requests and write files to a user-specified output directory as expected. The SKILL.md guardrails (reject localhost/private IPs, avoid auth/cookies, ask before writing outside working area) align with the script behavior.
Install Mechanism
No install spec (instruction-only) and only a dependency on python3. The skill bundles the scraper scripts rather than downloading external code at runtime, avoiding high-risk remote installs.
Credentials
No environment variables, credentials, or unrelated binaries are requested. The scripts access network and local filesystem as required by a scraper; nothing asks for unrelated secrets or broad system config access.
Persistence & Privilege
The skill is user-invocable and not always-enabled; it does not request persistent privileges or attempt to modify other skills or global agent configuration.
Assessment
This skill appears to do what it says: it will run the included Python scripts to discover public sitemaps and fetch pages, then write Markdown files to the destination folder you choose. Before running: (1) inspect the bundled scripts (they are included) and run them in a sandbox or container if you are cautious; (2) only target public http/https hosts and avoid internal/private hostnames as advised; (3) choose an output directory you control and confirm the agent asks before writing outside that area; (4) be aware the scraper performs arbitrary HTTP requests (so don't point it at services where requests could trigger actions or costs).

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

Binspython3
latestvk9702j0mpsmaz0v71hff99vc8x83tg8h
131downloads
2stars
3versions
Updated 4w ago
v1.0.2
MIT-0

Sitemap Content Scraper

Use this skill to turn a public website into a sitemap-driven scraping job. Prefer the existing sitemap structure over ad hoc crawling so the scrape stays bounded, explainable, and easy for the user to steer.

Workflow

  1. Ask for the website or URL scope if it is not already provided.
  2. Run python3 {baseDir}/scripts/discover_sitemaps.py <site-or-url>.
  3. Summarize the discovered sitemap inventory in plain language.
  4. If user gave a scoped URL (for example https://example.com/docs), use scope_hint_substring from discovery output as default filter guidance.
  5. Ask which content family the user wants, such as documentation, knowledge base, blog, academy, changelog, or another category.
  6. Map the user request to the most relevant sitemap by name and sample URL patterns.
  7. If multiple sitemaps still match, ask the user to choose one or give a tighter scope.
  8. Ask for the destination folder if it is missing.
  9. Run python3 {baseDir}/scripts/scrape_sitemap.py --sitemap-url <chosen-sitemap> --output-dir <destination>, and when a scoped URL was provided add --include-substring <scope_hint_substring> unless the user overrides scope.
  10. Report what was scraped, where it was saved, and any skipped or failed pages.

Quick Commands

Discover sitemap inventory:

python3 {baseDir}/scripts/discover_sitemaps.py https://example.com

Discover and preserve scope hint from a direct URL prompt:

python3 {baseDir}/scripts/discover_sitemaps.py https://example.com/docs

Scrape one sitemap into a chosen folder:

python3 {baseDir}/scripts/scrape_sitemap.py \
  --sitemap-url https://example.com/docs-sitemap.xml \
  --output-dir /tmp/example-docs

Filter to a subset of URLs when the sitemap mixes sections:

python3 {baseDir}/scripts/scrape_sitemap.py \
  --sitemap-url https://example.com/sitemap.xml \
  --output-dir /tmp/example-docs \
  --include-substring /docs/ \
  --exclude-substring /tag/

Selection Rules

  • Prefer sitemaps explicitly named for the requested content family, such as docs-sitemap.xml, post-sitemap.xml, kb-sitemap.xml, or academy-sitemap.xml.
  • Use the sample URLs returned by discover_sitemaps.py to explain why a sitemap looks like docs, blog, help center, or another category.
  • If the request is broad, offer the discovered choices instead of scraping everything by default.
  • If no sitemap exists, stop and ask whether the user wants a bounded crawl workflow instead. Do not silently switch strategies.

Output Contract

  • Save one Markdown file per scraped page.
  • Save manifest.json at the output root with success and failure details.
  • Keep source URLs in the Markdown header so the corpus remains traceable.
  • Preserve a stable folder structure derived from the source URL path.

Read {baseDir}/references/sitemap-selection.md when mapping user intent to sitemap candidates, handling ambiguous sitemap names, or explaining the output layout.

Trigger Examples

  • "Scrape example.com/docs content into ./out/docs."
  • "Pull the help center pages from https://example.com/help."
  • "Find blog sitemaps for example.com and scrape only posts."

Guardrails

  • Scrape only public content.
  • Accept only http and https targets.
  • Reject localhost, private IP ranges, and internal-only hostnames.
  • Enforce public-only targets using both hostname resolution checks and redirect-target checks at request time.
  • Respect the chosen sitemap scope instead of broad site crawling.
  • Avoid login flows, private dashboards, carts, checkout paths, or user-specific pages.
  • Do not use authentication headers, cookies, or tokens.
  • Ask before writing outside the intended working area.
  • Tell the user when extraction quality looks weak on JavaScript-heavy pages. The bundled scraper is HTML-first and may miss client-rendered content.

Comments

Loading comments...