{"skill":{"slug":"siteone-crawler","displayName":"Siteone Crawler","summary":"Website crawling, auditing, offline cloning, and markdown export using SiteOne Crawler (Rust). Trigger when the user asks to: crawl a website, audit/analyze...","description":"---\nname: siteone-crawler\ndescription: >\n  Website crawling, auditing, offline cloning, and markdown export using SiteOne Crawler (Rust).\n  Trigger when the user asks to: crawl a website, audit/analyze a site for SEO/security/performance/accessibility,\n  generate an HTML audit report, clone a website for offline browsing, export a website to markdown,\n  generate a sitemap, stress/load test a site, or run a CI/CD quality gate check.\n  Binary path: ~/.siteone-crawler/siteone-crawler\n---\n\n# SiteOne Crawler\n\nCross-platform website crawler/analyzer written in Rust.\n\n## Setup (run once)\n\nBefore first use, ensure the binary exists. If not found, install it automatically:\n\n1. Check if binary exists at the paths below (in order of priority):\n   - `$HOME/.siteone-crawler/siteone-crawler`\n   - Any `siteone-crawler` found in `$PATH` (via `which siteone-crawler`)\n2. If neither exists, download the latest release from GitHub:\n   ```bash\n   INSTALL_DIR=\"$HOME/.siteone-crawler\"\n   mkdir -p \"$INSTALL_DIR\"\n   # Detect OS/arch\n   OS=$(uname -s | tr '[:upper:]' '[:lower:]')\n   ARCH=$(uname -m)\n   case \"$ARCH\" in x86_64) ARCH=\"x64\" ;; aarch64|arm64) ARCH=\"arm64\" ;; esac\n   # Get latest release URL from GitHub API\n   RELEASE_URL=$(curl -sL https://api.github.com/repos/janreges/siteone-crawler/releases/latest \\\n     | grep -oP \"browser_download_url.*?${OS}-${ARCH}\\.zip\" | head -1 | sed 's/browser_download_url\": \"//')\n   curl -sL \"$RELEASE_URL\" -o /tmp/siteone-crawler.zip \\\n     && unzip -o /tmp/siteone-crawler.zip -d /tmp/siteone-crawler \\\n     && mv /tmp/siteone-crawler/siteone-crawler \"$INSTALL_DIR/\" \\\n     && chmod +x \"$INSTALL_DIR/siteone-crawler\" \\\n     && rm -rf /tmp/siteone-crawler /tmp/siteone-crawler.zip\n   ```\n3. After installation, set `CRAWLER` to the resolved path and verify with `$CRAWLER --version`.\n\n## Binary\n\n```bash\nCRAWLER=\"$HOME/.siteone-crawler/siteone-crawler\"\n```\n\nIf the above path doesn't exist, fall back to `$(which siteone-crawler)` after running Setup.\n\nAlways use the resolved path. The binary outputs colored text to terminal; use `--no-color` for script/pipeline usage and `--output json` for programmatic consumption.\n\n## Common Workflows\n\n### 1. Quick Audit (HTML report)\n\n```bash\n$CRAWLER --url=\"https://example.com\" --output-html-report=\"/path/to/report.html\"\n```\n\nGenerates a self-contained interactive HTML audit report with quality scores (0.0-10.0) across Performance, SEO, Security, Accessibility, Best Practices.\n\n### 2. Full Audit + JSON + Upload\n\n```bash\n$CRAWLER --url=\"https://example.com\" \\\n  --output-html-report=\"/path/to/report.html\" \\\n  --output-json-file=\"/path/to/result.json\" \\\n  --upload --upload-retention=\"7d\"\n```\n\n### 3. Offline Clone\n\n```bash\n$CRAWLER --url=\"https://example.com\" --offline-export-dir=\"/path/to/offline-site\" --disable-javascript\n```\n\nUse `--disable-javascript` for SPA/React sites to get a browsable static version. Use `--allowed-domain-for-external-files=\"*\"` to include CDN assets.\n\n### 4. Markdown Export\n\nMulti-file (browsable):\n```bash\n$CRAWLER --url=\"https://example.com\" --markdown-export-dir=\"/path/to/md-export\"\n```\n\nSingle-file (ideal for AI tools):\n```bash\n$CRAWLER --url=\"https://example.com\" --markdown-export-dir=\"/tmp/md\" --markdown-export-single-file=\"/path/to/site.md\" \\\n  --markdown-disable-images --markdown-disable-files\n```\n\n### 5. Sitemap Generation\n\n```bash\n$CRAWLER --url=\"https://example.com\" --sitemap-xml-file=\"/path/to/sitemap\" --sitemap-txt-file=\"/path/to/sitemap\"\n```\n\n### 6. CI/CD Quality Gate\n\n```bash\n$CRAWLER --url=\"https://example.com\" --ci --ci-min-score=\"7.0\" --ci-max-404=\"0\" --ci-max-5xx=\"0\"\n```\n\nExit code 10 if thresholds not met. See references/cli-params.md for all `--ci-*` options.\n\n### 7. Stress/Load Test\n\n```bash\n$CRAWLER --url=\"https://example.com\" --workers=\"20\" --max-reqs-per-sec=\"100\" --max-depth=\"1\"\n```\n\n**Warning**: high worker counts can cause DoS. Use with caution.\n\n### 8. Single Page Crawl\n\n```bash\n$CRAWLER --url=\"https://example.com/about\" --single-page --output-json-file=\"/path/to/result.json\"\n```\n\n### 9. HTML-to-Markdown (local file)\n\n```bash\n$CRAWLER --html-to-markdown=\"/path/to/page.html\" --html-to-markdown-output=\"/path/to/page.md\"\n```\n\n### 10. Browse Exported Content\n\n```bash\n$CRAWLER --serve-markdown=\"/path/to/md-export\" --serve-port=\"8321\"\n$CRAWLER --serve-offline=\"/path/to/offline-site\" --serve-port=\"8321\"\n```\n\n## Key Parameters Reference\n\nSee `references/cli-params.md` for the complete parameter reference organized by category.\n\n### Most-used flags\n\n| Flag | Purpose | Default |\n|------|---------|---------|\n| `--url` | Target URL (required) | - |\n| `--output` | `text` or `json` | text |\n| `--workers` | Concurrent threads | 3 |\n| `--max-reqs-per-sec` | Requests per second limit | 10 |\n| `--max-depth` | Crawl depth (0 = unlimited) | 0 |\n| `--timeout` | Request timeout in seconds | 5 |\n| `--no-color` | Disable colors | off |\n| `--ignore-robots-txt` | Ignore robots.txt | off |\n\n### Resource filtering\n\n| Flag | Effect |\n|------|--------|\n| `--disable-all-assets` | Only crawl pages |\n| `--disable-javascript` | No JS (recommended for offline/SPA) |\n| `--disable-images` | No images |\n| `--disable-styles` | No CSS |\n| `--disable-files` | No downloadable docs |\n\n### URL filtering\n\n| Flag | Effect |\n|------|--------|\n| `--include-regex` | PCRE regex to include URLs |\n| `--ignore-regex` | PCRE regex to skip URLs |\n| `--allowed-domain-for-crawling` | Allow cross-domain crawling |\n| `--allowed-domain-for-external-files` | Allow external asset domains |\n\n## Script Helpers\n\n### `scripts/audit.sh` — Quick audit wrapper\n\nRuns a full crawl with HTML report and optional JSON output. See script for usage.\n\n### `scripts/export-markdown.sh` — Markdown export wrapper\n\nExports a website to markdown (single or multi-file). See script for usage.\n\n## Tips\n\n- For modern JS frameworks (Next.js, React, Vue), add `--disable-javascript` when doing offline exports\n- Use `--output json` for programmatic processing; JSON goes to STDOUT, progress to STDERR\n- Use `--extra-columns=\"Title,Keywords,Description\"` to add SEO columns\n- Use `--timezone=\"Asia/Shanghai\"` for local timestamps\n- For large sites, increase `--memory-limit`, `--max-visited-urls`, and `--max-queue-length`\n- Use `--resolve` to test local/dev servers (like curl --resolve)\n- HTML reports are self-contained — open in any browser, no server needed\n","tags":{"latest":"0.0.1"},"stats":{"comments":0,"downloads":304,"installsAllTime":0,"installsCurrent":0,"stars":0,"versions":1},"createdAt":1778139486280,"updatedAt":1778492867743},"latestVersion":{"version":"0.0.1","createdAt":1778139486280,"changelog":"- Initial release of SiteOne Crawler skill for website crawling, auditing, cloning, and markdown export.\n- Supports actions such as SEO/security/performance/accessibility audit (HTML or JSON report), offline site cloning, markdown export (single/multi-file), sitemap generation, and CI/CD quality gate checks.\n- Automatically installs or resolves the crawler Rust binary if not found.\n- Includes common usage workflows and command examples for quick setup.\n- Key CLI flags, advanced filtering, and tips for different site scenarios provided in documentation.","license":"MIT-0"},"metadata":null,"owner":{"handle":"tsingliuwin","userId":"s174akkb8q5djyfqbckt0d5wfs83k1cz","displayName":"Tsingliu","image":"https://avatars.githubusercontent.com/u/14992305?v=4"},"moderation":{"isSuspicious":false,"isMalwareBlocked":false,"verdict":"clean","reasonCodes":["review.llm_review"],"summary":"Review: review.llm_review","engineVersion":"v2.4.24","updatedAt":1779975626118}}