Install
openclaw skills install seaportalUse this skill when an agent needs to read or navigate websites without a browser: fetch a page as clean Markdown, get a JSON accessibility snapshot of interactive elements, follow links across pages, or quickly decide whether a site is static/SSR (seaportal can handle it) vs SPA/blocked (needs a real browser like pinchtab). HTTP-only, fast (<2s), token-efficient. Prefer this over a browser whenever the page is static or server-rendered.
openclaw skills install seaportalCLI-first read-only web fetcher. Use the seaportal command. No JavaScript execution — for SPAs/blocked pages, escalate to a real browser.
seaportal <url> # Markdown + frontmatter (also writes renders/seaportal/*.md, *.json)
seaportal --json <url> # Full Result struct as JSON
seaportal --xml <url> # TEI-Lite XML (teiHeader metadata + text/body content)
seaportal --snapshot <url> # Accessibility tree as JSON
seaportal --snapshot --format=compact <url> # Accessibility tree as text (most token-efficient)
seaportal --snapshot --filter=interactive --format=compact <url> # Only links/buttons/inputs
seaportal --max-tokens=2000 <url> # Cap Markdown body size (paragraph-boundary cut, sets truncated:true)
seaportal --snapshot --max-tokens=2000 <url> # Cap snapshot tree size
seaportal --fast <url> # Bail early if browser is needed
seaportal --head-only <url> # 16 KB range fetch — metadata + canonical only, no body
seaportal --respect-robots <url> # Consult robots.txt; refuse disallowed fetches
seaportal --rate-limit=500ms <url> # Min interval between same-host requests
seaportal --probe-search <url> # Override to needs-browser when search URL yields no results
seaportal --no-dedupe <url> # Disable repeated-block dedup
seaportal --no-prune-fallback <url> # Disable the heuristic fallback when readability is thin
seaportal --with-links <url> # Add structured list of discovered <a> links to output
seaportal --with-images <url> # Add structured list of discovered <img> entries to output
seaportal --with-tables <url> # Add structured tables (caption/headers/rows) to output, layout tables flattened
seaportal --with-comments <url> # Emit user-generated comments (Disqus/native) separately in result.comments (stripped from Content by default either way)
seaportal --links=text <url> # Markdown link retention: none|text|all|footer (default all)
seaportal --citations <url> # Synonym for --links=footer (back-compat)
seaportal --chunk=heading <url> # Chunk Markdown by heading / sentence / window
seaportal --select=".main-content" <url> # Scope extraction to a CSS subtree
seaportal --strip=".ads,.cookie-banner" <url> # Remove matching elements before extraction
seaportal --retries 5 <url> # Override default retry count (3)
seaportal --max-retry-wait 10s <url> # Override single backoff cap (30s)
seaportal --retry-timeout 60s <url> # Override total retry budget (90s)
seaportal --base-url=URL - # Read HTML from stdin; --base-url resolves relative links
seaportal --ua=googlebot <url> # User-Agent preset or literal string
seaportal --proxy=http://user:pass@proxy:8080 <url> # Route via HTTP(S) proxy
seaportal --cache=/tmp/sp-cache <url> # Reuse fresh 200 OK responses from disk (opt-in; default TTL 24h)
seaportal --cache=/tmp/sp-cache --cache-stale-tolerance=5m <url> # Stale-while-revalidate: serve stale within TTL+tolerance, refresh in background
seaportal --no-pdf <url> # Skip PDF extraction (default: extract PDF text)
seaportal --schema=path/to/schema.yaml <url> # Apply a CSS schema (JSON/YAML), populate result.schema
seaportal --query="compound interest" <url> # BM25-rank H2/H3 sections by relevance, annotate result.rankedSections
seaportal --query=... --top-n=3 <url> # Keep only the top-N most relevant sections in rankedSections
seaportal --query=... --top-n=3 --filter-by-query <url> # Replace Content with concatenated top-N sections
seaportal --split-out=dir/ --split-bytes=32768 <url> # Shard output across multiple files; print manifest (path\tindex/of\tbytes) to stdout
seaportal --version
The default verb (no subcommand) is URL extraction — seaportal <url> behaves exactly as documented above.
seaportal sitemap <url> — fetch a sitemap.xml, recurse into nested <sitemapindex> references, decompress .gz, and print one URL per line. Flags: --json (emit JSON array of {loc,lastmod,changefreq,priority} entries), --max-urls N (default 50000), --max-depth N (default 5), --allow-internal (permit trusted private/internal hosts). Example: seaportal sitemap https://example.com/sitemap.xml --json.seaportal feed <url> — fetch and parse RSS 2.0, Atom 1.0, or JSON Feed 1.x into a unified {title, link, published, summary, author, guid} shape (format sniffed from the root element / first byte). Default output is one TSV line per item (published\ttitle\tlink). Flags: --json (emit JSON array), --max-items N (default 200), --allow-internal (permit trusted private/internal hosts). Example: seaportal feed https://example.com/feed.xml --json.seaportal mcp — run as an MCP (Model Context Protocol) server over JSON-RPC 2.0 line-delimited stdio. Exposes four tools — fetch_url, fetch_snapshot, parse_sitemap, parse_feed — each routing to the library entry point of the same shape. No flags; configuration flows through MCP tool arguments. See MCP integration below.seaportal help — usage summary including subcommands.Register seaportal as an MCP server in your editor (Claude Desktop / Claude Code / Cursor) — example claude_desktop_config.json:
{
"mcpServers": {
"seaportal": {
"command": "seaportal",
"args": ["mcp"]
}
}
}
Tools exposed: fetch_url ({url, dedupe?, fast?, with_links?, with_images?, with_tables?, with_comments?, max_tokens?}), fetch_snapshot ({url, filter?, max_tokens?, allow_internal?}), parse_sitemap ({url, max_depth?, max_urls?, allow_internal?}), parse_feed ({url, max_items?, allow_internal?}). Each returns its library result as a single JSON text content block.
--ua=<name> accepts curated presets; unknown values pass through as literal UA strings. Empty (default) sends a real Chrome UA. Per-host DomainUserAgent overrides still win.
chrome (default), safari, firefox — real browser UAsgooglebot, bingbot, search-bot — bot UAs (may trigger reverse-DNS challenges)seaportal — honest self-identify for cooperative sites--cache=<dir> is opt-in: only fresh 200 OK responses are stored, keyed by SHA-256 of URL + Accept/Accept-Language/User-Agent. --cache-ttl=<dur> controls freshness (default 24h). --no-cache bypasses reads but still writes — i.e. "force refresh". Errors / 3xx / 4xx / 5xx are never cached. Result includes cacheHit: true on a replay.
Past-TTL entries that carry ETag or Last-Modified are automatically re-validated with a conditional GET (If-None-Match / If-Modified-Since). A 304 Not Modified replays the cached body, refreshes FetchedAt, and sets cacheRevalidated: true (distinct from cacheHit, which means "served from disk with no network call"). A 200 replaces the entry; other statuses leave the cache untouched.
--cache-stale-tolerance=<dur> enables stale-while-revalidate (SWR) semantics: entries whose age falls within TTL + tolerance are served immediately from disk (with cacheStale: true) while a background goroutine fires the conditional GET to refresh the entry for the next call. Default 0 keeps the existing synchronous-revalidate behaviour. The SWR band does not require validators on the cached entry — within tolerance the body is trusted unconditionally; beyond tolerance the existing validator-gated synchronous revalidation runs. --no-cache bypasses SWR entirely. Background refresh failures are silent and leave the stale entry intact for the next attempt.
application/pdf responses are extracted by default: text is pulled page-by-page via ledongthuc/pdf and flows through the same Result.Content Markdown pipeline (link retention, truncation, chunking, cache) with ExtractionMethod="pdf" and --- page N --- separators. Pass --no-pdf to restore the legacy "skipped binary content" behaviour. Image-only / scanned PDFs yield an empty extraction error (no OCR).
h2;
fall back to HTTP/1.1 otherwise.Result.protocol ("h2" or
"http/1.1").--proxy=URL routes the fetch through a proxy. http:// and https:// proxy URLs are supported with Basic auth taken from the URL userinfo (user:pass@host:port). HTTPS targets use a CONNECT tunnel; the Chrome TLS fingerprint is preserved end-to-end with the origin. socks5:// URLs work for HTTP target URLs only — HTTPS-over-SOCKS5 is a V1 limitation. Invalid proxy URLs fail fast with result.Error = "invalid proxy URL: ...".
The CLI is safe by default (DefaultSecurityPolicy): it blocks targets that
resolve to private/internal IPs (SSRF guard), allows only http/https, caps
redirects at 10 with per-hop re-validation, and bounds the raw (50 MiB) and
decompressed (200 MiB) body.
localhost, 127.0.0.1, a 192.168.*/10.* host, or any intranet
URL fails by default with Error: target resolves to a private/internal IP.
Add --allow-internal to permit it (you are vouching the target is trusted).--max-redirects N, --allow-domains / --deny-domains,
--trusted-resolve-cidrs, --max-response-bytes, --max-decompressed-bytes.fetch_url, fetch_snapshot,
parse_sitemap, and parse_feed.--proxy the dial-time rebinding check is skipped (the target is
still vetted before fetch and on each redirect, just not at connect time).--rate-limit=DURATION enforces a minimum interval between requests to the same host. Useful primarily for library callers sharing a HostRateLimiter across calls via Options.RateLimiter; a single CLI invocation only fires one request, so the throttle has no cross-call effect by itself. Combines with --respect-robots crawl-delay (both apply; effective wait is their sum).
--chunk=NAME[:SIZE[:OVERLAP]] populates result.chunks (off by default). Runs after --max-tokens truncation; fenced code blocks are never split.
--chunk=heading — split at H2/H3 boundaries; pre-heading prologue is its own chunk.--chunk=sentence:512 — group sentences until ~512 tokens; heading inherited from nearest H2/H3 above.--chunk=window:2000:200 — 2000-char windows, 200-char overlap, snapped to word boundaries.--query="..." scores each H2/H3-bounded Markdown section against the query with standard BM25 (k1=1.5, b=0.75) and populates result.rankedSections (descending score). Pure additive by default — Content is untouched. Combine with --top-n=N to truncate, or --filter-by-query to replace Content with the concatenated top-N sections (default top-3 when --top-n is unset). No stopwords/stemming in V1; IDF handles common words naturally.
seaportal --json --query="formula" <url> — annotate all sections with scores.seaportal --json --query="formula" --top-n=3 <url> — keep only the 3 highest-scoring sections in rankedSections.seaportal --query="formula" --top-n=3 --filter-by-query <url> — Markdown body becomes just those 3 sections.--schema=<path> applies a declarative CSS schema (JSON or YAML, format sniffed from extension) to the raw HTML and surfaces the result as result.schema in JSON output. Three modes per field: single value (selector only), multiple values (multiple: true), nested array of objects (fields: populated). Optional attr: reads an attribute instead of text. Runs on the pre-preprocess DOM so chrome elements (nav/sidebar) are reachable. Bad selectors or load failures become warnings, never crash. Example schema:
fields:
title: { selector: h1 }
tags: { selector: .tag, multiple: true }
products:
selector: .product
fields:
name: { selector: .name }
price: { selector: .price, attr: data-price }
Invoke: seaportal --json --schema=./schema.yaml <url>. XPath is a V2 limitation — CSS only for now.
When a page needs JS execution, let pinchtab (or any other fetcher) render the HTML, then pipe it into seaportal for extraction:
pinchtab fetch https://example.com | seaportal --base-url https://example.com --json -
--base-url is required in stdin mode. Network-only flags (--head-only, --respect-robots, --retries) are silently no-ops with a stderr warning.
seaportal <url>. Read the frontmatter — pageClass, trustworthy, needsBrowser, confidence tell you if extraction is reliable.pageClass:
static / ssr / hydrated → trustworthy, keep using seaportal.spa / dynamic → JS-only content, stop and escalate to a browser (pinchtab).blocked → bot-protection or login wall, escalate.seaportal --snapshot --filter=interactive --format=compact <url> to see only links/buttons with their hrefs. Each entry has a stable ref (e1, e2…) and CSS selector.href, resolve against the page URL if relative, and re-run seaportal <new-url>.There is no session, no click, no form submit — every navigation is a fresh HTTP GET. To "click" a link you re-invoke seaportal on its href.
| Goal | Command |
|---|---|
| Read article / docs content | seaportal <url> (Markdown) |
| Programmatic decision-making | seaportal --json <url> |
| Map of page structure (cheap on tokens) | seaportal --snapshot --format=compact <url> |
| Just the actionable links/buttons | seaportal --snapshot --filter=interactive --format=compact <url> |
| Large page, must cap tokens | add --max-tokens=2000 (caps snapshot tree in snapshot mode, or Markdown body otherwise; truncates at the latest paragraph boundary and sets truncated: true) |
Compact snapshot rows look like:
e2 link "Docs" <a> [interactive] href=/docs
e5 heading "Welcome" <h1> level=1
pageClass field in the frontmatter / JSON:
| Class | Action |
|---|---|
static | Use seaportal — pure HTML, high confidence. |
ssr | Use seaportal — server-rendered. |
hydrated | Use seaportal — SSR + JS, usually fine. |
spa | Escalate — JS-only, seaportal will return little/empty. |
dynamic | Escalate — heavy client rendering. |
blocked | Escalate — bot protection, captcha, or login wall. |
Also escalate if needsBrowser: true or validationOk: false. Use --fast when you want seaportal to bail early on any of these instead of doing full extraction.
For programmatic routing, prefer profile.decision + profile.browserRecommended over mapping pageClass yourself: one explicit decision (static-high-confidence / static-ok / static-caution / browser-needed / blocked / unreachable / not-found / unsupported), where browserRecommended: true means a real browser is likely to help. See docs/reference/browser-discriminator.md.
If seaportal classified the page as static/ssr/hydrated (i.e. it thinks extraction succeeded) but the Markdown body looks thin — length < ~1500, no real paragraphs, mostly headings or naked links — don't escalate yet. Readability sometimes prunes link-heavy or table-heavy sections that the accessibility tree still has in full. Retry with:
seaportal --snapshot --format=compact <url> # whole structure
seaportal --snapshot --filter=interactive --format=compact <url> # just links/buttons
If the snapshot returns substantial nodes, use those. If the snapshot is also thin, then escalate. Canonical situations where the fallback is worth a try: government index pages (usa.gov, gov.uk) where the link set is the content, and reference/listing pages where the prose is incidental to the structure. Don't loop — one snapshot retry is enough; if it's not there, it's not coming.
For search URLs specifically, --probe-search short-circuits CNN/DDG-style JS search shells with reason client-rendered-search — when set, seaportal flips outcome to needs-browser if a search-shaped URL returns a short body with no result-list structure.
The default Markdown mode (no flag) also writes two files under ./renders/seaportal/<domain>_<timestamp>.{md,json} from the working directory. If running from a directory where that's unwanted, use --json or --snapshot instead — those only print to stdout.
--xml)--xml emits a TEI-Lite-shaped document: <teiHeader> with title/author/published-date/language/source URL, plus <text><body> containing the Markdown converted into <head>, <list>/<item>, <code>, and <p> elements. Mutually exclusive with --json (exit 2). Scope is basic structural shaping — full TEI ODD validation, footnotes, and cross-references are out of scope.
--split-out=<dir> shards the rendered output into multiple files under <dir>, capped at --split-bytes (default --max-tokens × 4 or 32 KB). Prefers existing --chunk boundaries; otherwise splits on paragraph boundaries. Files are named <url-slug>-NNN.{md,json} and written atomically (.tmp + rename). Stdout receives a TSV manifest (path\tindex/of\tbytes) instead of the content body. Not supported with --xml.
pinchtab skill instead. A common pattern is: seaportal first, pinchtab on escalation.