[^<]+

` with bylines | | Local Business | NAP in homepage footer; `Schema.org LocalBusiness` JSON-LD; Google Maps embed; multiple `/locations/` | | Multilingual | 2+ `` in homepage ``; locale prefixes in sitemap (`/en/`, `/de/`, `/ar/`); non-default `` | A site can match multiple types — apply all matching playbooks. **Evidence to cite:** Per-type, the cue that matched (e.g., "SaaS — `/pricing` page exists, JSON-LD `@type: SoftwareApplication` detected on homepage"). --- ### R-16: Faceted-nav parameter detection (e-commerce) **Purpose:** Detect indexable faceted-navigation URLs that risk crawl-budget waste. **Parsing:** ```bash # From R-02 sitemap URL list grep -E '\?' urls.txt | head -20 grep -cE '\?' urls.txt ``` If many `?` URLs exist in the sitemap, flag as crawl-budget risk. **Evidence to cite:** Command + count + 3-5 example URLs. --- ### R-17: Out-of-stock product handling (e-commerce) **Purpose:** Detect product pages that are out-of-stock but still return 200 with no canonical handling. **Parsing (per sampled product page):** ```bash # Look for "out of stock" / "sold out" / "unavailable" markers in rendered HTML grep -iE 'out of stock|sold out|currently unavailable' page.html # Check Schema.org availability (JS-rendered, parseable from R-04 output) grep -oE '"availability"[^,}]*' page.html # Check canonical destination — does it point to itself, or to a parent category? grep -oE ']+rel="canonical"[^>]+href="[^"]+"' page.html ``` Flag: page contains an out-of-stock signal, has `Schema.org availability: OutOfStock` (or no availability schema), and self-canonicals (rather than canonicaling to a parent category or returning 404). If we need to confirm the HTTP status code, use `curl -o /dev/null -s -w "%{http_code}\n" ""` rather than relying on `bdata scrape -f json` shape — the JSON-mode output schema is not documented for this purpose. **Evidence to cite:** Command + the out-of-stock text + missing canonical. --- ### R-18: Author-page reachability (blog) **Purpose:** Find broken `/author/` links from blog posts. **Parsing:** ```bash # From sampled blog posts grep -oE ']+href="[^"]*author[^"]*"' post.html | grep -oE 'href="[^"]*"' | sort -u # For each unique author URL, fetch: bdata scrape "https://example.com/author/jane" -f html ``` Flag if any author URL returns 404 or is blocked. **Evidence to cite:** Command + the broken author URL. --- ### R-19: Publication-date / freshness extraction (blog) **Purpose:** Identify outdated content (>18 months without update). **Parsing:** ```bash # datetime in tags grep -oE ']+datetime="[^"]*"' post.html # Modified date (Schema.org Article) grep -oE '"dateModified"[^,}]*' post.html # Updated date in visible text (fallback) grep -iE 'last updated|updated on' post.html ``` Flag posts with `datetime` >18 months ago and no visible update marker. **Evidence to cite:** Command + the date + how long ago it is. --- ### R-20: NAP extraction (local business) **Purpose:** Extract Name / Address / Phone from each sampled page footer to detect inconsistency. **Parsing:** ```bash # Phone (US formats) grep -oE '$?\b[0-9]{3}$?[-. ]?[0-9]{3}[-. ]?[0-9]{4}\b' page.html | sort -u # Address (rough — match street suffixes) grep -oE '[0-9]+ [A-Z][a-z]+( [A-Z][a-z]+)* (St|Ave|Blvd|Rd|Way|Dr|Ln)\b[^<]*' page.html | sort -u # Or pull from LocalBusiness JSON-LD if present echo "$JSONLD" | jq -r '.address, .telephone' ``` Normalize phone (digits only) before comparing across pages. Flag any mismatch as a finding. **Evidence to cite:** Command + the conflicting values per page. --- ### R-21: Local SERP visibility (local business) **Purpose:** Run `bdata search " "` per location to validate local SERP presence. **Command:** ```bash bdata search "plumber Austin" --country us --json ``` **Parsing:** Same as R-13 — find user's domain in `.organic[].link`. **Evidence to cite:** Command + position per city. --- ### R-22: Outbound-link sampling (E-E-A-T) **Purpose:** Quick check that content links to authoritative sources rather than orphan pages with no citations. **Parsing:** ```bash grep -oE ']+href="https?://[^"]*"' page.html \ | grep -oE 'href="[^"]*"' \ | grep -vE 'href="https?://(www\.)?example\.com' \ | sort -u | head ``` Flag: low outbound-link count + topic that demands citations (medical, legal, financial). **Evidence to cite:** Command + outbound-link count. --- ### R-23: Author byline + bio detection (E-E-A-T) **Purpose:** Detect missing or thin author bylines on blog content. **Parsing:** ```bash # Schema.org Article author echo "$JSONLD" | jq -r '.author.name // .author[0].name' # Visible byline (rough) grep -iE 'by <[^>]+>|class="author"|rel="author"' post.html ``` Flag posts with no detectable author. **Evidence to cite:** Command + presence/absence. --- ### R-24: Word-count + paragraph-count extraction **Purpose:** Detect thin content. **Parsing:** ```bash # Strip HTML tags, count words sed -E 's|<[^>]+>||g' page.html | wc -w # Paragraph count grep -oE ']' page.html | wc -l ``` Thresholds (rough): <300 words is thin for a content page; <100 words is thin even for a product page. **Evidence to cite:** Command + count. --- ### R-25: Generic anchor-text density **Purpose:** Flag over-reliance on generic anchor text in internal links. **Parsing:** ```bash # From R-10 anchor extraction. Trim whitespace and use grep -oE | wc -l (not -c, to avoid line-counting on minified HTML). generic=$(grep -oE ']+href="[^"]*"[^>]*>[^<]+' page.html \ | sed -E 's|.*>([^<]+)|\1|' \ | sed -E 's/^[[:space:]]+//; s/[[:space:]]+$//' \ | grep -ciE '^(click here|read more|learn more|here|link|this|more)$') total=$(grep -oE ']+href="[^"]*"' page.html | wc -l) echo "$generic/$total internal anchors are generic" ``` Flag if generic-anchor ratio is > 30% of internal-link count. **Evidence to cite:** Command + ratio.