Traktor Web Scraper

v1.0.0

Extract all assets and content from websites including images, SVGs, fonts, videos, and page structure. Parallel agents with thorough scraping coverage. Trig...

0· 652· 1 versions· 0 current· 0 all-time· Updated 11h ago· MIT-0

by@dorukardahan

Security Scans

VirusTotalSuspicious ClawScanSuspicious Static analysisBenign

Install

openclaw skills install traktor

Traktor - web asset extraction skill v2.0

Follow these steps exactly when extracting website assets.

Trigger

/traktor <url1> [url2] [url3]...

Step 1: Parse URLs from arguments

Extract all URLs from the user's command. Store them in a list.

URLS = [all URLs provided after /traktor]
SITE_COUNT = number of URLs
PROJECT_DIR = current working directory

Step 2: Create folder structure

If SITE_COUNT == 1

RUN THIS BASH COMMAND:

mkdir -p assets/{logos,icons,images,videos,fonts,backgrounds} full-site/{pages,assets,styles}

If SITE_COUNT > 1

For each URL, extract site name (e.g., "stripe.com" -> "stripe-com") and RUN:

# Replace {site-name} with actual site name for each URL
mkdir -p assets/{site-name}/{logos,icons,images,videos,fonts,backgrounds}
mkdir -p full-sites/{site-name}/{pages,assets,styles}

Example for 3 sites:

mkdir -p assets/{stripe-com,vercel-com,linear-app}/{logos,icons,images,videos,fonts,backgrounds}
mkdir -p full-sites/{stripe-com,vercel-com,linear-app}/{pages,assets,styles}

Step 3: Spawn extraction agents

For EACH URL, call the Task tool with these EXACT parameters:

Tool: Task
Parameters:
  - subagent_type: "general-purpose"
  - run_in_background: true
  - description: "Extract {site-name} assets"
  - prompt: [USE THE AGENT PROMPT BELOW - replace variables]

Agent prompt (copy this exactly, replace {variables})

NOTE: This agent prompt uses mcp__claude-in-chrome__* tools which require the claude-in-chrome browser extension MCP server. These tools will not be available in environments without the browser extension installed.

Extract ALL assets from {URL} with paranoid-level thoroughness. Miss nothing.

OUTPUT DIRECTORIES
- Assets: {PROJECT_DIR}/assets/{site-name}/ (or {PROJECT_DIR}/assets/ if single site)
- Full site: {PROJECT_DIR}/full-sites/{site-name}/ (or {PROJECT_DIR}/full-site/ if single site)

PHASE 1: Browser setup and navigation

1. Call mcp__claude-in-chrome__tabs_context_mcp to get browser context
2. Call mcp__claude-in-chrome__tabs_create_mcp to create new tab
3. Call mcp__claude-in-chrome__navigate with url="{URL}" and the new tabId
4. Call mcp__claude-in-chrome__computer with action="screenshot" to verify page loaded
5. Scroll full page to trigger lazy loading:
   - mcp__claude-in-chrome__computer action="scroll" scroll_direction="down" scroll_amount=10
   - Repeat 3-5 times with 1 second waits

PHASE 2: Asset discovery (JavaScript extraction)

Call mcp__claude-in-chrome__javascript_tool with this code:

(() => {
  const assets = {
    images: [...document.querySelectorAll('img')].map(img => ({
      src: img.src,
      srcset: img.srcset,
      dataSrc: img.dataset.src,
      alt: img.alt
    })).filter(i => i.src || i.srcset || i.dataSrc),

    videos: [...document.querySelectorAll('video, video source')].map(v => ({
      src: v.src || v.currentSrc,
      poster: v.poster,
      type: v.type
    })).filter(v => v.src),

    svgsInline: [...document.querySelectorAll('svg')].map((svg, i) => ({
      id: svg.id || 'svg-' + i,
      class: svg.className?.baseVal || '',
      html: svg.outerHTML
    })),

    backgrounds: [...document.querySelectorAll('*')].map(el => {
      const bg = getComputedStyle(el).backgroundImage;
      if (bg && bg !== 'none' && bg.includes('url(')) {
        return bg.match(/url\(['"]?([^'"]+)['"]?\)/)?.[1];
      }
      return null;
    }).filter(Boolean),

    favicons: [...document.querySelectorAll('link[rel*="icon"]')].map(l => ({
      href: l.href,
      rel: l.rel,
      sizes: l.sizes?.value
    })),

    ogImages: (() => {
      const og = document.querySelector('meta[property="og:image"]');
      const twitter = document.querySelector('meta[name="twitter:image"]');
      return [og?.content, twitter?.content].filter(Boolean);
    })(),

    fonts: (() => {
      const fonts = [];
      for (const sheet of document.styleSheets) {
        try {
          for (const rule of sheet.cssRules) {
            if (rule.type === 5) { // FONT_FACE_RULE
              const src = rule.style.getPropertyValue('src');
              const urls = src.match(/url\(['"]?([^'"]+)['"]?\)/g);
              if (urls) fonts.push(...urls.map(u => u.match(/url\(['"]?([^'"]+)['"]?\)/)?.[1]));
            }
          }
        } catch (e) {}
      }
      return [...new Set(fonts)];
    })()
  };

  return JSON.stringify(assets, null, 2);
})()

Store the result as DISCOVERED_ASSETS.

PHASE 3: Content extraction

Call mcp__claude-in-chrome__javascript_tool with this code:

(() => {
  const content = {
    url: window.location.href,
    title: document.title,
    extractedAt: new Date().toISOString().split('T')[0],
    meta: {
      description: document.querySelector('meta[name="description"]')?.content,
      ogTitle: document.querySelector('meta[property="og:title"]')?.content,
      ogDescription: document.querySelector('meta[property="og:description"]')?.content,
      ogImage: document.querySelector('meta[property="og:image"]')?.content,
      favicon: document.querySelector('link[rel*="icon"]')?.href
    },
    navigation: {
      header: [...document.querySelectorAll('header a, nav a')].slice(0, 20).map(a => ({
        text: a.textContent?.trim(),
        href: a.href
      })),
      footer: [...document.querySelectorAll('footer a')].slice(0, 20).map(a => ({
        text: a.textContent?.trim(),
        href: a.href
      }))
    },
    headings: [...document.querySelectorAll('h1, h2, h3')].slice(0, 30).map(h => ({
      level: h.tagName,
      text: h.textContent?.trim()
    })),
    buttons: [...document.querySelectorAll('button, a.btn, [class*="button"]')].slice(0, 20).map(b => ({
      text: b.textContent?.trim(),
      href: b.href || null
    }))
  };

  return JSON.stringify(content, null, 2);
})()

Store the result as PAGE_CONTENT.

PHASE 4: Download assets

For each asset URL discovered, download using curl with error handling.
If curl fails (non-zero exit), log the URL and continue to the next asset.

# Logos (favicon, og:image, header logos)
curl -sfLo "{output_dir}/logos/{site-name}-favicon.ico" "{favicon_url}" || echo "FAIL: {favicon_url}"
curl -sfLo "{output_dir}/logos/{site-name}-og-image.png" "{og_image_url}" || echo "FAIL: {og_image_url}"

# Images
curl -sfLo "{output_dir}/images/{site-name}-{descriptive-name}.{ext}" "{image_url}" || echo "FAIL: {image_url}"

# SVGs - Write inline SVGs to files using Write tool

# Videos
curl -sfLo "{output_dir}/videos/{site-name}-{name}.mp4" "{video_url}" || echo "FAIL: {video_url}"

# Fonts
curl -sfLo "{output_dir}/fonts/{font-name}.woff2" "{font_url}" || echo "FAIL: {font_url}"

NAMING CONVENTION: {site-prefix}-{descriptive-name}.{ext}
- Use alt text or context for descriptive names
- Example: stripe-hero-illustration.svg, vercel-logo-white.svg

PHASE 5: Save JSON files

1. Save page content:
   Use Write tool to save PAGE_CONTENT to:
   {full-site-dir}/pages/homepage.json

2. Save asset URLs catalog:
   Use Write tool to save DISCOVERED_ASSETS to:
   {full-site-dir}/asset-urls.json

PHASE 6: Report results

When complete, report:
- Total assets downloaded (count by type)
- Total size (estimate from curl outputs)
- Any failed downloads (list URLs)
- Output paths

ERROR HANDLING
- If site unreachable: Report error, skip
- If asset download fails: Retry once with 2s delay, then log URL and continue
- If browser tab crashes: Call tabs_create_mcp again, continue
- If rate limited: Add 2 second delays between requests

CONSTRAINTS
- Skip files > 50MB (log URL for manual download)
- Skip duplicate URLs
- Maximum 100 assets per category
- 5 minute timeout for entire extraction

Step 4: Monitor agents

After spawning all agents:

Report to user:

Traktor v2.0 - Extraction started

Folder structure created
Spawned {N} extraction agents:
   - Agent 1: {site1} [running in background]
   - Agent 2: {site2} [running in background]
   ...

Agents working in background. You'll be notified as they complete.

As agents complete, collect their results.

Step 5: Generate final manifest

After ALL agents complete, create manifest file:

Use Write tool to create asset-manifest.json:

{
  "generated_at": "{current_datetime}",
  "tool": "traktor v2.0",
  "project_dir": "{PROJECT_DIR}",
  "sites_extracted": ["{site1}", "{site2}"],
  "total_assets": "{total_count}",
  "by_type": {
    "logos": "{count}",
    "icons": "{count}",
    "images": "{count}",
    "videos": "{count}",
    "fonts": "{count}",
    "svgs": "{count}"
  },
  "naming_convention": "{site-prefix}-{descriptive-name}.{ext}",
  "output_structure": {
    "assets": "./assets/",
    "full_sites": "./full-sites/"
  }
}

Step 6: Final report

Display to user:

Traktor extraction complete!

Summary:
   Sites: {N}
   Total assets: {count} ({size_mb} MB)

   By type:
   - Logos: {n}
   - Images: {n}
   - Videos: {n}
   - SVGs: {n}
   - Fonts: {n}

Output:
   - Assets: ./assets/
   - Full sites: ./full-sites/
   - Manifest: ./asset-manifest.json

Failed downloads: {list or "None"}

Quick reference

Step	Action	Tool
1	Parse URLs	Internal
2	Create folders	Bash
3	Spawn agents	Task (background)
4	Monitor	Wait for notifications
5	Create manifest	Write
6	Report	Output to user

Example execution

User: /traktor https://0g.ai

Claude executes:

Parse: URLS = ["https://0g.ai"], SITE_COUNT = 1
Bash: mkdir -p assets/{logos,icons,images,videos,fonts,backgrounds} full-site/{pages,assets,styles}
Task: Spawn 1 agent with exact prompt above
Report: "Traktor started, 1 agent running..."
Wait for agent completion
Write: asset-manifest.json
Report: Final summary

User: /traktor https://stripe.com https://vercel.com https://linear.app

Claude executes:

Parse: 3 URLs
Bash: Create 3 site folders in assets/ and full-sites/
Task: Spawn 3 parallel background agents
Report: "Traktor started, 3 agents running..."
Collect results as agents finish
Write: Combined manifest
Report: Combined summary

Version tags

latestvk974wdv62a2shzvh7n2gep43gn81fgjb