HN Podcast Transcribe

Download, transcribe, and archive Hacker News podcast episodes (e.g. "Hacker News Recap" by Wondercraft). Use when: (1) user wants to transcribe HN podcast episodes, (2) user asks to archive HN podcast content, (3) user wants searchable text from HN audio podcasts, (4) user mentions "HN podcast" + "transcribe" or "archive". Supports incremental processing — only new episodes are downloaded/transcribed on each run. Works with any podcast RSS feed, not just HN.

Audits

Pass

Install

openclaw skills install hn-podcast-transcribe

HN Podcast Transcribe & Archive

Automatically download, transcribe, and archive Hacker News podcast episodes into a searchable local archive.

Default Podcast

Hacker News Recap by Wondercraft.ai — daily AI-generated recap of top HN posts.

  • RSS: https://rss.buzzsprout.com/2170103.rss

Override with HN_PODCAST_RSS env var for any podcast RSS feed.

Workflow

1. Fetch new episodes

python3 scripts/fetch_episodes.py [--rss URL] [--archive DIR] [--limit N] [--no-download]
  • Parses podcast RSS feed
  • Compares against existing archive to skip already-processed episodes
  • Downloads audio (mp3/m4a/wav) for each new episode
  • Saves metadata as JSON alongside audio
  • Default archive: ./hn-podcast-archive/
  • --no-download: save metadata only, skip audio download

Download strategies (tried in order):

  1. Direct HTTP download — works for most podcast CDNs
  2. yt-dlp fallback — handles some Cloudflare-protected hosts
  3. If both fail, the episode directory is created with metadata; place audio manually

Cloudflare note: Some hosts (e.g. Buzzsprout) block automated downloads. If direct download fails:

  • Use --no-download to create the directory structure
  • Download audio manually via browser or podcast app
  • Place the file as audio.mp3 in the episode directory
  • Re-run the transcribe step

2. Transcribe audio

python3 scripts/transcribe_episodes.py [--archive DIR] [--model MODEL] [--format FORMAT]
  • Finds episodes with audio but no transcript
  • Runs Whisper locally (no API key needed)
  • Outputs: txt, srt, vtt, or json (default: txt)
  • Default model: turbo (fast, good accuracy)
  • Supports audio formats: mp3, m4a, wav, ogg, flac

3. Generate archive index

python3 scripts/build_index.py [--archive DIR]
  • Creates archive_index.json with all episodes, dates, titles, and transcript paths
  • Enables fast search across the archive

4. Search archive

python3 scripts/search_archive.py [--archive DIR] "search query"
  • Full-text search across all transcribed episodes
  • Returns matching episodes with context snippets

One-shot: Full Pipeline

python3 scripts/pipeline.py [--rss URL] [--archive DIR] [--model MODEL] [--limit N]

Runs fetch → transcribe → index in sequence.

Cron Integration

Set up periodic processing with OpenClaw cron:

# Daily at 6am — process new HN Recap episodes
cron add --name "hn-podcast-digest" --schedule "0 6 * * *" --payload '{"kind":"agentTurn","message":"Run the HN podcast transcription pipeline: python3 scripts/pipeline.py --limit 3"}'

Archive Structure

hn-podcast-archive/
├── archive_index.json
├── 2026-05-10_hardware-attestation-as-monopoly-enabler/
│   ├── episode.json
│   ├── audio.mp3
│   └── transcript.txt
├── 2026-05-09_a-recent-experience-with-chatgpt-5-5-pro/
│   ├── episode.json
│   ├── audio.mp3
│   └── transcript.txt
└── ...

Configuration

Env VarDefaultDescription
HN_PODCAST_RSSBuzzsprout HN Recap feedPodcast RSS feed URL
HN_ARCHIVE_DIR./hn-podcast-archiveArchive directory
WHISPER_MODELturboWhisper model name
WHISPER_FORMATtxtTranscript output format

Requirements

  • Python 3.10+
  • openai-whisper (pip install openai-whisper)
  • requests (pip install requests)
  • static-ffmpeg (pip install static-ffmpeg) — auto-provides ffmpeg
  • yt-dlp (optional, for fallback downloads)
  • Whisper models auto-download to ~/.cache/whisper on first use