ReelTalk
Accept any Instagram, TikTok, or YouTube Shorts URL and get a full transcription, plain-English summary, and the ability to keep asking follow-up questions about the content — all processed entirely locally.
What it does
- Receive any Instagram, TikTok, or YouTube Shorts URL from the user.
- Extract the audio track using yt-dlp and transcribe with Whisper.
- If no speech detected (music-only, silent, or failed transcription), fall back to OCR: download the video, extract frames at 1 fps, run Tesseract OCR on each frame, and aggregate the text.
- Summarize what was said (or shown on screen) in plain English.
- Continue the conversation — the user can ask follow-ups, dig deeper, or discuss.
When to trigger
When the user shares a URL from any supported platform:
- Instagram — reels, posts, stories, videos (
instagram.com, instagr.am)
- TikTok — videos (
tiktok.com, vm.tiktok.com, vt.tiktok.com, tiktok.tv)
- YouTube — Shorts (
youtube.com/shorts/, youtu.be short links)
Also trigger when the user pastes a link without context (e.g., drops a bare URL into chat).
Workflow
Audio path (speech content)
- Run
yt-dlp --list-formats <url> to find available formats.
- For audio-only extraction:
yt-dlp -f "bestaudio" -o "/tmp/reel_audio.%(ext)s" "<url>"
- Transcribe:
whisper /tmp/reel_audio.m4a --model small --language en --task transcribe
- If transcription yields meaningful text → summarize and chat.
- Clean up:
rm -f /tmp/reel_audio.*
OCR fallback path (text-on-screen / music-only)
If Whisper returns empty, very short, or clearly hallucinated output (e.g. music interpreted as words):
- Download the highest-quality video format:
yt-dlp -f "bv*+ba/b" -o "/tmp/reel_video.mp4" "<url>"
- Extract frames at 1 fps:
ffmpeg -i /tmp/reel_video.mp4 -vf "fps=1" -vsync vfr -q:v 2 /tmp/reel_frame_%02d.jpg
- OCR each frame with Tesseract (try English first, then Hindi if supported):
tesseract /tmp/reel_frame_XX.jpg stdout --psm 6
- Alternatively copy to
$HOME/Desktop/ first if /tmp/ path causes issues.
- Aggregate all OCR text, deduplicate similar frames, summarize the on-screen content.
- Clean up:
rm -f /tmp/reel_video.mp4 /tmp/reel_frame_*.jpg
TikTok-specific notes
- TikTok blocks unauthenticated API — yt-dlp handles extraction automatically.
- TikTok videos often have watermarked and watermark-free formats; prefer
h264_* formats.
- Some TikTok URLs redirect (
t.tiktok.com, vm.tiktok.com) — yt-dlp follows these automatically.
YouTube-specific notes
- YouTube Shorts are served as standard video formats; no special handling needed.
- Use
--cookies-from-browser if age-restricted or login-required content fails.
Requirements
yt-dlp (Homebrew: brew install yt-dlp)
whisper (OpenAI Whisper, Homebrew: brew install whisper or pip install openai-whisper)
tesseract (Homebrew: brew install tesseract)
ffmpeg (typically already installed as yt-dlp dependency)
- Optional:
tesseract-lang for Hindi/other language OCR support (brew install tesseract-lang)
Notes
- All processing is local — nothing sent to external APIs except the initial URL fetch.
- The
small English Whisper model balances speed vs accuracy on CPU.
- First Whisper run downloads the model (~461MB), then cached.
- Text-on-screen reels with background music (common on Instagram/TikTok) will automatically fall through to OCR.
- Copy frames to
$HOME/Desktop/ for OCR if Tesseract has issues with /tmp/ paths (macOS extended attributes can interfere).
- For long videos (>5 min), consider using
base Whisper model for speed, or extract shorter segments.
Tags
instagram, tiktok, youtube, shorts, reel, audio, transcription, whisper, speech-to-text, ocr, tesseract, yt-dlp, ffmpeg, media, summarization, video, text-on-screen