xiaoyuzhou-asr

v1.0.0

Transcribe 小宇宙 (Xiaoyuzhou) podcast episodes to text using local Qwen3-ASR speech recognition. Combines xyz API (小宇宙FM API) to fetch episode metadata and aud...

0· 31· 1 versions· 0 current· 0 all-time· Updated 3h ago· MIT-0

by@worldwonderer

Security Scans

VirusTotalBenign ClawScanBenign Static analysisBenign

Install

openclaw skills install xiaoyuzhou-asr

xiaoyuzhou-asr

Transcribe 小宇宙 podcast episodes to text using local Qwen3-ASR (Metal/CUDA accelerated).

Prerequisites

xyz API server running — fetches episode data and audio URLs from 小宇宙

git clone https://github.com/ultrazg/xyz.git && cd xyz && go run .
# Default port: 23020, change with -p

Access token — login via POST /sendCode then POST /login (see references/xyz-api.md)
ffmpeg — audio format conversion (brew install ffmpeg)

Qwen3-ASR model — download (HF Hub does NOT ship tokenizer.json):

python3 -c "
from huggingface_hub import snapshot_download
snapshot_download('Qwen/Qwen3-ASR-0.6B', local_dir='models/0.6B')
"

qwen3-asr-rs — build from source:

git clone https://github.com/alan890104/qwen3-asr-rs.git && cd qwen3-asr-rs
cargo build --release --example local_transcribe

tokenizer.json — auto-generated by the transcription script on first run (from vocab.json + merges.txt). No manual step needed.

Workflow

Step 1: Find Episode

TOKEN="$XYZ_ACCESS_TOKEN"
BASE="http://localhost:23020"

# Search episodes by keyword
curl -s -X POST $BASE/search \
  -H "x-jike-access-token: $TOKEN" -H "Content-Type: application/json" \
  -d '{"keyword":"关键词","type":"EPISODE"}'

# Get episode detail (contains audio URL)
curl -s -X POST $BASE/episode_detail \
  -H "x-jike-access-token: $TOKEN" -H "Content-Type: application/json" \
  -d '{"eid":"EPISODE_ID"}'

# List episodes of a podcast
curl -s -X POST $BASE/episode_list \
  -H "x-jike-access-token: $TOKEN" -H "Content-Type: application/json" \
  -d '{"pid":"PODCAST_ID","order":"desc"}'

Step 2: Download and Convert Audio

Audio URL is in data.data.media.source.url (m4a format).

mkdir -p /tmp/xiaoyuzhou-audio
curl -L -o /tmp/xiaoyuzhou-audio/episode.m4a "$AUDIO_URL"
ffmpeg -y -i /tmp/xiaoyuzhou-audio/episode.m4a -ar 16000 -ac 1 /tmp/xiaoyuzhou-audio/episode.wav

Step 3: Split Long Audio (REQUIRED for >3 min)

Podcasts are continuous speech with few silence gaps. Use fixed-interval splitting:

# Split into 3-minute segments (must split at ≥2 min for Metal GPU memory)
ffmpeg -y -i episode.wav -f segment -segment_time 180 -ar 16000 -ac 1 seg_%03d.wav

Or try silence-based splitting (may find no gaps in continuous podcasts):

ffmpeg -i episode.wav -af "silencedetect=noise=-30dB:d=2" -f null - 2>&1 | grep silence_end
ffmpeg -i episode.wav -f segment -segment_times T1,T2 -ar 16000 -ac 1 seg_%03d.wav

Step 4: Transcribe

MODEL_DIR="/path/to/models/0.6B"
ASR_BIN="qwen3-asr-rs/target/release/examples/local_transcribe"

# Transcribe each segment
for seg in seg_*.wav; do
  $ASR_BIN $MODEL_DIR $seg 2>/dev/null | grep "^Text     :" | sed 's/^Text     : //'
done

For efficiency (load model once in Rust):

use qwen3_asr::{AsrInference, TranscribeOptions, best_device};
let engine = AsrInference::load("models/0.6B", best_device())?;
for seg in segments {
    let result = engine.transcribe(&seg, TranscribeOptions::default())?;
    output.push(result.text);
}

Step 5: Format Output

Combine transcript with metadata as markdown:

# {title}
**节目**: {podcast.title} | **日期**: {pubDate} | **时长**: {duration}s

## 转录文本
{transcript}

References

xyz API endpoints and auth: references/xyz-api.md
Qwen3-ASR usage and performance: references/qwen3-asr.md

Token Management

Tokens expire. If API returns 401, refresh: POST /refresh_token
Store in env: XYZ_ACCESS_TOKEN, XYZ_REFRESH_TOKEN
Prompt user to login if no valid token

Constraints

MUST split audio into ≤3-minute segments for Metal GPU stability
Audio must be WAV 16kHz mono
tokenizer.json must be generated manually (not included in HF download)
local_transcribe binary needed (demo binary only runs built-in test samples)
xyz API requires Chinese phone number (+86) login
All processing is local — audio never leaves the machine

Version tags

latestvk97ca1jbnmxrx94dd09m3bxgqd85wabf