Install
openclaw skills install noizai-chat-with-anyoneChat with any real person or fictional character in their own voice by automatically finding their speech online, extracting a clean reference sample, and generating audio replies. Also supports generating a matching voice from an uploaded image. Use when the user says "我想跟xxx聊天", "你来扮演xxx跟我说话", "让xxx给我讲讲这篇文章", "我想跟图片中的人说话", or similar.
openclaw skills install noizai-chat-with-anyoneClone a real person's voice from online video, or design a voice from a photo, then roleplay as that person with TTS.
This skill synthesizes speech that imitates real voices. Before proceeding, the agent must:
If the user's intent appears harmful, refuse politely and explain why.
| Dependency | Type | How to verify |
|---|---|---|
ffmpeg | System binary | ffmpeg -version |
yt-dlp | System binary | yt-dlp --version |
tts skill | Cursor skill | ls skills/tts/scripts/tts.py |
NOIZ_API_KEY | Env var or file | python3 skills/tts/scripts/tts.py config --show |
Before the first run, verify all dependencies are present:
ffmpeg -version && yt-dlp --version && ls skills/tts/scripts/tts.py
If yt-dlp is missing, install it:
uv pip install yt-dlp
If the Noiz API key is not configured:
python3 skills/tts/scripts/tts.py config --set-api-key YOUR_KEY
Track progress with this checklist:
- [ ] A1. Disambiguate character
- [ ] A2. Find reference video
- [ ] A3. Download audio + subtitles
- [ ] A4. Extract best reference segment
- [ ] A5. Generate speech
If ambiguous (e.g. "US President", "Spider-Man actor"), ask the user to specify the exact person before proceeding.
Use web search to find a YouTube (or Bilibili) video of the person speaking clearly. Best candidates: interviews, speeches, press conferences. Avoid videos with heavy background music.
Search queries to try:
{CHARACTER_NAME} interview / {CHARACTER_NAME} 采访{CHARACTER_NAME} speech / {CHARACTER_NAME} 演讲{CHARACTER_NAME} press conferencemkdir -p "tmp/chat_with_anyone/{CHARACTER_NAME}"
yt-dlp -x --audio-format mp3 \
--write-subs --write-auto-subs --sub-langs "en,zh-Hans" \
--convert-subs srt \
-o "tmp/chat_with_anyone/{CHARACTER_NAME}/%(title)s.%(ext)s" \
"{VIDEO_URL}"
After download, list the output directory to identify the audio file and SRT subtitle file:
ls tmp/chat_with_anyone/{CHARACTER_NAME}/
Expected output: a .mp3 audio file and one or more .srt subtitle files.
If no subtitle files appear: try a different video that has auto-generated captions, or adjust --sub-langs for the target language.
Use the automated extraction script — it parses the SRT, finds the densest 3-12 second speech window, and extracts it as a WAV:
python3 skills/chat-with-anyone/scripts/extract_ref_segment.py \
--srt "tmp/chat_with_anyone/{CHARACTER_NAME}/{SRT_FILE}" \
--audio "tmp/chat_with_anyone/{CHARACTER_NAME}/{AUDIO_FILE}" \
-o "tmp/chat_with_anyone/{CHARACTER_NAME}/ref.wav"
The script prints the selected time range and saves the reference WAV. Verify the output exists and is non-empty before proceeding.
If the script reports no suitable segment: try --min-duration 2 for shorter clips, or download a different video.
Write a response in character, then synthesize it:
python3 skills/tts/scripts/tts.py \
-t "{RESPONSE_TEXT}" \
--ref-audio "tmp/chat_with_anyone/{CHARACTER_NAME}/ref.wav" \
-o "tmp/chat_with_anyone/{CHARACTER_NAME}/reply.wav"
Present the generated audio file to the user along with the text. For subsequent messages, reuse the same --ref-audio path.
Track progress with this checklist:
- [ ] B1. Analyze image
- [ ] B2. Design voice
- [ ] B3. Preview (optional)
- [ ] B4. Generate speech
Use your vision capability to examine the image:
Pass both the image and the description to the voice-design script:
python3 skills/chat-with-anyone/scripts/voice_design.py \
--picture "{IMAGE_PATH}" \
--voice-description "{VOICE_DESCRIPTION}" \
-o "tmp/chat_with_anyone/voice_design"
The script outputs:
voice_id.txt containing the best voice IDRead the voice ID:
cat tmp/chat_with_anyone/voice_design/voice_id.txt
Present the preview audio files from the output directory so the user can hear the voice. If unsatisfied, re-run B2 with adjusted --voice-description or --guidance-scale.
python3 skills/tts/scripts/tts.py \
-t "{RESPONSE_TEXT}" \
--voice-id "{VOICE_ID}" \
-o "tmp/chat_with_anyone/voice_design/reply.wav"
For subsequent messages, keep using the same --voice-id for consistency.
User: 我想跟特朗普聊天,让他给我讲个睡前故事。
Agent steps:
Donald Trump speech youtube, find a clear speech video.yt-dlp -x --audio-format mp3 --write-subs --write-auto-subs --sub-langs "en" --convert-subs srt -o "tmp/chat_with_anyone/trump/%(title)s.%(ext)s" "https://youtube.com/watch?v=..."python3 skills/chat-with-anyone/scripts/extract_ref_segment.py --srt "tmp/chat_with_anyone/trump/....srt" --audio "tmp/chat_with_anyone/trump/....mp3" -o "tmp/chat_with_anyone/trump/ref.wav"python3 skills/tts/scripts/tts.py -t "Let me tell you a tremendous bedtime story..." --ref-audio "tmp/chat_with_anyone/trump/ref.wav" -o "tmp/chat_with_anyone/trump/reply.wav"reply.wav and the story text to the user.User: [uploads photo.jpg] 我想跟这张图片里的人聊天
Agent steps:
python3 skills/chat-with-anyone/scripts/voice_design.py --picture "photo.jpg" --voice-description "A young Chinese woman around 25, gentle and warm voice, friendly tone" -o "tmp/chat_with_anyone/voice_design"tmp/chat_with_anyone/voice_design/voice_id.txt.python3 skills/tts/scripts/tts.py -t "你好呀!很高兴认识你!" --voice-id "{VOICE_ID}" -o "tmp/chat_with_anyone/voice_design/reply.wav"--voice-id.| Problem | Solution |
|---|---|
yt-dlp download fails or video unavailable | Try a different video URL; some regions/videos are restricted. Run yt-dlp -U to update |
| No SRT subtitle files | Re-download with --sub-lang en,zh-Hans; if still none, try a different video with auto-captions |
extract_ref_segment.py finds no suitable window | Use --min-duration 2 for shorter clips, or try a different video |
| Voice design returns error | Check Noiz API key; ensure image is a clear photo of a person |
| TTS output sounds wrong | For Workflow A, try a different reference video; for Workflow B, adjust --voice-description |