Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

Local Tts Workflow

v1.0.0

OpenClaw text-to-speech workflow for an OpenAI-compatible TTS server, including remote/self-hosted deployments such as vLLM Omni. Use when configuring, testi...

0· 21·1 current·1 all-time
byMozi Arasaka@mozi1924
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
medium confidence
Purpose & Capability
The skill's name/description (a local TTS debugging and text-prep workflow) matches the actions described (curling a local OpenAI-compatible TTS server, normalizing text for TTS, testing modes). However the runtime instructions require reading OpenClaw configuration (~/.openclaw/openclaw.json) and using configured baseUrl as truth; the skill metadata did not declare required config paths or credentials, which is an inconsistency.
!
Instruction Scope
SKILL.md explicitly instructs the agent to read ~/.openclaw/openclaw.json to extract messages.tts.* fields and to run curl commands against example hosts (100.66.193.127:8000 and 127.0.0.1 examples). It also references environment variables (e.g., MLX_AUDIO_SERVER_TASKS) and suggests uploading consent audio. The instructions therefore direct the agent to access local config files and perform network operations — actions not reflected in the declared requirements. This grants the skill broader access than the registry metadata indicates.
Install Mechanism
No install spec and no code files are included (instruction-only). That minimizes disk-write and install-time risks.
!
Credentials
The manifest lists no required env vars or config paths, yet the instructions reference environment variables and require reading ~/.openclaw/openclaw.json (which may contain URLs or tokens). The skill asks callers to run curl against local/internal endpoints and to use server-specific env vars for server startup; those accesses are plausible for TTS debugging, but they are not declared in metadata and could expose sensitive local configuration if granted without review.
Persistence & Privilege
The skill is user-invocable, not always-enabled, and does not request to modify system-wide agent settings. It does not demand persistent presence or elevated platform privileges.
What to consider before installing
This is an instruction-only TTS debugging workflow that tells an agent to read your OpenClaw config (~/.openclaw/openclaw.json) and to run curl against local/internal TTS endpoints. That behavior is reasonable for TTS debugging, but the registry metadata did not declare the config path or env accesses — a mismatch you should resolve before installing. Before allowing this skill: 1) Inspect ~/.openclaw/openclaw.json yourself to confirm it contains no secrets you wouldn't want an agent to read (API keys, tokens, credentials). 2) Ask the skill author to declare required config paths and env vars in the metadata. 3) Prefer running the instructions manually or in a sandboxed environment first (so you can see what endpoints are contacted). 4) When the agent performs network tests, ensure the baseUrl is taken from your config (not the example IPs) and that you trust the target host. If you can't confirm the config contents or the author refuses to update metadata, treat the skill as higher risk and avoid granting it automatic invocation that can read local config.

Like a lobster shell, security has layers — review code before you run it.

latestvk974ckvmk30hq6nscs5zazvq7h84085k

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

Local TTS Workflow

Use this skill to debug the actual speech pipeline and to prepare text so the model reads it sanely.

Do not hardcode 127.0.0.1 blindly. Read the active OpenClaw config first and use the current messages.tts.openai.baseUrl as the source of truth.

Current known deployment in this workspace: http://100.66.193.127:8000/v1.

Core rule: normalize numbers before synthesis

If text is meant to be spoken aloud, do not leave Arabic numerals in the final TTS input.

Convert them into words first.

Examples:

  • Chinese output: write 一 二 三, not 123
  • English output: write one two three, not 123

This rule matters because the TTS model can go weird or read digits badly when fed raw numerals.

When preparing spoken text, normalize:

  • dates
  • times
  • counts
  • version-like strings if they will be read aloud
  • mixed Chinese/English numeric snippets

If preserving exact machine-readable formatting matters, keep one copy for display and a separate normalized copy for TTS.

Workflow

1. Verify the server before touching OpenClaw

Read ~/.openclaw/openclaw.json first and extract:

  • messages.tts.provider
  • messages.tts.openai.baseUrl
  • messages.tts.openai.model
  • messages.tts.openai.voice

Check the basics against the actual configured host:

curl http://100.66.193.127:8000/health
curl http://100.66.193.127:8000/v1/models

Confirm that the intended TTS model exists.

If the server is task-gated, ensure TTS is enabled:

MLX_AUDIO_SERVER_TASKS=tts uv run python server.py

2. Prove the raw TTS endpoint works

Always isolate the server from the client stack.

Minimal non-streaming test:

curl http://100.66.193.127:8000/v1/audio/speech \
 -X POST \
 -H 'Content-Type: application/json' \
 -d '{
 "model": "/models/lj-qwen3-tts/",
 "voice": "lj",
 "input": "你好,这是一次性返回完整音频的测试。",
 "format": "wav",
 "stream": false
 }' \
 --output sample.wav

Basic streaming test:

curl http://100.66.193.127:8000/v1/audio/speech \
 -H 'Content-Type: application/json' \
 -X POST \
 -d '{
 "model": "/models/lj-qwen3-tts/",
 "voice": "lj",
 "input": "你好,这是实时流式语音合成测试。",
 "format": "wav",
 "stream": true,
 "stream_format": "audio"
 }' \
 | ffplay -i -

If direct curl works but OpenClaw does not, the bug is probably in the TTS integration or provider selection layer, not the TTS backend.

3. Distinguish server failure from integration failure

Use this rule:

  • Direct curl fails → fix the local TTS server first
  • Direct curl works, but OpenClaw sounds wrong or falls back → inspect OpenClaw provider selection, fallback, and request shape
  • OpenClaw sends requests but voice/mode is wrong → inspect fields like model, voice, instructions, ref_audio, ref_text, and streaming flags

4. Know the four TTS modes

Use the right request shape for the right model type.

Base speaker

Use built-in speaker playback.

Typical shape:

  • model type: base
  • no full ref_audio + ref_text
  • voice.id means built-in speaker name

Base clone

Use clone-style synthesis.

Typical shape:

  • model type: base
  • must provide both ref_audio and ref_text, or supply a consent voice identity that resolves to both

Hard rule: do not attempt clone with only ref_audio.

CustomVoice

Use a model with prebuilt custom speakers.

Typical shape:

  • model type: custom_voice
  • voice may be accepted either as a plain string or as {"id":"..."} depending on the server
  • for this workspace, lj-qwen3-tts / /models/lj-qwen3-tts/ must use speaker/voice lj
  • do not send clone payloads

VoiceDesign

Use style-description-driven synthesis.

Typical shape:

  • model type: voice_design
  • must provide instructions
  • do not send voice, ref_audio, or ref_text

5. Treat streaming as a real transport choice

This server supports real incremental generation, not fake post-hoc slicing.

Important behavior:

  • If stream is omitted, the server defaults to streaming behavior
  • stream=false forces full non-streaming response
  • stream_format="audio" returns playable audio bytes
  • stream_format="event" returns SSE events with base64 chunks

If a client expects one mode and the server is returning the other, you will get confusing results fast.

6. Use consent uploads properly

For consent-based clone flows, upload voice material through /v1/audio/voice_consents.

Use ref_text with the recording. That is not optional in spirit, even if a workflow tries to pretend otherwise.

If later synthesis depends on stored consent voices, verify that the saved identity actually maps to both:

  • reference audio
  • reference text

7. OpenClaw-specific debugging pattern

When OpenClaw TTS appears broken:

  1. Confirm messages.tts points at the actual configured endpoint in openclaw.json
  2. Confirm the intended model exists in /v1/models or is otherwise accepted by the server
  3. Confirm the selected provider is really the OpenAI-compatible path and not Microsoft fallback
  4. Test direct curl with the same effective model/voice/mode assumptions
  5. Inspect whether OpenClaw is falling back to another provider
  6. If using [[tts:...]], verify whether single-reply override keys (model, voice, maybe provider) are enabled and are being honored
  7. If needed, compare raw request shape with a dump proxy

If OpenClaw reaches the server successfully, the next question is usually which mode did it actually request.

8. Preferred test ladder

Use this order:

  1. GET /health
  2. GET /v1/models
  3. direct non-streaming TTS test
  4. direct streaming TTS test
  5. consent upload test if clone is involved
  6. OpenAI client compatibility test if relevant
  7. OpenClaw integration test
  8. dump-proxy / log inspection only if still ambiguous

9. Common conclusions

Server good, integration bad

Typical signs:

  • manual curl returns playable audio
  • OpenClaw output sounds like fallback voice or wrong mode
  • provider selection is inconsistent

Conclusion: fix integration, not inference.

Text normalization bug

Typical signs:

  • synthesis succeeds technically
  • numbers are read awkwardly, skipped, or glitched

Conclusion: normalize the spoken text first. Do not blame the transport layer for a prompt-content problem.

Mode mismatch

Typical signs:

  • clone request sent to CustomVoice
  • VoiceDesign called without instructions
  • only ref_audio present for Base clone

Conclusion: wrong request semantics for the chosen model type.

10. Use the reference doc when exact fields matter

Read references/tts-api.md when you need exact behavior for:

  • /v1/audio/speech
  • /v1/audio/voice_consents
  • streaming vs non-streaming
  • stream_format="audio" vs stream_format="event"
  • mode selection and response headers
  • consent storage semantics
  • exact model/request mismatch errors

Do not assume generic OpenAI TTS docs fully match this local server.

Resources

references/

  • references/tts-api.md — exact local API behavior, streaming semantics, mode rules, consent upload flow, and common error conditions

Files

2 total
Select a file
Select a file to preview.

Comments

Loading comments…