Voice Agent
PassAudited by ClawScan on May 1, 2026.
Overview
This skill appears purpose-aligned: it uses a local voice backend to transcribe audio and generate speech, with some disclosed data-flow and setup dependencies users should understand.
Before installing, make sure you trust the backend service running at localhost:8000 and understand that audio/text may be processed by that backend and possibly AWS Polly. Use deliberate input files and safe output paths to avoid accidental exposure or overwriting important files.
Findings (3)
Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.
A chosen audio file is handed to the local backend, and an output path can be created or overwritten by the synthesis command.
The client reads a provided audio file for transcription and writes synthesized audio to a provided output path. This is expected for the skill, but users should ensure only intended files and output locations are used.
with open(filename, 'rb') as f: data += f.read() ... with open(output_file, 'wb') as f:
Use explicit, non-sensitive audio inputs and safe output paths; avoid pointing the output at existing important files.
The skill’s safety depends partly on the backend service running on localhost:8000, not just on the packaged client script.
The skill is client-only and relies on a separately managed backend and repository docs outside the included package. That dependency is disclosed, but the backend is part of the trust decision.
Requires a running backend API at `http://localhost:8000`. Backend setup instructions are in this repository: - `README.md` - `walkthrough.md` - `DOCKER_README.md`
Install and run the backend only from a trusted source, review its setup instructions, and avoid running an unexpected service on localhost:8000.
Text sent for speech generation may be handled beyond the local machine by AWS Polly through the backend.
The skill discloses that text-to-speech uses AWS Polly via the backend. That is purpose-aligned, but synthesis text may be processed by an external provider depending on backend configuration.
It uses **local Whisper** for Speech-to-Text transcription and **AWS Polly** for Text-to-Speech generation.
Do not synthesize highly sensitive text unless you are comfortable with the backend and AWS Polly handling it.
