vn-skill-for-windows
Process video, image, and audio files locally on Windows — no cloud upload, no API key required.
When to Activate
Trigger this skill when the user wants to:
- Extract audio from a video
- Extract a frame / thumbnail / cover image from a video
- Add auto-generated captions to a video
- Burn SRT subtitles into a video
- Compress a video or image
- Concatenate / merge video clips
- Denoise audio or video
- Remove foreground subject from a video (cutout)
Intent mapping — match user phrases to commands:
| User says | Command |
|---|
| extract audio / get audio / rip audio / pull audio / audio only | extract-audio |
| grab frame / screenshot / thumbnail / cover image / still / poster | extract-frame |
| auto captions / auto subtitles / generate subtitles / transcribe / speech to text | auto-captions |
| add subtitles / burn SRT / burn captions / overlay subtitles | add-caption |
| compress image / shrink image / reduce image size / optimize image / convert to webp | compress-image |
| compress video / shrink video / make smaller / reduce file size / resize to 720p | compress-video |
| join clips / merge videos / concatenate / combine videos / stitch together | concat-video |
| denoise / remove noise / clean audio / noise reduction / clean up audio | denoise |
| remove background / cutout / extract person / isolate foreground | cutout-video |
Also activate when the user mentions: "edit video", "process video", "trim audio", "video editing", or references any local media file they want transformed.
Do NOT trigger this skill when:
- The user is asking general questions about video editing (answer directly)
- The user wants to edit a video timeline, add effects, or do creative editing
- The input file is not a local file (URLs, cloud files not supported)
- The platform is not Windows
Required inputs
Before running any command, confirm you have all required inputs:
| Command | Required | Ask if missing |
|---|
extract-audio | video file path | yes |
extract-frame | video file path | yes |
auto-captions | video file path | yes; engine defaults to whisper_base |
add-caption | video file path + SRT file path | yes, ask for SRT path |
compress-video | video file path | yes; resolution/fps/bitrate are optional |
compress-image | image file path | yes; format/quality are optional |
concat-video | at least 2 video file paths | yes |
denoise | audio or video file path | yes; level defaults to moderate |
cutout-video | video file path | yes |
Stop and ask the user when:
- The input file path does not exist or was not provided
- The user requests
add-caption but no SRT file path was given
- The user requests
concat-video with fewer than 2 files
- The user requests
denoise --level custom but no --custom-value was provided
- The requested operation is outside the scope of this skill (e.g. timeline editing, adding effects)
Do not guess or assume missing required inputs — always ask first.
Ambiguity Resolution
When the user's request contains values or terms that could map to more than
one parameter, ask one focused question to clarify BEFORE executing. Never
guess — a wrong guess wastes processing time and erodes trust.
Ask for clarification when:
- A single value is given without specifying which property it applies to.
- e.g. "add captions, purple" — unclear whether purple is text color or stroke color.
- A number is given without units or context.
- e.g. "compress photo 10" — is
10 quality (%), width (px), or target file size (MB)?
- A required pair of values is partially missing.
- e.g. "white stroke" with no text color — ask what the text color should be.
- Conflicting instructions.
- e.g. "compress to 4K but keep the file under 5 MB" — likely impossible, confirm trade-off.
Do NOT ask when the intent is clear:
- "compress this photo" → use defaults.
- "compress video to 720p" →
-r 720 is unambiguous.
- "extract audio as mp3" →
-f mp3.
- "denoise this, it's really noisy" →
-l high.
How to ask: Present lettered options so the user can reply with just a letter.
❓ "Compress photo 10" — what does 10 mean here?
A. Quality 10% — heavy compression, lower image quality
B. Max width 10 px — tiny thumbnail
C. Target file size ≈ 10 MB
D. Something else — please clarify
1. Agent Behavior (follow this strictly)
1.1 Startup checks (run once per session before any task)
Run these checks in order. Stop and handle any failure before continuing.
Step 1 — Find CLI:
${vn-tools-cli} = $null
foreach ($root in @((Join-Path $env:LOCALAPPDATA "Programs"), $env:LOCALAPPDATA, $env:ProgramFiles)) {
${vn-tools-cli} = Get-ChildItem -LiteralPath $root -Directory -ErrorAction SilentlyContinue |
ForEach-Object { Join-Path $_.FullName "CLI\bin\vn-tools-cli.exe" } |
Where-Object { Test-Path -LiteralPath $_ } |
Select-Object -First 1
if ($null -ne ${vn-tools-cli}) { break }
}
Step 2 — Auto-install or upgrade (do NOT ask — just do it and inform the user):
Tell the user immediately:
🔧 Installing VN Tools CLI (one-time setup) — this may take a minute or two.
Or if upgrading:
🔧 Upgrading VN Tools CLI to 0.1.0.0 — one moment.
Then run:
# The install script ships as `vn-tools-cli-install.ps1.txt` because ClawHub
# rejects `.ps1` uploads; copy it to %TEMP% with the real extension before
# executing so PowerShell treats it as a script.
$src = "<skill-dir>\scripts\vn-tools-cli-install.ps1.txt"
$dst = Join-Path $env:TEMP "vn-tools-cli-install.ps1"
Copy-Item -LiteralPath $src -Destination $dst -Force
powershell -NoProfile -ExecutionPolicy Bypass -File $dst
After the installer finishes, re-run the search from Step 1.
Step 3 — Validate input file(s):
Test-Path -LiteralPath "C:\absolute\path\to\file"
- If missing → tell the user the file was not found and ask for a valid path
- For
add-caption: also validate the SRT file path
- For
concat-video: validate all input files
1.2 Running a task
- Resolve the correct command from the When to Activate trigger list
- Build the command using absolute Windows paths — quote paths with spaces
- If the user does not specify an output directory, omit
-o
- The CLI prints the generated output file path to stdout on success
- Report the result per § 3 Response Format
- On error → report in plain language
1.3 Input file handling
- Caching: When the user provides a file via URL or chat attachment,
download it to a local working directory. Reuse the local copy if the same
file was already downloaded.
- No file identified: If the user requests an action but neither attaches a
file nor mentions a specific file, ask which file to process. If the user
clearly references a previous file (e.g. "compress the video I just sent"),
reuse it without asking.
1.4 Command quick reference
CLI binary: & ${vn-tools-cli} (resolved in § 1.1)
| Command | Required args | Key options |
|---|
extract-audio | <video> | -o <dir> -f mp3 |
extract-frame | <video> | -p first |
auto-captions | <video> | -e <engine> -l <lang> -j <threads> [style flags] -o <dir> |
add-caption | <video> --srt <srt> | --font-family --font-size --text-color --stroke-color --stroke-width --background-color --opacity -o <dir> |
compress-video | <video> | -r <resolution> --fps <fps> -b <kbps> --hdr -o <dir> |
compress-image | <image> | -f jpeg |
concat-video | <video1> <video2> [...] | -o <dir> |
denoise | <audio | video> | |
cutout-video | <video> | --feather <0-100> --expand <-20..20> -f mp4 |
Engine selection for auto-captions:
| Engine | Model file | Size | Best for |
|---|
whisper_tiny | ggml-tiny.bin | ≈75 MB | Fast drafts, low resource |
whisper_base | ggml-base.bin | ≈148 MB | Good default balance |
whisper_medium_en | ggml-medium.en-q8_0.bin | 785 MB | High accuracy, English only |
whisper_medium | ggml-medium-q8_0.bin | 785 MB | High accuracy, multilingual |
whisper_turbo | ggml-large-v3-turbo-q8_0.bin | 833 MB | Best quality, multilingual |
Supported languages: auto en zh ja ko es pt ar hi id fr de ru it tr vi th pl uk nl sv fi ro cs hu he el bg hr sk sl lt lv et ms ta ur sw mk mi is hy az af
Denoise levels:
| Level | Use case |
|---|
low | Quiet background hum |
moderate | Default, general use |
high | Noisy environment |
veryHigh | Very noisy recordings |
Enable --pf (postfilter) for additional clarity at the cost of processing time.
2. Default Decision Rules
- Output format should match the source format unless the user explicitly
requests a different format (e.g. JPEG in → JPEG out, MP4 in → MP4 out).
extract-audio: omit -f unless the user requests a format; the CLI stream-copies audio when possible.
extract-frame: default to first frame and PNG. Only use -p custom -t <seconds> when the user gives a timestamp.
compress-image: keep the source format (JPEG → JPEG, PNG → PNG) and use
-q 0.8. Only convert if the user asks. Keep aspect ratio unless asked.
compress-video: if the user only says "compress" or "make smaller" without
specifying resolution or bitrate, present options with estimated size. Use
est_MB = (video_kbps + audio_kbps) × duration_s / 8 / 1024
with reference bitrates: 1080p=2500, 720p=1500, 480p=800, 360p=500, 240p=300.
Audio: use source bitrate or assume 128 kbps. Example for 1080p source:
A. Downscale to 720p, lower bitrate (720p / 1500 kbps) — est. ≈24 MB
B. Downscale to 480p, lower bitrate (480p / 800 kbps) — est. ≈14 MB
C. Keep 1080p, lower bitrate (1080p / 2500 kbps) — est. ≈39 MB
D. Custom — tell me target size or params
If the user gives a clear target (e.g. "compress to 720p"), skip options.
Never force -b unless the user requests a specific bitrate.
Only add --fps or --hdr when asked.
add-caption: rely on CLI style defaults unless the user specifies styling.
Defaults: Inter, size 13, text FFFFFF, stroke 000000, width 1.0, no background, opacity 1.0.
auto-captions: if the user does not specify a model, present the engine
options from § 1.4 Engine Selection. "quick captions" → tiny,
"accurate captions" → medium/turbo. Default language to -l auto; only set
-j when the user asks.
denoise: default to -l moderate; omit --audio-only, --high-pass, --pf unless requested.
concat-video: preserve input order exactly as given. Hard cut only (no transitions).
cutout-video: default to --feather 0 --expand 0; omit -f unless the user requests a container.
- For "better subtitle accuracy" or non-English speech, prefer
whisper_medium or whisper_turbo.
3. Response Format (MANDATORY — follow for every task)
Users interact through chat platforms (Discord, Slack, web) with no visible
terminal. Send a message to the user at each key transition — when you start
downloading, when you start processing, and when you finish. The user should
never be left wondering what is happening.
3.1 Message Discipline
Only send messages that match the templates defined below. Everything else
is internal reasoning and MUST NOT be sent to the user.
| Type of leak | Bad example | Why it's wrong |
|---|
| Narrating actions | "Let me search for the file..." | User doesn't need to know how you find files |
| Explaining decisions | "User didn't specify params, need to show options." | Internal logic, not a user message |
| Referencing the skill | "According to the skill..." / "Per skill rules..." | Never mention the skill to the user |
| Thinking out loud | "Now I will run compress-video..." | Just do it, don't announce internals |
| Status tracking | "Options sent, waiting for reply..." | The user already sees the options |
| Echoing raw metadata | "Video info: H.264, 720x968, 6.9 MB. ..." as a standalone message | Metadata belongs inside the options, not as a separate message |
3.2 Start Message — send IMMEDIATELY when you begin
As soon as you receive the user's request, send a message describing the first
action you are about to take. If the input file needs to be downloaded from a
URL/attachment, say so. If it already exists locally, go straight to the action.
File needs downloading:
⬇️ Downloading clip.mp4 — one moment.
After the download finishes, send the processing message:
🎬 Download complete. Starting compress-video (720p) — this may take a moment.
File already exists locally (cached from a previous request):
Skip the download message and go straight to:
🎬 Starting compress-video on clip.mp4 (720p) — this may take a moment.
If the CLI needs to be installed first:
🔧 VN Tools CLI needs to be installed first (one-time setup) — this may take a minute or two.
After installation, send the processing message as above.
**auto-captions with a non-tiny engine:**
Only ggml-tiny.bin is bundled. Other models download on first use. Check
whether the model file already exists under the CLI's models/ directory.
| Engine | Model file | Download size |
|---|
whisper_base | models/ggml-base.bin | ≈148 MB |
whisper_medium_en | models/ggml-medium.en-q8_0.bin | ≈785 MB |
whisper_medium | models/ggml-medium-q8_0.bin | ≈785 MB |
whisper_turbo | models/ggml-large-v3-turbo-q8_0.bin | ≈833 MB |
Model does not exist — mention the download in the processing message:
🎬 Starting auto-captions on meeting.mp4 (whisper_base).
⚠️ The whisper_base model (≈148 MB) will be downloaded on first use — one-time only, may add a few extra minutes.
Model already exists:
🎬 Starting auto-captions on meeting.mp4 (whisper_base) — this may take a moment.
3.3 End Message — send AFTER processing finishes
One message that combines the result summary and the file delivery.
Success:
✅ Done — audio extracted.
Source: interview.mp4 · MP4 · 1920×1080 · 5 min 32 s · 120 MB
Output: interview_audio_20260408.mp3 · MP3 · 128 kbps · 5 min 32 s · 3.2 MB
Failure:
❌ Processing failed — the video file appears to be corrupted.
Error: Invalid data found when processing input
3.4 Preview delivery
Applies when the output file exceeds the platform's upload limit. Image and
audio outputs are typically small — send directly. The preview flow below
applies only to video.
Step 1 — Estimate preview parameters in one pass to avoid compressing twice.
Target ≈80% of the platform limit for headroom:
target_size_kb = upload_limit_mb × 0.8 × 1024
target_bitrate_kbps = target_size_kb × 8 / duration_seconds
| target_bitrate_kbps | Resolution (-r) |
|---|
| ≥ 1500 | 480 |
| 800 – 1499 | 360 |
| 400 – 799 | 240 |
| < 400 | 240 (cap at 400) |
Hard floor: never below 240p. If even 240p can't fit, skip the preview.
Step 2 — Send the preview with the summary in one message:
✅ Done — video captioned.
Source: meeting.mp4 · MP4 · 1920×1080 · 5 min 32 s · 120 MB
Output: meeting_captioned_20260409.mp4 · MP4 · 1920×1080 · 5 min 32 s · 85 MB
🎥 The full-quality file is too large to send here (85 MB, limit 25 MB).
Attached is a preview (360p, 4.2 MB).
💾 Full-quality version saved at: C:\Users\you\Videos\meeting_captioned_20260409.mp4
If even 240p can't fit, skip the preview:
✅ Done — video captioned. Output is 850 MB — too large to preview.
💾 Full-quality file: C:\Users\you\Videos\meeting_captioned_20260409.mp4
Upload failure (file was within limit): Retry once. If retry also fails:
💾 Upload failed after retry. File saved at:
C:\Users\you\Videos\output_20260409.mp4
4. Failure Handling
4.1 CLI install / upgrade failed
If the install script exits with a non-zero code or the CLI is still not found after running it:
- Check the MSI log at
%TEMP%\vn-tools-cli-install\vn-tools-cli-install-msi.log for details.
- Common causes:
- Network error — MSI download failed. Ask the user to check their internet connection and retry.
- Antivirus / security software — may block
msiexec or quarantine the MSI. Ask the user to temporarily disable it and retry.
- Insufficient disk space — ask the user to free space and retry.
- If retry also fails, provide the manual download link:
❌ Automatic installation failed. Please download and install manually:
https://github.com/cawcut/skill-vn/releases/download/0.1.0/vn-tools-cli_0.1.0.0_windows_x64.msi
4.2 Whisper model download failed
If auto-captions fails because the Whisper model could not be downloaded:
User explicitly specified an engine (e.g. whisper_medium): do NOT silently switch. Ask:
The whisper_medium model failed to download. Would you like to:
A. Retry downloading whisper_medium
B. Switch to whisper_tiny (built-in, no download needed, lower accuracy)
C. Cancel
User did NOT specify an engine (using default): automatically fall back to whisper_tiny and inform:
The Whisper model failed to download. Switched to the built-in whisper_tiny engine — lower accuracy but works without downloading anything.
4.3 SRT file missing (add-caption)
If the user requests add-caption without providing an SRT file:
Please provide the path to your SRT subtitle file.
Validate the path exists before running the command.
4.4 Invalid parameter values
The CLI enforces these constraints — validate before running:
| Parameter | Constraint |
|---|
compress-video -r | minimum 144 |
compress-video --fps | 1–120 |
compress-video -b | minimum 100 kbps |
compress-image -q | 0.0–1.0 |
denoise -l custom | requires -v <int> |
cutout-video --feather | 0–100 |
cutout-video --expand | -20 to 20 |
If a value is out of range, tell the user the valid range and ask for a corrected value.
4.5 Generic failure
Summarise the error output in plain language. Suggest a next step based on the error content. If unresolvable, refer the user to § 6 Support.
5. Quick Examples
& ${vn-tools-cli} extract-audio "C:\Videos\clip.mp4" -o "C:\Output" -f mp3
& ${vn-tools-cli} extract-frame "C:\Videos\clip.mp4" -p custom -t 5.5 -o "C:\Output"
& ${vn-tools-cli} compress-image "C:\Photos\big.jpg" -q 0.8 -o "C:\Output"
& ${vn-tools-cli} compress-video "C:\Videos\raw.mp4" -r 720 -o "C:\Output"
& ${vn-tools-cli} add-caption "C:\Videos\clip.mp4" --srt "C:\Subs\clip.srt" --font-size 28 -o "C:\Output"
& ${vn-tools-cli} concat-video "C:\Videos\part1.mp4" "C:\Videos\part2.mp4" -o "C:\Output"
& ${vn-tools-cli} auto-captions "C:\Videos\interview.mp4" -e whisper_base -l en -o "C:\Output"
& ${vn-tools-cli} denoise "C:\Audio\noisy.wav" -l high --pf -o "C:\Output"
& ${vn-tools-cli} cutout-video "C:\Videos\person.mp4" --feather 5 -f mp4 -o "C:\Output"
6. Reference
For the full CLI options, all flags, and example invocations:
→ references/cli-reference.md
7. Support
If you still need help after trying the suggested fixes, contact VN Tools support at vn.support+windows@ui.com.
8. Notes
- The CLI is Windows-only.
- All file paths must be absolute Windows paths; quote paths with spaces.
- Output filenames contain a timestamp to prevent overwrites.
auto-captions removes the temporary SRT after burning; it does not return the SRT separately.
cutout-video composites the subject over a green background (opaque YUV420P); no alpha channel is produced regardless of container format.
concat-video is hard cut only — no transition effects.