Vision Tagger
v1.0.0Tag and annotate images using Apple Vision framework (macOS only). Detects faces, bodies, hands, text (OCR), barcodes, objects, scene labels, and saliency re...
Security Scan
OpenClaw
Benign
high confidencePurpose & Capability
The name/description (macOS Apple Vision tagging) match the required binaries (swiftc, python3) and included files (Swift Vision code + Python annotator). Minor inconsistency: registry metadata/flags list no OS restriction while the SKILL.md and scripts explicitly require macOS; this is likely a metadata omission rather than malicious.
Instruction Scope
SKILL.md and scripts instruct compiling a Swift program and running it on an image, then optionally annotating with the Python script. The instructions only reference the image file(s) provided by the user and local system tools; they do not read other system config paths or request additional environment variables.
Install Mechanism
There is no remote download/install of arbitrary code; included files are compiled locally (swiftc) and Python dependencies are installed via pip. The setup script triggers xcode-select --install when swiftc is missing, which is a standard way to get Xcode CLI tools.
Credentials
The skill declares no required environment variables or credentials and the code does not attempt to access secrets or external service tokens. The requested permissions (access to local filesystem image paths and ability to run a compiled binary) are proportional to the stated purpose.
Persistence & Privilege
The skill does not request persistent global privileges, does not set always: true, and does not modify other skills or system-wide settings. It compiles a binary into its own scripts directory, which is expected for this kind of skill.
Assessment
This skill appears to do local image analysis using Apple Vision and Pillow; it requires macOS 12+ and Xcode CLI tools. Before installing: (1) only run on macOS as intended (SKILL.md requires macOS); (2) review the included Swift and Python source (they are provided) and only run the compile/install steps if you trust the source; (3) be aware the setup compiles a binary in the skill folder and installs Pillow via pip; (4) run the tool on non-sensitive images first to confirm behavior; and (5) if you want extra caution, run the setup/annotation inside a sandboxed account or VM and inspect the compiled binary with standard tools.Like a lobster shell, security has layers — review code before you run it.
Runtime requirements
👁️ Clawdis
Binsswiftc, python3
latest
Vision Tagger
macOS-native image analysis using Apple's Vision framework. All processing is local — no cloud APIs, no API keys needed.
Requirements
- macOS 12+ (Monterey or later)
- Xcode Command Line Tools
- Python 3 with Pillow
Setup (one-time)
# Install Xcode CLI tools if needed
xcode-select --install
# Install Pillow
pip3 install Pillow
# Compile the Swift binary
cd scripts/
swiftc -O -o image_tagger image_tagger.swift
Usage
Analyze image → JSON
./scripts/image_tagger /path/to/photo.jpg
Output includes:
faces— bounding boxes, roll/yaw/pitch, landmarks (eyes, nose, mouth)bodies— 18 skeleton joints with confidence scoreshands— 21 joints per hand (left/right)text— OCR results with bounding boxeslabels— scene classification (desk, outdoor, clothing, etc.)barcodes— QR codes, UPC, etc.saliency— attention and objectness regions
Annotate image with boxes
python3 scripts/annotate_image.py photo.jpg output.jpg
Draws colored boxes:
- 🟢 Green: faces
- 🟠 Orange: body skeleton
- 🟣 Magenta: hands
- 🔵 Cyan: text regions
- 🟡 Yellow: rectangles/objects
- Scene labels at bottom
Python integration
import subprocess, json
def analyze(path):
r = subprocess.run(['./scripts/image_tagger', path], capture_output=True, text=True)
return json.loads(r.stdout[r.stdout.find('{'):])
tags = analyze('photo.jpg')
print(tags['labels']) # [{'label': 'desk', 'confidence': 0.85}, ...]
print(tags['faces']) # [{'bbox': {...}, 'confidence': 0.99, 'yaw': 5.2}]
Example JSON Output
{
"dimensions": {"width": 1920, "height": 1080},
"faces": [{"bbox": {"x": 0.3, "y": 0.4, "width": 0.15, "height": 0.2}, "confidence": 0.99, "roll": -2, "yaw": 5}],
"bodies": [{"joints": {"head_joint": {"x": 0.5, "y": 0.7, "confidence": 0.9}, "left_shoulder": {...}}, "confidence": 1}],
"hands": [{"chirality": "left", "joints": {"VNHLKWRI": {"x": 0.4, "y": 0.3, "confidence": 0.85}}}],
"text": [{"text": "HELLO", "confidence": 0.95, "bbox": {...}}],
"labels": [{"label": "outdoor", "confidence": 0.88}, {"label": "sky", "confidence": 0.75}],
"saliency": {"attentionBased": [{"x": 0.2, "y": 0.1, "width": 0.6, "height": 0.8}]}
}
Detection Capabilities
| Feature | Details |
|---|---|
| Faces | Bounding box, confidence, roll/yaw/pitch angles, 76-point landmarks |
| Bodies | 18 joints: head, neck, shoulders, elbows, wrists, hips, knees, ankles |
| Hands | 21 joints per hand, left/right chirality |
| Text (OCR) | Recognized text with confidence and bounding boxes |
| Labels | 1000+ scene/object categories (clothing, furniture, outdoor, etc.) |
| Barcodes | QR, UPC, EAN, Code128, PDF417, Aztec, DataMatrix |
| Saliency | Attention-based and objectness-based regions |
Use Cases
- Photo tagging — Auto-tag photos with detected objects/scenes
- Posture monitoring — Track face/body position for ergonomics
- Document scanning — Extract text from images
- Security — Detect people in camera feeds
- Accessibility — Describe image contents
Comments
Loading comments...
