Vision Tagger

v1.0.0

Tag and annotate images using Apple Vision framework (macOS only). Detects faces, bodies, hands, text (OCR), barcodes, objects, scene labels, and saliency re...

0· 1.3k·8 current·8 all-time
bySagar Jha@sagarjhaa
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
The name/description (macOS Apple Vision tagging) match the required binaries (swiftc, python3) and included files (Swift Vision code + Python annotator). Minor inconsistency: registry metadata/flags list no OS restriction while the SKILL.md and scripts explicitly require macOS; this is likely a metadata omission rather than malicious.
Instruction Scope
SKILL.md and scripts instruct compiling a Swift program and running it on an image, then optionally annotating with the Python script. The instructions only reference the image file(s) provided by the user and local system tools; they do not read other system config paths or request additional environment variables.
Install Mechanism
There is no remote download/install of arbitrary code; included files are compiled locally (swiftc) and Python dependencies are installed via pip. The setup script triggers xcode-select --install when swiftc is missing, which is a standard way to get Xcode CLI tools.
Credentials
The skill declares no required environment variables or credentials and the code does not attempt to access secrets or external service tokens. The requested permissions (access to local filesystem image paths and ability to run a compiled binary) are proportional to the stated purpose.
Persistence & Privilege
The skill does not request persistent global privileges, does not set always: true, and does not modify other skills or system-wide settings. It compiles a binary into its own scripts directory, which is expected for this kind of skill.
Assessment
This skill appears to do local image analysis using Apple Vision and Pillow; it requires macOS 12+ and Xcode CLI tools. Before installing: (1) only run on macOS as intended (SKILL.md requires macOS); (2) review the included Swift and Python source (they are provided) and only run the compile/install steps if you trust the source; (3) be aware the setup compiles a binary in the skill folder and installs Pillow via pip; (4) run the tool on non-sensitive images first to confirm behavior; and (5) if you want extra caution, run the setup/annotation inside a sandboxed account or VM and inspect the compiled binary with standard tools.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

👁️ Clawdis
Binsswiftc, python3
latestvk97br2x0dbgv163eb6112p441181b1e4
1.3kdownloads
0stars
1versions
Updated 1mo ago
v1.0.0
MIT-0

Vision Tagger

macOS-native image analysis using Apple's Vision framework. All processing is local — no cloud APIs, no API keys needed.

Requirements

  • macOS 12+ (Monterey or later)
  • Xcode Command Line Tools
  • Python 3 with Pillow

Setup (one-time)

# Install Xcode CLI tools if needed
xcode-select --install

# Install Pillow
pip3 install Pillow

# Compile the Swift binary
cd scripts/
swiftc -O -o image_tagger image_tagger.swift

Usage

Analyze image → JSON

./scripts/image_tagger /path/to/photo.jpg

Output includes:

  • faces — bounding boxes, roll/yaw/pitch, landmarks (eyes, nose, mouth)
  • bodies — 18 skeleton joints with confidence scores
  • hands — 21 joints per hand (left/right)
  • text — OCR results with bounding boxes
  • labels — scene classification (desk, outdoor, clothing, etc.)
  • barcodes — QR codes, UPC, etc.
  • saliency — attention and objectness regions

Annotate image with boxes

python3 scripts/annotate_image.py photo.jpg output.jpg

Draws colored boxes:

  • 🟢 Green: faces
  • 🟠 Orange: body skeleton
  • 🟣 Magenta: hands
  • 🔵 Cyan: text regions
  • 🟡 Yellow: rectangles/objects
  • Scene labels at bottom

Python integration

import subprocess, json

def analyze(path):
    r = subprocess.run(['./scripts/image_tagger', path], capture_output=True, text=True)
    return json.loads(r.stdout[r.stdout.find('{'):])

tags = analyze('photo.jpg')
print(tags['labels'])  # [{'label': 'desk', 'confidence': 0.85}, ...]
print(tags['faces'])   # [{'bbox': {...}, 'confidence': 0.99, 'yaw': 5.2}]

Example JSON Output

{
  "dimensions": {"width": 1920, "height": 1080},
  "faces": [{"bbox": {"x": 0.3, "y": 0.4, "width": 0.15, "height": 0.2}, "confidence": 0.99, "roll": -2, "yaw": 5}],
  "bodies": [{"joints": {"head_joint": {"x": 0.5, "y": 0.7, "confidence": 0.9}, "left_shoulder": {...}}, "confidence": 1}],
  "hands": [{"chirality": "left", "joints": {"VNHLKWRI": {"x": 0.4, "y": 0.3, "confidence": 0.85}}}],
  "text": [{"text": "HELLO", "confidence": 0.95, "bbox": {...}}],
  "labels": [{"label": "outdoor", "confidence": 0.88}, {"label": "sky", "confidence": 0.75}],
  "saliency": {"attentionBased": [{"x": 0.2, "y": 0.1, "width": 0.6, "height": 0.8}]}
}

Detection Capabilities

FeatureDetails
FacesBounding box, confidence, roll/yaw/pitch angles, 76-point landmarks
Bodies18 joints: head, neck, shoulders, elbows, wrists, hips, knees, ankles
Hands21 joints per hand, left/right chirality
Text (OCR)Recognized text with confidence and bounding boxes
Labels1000+ scene/object categories (clothing, furniture, outdoor, etc.)
BarcodesQR, UPC, EAN, Code128, PDF417, Aztec, DataMatrix
SaliencyAttention-based and objectness-based regions

Use Cases

  • Photo tagging — Auto-tag photos with detected objects/scenes
  • Posture monitoring — Track face/body position for ergonomics
  • Document scanning — Extract text from images
  • Security — Detect people in camera feeds
  • Accessibility — Describe image contents

Comments

Loading comments...