Vision Tool

v1.1.3

Image recognition using Ollama + qwen3.5:4b with think=False for reliable content extraction.

0· 113·0 current·0 all-time
byRuilizhen Hu@huruilizhen

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for huruilizhen/vision-tool.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "Vision Tool" (huruilizhen/vision-tool) from ClawHub.
Skill page: https://clawhub.ai/huruilizhen/vision-tool
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Required binaries: ollama, python3
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install vision-tool

ClawHub CLI

Package manager switcher

npx clawhub@latest install vision-tool
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description match the implementation: the code reads an image, Base64-encodes it, and posts to an Ollama /api/chat endpoint using model qwen3.5:4b. Required binaries (ollama, python3) are appropriate and no unrelated credentials or tools are requested.
Instruction Scope
Runtime instructions only run local Python code and call the Ollama API at http://127.0.0.1:11434/api/chat; they read the provided image file and send its Base64 payload. This is coherent with image-analysis purpose, but be aware that if the Ollama service URL is changed from the default, image data could be sent to a remote host — the code itself does not exfiltrate to external endpoints by default.
Install Mechanism
No install spec that downloads external artifacts; included code is pure Python using the requests library. There are no archive downloads or external installers declared in the skill metadata.
Credentials
The skill declares no required environment variables or credentials. It uses sensible defaults (local Ollama URL). No secret or cloud credentials are requested, which is proportionate for a local-model vision tool.
Persistence & Privilege
always:false and user-invocable:true (defaults) — the skill does not request forced persistent inclusion or elevated platform privileges. It does not modify other skills or system-wide configs.
Assessment
This skill appears to do exactly what it claims: it reads a local image file, Base64-encodes it, and POSTs it to an Ollama /api/chat endpoint on localhost. Before installing or running it, ensure you: 1) run a trusted Ollama instance locally (ollama serve) and have pulled qwen3.5:4b, 2) confirm the Ollama service is not proxying/forwarding requests to an untrusted remote endpoint (if you change the default URL the skill will send images to wherever that URL points), and 3) review and run the included tests in a safe environment. Because the skill does not request secrets or remote installs and the code is readable, there are no incoherent or disproportionate requests — but always verify you trust the Ollama server you will use (local vs remote).

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

👁️ Clawdis
Binsollama, python3
latestvk97e5029kqsh7a8enzamkwftq984qymx
113downloads
0stars
5versions
Updated 2w ago
v1.1.3
MIT-0

Vision Tool 👁️

Image recognition using Ollama + qwen3.5:4b. Uses /api/chat endpoint for direct content extraction.

Features

Direct content extraction - Uses /api/chat endpoint for clean output
Simplified architecture - No complex thinking field processing needed
English prompts - Optimized for English language analysis
Multi-channel support - Works in WeChat, Telegram, Discord, etc.
Error handling - Full error recovery and reporting

Installation

Prerequisites

  1. Ollama service: ollama serve (must be running)
  2. qwen3.5:4b model: ollama pull qwen3.5:4b
  3. Python 3.8+: Required for running the skill

Install the skill

clawhub install vision-tool

Development Setup (For Contributors)

If you want to contribute or modify the skill, see CONTRIBUTING.md for detailed development instructions.

Basic setup:

# Clone the repository
git clone https://github.com/HuRuilizhen/vision-tool
cd vision-tool

# Set up development environment
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

# Run tests
python3 -m pytest tests/

Usage

Basic usage

# From any OpenClaw channel
exec: python3 /path/to/vision-tool/main.py /path/to/image.jpg

# With custom prompt
exec: python3 /path/to/vision-tool/main.py /path/to/image.jpg --prompt "Describe this image"

# Debug output
exec: python3 /path/to/vision-tool/main.py /path/to/image.jpg --debug

Channel-specific examples

WeChat Channel:

# When receiving an image
exec: python3 /path/to/vision-tool/main.py "$IMAGE_PATH"

Telegram Channel:

# Reply to photo messages
exec: python3 /path/to/vision-tool/main.py "/path/to/telegram_photo.jpg"

Discord Channel:

# Process attachments
exec: python3 /path/to/vision-tool/main.py "./discord_attachment.jpg"

Example Output

Analysis (30.7s):
------------------------------------------------------------
The user wants a description of the image provided.
**1. Overall Composition:**
- It's a top-down view of a meal served on a white tray.
- There are six distinct dishes/bowls arranged...
**2. Detailed Breakdown of Dishes:**
- **Top Left:** A small white rectangular dish...
- **Top Middle:** A small white rectangular dish...
------------------------------------------------------------

How It Works

  1. Image reading: Reads and Base64 encodes the image
  2. API call: Calls Ollama /api/chat endpoint with qwen3.5:4b
  3. Direct extraction: Gets analysis directly from content field
  4. Fallback handling: Simple cleanup if thinking field is used
  5. Output formatting: Generates clean analysis results

Performance

  • Average processing time: 25-35 seconds per image (hardware dependent)
  • Image size support: 100KB-500KB recommended
  • Token consumption: ~2000 tokens per image
  • API endpoint: Uses /api/chat for direct content access

Troubleshooting

Common Issues

  1. Ollama not running: Run ollama serve first
  2. Model not installed: Run ollama pull qwen3.5:4b
  3. Image path incorrect: Use absolute paths or correct relative paths
  4. Timeout: Model may take 30+ seconds for complex images

Performance Tips

  • Compress images to under 300KB for faster processing
  • Use clear, concise prompts
  • Ensure Ollama has sufficient system resources

API Reference

Python API

from vision_core import VisionAnalyzer

analyzer = VisionAnalyzer()
result = analyzer.analyze_image("image.jpg", "Describe this image")
print(result["analysis"])

Command Line

# Basic analysis
python3 main.py image.jpg

# Custom prompt
python3 main.py image.jpg --prompt "What objects are in this image?"

# Debug mode
python3 main.py image.jpg --debug

Development

File Structure

vision-tool/
├── SKILL.md          # This documentation
├── main.py           # Main skill script
├── scripts/
│   └── vision_core.py  # Core analysis engine
└── tests/
    └── test_basic.py   # Basic tests

Testing

# Test with example image
python3 main.py /path/to/test.jpg --prompt "Test analysis"

# Run unit tests
python3 -m pytest tests/

Changelog

v1.1.0 (2026-04-13)

  • Uses /api/chat endpoint for direct content extraction
  • Simplified architecture without complex thinking field processing
  • Default English prompt "Describe this image"
  • Removed regex dependencies for cleaner code

v1.0.0 (2026-04-12)

  • Initial release

Contributing

Issues and pull requests are welcome. Please ensure tests pass before submitting.

License

This skill is part of the OpenClaw ecosystem.


Ready to use in all OpenClaw channels! 🚀

Comments

Loading comments...