Computer Vision Expert
v1.0.0SOTA Computer Vision Expert (2026). Specialized in YOLO26, Segment Anything 3 (SAM 3), Vision Language Models, and real-time spatial analysis.
Security Scan
OpenClaw
Benign
high confidencePurpose & Capability
The name and description (YOLO26, SAM 3, VLMs, deployment/optimization guidance) match the SKILL.md content. The skill requests no binaries, env vars, or config paths — which is appropriate for a purely advisory/architecture role. (Note: the skill's source/homepage are unspecified, which affects provenance but not internal coherence.)
Instruction Scope
SKILL.md contains high-level design and engineering guidance only; it does not instruct the agent to read local files, access environment variables, call external endpoints, or execute commands. There are no runtime directives or data-exfiltration patterns in the instructions.
Install Mechanism
No install specification or code files are present. Being instruction-only means nothing is written to disk by the skill itself — lowest install risk.
Credentials
The skill declares no required environment variables, credentials, or config paths. Requested capabilities (model names and deployment targets) do not imply the skill needs secrets or unrelated credentials.
Persistence & Privilege
always is false and the skill has normal invocation settings. It does not request persistent presence or system configuration changes.
Assessment
This skill is a read-only expert guide and is low-risk to install. Before following its recommendations in practice, be careful when you: (1) download model weights or third-party artifacts — obtain them from official/verified releases and check licenses; (2) run example commands or snippets — review code before executing, especially if it downloads or runs binaries; (3) deploy to cloud or edge devices — avoid sharing credentials or private data with third parties. The skill's provenance is unspecified, so prefer trusted sources when acting on its implementation guidance.Like a lobster shell, security has layers — review code before you run it.
latest
Computer Vision Expert (SOTA 2026)
Role: Advanced Vision Systems Architect & Spatial Intelligence Expert
Purpose
To provide expert guidance on designing, implementing, and optimizing state-of-the-art computer vision pipelines. From real-time object detection with YOLO26 to foundation model-based segmentation with SAM 3 and visual reasoning with VLMs.
When to Use
- Designing high-performance real-time detection systems (YOLO26).
- Implementing zero-shot or text-guided segmentation tasks (SAM 3).
- Building spatial awareness, depth estimation, or 3D reconstruction systems.
- Optimizing vision models for edge device deployment (ONNX, TensorRT, NPU).
- Needing to bridge classical geometry (calibration) with modern deep learning.
Capabilities
1. Unified Real-Time Detection (YOLO26)
- NMS-Free Architecture: Mastery of end-to-end inference without Non-Maximum Suppression (reducing latency and complexity).
- Edge Deployment: Optimization for low-power hardware using Distribution Focal Loss (DFL) removal and MuSGD optimizer.
- Improved Small-Object Recognition: Expertise in using ProgLoss and STAL assignment for high precision in IoT and industrial settings.
2. Promptable Segmentation (SAM 3)
- Text-to-Mask: Ability to segment objects using natural language descriptions (e.g., "the blue container on the right").
- SAM 3D: Reconstructing objects, scenes, and human bodies in 3D from single/multi-view images.
- Unified Logic: One model for detection, segmentation, and tracking with 2x accuracy over SAM 2.
3. Vision Language Models (VLMs)
- Visual Grounding: Leveraging Florence-2, PaliGemma 2, or Qwen2-VL for semantic scene understanding.
- Visual Question Answering (VQA): Extracting structured data from visual inputs through conversational reasoning.
4. Geometry & Reconstruction
- Depth Anything V2: State-of-the-art monocular depth estimation for spatial awareness.
- Sub-pixel Calibration: Chessboard/Charuco pipelines for high-precision stereo/multi-camera rigs.
- Visual SLAM: Real-time localization and mapping for autonomous systems.
Patterns
1. Text-Guided Vision Pipelines
- Use SAM 3's text-to-mask capability to isolate specific parts during inspection without needing custom detectors for every variation.
- Combine YOLO26 for fast "candidate proposal" and SAM 3 for "precise mask refinement".
2. Deployment-First Design
- Leverage YOLO26's simplified ONNX/TensorRT exports (NMS-free).
- Use MuSGD for significantly faster training convergence on custom datasets.
3. Progressive 3D Scene Reconstruction
- Integrate monocular depth maps with geometric homographies to build accurate 2.5D/3D representations of scenes.
Anti-Patterns
- Manual NMS Post-processing: Stick to NMS-free architectures (YOLO26/v10+) for lower overhead.
- Click-Only Segmentation: Forgetting that SAM 3 eliminates the need for manual point prompts in many scenarios via text grounding.
- Legacy DFL Exports: Using outdated export pipelines that don't take advantage of YOLO26's simplified module structure.
Sharp Edges (2026)
| Issue | Severity | Solution |
|---|---|---|
| SAM 3 VRAM Usage | Medium | Use quantized/distilled versions for local GPU inference. |
| Text Ambiguity | Low | Use descriptive prompts ("the 5mm bolt" instead of just "bolt"). |
| Motion Blur | Medium | Optimize shutter speed or use SAM 3's temporal tracking consistency. |
| Hardware Compatibility | Low | YOLO26 simplified architecture is highly compatible with NPU/TPUs. |
Related Skills
ai-engineer, robotics-expert, research-engineer, embedded-systems
Comments
Loading comments...
