vlm-grounding

v1.0.0

Use GLM-4.7V's multimodal grounding capability to detect and locate objects/text in images. Activate when user asks to find, locate, detect, or ground specif...

⭐ 0· 185·0 current·0 all-time

byJi Qi@qijimrc·duplicate of @qijimrc/visual-grounding

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for qijimrc/vlm-grounding.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "vlm-grounding" (qijimrc/vlm-grounding) from ClawHub.
Skill page: https://clawhub.ai/qijimrc/vlm-grounding
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install vlm-grounding

ClawHub CLI

Package manager switcher

npx clawhub@latest install vlm-grounding

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Suspicious

medium confidence

ℹ

Purpose & Capability

SKILL.md describes a reasonable grounding workflow (call model, parse boxes, draw visualizations). However the doc references a system config path (/root/.openclaw/agents/main/agent/models.json) and internal hosts (e.g., 172.20.112.202) without declaring that it needs access to those configs or network endpoints — this is an unexplained dependency on internal configuration.

Instruction Scope

Instructions tell the agent to contact an HTTP model API and to set NO_PROXY to bypass proxying (which affects network routing). They also include guidance that could cause the agent to read or use system-local config to locate model endpoints. The SKILL.md itself contains prompt-like material and the package contains a large session log (ssssss.json) with system/tool lists; combined with detected base64/unicode-control patterns, this raises concern about embedded prompt-injection or unintended privileged instructions.

✓

Install Mechanism

There is no install spec and no code files to be installed; this reduces disk-write risk. The skill is instruction-only, which is lower risk than an install that fetches and executes arbitrary archives.

Credentials

The manifest declares no env vars or credentials, but the instructions tell users to set NO_PROXY and point to internal hosts and a root-owned models.json path. That implies the skill expects access to internal network and possibly system config; those capabilities are not declared. The included session log also exposes an 'authorization' header (Bearer idonthaveakey in the sample)—an unexpected token-like artifact that could confuse or be misused.

✓

Persistence & Privilege

The skill is not marked always:true and does not request persistent privileges. It appears user-invocable only, which is appropriate for this type of helper.

Scan Findings in Context

[base64-block] unexpected: Base64 blocks and similar encodings are not expected in a simple grounding instruction file and can be used for prompt injection or hidden payloads. This is a red flag to inspect the SKILL.md and any bundled files closely.

[unicode-control-chars] unexpected: Unicode control characters can be used to obfuscate instructions or attack evaluation tooling. Not expected for a straightforward grounding helper.

What to consider before installing

Treat this skill as potentially unsafe until you verify a few things: 1) Who published it and do you trust that owner? 2) Inspect the bundled ssssss.json log — remove or understand why session/tool content and example authorization headers are included. 3) Confirm whether the skill actually needs to read /root/.openclaw/agents/main/agent/models.json or call internal IPs; if so, restrict it to an isolated environment and ensure no sensitive networks/configs are exposed. 4) Watch for prompt-injection patterns in SKILL.md (base64/unicode control chars); ask the author to remove hidden/encoded content and to explicitly declare any needed config paths or credentials. If you cannot validate these points, run the skill only in a sandboxed agent or decline to install.

Like a lobster shell, security has layers — review code before you run it.

latestvk9772ztqjw05nfsrbrcqzm6aw9836346

185downloads

0stars

1versions

Updated 20h ago

v1.0.0

MIT-0

Grounding - 多模态目标定位

利用 GLM-4.7V 的 grounding 能力，在图片中定位目标对象或文字，输出带标注框的结果图。

工作流程

用户输入（图片 + prompt）
        │
        ▼
  HttpInterface() → 调用模型 API → 得到 response 文本
        │
        ▼
  parse_bboxes_from_response() → 从回复中解析出坐标框列表
        │
        ▼
  visualize_boxes(renormalize=True) → 反归一化 + 画框 → 保存结果图

Step 1: 调用模型获取坐标

使用 HttpInterface 调用模型 API：

import os
os.environ['NO_PROXY'] = '<model-host>'  # 跳过代理
os.environ['no_proxy'] = '<model-host>'

from interface_http import HttpInterface

url = 'http://<host>:<port>/v1/chat/completions'
prompt = '''请在这张图中找到所有"{target}"，并以 [xmin, ymin, xmax, ymax] 格式输出每个目标的边界框坐标，坐标值为 0-1000 的归一化整数。每个目标一行，格式如下：
目标名称: [xmin, ymin, xmax, ymax]'''

response = HttpInterface(url, prompt, images=[image_path], no_think=True)
# 返回: "目标名称: [xmin, ymin, xmax, ymax]"

注意： 调用前需设置 NO_PROXY 环境变量跳过代理，否则内网请求会被代理拦截。

Step 2: 解析坐标框

from utils_boxes import parse_bboxes_from_response

boxes = parse_bboxes_from_response(response)
# 返回: [[x1, y1, x2, y2], ...]  (0-1000 归一化)

parse_bboxes_from_response 会自动：

从回复尾部向前检查截断，拓展 context window
遍历所有括号风格（[], {}, (), <>, <bbox>）提取坐标
扁平化嵌套列表，返回一维 box 列表

Step 3: 画框可视化

from utils_boxes import visualize_boxes

visualize_boxes(
    img_path=image_path,
    boxes=boxes,                    # parse_bboxes_from_response 的输出
    labels=['label1', 'label2'],    # 每个框的标签
    renormalize=True,               # 自动将 0-1000 归一化转为像素坐标
    save_path='output.jpg',
    colors=['red', 'blue'],         # 可选
    thickness=[2, 3],               # 可选
)

renormalize=True 时，内部自动调用 reverse_normalize_box：pixel = coord * img_dimension / 1000

完整示例

import os
os.environ['NO_PROXY'] = '172.20.112.202'
os.environ['no_proxy'] = '172.20.112.202'

from interface_http import HttpInterface
from utils_boxes import parse_bboxes_from_response, visualize_boxes

url = 'http://172.20.112.202:5002/v1/chat/completions'
img = '/path/to/image.jpg'

# 1. 调用模型
response = HttpInterface(
    url,
    '请在这张图中找到"红色圣诞帽"，以 [xmin, ymin, xmax, ymax] 格式输出坐标（0-1000归一化）',
    images=[img],
    no_think=True,
)

# 2. 解析坐标
boxes = parse_bboxes_from_response(response)

# 3. 画框
visualize_boxes(img_path=img, boxes=boxes, labels=['圣诞帽'], renormalize=True, save_path='out.jpg')

工具函数速查

函数	作用
`HttpInterface(url, prompt, images, no_think)`	调用模型 API，返回文本回复
`parse_bboxes_from_response(text)`	从模型回复中提取所有坐标框列表
`find_boxes_all(text, flat=True)`	提取文本中所有括号风格的坐标框
`reverse_normalize_box(box, w, h)`	0-1000 归一化 → 像素坐标
`visualize_boxes(..., renormalize=True)`	画框 + 自动反归一化

注意事项

模型 API 地址配置在 /root/.openclaw/agents/main/agent/models.json
调用内网模型时必须设置 NO_PROXY 环境变量
no_think=True 可关闭模型思考模式，加快响应

Comments

Loading comments...