visual-grounding

v1.0.0

Use GLM-4.7V's multimodal grounding capability to detect and locate objects/text in images. Activate when user asks to find, locate, detect, or ground specif...

⭐ 0· 184·0 current·0 all-time

byJi Qi@qijimrc

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for qijimrc/visual-grounding.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "visual-grounding" (qijimrc/visual-grounding) from ClawHub.
Skill page: https://clawhub.ai/qijimrc/visual-grounding
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install visual-grounding

ClawHub CLI

Package manager switcher

npx clawhub@latest install visual-grounding

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Suspicious

medium confidence

ℹ

Purpose & Capability

The SKILL.md describes grounding via an HTTP model API and visualization helpers — consistent with the skill name. However the doc references helper modules (interface_http, utils_boxes) that are not included in the package and references an internal config path (/root/.openclaw/agents/main/agent/models.json). The registry metadata declares no env vars or binaries required, but the instructions explicitly tell callers to set NO_PROXY and to contact an internal model host (e.g., 172.20.112.202). These are plausible for a local-model grounding skill but are not declared in metadata.

Instruction Scope

Instructions tell the agent to set NO_PROXY and call an internal HTTP model endpoint and to parse model responses for bounding boxes — behavior expected for grounding. However the SKILL.md also describes parsing/expanding truncated replies and contains obfuscation/prompt-injection signals (base64-block, unicode-control-chars). The document does not instruct arbitrary file reads, but it references internal config paths and helper modules not supplied, and the included guidance could be used to coax the agent to access internal resources. That ambiguity is concerning.

✓

Install Mechanism

No install spec and no code files (instruction-only) — lowest-risk distribution. Nothing in the package will be written to disk by an installer step.

Credentials

The skill declares no required credentials, which matches a local-model grounding use, but the SKILL.md instructs setting NO_PROXY to bypass proxies and contains examples with an internal IP. The package also contains a large session log (ssssss.json) that exposes a tool/system prompt and an Authorization header string (Bearer idonthaveakey). Including such an internal transcript in the skill bundle is unexpected and could leak sensitive run-time details or be used to manipulate behavior; this is disproportionate for a simple grounding skill.

✓

Persistence & Privilege

always is false and there are no install hooks or instructions to modify other skills or global agent settings. The skill does not request persistent/autonomous privileges beyond normal invocation.

Scan Findings in Context

[base64-block] unexpected: A base64 block pattern was detected in SKILL.md; a grounding/integration doc normally doesn't need encoded payloads. This can indicate obfuscated content or prompt-injection attempts and should be inspected.

[unicode-control-chars] unexpected: Unicode control characters were detected in SKILL.md. These are often used to try to hide or manipulate text rendering (prompt injection). Not expected for a straightforward grounding instruction file.

What to consider before installing

Do not install blindly. Steps to take before proceeding: - Verify the skill author/source; this package contains an oversized session log (ssssss.json) that is unnecessary for a grounding helper — inspect or remove it. - Open SKILL.md and search for any base64 or invisible/unicode-control characters; if present, ask the author to explain them or provide a clean copy. - Confirm the helper modules referenced (interface_http, utils_boxes) actually exist on the agent environment; the skill provides no implementation files. - Be cautious setting NO_PROXY or pointing to internal IPs; avoid exposing network services or credentials. If you must test, run in an isolated/sandbox agent and do not provide sensitive creds. - If you plan to use an internal model endpoint, verify models.json and endpoint addresses come from a trusted admin and that no secrets are embedded in skill files. - If anything remains unclear (why the session log is included, what the obfuscated content is), contact the skill maintainer and request a minimal, clean SKILL.md and the missing helper modules before use.

Like a lobster shell, security has layers — review code before you run it.

latestvk9721tr046tkxf3vp9drsm9209836krg

184downloads

0stars

1versions

Updated 20h ago

v1.0.0

MIT-0

Grounding - 多模态目标定位

利用 GLM-4.7V 的 grounding 能力，在图片中定位目标对象或文字，输出带标注框的结果图。

工作流程

用户输入（图片 + prompt）
        │
        ▼
  HttpInterface() → 调用模型 API → 得到 response 文本
        │
        ▼
  parse_bboxes_from_response() → 从回复中解析出坐标框列表
        │
        ▼
  visualize_boxes(renormalize=True) → 反归一化 + 画框 → 保存结果图

Step 1: 调用模型获取坐标

使用 HttpInterface 调用模型 API：

import os
os.environ['NO_PROXY'] = '<model-host>'  # 跳过代理
os.environ['no_proxy'] = '<model-host>'

from interface_http import HttpInterface

url = 'http://<host>:<port>/v1/chat/completions'
prompt = '''请在这张图中找到所有"{target}"，并以 [xmin, ymin, xmax, ymax] 格式输出每个目标的边界框坐标，坐标值为 0-1000 的归一化整数。每个目标一行，格式如下：
目标名称: [xmin, ymin, xmax, ymax]'''

response = HttpInterface(url, prompt, images=[image_path], no_think=True)
# 返回: "目标名称: [xmin, ymin, xmax, ymax]"

注意： 调用前需设置 NO_PROXY 环境变量跳过代理，否则内网请求会被代理拦截。

Step 2: 解析坐标框

from utils_boxes import parse_bboxes_from_response

boxes = parse_bboxes_from_response(response)
# 返回: [[x1, y1, x2, y2], ...]  (0-1000 归一化)

parse_bboxes_from_response 会自动：

从回复尾部向前检查截断，拓展 context window
遍历所有括号风格（[], {}, (), <>, <bbox>）提取坐标
扁平化嵌套列表，返回一维 box 列表

Step 3: 画框可视化

from utils_boxes import visualize_boxes

visualize_boxes(
    img_path=image_path,
    boxes=boxes,                    # parse_bboxes_from_response 的输出
    labels=['label1', 'label2'],    # 每个框的标签
    renormalize=True,               # 自动将 0-1000 归一化转为像素坐标
    save_path='output.jpg',
    colors=['red', 'blue'],         # 可选
    thickness=[2, 3],               # 可选
)

renormalize=True 时，内部自动调用 reverse_normalize_box：pixel = coord * img_dimension / 1000

完整示例

import os
os.environ['NO_PROXY'] = '172.20.112.202'
os.environ['no_proxy'] = '172.20.112.202'

from interface_http import HttpInterface
from utils_boxes import parse_bboxes_from_response, visualize_boxes

url = 'http://172.20.112.202:5002/v1/chat/completions'
img = '/path/to/image.jpg'

# 1. 调用模型
response = HttpInterface(
    url,
    '请在这张图中找到"红色圣诞帽"，以 [xmin, ymin, xmax, ymax] 格式输出坐标（0-1000归一化）',
    images=[img],
    no_think=True,
)

# 2. 解析坐标
boxes = parse_bboxes_from_response(response)

# 3. 画框
visualize_boxes(img_path=img, boxes=boxes, labels=['圣诞帽'], renormalize=True, save_path='out.jpg')

工具函数速查

函数	作用
`HttpInterface(url, prompt, images, no_think)`	调用模型 API，返回文本回复
`parse_bboxes_from_response(text)`	从模型回复中提取所有坐标框列表
`find_boxes_all(text, flat=True)`	提取文本中所有括号风格的坐标框
`reverse_normalize_box(box, w, h)`	0-1000 归一化 → 像素坐标
`visualize_boxes(..., renormalize=True)`	画框 + 自动反归一化

注意事项

模型 API 地址配置在 /root/.openclaw/agents/main/agent/models.json
调用内网模型时必须设置 NO_PROXY 环境变量
no_think=True 可关闭模型思考模式，加快响应

Comments

Loading comments...