UI Element Ops

v1.0.2

Parse UI screenshots into structured element JSON (type, OCR text, bbox) and operate desktop UI from parsed elements. Use when a user asks to detect/locate U...

0· 516· 3 versions· 4 current· 4 all-time· Updated 11h ago· MIT-0

byMuRong@murongg

Security Scans

VirusTotalSuspicious ClawScanBenign Static analysisBenign

Install

openclaw skills install ui-element-ops

UI Element Ops

Parse one or more screenshots into a machine-readable JSON schema with:

type (normalized UI element type)
bbox_px and bbox_norm
text (OCR/caption content when available)
clickable flag
optional overlay image with labeled boxes
desktop actions via scripts/operate_ui.py (click/type/key/hotkey/screenshot)
element query and orchestration via scripts/operate_ui.py (find, wait)
coordinate calibration profile for multi-display/DPI/window offset (calibrate)

Quick Start

Prepare runtime once per machine:

skills/ui-element-ops/scripts/bootstrap_omniparser_env.sh "$PWD"

Parse one screenshot:

skills/ui-element-ops/scripts/run_parse_ui.sh /abs/path/to/1.jpeg

Read outputs:

<image>.elements.json
<image>.overlay.png

One-step capture + parse with randomized names:

skills/ui-element-ops/scripts/capture_and_parse.sh

Workflow

Confirm screenshot path and desired output path.
Run scripts/bootstrap_omniparser_env.sh when .venv or OmniParser weights are missing.
Run scripts/run_parse_ui.sh for standard parsing.
Report absolute output paths and summary counts: total, clickable, by_type.
Call out obvious quality risks for tiny text or dense icon layouts.
Execute desktop actions when requested:
- list elements: python3 skills/ui-element-ops/scripts/operate_ui.py list --elements <json>
- find elements: python3 skills/ui-element-ops/scripts/operate_ui.py find --elements <json> --type button --text-contains login
- wait for appear/disappear: python3 skills/ui-element-ops/scripts/operate_ui.py wait --elements <json> --state appear --text-contains continue
- click by id: python3 skills/ui-element-ops/scripts/operate_ui.py click --elements <json> --id e_0001
- screenshot: python3 skills/ui-element-ops/scripts/operate_ui.py screenshot (defaults to user tmp dir)
- calibrate coordinates: python3 skills/ui-element-ops/scripts/operate_ui.py calibrate --parsed-size <w> <h> --actual-size <w> <h>

Tunables

Edit type mapping keywords in references/type_rules.example.json.
Use advanced parser args via scripts/parse_ui.py --help.
Use --use-paddleocr only when paddleocr/paddlepaddle are installed.

Outputs

Main JSON output:
- schema_version, pipeline, image, counts, elements
- each element has id, type, bbox_px, bbox_norm, text, clickable
Overlay PNG output:
- same screenshot with labeled detection boxes

Failure Handling

Missing dependencies or weights: run bootstrap script again.
Permission/cache errors under $HOME: keep temporary caches under /tmp (handled by run script).
CPU-only machine: expect slower inference.
Performance note: parse/capture-and-parse commands are heavy; avoid very tight loops and reuse recent elements.json when possible.
Headless environment limitation:
- usable without GUI: parse/list/find/wait/calibrate on existing files.
- requires GUI session: click/click-xy/type/key/hotkey/screenshot/screen-info.

Version tags

latestvk9745x751aqsptq96m2gc8csm581zd22