UI Element Ops

v1.0.2

Parse UI screenshots into structured element JSON (type, OCR text, bbox) and operate desktop UI from parsed elements. Use when a user asks to detect/locate U...

0· 516· 3 versions· 4 current· 4 all-time· Updated 11h ago· MIT-0
byMuRong@murongg

Install

openclaw skills install ui-element-ops

UI Element Ops

Parse one or more screenshots into a machine-readable JSON schema with:

  • type (normalized UI element type)
  • bbox_px and bbox_norm
  • text (OCR/caption content when available)
  • clickable flag
  • optional overlay image with labeled boxes
  • desktop actions via scripts/operate_ui.py (click/type/key/hotkey/screenshot)
  • element query and orchestration via scripts/operate_ui.py (find, wait)
  • coordinate calibration profile for multi-display/DPI/window offset (calibrate)

Quick Start

  1. Prepare runtime once per machine:
skills/ui-element-ops/scripts/bootstrap_omniparser_env.sh "$PWD"
  1. Parse one screenshot:
skills/ui-element-ops/scripts/run_parse_ui.sh /abs/path/to/1.jpeg
  1. Read outputs:
  • <image>.elements.json
  • <image>.overlay.png
  1. One-step capture + parse with randomized names:
skills/ui-element-ops/scripts/capture_and_parse.sh

Workflow

  1. Confirm screenshot path and desired output path.
  2. Run scripts/bootstrap_omniparser_env.sh when .venv or OmniParser weights are missing.
  3. Run scripts/run_parse_ui.sh for standard parsing.
  4. Report absolute output paths and summary counts: total, clickable, by_type.
  5. Call out obvious quality risks for tiny text or dense icon layouts.
  6. Execute desktop actions when requested:
    • list elements: python3 skills/ui-element-ops/scripts/operate_ui.py list --elements <json>
    • find elements: python3 skills/ui-element-ops/scripts/operate_ui.py find --elements <json> --type button --text-contains login
    • wait for appear/disappear: python3 skills/ui-element-ops/scripts/operate_ui.py wait --elements <json> --state appear --text-contains continue
    • click by id: python3 skills/ui-element-ops/scripts/operate_ui.py click --elements <json> --id e_0001
    • screenshot: python3 skills/ui-element-ops/scripts/operate_ui.py screenshot (defaults to user tmp dir)
    • calibrate coordinates: python3 skills/ui-element-ops/scripts/operate_ui.py calibrate --parsed-size <w> <h> --actual-size <w> <h>

Tunables

  • Edit type mapping keywords in references/type_rules.example.json.
  • Use advanced parser args via scripts/parse_ui.py --help.
  • Use --use-paddleocr only when paddleocr/paddlepaddle are installed.

Outputs

  • Main JSON output:
    • schema_version, pipeline, image, counts, elements
    • each element has id, type, bbox_px, bbox_norm, text, clickable
  • Overlay PNG output:
    • same screenshot with labeled detection boxes

Failure Handling

  • Missing dependencies or weights: run bootstrap script again.
  • Permission/cache errors under $HOME: keep temporary caches under /tmp (handled by run script).
  • CPU-only machine: expect slower inference.
  • Performance note: parse/capture-and-parse commands are heavy; avoid very tight loops and reuse recent elements.json when possible.
  • Headless environment limitation:
    • usable without GUI: parse/list/find/wait/calibrate on existing files.
    • requires GUI session: click/click-xy/type/key/hotkey/screenshot/screen-info.

Version tags

latestvk9745x751aqsptq96m2gc8csm581zd22