Install
openclaw skills install peekaboo-linux-gnome-wayland-computer-useSee and control GNOME Wayland desktops using gnome-screenshot, ydotool, Window Calls, AT-SPI hybrid element coordinates, and optional GNOME Remote Desktop. The Linux/Wayland counterpart to macOS computer-use ("peekaboo") — a native agent on the box can screenshot, reason over the image, locate UI elements by role/name, and click/type them, with no human, no RDP, and no global-coordinate accessibility. Covers the native single-actor loop, the remote-over-SSH path, and an optional RDP fallback for hands-on human access.
openclaw skills install peekaboo-linux-gnome-wayland-computer-useThe Linux/Wayland counterpart to macOS computer-use — "see and poke the screen of a box for troubleshooting or automation." On macOS that's trivial (CGWindowList + CGEvent + AX tree). On GNOME/Wayland it's locked down by design, but it can be made to work end-to-end:
gnome-screenshot (re-enabled on GNOME 49+ via a small GNOME extension).ydotool (writes to /dev/uinput, no root).(0,0)).Both the "see" and "act" sides work natively (an agent process running ON the box) and remotely (from another machine over SSH). A native agent is a single actor — it can screenshot, analyze, then ydotool-click in one loop. No second machine, no human, no RDP roundtrip.
Context tags below — [NATIVE] = an agent process running ON the GNOME box (sharing the desktop
session of user@1000); [REMOTE] = on another machine, reaching the box over SSH.
Verified on Ubuntu 26.04 LTS + GNOME Shell 50 (Wayland), NVIDIA + AMD multi-GPU, multi-monitor. Substitute your own host/user where the examples use
USER@HOSTand/run/user/UID.
| Capability | Native (on box) | Remote (from another machine) |
|---|---|---|
| Screenshot | yes — gnome-screenshot -f out.png (extension unlocks it) | yes — same over SSH, then scp the file |
| Inject input (mouse/keys) | yes — ydotool via /dev/uinput (no root) | yes — over SSH, same ydotool |
| Click element by role/name | yes — locate-element (AT-SPI WINDOW extents + Window Calls origin) | yes — same over SSH |
| Full see→reason→act loop | yes — single actor, no RDP | yes — drive over SSH |
| Live interactive desktop | n/a | yes — RDP client (optional, for a human) |
XDG_RUNTIME_DIR=/run/user/UID,
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/UID/bus, WAYLAND_DISPLAY=wayland-0, DISPLAY=:0.
(Replace UID with the session user's numeric id, e.g. 1000.)systemd --user with the full graphical session env + lingering enabled,
they inherit all of the above and survive SSH disconnects.gnome-screenshot grabs the full virtual desktop (all monitors stitched). On multi-4K
setups that's huge — use -w (focused window) or the per-display helper below.GNOME 49+ revoked gnome-screenshot's access to the private Shell screenshot API, so on a stock
box every headless capture path is dead (see "why" below). The fix is the allow-gnome-screenshot
extension (https://github.com/siddhpant/allow-gnome-screenshot, GNOME 49/50): it re-adds
org.gnome.Screenshot to the Shell's _senderChecker._allowlistMap, so the gnome-screenshot
binary specifically regains access. It is scoped to that one sender — it does NOT enable global
unsafe_mode.
[NATIVE] — just call it (session env already present):
gnome-screenshot -f /tmp/shot.png # full virtual desktop (all monitors)
gnome-screenshot -w -f /tmp/shot.png # focused window only (smaller)
gnome-screenshot -a -f /tmp/shot.png # interactive area (needs a human at the screen — avoid headless)
# then hand /tmp/shot.png to your image-analysis step
[REMOTE] — over SSH, then pull it back:
ssh USER@HOST '
export XDG_RUNTIME_DIR=/run/user/UID DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/UID/bus
gnome-screenshot -f /tmp/shot.png'
scp USER@HOST:/tmp/shot.png /tmp/ # then analyze it
Sanity-check the capture — a 4K full-desktop PNG is ~15-20 MB; a tiny/near-empty file or an
all-black image means the extension got disabled or the session dropped to the GDM greeter. Quick
non-black check:
python3 -c "from PIL import Image; im=Image.open('/tmp/shot.png').convert('RGB'); d=list(im.getdata()); print(sum(1 for p in d[::997] if p!=(0,0,0)))" → should be well >0.
# [NATIVE or over SSH with the session env exported]
sudo apt-get install -y gnome-screenshot
D=~/.local/share/gnome-shell/extensions/allow-gnome-screenshot@siddh.me
git clone --depth 1 https://github.com/siddhpant/allow-gnome-screenshot.git /tmp/ags
mkdir -p "$D" && cp -r /tmp/ags/allow-gnome-screenshot@siddh.me/* "$D/"
sudo systemctl restart gdm3 # reload the Shell so it discovers the extension (see WARNING below)
# after the session returns:
export XDG_RUNTIME_DIR=/run/user/UID DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/UID/bus
gnome-extensions enable allow-gnome-screenshot@siddh.me
gnome-extensions info allow-gnome-screenshot@siddh.me | grep State # expect ACTIVE
gnome-screenshot -f grabs the entire virtual desktop. On multi-monitor setups that's a huge,
mostly-empty canvas. To get a single clean display, use the bundled screenshot-display helper
(scripts/screenshot-display.py). It reads live monitor geometry from Mutter DisplayConfig and
crops the full capture to the chosen monitor.
screenshot-display --list # enumerate monitors + geometry
screenshot-display --monitor primary --out /tmp/p.png # primary (default)
screenshot-display --monitor DP-4 --out /tmp/m.png # by connector name
screenshot-display --monitor 2 --out /tmp/m.png # by index (0-based, sorted L->R, T->B)
screenshot-display --monitor active --out /tmp/w.png # the FOCUSED WINDOW (best "being worked on" signal)
Each per-monitor crop is a clean single-display image — feed it straight to your vision step.
[REMOTE]: run screenshot-display over SSH (export the session env), then scp the output back.
Shell.Eval/Introspect are locked, but the Window Calls extension
(https://github.com/ickyicky/window-calls, GNOME 45-50) exposes window list + geometry over D-Bus
at /org/gnome/Shell/Extensions/Windows. The screenshot-display helper uses it to find and crop
to a window or its monitor:
screenshot-display --list-windows # wm_class, id, focus, monitor, title
screenshot-display --window focused --out /tmp/w.png # crop to focused window's exact rect
screenshot-display --window "wm_class=firefox" --out /tmp/w.png # match by wm_class substring
screenshot-display --window "title=Settings" --out /tmp/w.png # match by title substring
screenshot-display --window-display focused --out /tmp/d.png # the whole MONITOR that window sits on
screenshot-display --window-display "wm_class=Code" --out /tmp/d.png # monitor a matched window is on
Flow: List → pick window (focused / wm_class / title) → Details gives x,y,width,height,monitor
→ helper crops the full capture to the window rect, or to whichever monitor's rect contains the
window's center. (Details works; GetFrameBounds is broken in this ext version — the helper
doesn't use it.)
Full setup/enablement guide to replicate on any GNOME/Wayland box:
references/peekaboo-and-rdp-setup-ubuntu-gnome-wayland.md.
ydotool writes to /dev/uinput below the display server (the input analog of the screenshot
unlock). Install it, run ydotoold as a persistent user service (socket
/run/user/UID/.ydotool_socket), and ACL /dev/uinput to your user (no root needed at runtime).
[NATIVE]:
export YDOTOOL_SOCKET=/run/user/UID/.ydotool_socket
ydotool mousemove -a 960 540 # absolute pixel coords (no element targeting)
ydotool click 0xC0 # left click down+up
ydotool type "hello"
ydotool key 28:1 28:0 # Enter (keycode:press / keycode:release)
[REMOTE]:
ssh USER@HOST '
export XDG_RUNTIME_DIR=/run/user/UID YDOTOOL_SOCKET=/run/user/UID/.ydotool_socket
ydotool mousemove -a 700 400; ydotool click 0xC0; ydotool type "hi"'
If ydotoold.service is down — systemctl --user enable --now ydotoold.service. For element
targeting (not blind coords) use the locate-element helper below.
You do NOT need blind pixel guessing or a vision round-trip for most clicks. The bundled
locate-element helper (scripts/locate-element.py) resolves a UI element to a real absolute
coordinate and clicks/types it.
Why it works when AT-SPI looks broken: on Wayland, getExtents(SCREEN) returns (0,0) for
GTK4/Electron (no global coordinates by design), so naive AT-SPI clicking fails. But
getExtents(WINDOW) (widget-relative) is populated, and Window Calls gives the window's
screen origin from Mutter's privileged view. The helper composes them:
abs = window_origin + window_relative_extents + size/2. Proven on GTK4 apps.
locate-element --app gnome-text-editor --list # list elements + computed ABS coords
locate-element --app gnome-text-editor --role text --click # click first text widget
locate-element --app org.gnome.Nautilus --name "New Folder" --click # click by visible label
locate-element --app gnome-text-editor --role text --click --type "hello" # click then type
locate-element --window-id 1367000216 --role "push button" --name Open --click # by explicit window id
locate-element --app firefox --name Reload --print # just print "X Y", act yourself
--app matches the AT-SPI app name AND wm_class loosely (normalizes org.gnome.TextEditor ↔ gnome-text-editor).--role/--name are case-insensitive substring filters; --index N picks the Nth match.toolkit-accessibility=true (gsettings set org.gnome.desktop.interface toolkit-accessibility true),
Window Calls active, and ydotoold running.Vision fallback (only when an element zeros even WINDOW extents — custom-drawn canvases, some
Electron internals): locate-element exits non-zero and prints the fallback recipe —
screenshot-display --window 'wm_class=X' --out /tmp/w.png → analyze for pixel coords →
ydotool mousemove -a -x X -y Y.
Deep dive on the technique, the gotchas, and why Newton isn't a substitute yet:
references/wayland-atspi-hybrid-som.md.
[NATIVE agent on the box]
1. gnome-screenshot -f /tmp/s.png
2. analyze /tmp/s.png → "the Save button is at ~(1820, 640)"
3. ydotool mousemove -a 1820 640 && ydotool click 0xC0
4. gnome-screenshot again → verify it landed
No second machine, no human, no RDP. For element targeting prefer locate-element over blind coords.
Window Calls also does focus/move/resize/close verbs (window id from --list-windows):
WIN=/org/gnome/Shell/Extensions/Windows
gdbus call --session --dest org.gnome.Shell --object-path $WIN --method $WIN.Activate <id> # focus/raise
gdbus call --session --dest org.gnome.Shell --object-path $WIN --method $WIN.Move <id> <x> <y>
gdbus call --session --dest org.gnome.Shell --object-path $WIN --method $WIN.Resize <id> <w> <h>
gdbus call --session --dest org.gnome.Shell --object-path $WIN --method $WIN.Maximize <id>
gdbus call --session --dest org.gnome.Shell --object-path $WIN --method $WIN.Close <id>
Flow: Activate <id> → screenshot-display --window focused → ydotool interact.
For a person to drive the desktop hands-on from another machine, use GNOME Remote Desktop in Remote Login mode (system/headless, fresh negotiated display — NOT Desktop Sharing, which mirrors the physical multi-monitor desktop and re-hits the huge-canvas problem).
grdctl --system (system service gnome-remote-desktop.service, enabled at boot).*:3389 — always LAN-scope it in your firewall, never any-source./var/lib/gnome-remote-desktop/rdp-tls.{crt,key}.Connect from a client (Microsoft Remote Desktop / Windows App, or any RDP client) → HOST → device
user/pass → GDM → sign in as the Linux user.
⚠️ #1 RDP gotcha — credential changes need a daemon restart. grdctl --system rdp set-credentials
writes the keyfile but the running daemon does NOT hot-reload it into its NTLM SAM. Until you
sudo systemctl restart gnome-remote-desktop.service, NLA fails for every client:
ntlm_fetch_ntlm_v2_hash: Could not find user in SAM database → SEC_E_NO_CREDENTIALS →
transport_accept_nla: client authentication failure. Meanwhile grdctl status shows the correct
creds (it reads the file). Clients report this as FREERDP_ERROR_CONNECT_TRANSPORT_FAILED (RoyalTS)
or ERRCONNECT_AUTHENTICATION_FAILED (FreeRDP) — looks like transport/cert, is actually stale creds.
Fix: set creds (password via STDIN) → restart the service → verify journald shows no SAM/
NO_CREDENTIALS errors. Server requires NLA (HYBRID_REQUIRED_BY_SERVER) — expected.
Diagnosing any RDP auth failure: read the SERVER's journald, not the client error string.
Client errors all collapse different server causes into the same opaque message. Full server-log→cause
map, the +auth-only//sec:tls client probes, the cert/listener sanity checks, AND the
empty-server-log case (tcpdump proves client-vs-network — e.g. a GUI client completing TCP then
sending zero RDP bytes = client-side engine fault, fixed by recreating the connection, not touching
the server): references/rdp-auth-diagnosis-server-side.md.
For automated capture you do NOT need RDP — Capability A is simpler. RDP is for a human.
Full setup steps: references/peekaboo-and-rdp-setup-ubuntu-gnome-wayland.md (Step 7).
gnome-screenshot WITHOUT the extension → exit 0 but no file (AccessDenied to the private API under the hood).gdbus/busctl call to org.gnome.Shell.Screenshot → AccessDenied. The extension allowlists the gnome-screenshot binary's D-Bus name, not arbitrary callers — so capture must go through the gnome-screenshot binary, not a hand-rolled gdbus call.xdg-desktop-portal Screenshot → no Response even with the PermissionStore granted (forced interactive dialog).org.gnome.Mutter.ScreenCast.CreateSession → Session creation inhibited (gated to grd itself).ffmpeg kmsgrab (KMS framebuffer) → "No usable planes" on NVIDIA proprietary KMS. /dev/fb0 → all-black (GNOME doesn't scan out to fb0).grim → wlroots-only, fails on Mutter. Don't apt install grim.enable unsafe_mode via gdb injection into gnome-shell → CRASHES the Shell, drops the desktop to the GDM greeter. Do NOT use the gdb method. The extension is the safe, scoped, persistent unlock.No usable screen-coordinate accessibility on Wayland — AT-SPI getExtents(SCREEN) is (0,0) on
GTK4/Electron. BUT this is solved by Capability B+ (locate-element): WINDOW-relative AT-SPI
extents + Window Calls window origin → real click coords. Use locate-element for "click element
labeled Save"; fall back to vision-locate only where even WINDOW extents are zero.
gnome-screenshot binary is allowlisted. A raw gdbus Screenshot call still fails. Always shell out to gnome-screenshot.sudo systemctl restart gdm3). There's no in-place Alt+F2 r over SSH. The live GUI session is briefly torn down, so don't do it mid-GUI-task. If autologin is enabled the desktop recovers in ~12s; systemd --user services survive if lingering is on. Verify recovery: loginctl list-sessions shows a seat0 Type=wayland Active=yes session.unsafe_mode injection — it crashes gnome-shell. Use the extension.allow-gnome-screenshot extension can silently flip to INACTIVE (its allowlist hook only re-arms on a fresh Shell init). Symptom: gnome-screenshot -f exits 0 but writes no file. A disable/enable cycle is NOT enough — it needs a Shell reload (sudo systemctl restart gdm3) to re-arm. If you depend on capture in an automated loop, verify the file exists each time and reload-on-miss. (Window Calls can flip too — re-enable after any GDM restart.)gnome-screenshot -a (interactive area) needs a human at the screen — useless headless; use -f (full) or -w (window).-w, or downscale before analysis.ydotoold alive + YDOTOOL_SOCKET/XDG_RUNTIME_DIR. Re-enable if it dies. Raw ydotool is blind absolute pixels — prefer locate-element (Capability B+) for element targeting.XDG_RUNTIME_DIR=/run/user/UID (+ DBUS) explicitly. A native systemd --user service already has it.