Install
openclaw skills install desktop-control-for-macosGeneric macOS desktop control using AppleScript for app and window semantics plus screenshot, OCR, mouse, and keyboard workflows.
openclaw skills install desktop-control-for-macos特别做了中文兼容,包括文字输入/识别等,中文用户放心使用~
This skill controls the macOS desktop through a small, explicit pipeline with a clear split between semantic app control and visual UI control:
pyautogui.FAILSAFE = True so moving to the top-left corner aborts automationThis skill intentionally does not include AppleScript UI scripting.
Use AppleScript for:
Use screenshot-guided OCR/OpenCV plus pyautogui for:
This boundary keeps the skill predictable. AppleScript is used where semantic macOS state is strong, and pyautogui is used where direct UI manipulation is more reliable.
On macOS, screenshot coordinates and click coordinates may use different coordinate systems.
screencapture images usually use pixel coordinates.This skill writes the coordinate mapping result to a JSON file, so later steps can reuse it without recalculating.
Initialization behavior in the current version:
/tmp/macos_desktop_control/calibration.json already exists, the existing calibration is reusedDefault calibration file:
/tmp/macos_desktop_control/calibration.json
macos-desktop-control/
SKILL.md
requirements.txt
scripts/
calibration.py
init_coordinate_mapping.py
capture_screen.py
crop_image.py
locate_text_ocr.py
locate_image_opencv.py
mouse.py
keyboard.py
applescript_app.py
applescript_window.py
Install Python dependencies:
pip install -r requirements.txt
OCR uses Apple Vision through PyObjC, so no separate Tesseract install is required.
On macOS, grant the terminal or runtime app these permissions:
The first version handles Retina screens by comparing screenshot pixel size with the logical screen size used by pyautogui.
You can still run initialization manually:
python scripts/init_coordinate_mapping.py
But in normal use, the skill now performs lazy initialization automatically on first use if the calibration file is missing.
Example output:
{
"screen_width_points": 1512,
"screen_height_points": 982,
"screenshot_width_pixels": 3024,
"screenshot_height_pixels": 1964,
"scale_x": 2.0,
"scale_y": 2.0,
"mode": "retina"
}
Later scripts read this file automatically.
Current lazy-init behavior:
capture_screen.pymouse.pylocate_text_ocr.pylocate_image_opencv.pyThese scripts first check whether /tmp/macos_desktop_control/calibration.json exists.
If not, they auto-generate it once and then continue.
Capture the current screen and resize the image into the logical coordinate system used by pyautogui.position() and pyautogui.click().
This skill's default convention is:
python scripts/capture_screen.py --output /tmp/macos_desktop_control/screen_logical.png
Core idea:
import pyautogui
img = pyautogui.screenshot()
screen_w, screen_h = pyautogui.size()
# Resize screenshot to the coordinate system used by pyautogui.position() / click().
img = img.resize((screen_w, screen_h))
img.save("screen_logical.png")
When a higher-level skill already knows a target rectangle, crop it directly instead of re-opening previews or re-running visual search.
By default, crop from a logical screenshot so the crop rectangle stays in the same coordinate system as recognition and mouse targeting. Only crop from a raw Retina or pixel screenshot when there is a specific reason to preserve raw pixels, and in that case convert coordinates first using calibration data.
python scripts/crop_image.py \
--image /tmp/macos_desktop_control/screen_logical.png \
--x1 400 --y1 300 --x2 700 --y2 650 \
--output /tmp/macos_desktop_control/crop.png
Use this for:
Use AI semantic understanding as the default first-choice locator. Use OCR or OpenCV only when AI is unsuitable, unavailable, or cannot produce a reliable target.
There are three supported strategies.
python scripts/locate_text_ocr.py \
--image /tmp/macos_desktop_control/screen_logical.png \
--text "Confirm"
You can also constrain OCR to a specific screen region when the same text may appear in multiple places:
python scripts/locate_text_ocr.py \
--image /tmp/macos_desktop_control/screen_logical.png \
--text "Chats" \
--x1 0 --y1 120 --x2 520 --y2 1107
The script prints the center point of the best matched Apple Vision OCR box. When a region is provided, the search runs only inside that rectangle, but the returned coordinates are still in full-screen logical coordinates.
When exact matching methods are unnecessary or brittle, use AI-based image understanding first.
This approach is the default when the target has one or more of these properties: its appearance is not fixed, there is no reusable template, the decision depends on surrounding visual context, or the target can only be described semantically.
Typical examples:
Preferred flow:
AI output contract:
found, x, y, confidence, reasonfound must be booleanx and y must be logical screen coordinates when found=trueconfidence should use a small fixed set such as high, medium, lowreason should briefly explain why the target was selected or why no reliable target was foundRecommended JSON shape:
{
"found": true,
"x": 742,
"y": 681,
"confidence": "high",
"reason": "Located the send button in the lower-right input area"
}
Safety rule for AI-driven clicks:
found=false, do not clickconfidence=low, prefer fallback or verification before clickingThis skill is responsible for screenshot capture and coordinate conversion. The caller interprets the recognition result and decides the next action.
Use Python and pyautogui to control the mouse in logical screen coordinates.
python scripts/mouse.py --action click --x 500 --y 300
python scripts/mouse.py --action move --x 500 --y 300 --duration 0.2
python scripts/mouse.py --action double-click --x 500 --y 300
python scripts/mouse.py --action right-click --x 500 --y 300
python scripts/mouse.py --action drag --x 500 --y 300 --to-x 800 --to-y 500 --duration 0.3
python scripts/mouse.py --action position
You can also pipe the result from a locate script:
python scripts/locate_image_opencv.py \
--image /tmp/macos_desktop_control/screen_logical.png \
--template ./target_button.png \
| python scripts/mouse.py --stdin --action click
Stdin accepts either x y text or JSON like {"x": 500, "y": 300}.
Use Python and pyautogui to paste text or trigger shortcuts.
Important practical note:
python scripts/keyboard.py --action paste --text "I am OpenClaw"
printf 'I am OpenClaw' | python scripts/keyboard.py --action paste --stdin
Default input rule for this skill:
command vpython scripts/keyboard.py --action press --key enter
python scripts/keyboard.py --action hotkey --keys command v
Recommended paste workflow when text fidelity matters:
python scripts/keyboard.py --action pastecommand v to pastepython scripts/keyboard.py --action key-down --key shift
python scripts/keyboard.py --action key-up --key shift
Use AppleScript when the task is semantic macOS control rather than visual targeting.
Good fits:
python scripts/applescript_app.py --action open --app "WeChat"
python scripts/applescript_app.py --action open --path "/Applications/WeChat.app"
python scripts/applescript_app.py --action activate --app "WeChat"
python scripts/applescript_app.py --action is-running --app "WeChat"
python scripts/applescript_app.py --action frontmost-app
python scripts/applescript_app.py --action frontmost-app --json-pretty
Use AppleScript window inspection when you need app-level UI state without relying on OCR.
Good fits:
python scripts/applescript_window.py --action title --app "WeChat"
python scripts/applescript_window.py --action count --app "WeChat"
python scripts/applescript_window.py --action list --app "WeChat"
python scripts/applescript_window.py --action title --app "WeChat" --json-pretty
python scripts/locate_image_opencv.py \
--image /tmp/macos_desktop_control/screen_logical.png \
--template ./target_button.png \
--threshold 0.8
The script prints the center point of the matched template.
Use this as a fallback when AI is not appropriate and the target has a stable reusable visual template.
Prefer AppleScript for:
Do not add AppleScript UI scripting here for button clicks or deep accessibility-tree automation. That path is intentionally excluded from this skill.
Prefer screenshot + desktop vision + pyautogui for:
Default visual targeting order in this skill:
When the same text may appear in multiple places, do not search the full screen by default. Constrain OCR to the intended region first, then click using the returned full-screen logical coordinates.
A practical sequence is often:
python scripts/applescript_app.py --action activate --app "WeChat"
python scripts/applescript_window.py --action title --app "WeChat"
python scripts/init_coordinate_mapping.py
python scripts/capture_screen.py
# first try AI semantic understanding with a bounded screenshot region when possible
# if AI cannot produce a reliable target, fall back to OCR or OpenCV
python scripts/locate_text_ocr.py --text "Confirm"
python scripts/mouse.py --action click --x 500 --y 300
python scripts/keyboard.py --action press --key enter
found, x, y, confidence, and reason before acting on AI-located targets.pyautogui.pyautogui.FAILSAFE = True; moving the mouse to the top-left corner aborts automation.