写在前面
特别做了中文兼容,包括文字输入/识别等,中文用户放心使用~
macos-desktop-control
This skill controls the macOS desktop through a small, explicit pipeline with a clear split between semantic app control and visual UI control:
Features
🖥️ App and window control
- ✅ Activate an app by name or bundle path
- ✅ Check whether an app is running
- ✅ Read the current frontmost app
- ✅ Read front window title, count windows, and list window titles
📸 Screenshot and image operations
- ✅ Capture the current screen as a logical-resolution screenshot
- ✅ Initialize screenshot-to-click calibration for macOS Retina displays
- ✅ Crop a known rectangular region from an image
- ✅ Reuse calibration data when a workflow must mix logical and raw screenshots
🎯 Visual target location
- ✅ Locate text by OCR on screenshots
- ✅ Locate templates by OpenCV image matching
- ✅ Constrain later actions to coordinates derived from a screenshot
⌨️ Mouse and keyboard control
- ✅ Move the mouse in logical screen coordinates
- ✅ Left click, right click, double click, and drag
- ✅ Read current mouse position
- ✅ Type text, paste via higher-level workflows, press keys, and send hotkeys
- ✅ Hold and release keys explicitly when needed
🛡️ Safety and scope
- ✅ Use logical coordinates as the default working convention
- ✅ Keep app-specific UI semantics out of this skill
- ✅ Keep AppleScript usage limited to app and window semantics, not deep UI scripting
- ✅ Keep
pyautogui.FAILSAFE = True so moving to the top-left corner aborts automation
- Use AppleScript for app and window semantics
- Initialize coordinate mapping
- Capture the screen
- Locate targets by OCR or OpenCV image matching
- Execute mouse and keyboard actions with Python
Design boundary
This skill intentionally does not include AppleScript UI scripting.
Use AppleScript for:
- opening or activating apps
- reading frontmost app state
- reading window titles and counts
Use screenshot-guided OCR/OpenCV plus pyautogui for:
- clicking UI targets
- typing into custom-drawn interfaces
- interacting with chat rows, images, canvases, or other visually defined targets
This boundary keeps the skill predictable. AppleScript is used where semantic macOS state is strong, and pyautogui is used where direct UI manipulation is more reliable.
Why initialization is needed
On macOS, screenshot coordinates and click coordinates may use different coordinate systems.
screencapture images usually use pixel coordinates.
- Mouse automation tools often use macOS screen coordinates, also called point coordinates.
- On Retina displays, one point is commonly equal to two pixels.
This skill writes the coordinate mapping result to a JSON file, so later steps can reuse it without recalculating.
Initialization behavior in the current version:
- the skill auto-initializes on first use when the calibration file does not exist
- it does not re-run mapping on every invocation
- if
/tmp/macos_desktop_control/calibration.json already exists, the existing calibration is reused
Default calibration file:
/tmp/macos_desktop_control/calibration.json
Directory layout
macos-desktop-control/
SKILL.md
requirements.txt
scripts/
calibration.py
init_coordinate_mapping.py
capture_screen.py
crop_image.py
locate_text_ocr.py
locate_image_opencv.py
mouse.py
keyboard.py
applescript_app.py
applescript_window.py
Requirements
Install Python dependencies:
pip install -r requirements.txt
OCR uses Apple Vision through PyObjC, so no separate Tesseract install is required.
On macOS, grant the terminal or runtime app these permissions:
- Screen Recording
- Accessibility
1. Initialize coordinate mapping
The first version handles Retina screens by comparing screenshot pixel size with the logical screen size used by pyautogui.
You can still run initialization manually:
python scripts/init_coordinate_mapping.py
But in normal use, the skill now performs lazy initialization automatically on first use if the calibration file is missing.
Example output:
{
"screen_width_points": 1512,
"screen_height_points": 982,
"screenshot_width_pixels": 3024,
"screenshot_height_pixels": 1964,
"scale_x": 2.0,
"scale_y": 2.0,
"mode": "retina"
}
Later scripts read this file automatically.
Current lazy-init behavior:
capture_screen.py
mouse.py
locate_text_ocr.py
locate_image_opencv.py
These scripts first check whether /tmp/macos_desktop_control/calibration.json exists.
If not, they auto-generate it once and then continue.
2. Capture screen
Capture the current screen and resize the image into the logical coordinate system used by pyautogui.position() and pyautogui.click().
This skill's default convention is:
- default screenshot is logical
- default recognition result coordinates are logical
- default mouse action coordinates are logical
- default crop operations should use a logical screenshot
- only use calibration conversion when a workflow explicitly mixes logical screenshots with raw pixel screenshots
python scripts/capture_screen.py --output /tmp/macos_desktop_control/screen_logical.png
Core idea:
import pyautogui
img = pyautogui.screenshot()
screen_w, screen_h = pyautogui.size()
# Resize screenshot to the coordinate system used by pyautogui.position() / click().
img = img.resize((screen_w, screen_h))
img.save("screen_logical.png")
3. Crop image regions
When a higher-level skill already knows a target rectangle, crop it directly instead of re-opening previews or re-running visual search.
By default, crop from a logical screenshot so the crop rectangle stays in the same coordinate system as recognition and mouse targeting.
Only crop from a raw Retina or pixel screenshot when there is a specific reason to preserve raw pixels, and in that case convert coordinates first using calibration data.
python scripts/crop_image.py \
--image /tmp/macos_desktop_control/screen_logical.png \
--x1 400 --y1 300 --x2 700 --y2 650 \
--output /tmp/macos_desktop_control/crop.png
Use this for:
- extracting a detected chat image thumbnail
- saving a button or dialog region for later analysis
- debugging screenshot-to-action pipelines
4. Locate targets
There are two supported strategies.
Locate by OCR text
python scripts/locate_text_ocr.py \
--image /tmp/macos_desktop_control/screen_logical.png \
--text "确定"
You can also constrain OCR to a specific screen region when the same text may appear in multiple places:
python scripts/locate_text_ocr.py \
--image /tmp/macos_desktop_control/screen_logical.png \
--text "会话" \
--x1 0 --y1 120 --x2 520 --y2 1107
The script prints the center point of the best matched Apple Vision OCR box.
When a region is provided, the search runs only inside that rectangle, but the returned coordinates are still in full-screen logical coordinates.
Locate by OpenCV image matching
python scripts/locate_image_opencv.py \
--image /tmp/macos_desktop_control/screen_logical.png \
--template ./target_button.png \
--threshold 0.8
The script prints the center point of the matched template.
5. Mouse actions
Use Python and pyautogui to control the mouse in logical screen coordinates.
Single click
python scripts/mouse.py --action click --x 500 --y 300
Move only
python scripts/mouse.py --action move --x 500 --y 300 --duration 0.2
Double click
python scripts/mouse.py --action double-click --x 500 --y 300
Right click
python scripts/mouse.py --action right-click --x 500 --y 300
Drag
python scripts/mouse.py --action drag --x 500 --y 300 --to-x 800 --to-y 500 --duration 0.3
Read current mouse position
python scripts/mouse.py --action position
You can also pipe the result from a locate script:
python scripts/locate_image_opencv.py \
--image /tmp/macos_desktop_control/screen_logical.png \
--template ./target_button.png \
| python scripts/mouse.py --stdin --action click
Stdin accepts either x y text or JSON like {"x": 500, "y": 300}.
6. Keyboard actions
Use Python and pyautogui to paste text or trigger shortcuts.
Important practical note:
- this skill uses clipboard paste for all text entry by default, including English
- this avoids input-method issues with Chinese, English, and mixed-language text
- do not use simulated typing for text entry in this skill
Paste text
python scripts/keyboard.py --action paste --text "我是OpenClaw"
Paste from stdin
printf '我是OpenClaw' | python scripts/keyboard.py --action paste --stdin
Default input rule for this skill:
- use clipboard paste for all text input by default, including English
- click the verified input field first, then paste with
command v
- do not use simulated typing for text entry in this skill
Press one key
python scripts/keyboard.py --action press --key enter
Press a hotkey
python scripts/keyboard.py --action hotkey --keys command v
Recommended paste workflow when text fidelity matters:
- copy the exact text into the clipboard, preferably via
python scripts/keyboard.py --action paste
- click the verified input field
- let the script send
command v to paste
- verify visually before pressing enter if sending would be externally visible
Hold and release keys
python scripts/keyboard.py --action key-down --key shift
python scripts/keyboard.py --action key-up --key shift
7. AppleScript app control
Use AppleScript when the task is semantic macOS control rather than visual targeting.
Good fits:
- open or activate an app
- check whether an app is running
- read the current frontmost app
Open by app name
python scripts/applescript_app.py --action open --app "微信"
Open by bundle path
python scripts/applescript_app.py --action open --path "/Applications/微信.app"
Activate an app
python scripts/applescript_app.py --action activate --app "微信"
Check whether an app is running
python scripts/applescript_app.py --action is-running --app "微信"
Get the current frontmost app
python scripts/applescript_app.py --action frontmost-app
python scripts/applescript_app.py --action frontmost-app --json-pretty
8. AppleScript window inspection
Use AppleScript window inspection when you need app-level UI state without relying on OCR.
Good fits:
- read the front window title
- count windows for a process
- list window titles for a process
Read the front window title
python scripts/applescript_window.py --action title --app "微信"
Count windows
python scripts/applescript_window.py --action count --app "微信"
List window titles
python scripts/applescript_window.py --action list --app "微信"
python scripts/applescript_window.py --action title --app "微信" --json-pretty
9. When to use AppleScript vs desktop vision
Prefer AppleScript for:
- opening or activating apps
- reading window titles
- checking the frontmost app
- simple app and process state queries
Do not add AppleScript UI scripting here for button clicks or deep accessibility-tree automation. That path is intentionally excluded from this skill.
Prefer screenshot + OCR/OpenCV + pyautogui for:
- buttons or labels that only exist visually
- apps with weak or unstable accessibility hierarchies
- targets inside custom-drawn UIs such as chat rows, images, or canvas content
- direct manipulation such as clicking, dragging, and typing into app surfaces
When the same text may appear in multiple places, do not search the full screen by default.
Constrain OCR to the intended region first, then click using the returned full-screen logical coordinates.
A practical sequence is often:
- AppleScript activates the app
- AppleScript reads window or process state
- screenshot-based vision finds the target
- mouse or keyboard automation performs the action
- AppleScript or a fresh screenshot verifies the result
10. Recommended flow
python scripts/applescript_app.py --action activate --app "微信"
python scripts/applescript_window.py --action title --app "微信"
python scripts/init_coordinate_mapping.py
python scripts/capture_screen.py
python scripts/locate_text_ocr.py --text "确定"
python scripts/mouse.py --action click --x 500 --y 300
python scripts/keyboard.py --action press --key enter
Notes
- Version 1 assumes a Retina display and single primary screen.
- Treat logical screenshots as the default working surface for this skill.
- Treat recognition output coordinates as logical unless a script explicitly says otherwise.
- Treat mouse and keyboard targeting as logical by default.
- Treat crop rectangles as logical by default, and prefer cropping from a logical screenshot.
- If another skill mixes logical screenshots with raw Retina or pixel screenshots, use calibration conversion deliberately. Do not assume logical bounds match raw pixel bounds 1:1.
- Keep this skill focused on generic desktop primitives. App-specific UI semantics, business rules, and event pipelines should stay in the higher-level app skill.
- All click, drag, move, and typing actions use Python /
pyautogui.
- AppleScript support in this skill is limited to app control and window inspection.
- For safety, keep
pyautogui.FAILSAFE = True; moving the mouse to the top-left corner aborts automation.