desktop-agent-ops

# Desktop Agent Ops Use this skill as a **main-agent operating manual** for desktop GUI tasks. --- ## MANDATORY: Auto-setup gate (FIRST ACTION, every time) ```bash python3 <SKILL_DIR>/scripts/first_run_setup.py --check ``` If `"ready": false`, run setup (installs EVERYTHING automatically): ```bash python3 <SKILL_DIR>/scripts/first_run_setup.py ``` **Auto-installs on first run:** 1. Platform detection (macOS / Windows / Linux) 2. `cliclick` + `tesseract` (macOS via brew; Linux guide printed) 3. OCR language packs auto-detected from system locale (中文→chi_sim, 日本語→jpn, etc.) 4. Python venv + pillow, pyautogui, pytesseract, opencv-python, numpy (via uv or pip) 5. OS permissions (Screen Recording, Accessibility, Automation) with auto-open System Settings 6. Smoke test (screenshot + mouse move verification) After setup, set `$PY` for ALL subsequent calls: ``` PY=<output.env.DESKTOP_AGENT_OPS_PYTHON> ``` **Do NOT proceed if setup is not ready.** --- ## Core Execution Loop Every desktop task follows this loop. No exceptions. ``` 1. auto-setup gate ← run once per session 2. init task context ← create isolated temp directory 3. FOCUS the target app ← bring app to front, confirm frontmost 4. GET window bounds ← know exact position and size 5. CAPTURE that window ← screenshot ONLY the target window 6. ANALYZE the capture ← read screenshot or run OCR 7. LOCATE target via OCR ← find text/button within window bounds 8. VERIFY before acting ← move cursor, screenshot with cursor, confirm 9. EXECUTE one action ← click, type, scroll, press key 10. CAPTURE again ← screenshot to see result 11. VERIFY the result ← did the UI change as expected? 12. → if more steps, go to 5 13. CLEANUP ← remove task temp directory ``` **Key principles:** - One action at a time. Never chain blind actions. - Always verify after each action. If verification fails, recapture and retry — do NOT guess. - Always work within a specific window. Never click based on full-screen assumptions. --- ## Window-Scoped Targeting (THE CORRECT WAY) **NEVER do OCR or clicking on a full-screen screenshot.** Always scope to the target app window. ### The 6-Step Pipeline ``` ┌─────────────────────────────────────────────────────────┐ │ Step 1: FOCUS the target app │ │ $PY desktop_ops.py focus-app --name "AppName" │ │ → brings app to front │ ├─────────────────────────────────────────────────────────┤ │ Step 2: GET window bounds │ │ $PY desktop_ops.py front-window-bounds --app "AppName"│ │ → {x, y, width, height} in logical coordinates │ ├─────────────────────────────────────────────────────────┤ │ Step 3: CAPTURE only that window │ │ $PY desktop_ops.py capture-region --x X --y Y │ │ --width W --height H --output /tmp/window.png │ ├─────────────────────────────────────────────────────────┤ │ Step 4: OCR within the window │ │ $PY ocr_text.py --app "AppName" --python $PY │ │ → abs_box coordinates are INSIDE the window │ ├─────────────────────────────────────────────────────────┤ │ Step 5: VERIFY before clicking │ │ $PY desktop_ops.py move --x TX --y TY │ │ $PY desktop_ops.py screenshot --with-cursor │ │ → confirm cursor is on the right element │ ├─────────────────────────────────────────────────────────┤ │ Step 6: CLICK only if verified │ │ $PY desktop_ops.py click --x TX --y TY │ │ $PY desktop_ops.py screenshot → verify result │ └─────────────────────────────────────────────────────────┘ ``` ### Shortcut (RECOMMENDED for most targeting): ```bash $PY scripts/target_resolver.py --app "AppName" --text "按钮文字" --python $PY ``` This single command: focuses app → gets bounds → OCR within window → returns `best_candidate` with `{x, y, within_window}`. ### Why window-scoped matters: | Approach | Risk | |----------|------| | ❌ Full-screen OCR | "搜索" in WeChat AND Chrome → clicks wrong app | | ✅ Window-scoped | "搜索" ONLY in WeChat window → correct click | --- ## Failure Recovery (CRITICAL) When something fails, follow these rules: ### OCR finds nothing 1. Re-focus the app: `focus-app --name "AppName"` 2. Re-get bounds: `front-window-bounds --app "AppName"` (window may have moved/resized) 3. Take a fresh screenshot and read it visually 4. Try a different region label (e.g. `content_area` instead of `bottom_input`) 5. Try lowering OCR confidence: `--min-conf 30` ### Click doesn't work 1. Screenshot with cursor to check cursor position 2. The window may have moved — re-get bounds 3. Try clicking a few pixels offset from the OCR center 4. Check if a dialog/popup is blocking the target ### App state changed (login screen, dialog, etc.) 1. ALWAYS re-get window bounds after any major UI change 2. ALWAYS re-run OCR after navigation or state change 3. Never reuse old coordinates — they may be stale ### General retry rule - Maximum 3 retries per action - Each retry must recapture fresh state - If 3 retries fail, report the failure with screenshots and stop --- ## Generalization: How to Apply This to ANY App The pipeline works for any desktop application. Here is how to reason about new apps: ### Step-by-step for ANY new app: 1. **Identify the app name** exactly as it appears in the system (e.g. "Google Chrome", "微信", "System Settings") 2. **Focus and get bounds** — this tells you the window's exact position 3. **Screenshot the window** — look at what's on screen 4. **Identify the target** — what text, button, or area do you need to interact with? 5. **Use OCR to find it** — `target_resolver.py --app "AppName" --text "target text"` 6. **Verify and click** ### Common patterns across apps: | Task | How to do it | |------|-------------| | Click a button | OCR find text → verify → click | | Type in a field | OCR find field label → click field → `type --text` | | Search for something | OCR find search box → click → type query → press return | | Scroll a list | Get window bounds → scroll at window center with `--x --y` | | Switch between apps | `focus-app --name "OtherApp"` → re-get bounds | | Handle a dialog | Screenshot → OCR for dialog buttons → click appropriate one | | Navigate menus | Click menu item → wait → screenshot → OCR new menu → click | | Select from dropdown | Click dropdown → wait → OCR options → click selection | | Read screen content | OCR the window → extract all text boxes | | Verify an action | Screenshot before and after → compare or OCR for expected text | ### App-specific adaptations: | App type | Special considerations | |----------|----------------------| | Chat apps (WeChat, Slack, etc.) | Verify conversation title before typing; use `insert-newline` for multi-line; verify send mechanism | | Browsers (Chrome, Safari, etc.) | Address bar at top; content area varies; may need to handle tabs | | System Settings | Deep navigation; panels change; re-get bounds after each navigation | | File managers (Finder, Explorer) | Sidebar + content area; double-click to open; path bar for navigation | | Editors (VS Code, TextEdit, etc.) | Tab bar + editor area; use hotkeys for save/undo; type in editor area | --- ## Text Input and Send Rules ### Typing text ```bash $PY scripts/desktop_ops.py type --text "your message" ``` - Uses **clipboard paste** as primary method on all platforms — reliable for all languages including CJK - macOS: `set the clipboard to` + `Cmd+V` (single osascript call) - Windows: PowerShell `Set-Clipboard` + `Ctrl+V` (falls back to `clip.exe`) - Linux: `xclip` + `Ctrl+V` - First click on the input field to focus it before typing ### Multi-line messages ```bash $PY scripts/desktop_ops.py type --text "first line" $PY scripts/desktop_ops.py insert-newline $PY scripts/desktop_ops.py type --text "second line" ``` - Use `insert-newline` for literal line breaks - Do NOT use `\n` in `type --text` — it may trigger send in some apps ### Sending a message 1. **Preferred**: Look for a visible send button (e.g., `发送`) via OCR, then click it 2. **Alternative**: Use `press --key return` ONLY when the app is verified to use Enter-to-send 3. **Never guess** which send method to use — verify first ### Backend priority (macOS) | Operation | Primary | Fallback | |-----------|---------|----------| | `type` | Clipboard paste | cliclick (ASCII only) | | `press` | AppleScript `key code` | cliclick `kp:` | | `hotkey` | cliclick `kd:/t:/ku:` | pyautogui | | `click` | cliclick | pyautogui | > **Important**: cliclick `kp:return` is NOT recognized by WeChat — always use AppleScript for key press. > **Important**: cliclick `t:` silently drops CJK characters — always use clipboard paste for text input. --- ## DPI / HiDPI / Retina (All Platforms) **Handled automatically.** No manual DPI work needed. | Platform | Common scales | Detection method | |----------|---------------|-----------------| | macOS Retina | 2.0x | screenshot pixels ÷ logical screen bounds | | Windows HiDPI | 1.25x, 1.5x, 2.0x | screenshot pixels ÷ pyautogui.size() | | Linux X11 | 1.0x, 1.5x, 2.0x | screenshot pixels ÷ pyautogui.size() | OCR output: `box` = logical (use for mouse), `pixel_box` = raw pixels, `dpi_scale` = factor. --- ## CLI Quick Reference (EXACT parameter names) **CRITICAL**: Use EXACTLY these names. Do NOT guess. ### desktop_ops.py ```bash $PY scripts/desktop_ops.py screenshot [--output PATH] [--x X --y Y --width W --height H] [--with-cursor] $PY scripts/desktop_ops.py capture-region --x X --y Y --width W --height H [--output PATH] [--with-cursor] $PY scripts/desktop_ops.py frontmost $PY scripts/desktop_ops.py list-apps $PY scripts/desktop_ops.py front-window-bounds [--app NAME] $PY scripts/desktop_ops.py focus-app --name "App Name" $PY scripts/desktop_ops.py move --x X --y Y [--duration SECONDS] $PY scripts/desktop_ops.py click [--x X --y Y] [--button left|right|middle] $PY scripts/desktop_ops.py double-click [--x X --y Y] [--button left|right|middle] $PY scripts/desktop_ops.py drag --x1 X1 --y1 Y1 --x2 X2 --y2 Y2 [--duration SEC] [--button left] $PY scripts/desktop_ops.py scroll --amount N [--x X --y Y] [--direction vertical|horizontal] $PY scripts/desktop_ops.py mouse-position $PY scripts/desktop_ops.py press --key KEY $PY scripts/desktop_ops.py type --text "text to type" $PY scripts/desktop_ops.py insert-newline [--count N] $PY scripts/desktop_ops.py hotkey --keys cmd c $PY scripts/desktop_ops.py screen-size $PY scripts/desktop_ops.py pixel-color --x X --y Y ``` ### ocr_text.py ```bash $PY scripts/ocr_text.py --app "AppName" --python $PY [--region-label LABEL] [--lang auto] $PY scripts/ocr_text.py --image /path/to/capture.png --python $PY [--lang auto] ``` ### target_resolver.py ```bash $PY scripts/target_resolver.py --app "AppName" --text "text" --python $PY $PY scripts/target_resolver.py --app "AppName" --template /path/icon.png --python $PY $PY scripts/target_resolver.py --app "AppName" --text "text" --region-label LABEL --python $PY ``` ### task_context.py / cleanup_task.py ```bash $PY scripts/task_context.py init --task-id "my-task" # aliases: create, --name $PY scripts/task_context.py show --task-id "my-task" $PY scripts/cleanup_task.py --task-id "my-task" ``` ### window_regions.py ```bash $PY scripts/window_regions.py --window-x X --window-y Y --window-width W --window-height H [--label LABEL] ``` Labels: `top_search`, `left_sidebar`, `left_sidebar_top`, `title_header`, `content_area`, `toolbar_row`, `bottom_input`, `primary_action` --- ## Workflow Examples ### Example 1: Click a button by text (any app) ``` 1. $PY first_run_setup.py --check → ready: true 2. $PY task_context.py init --task-id "click-button" 3. $PY desktop_ops.py focus-app --name "AppName" 4. $PY desktop_ops.py front-window-bounds --app "AppName" → {x, y, w, h} 5. $PY target_resolver.py --app "AppName" --text "OK" --python $PY → best_candidate: {x:450, y:520, within_window:true} 6. $PY desktop_ops.py move --x 450 --y 520 7. $PY desktop_ops.py screenshot --with-cursor → verify cursor on "OK" 8. $PY desktop_ops.py click --x 450 --y 520 9. $PY desktop_ops.py screenshot → verify result 10. $PY cleanup_task.py --task-id "click-button" ``` ### Example 2: Type and search ``` 1. $PY desktop_ops.py focus-app --name "Safari" 2. $PY target_resolver.py --app "Safari" --text "Search" --region-label top_search --python $PY → {x:300, y:80, within_window:true} 3. $PY desktop_ops.py click --x 300 --y 80 4. $PY desktop_ops.py type --text "hello world" 5. $PY desktop_ops.py press --key return 6. $PY desktop_ops.py screenshot → verify search results ``` ### Example 3: Send a chat message (WeChat, Slack, etc.) ``` 1. $PY desktop_ops.py focus-app --name "WeChat" 2. $PY desktop_ops.py front-window-bounds --app "WeChat" 3. # Navigate to the right conversation (OCR sidebar or search) 4. $PY target_resolver.py --app "WeChat" --text "ContactName" --region-label left_sidebar --python $PY 5. $PY desktop_ops.py click --x <found_x> --y <found_y> 6. # Verify conversation is open 7. $PY desktop_ops.py screenshot → confirm conversation title 8. # Click the input field 9. $PY target_resolver.py --app "WeChat" --text "" --region-label bottom_input --python $PY OR: click at the bottom center of the window 10. $PY desktop_ops.py type --text "Hello!" 11. # Send: prefer visible send button; if not available, use press --key return 12. $PY target_resolver.py --app "WeChat" --text "发送" --python $PY IF found: $PY desktop_ops.py click --x <x> --y <y> ELSE: $PY desktop_ops.py press --key return 13. $PY desktop_ops.py screenshot → verify message sent ``` ### Example 4: Scroll a list and find an item ``` 1. $PY desktop_ops.py focus-app --name "AppName" 2. $PY desktop_ops.py front-window-bounds --app "AppName" → {x:100, y:50, w:800, h:600} 3. # Scroll down in the window center $PY desktop_ops.py scroll --amount -5 --x 500 --y 350 4. $PY desktop_ops.py screenshot → check if target visible 5. $PY target_resolver.py --app "AppName" --text "target item" --python $PY 6. IF not found: scroll more and retry (max 5 scrolls) 7. IF found: click it ``` ### Example 5: Handle an unexpected dialog ``` 1. # During any operation, if the expected UI doesn't match: 2. $PY desktop_ops.py screenshot → examine what's on screen 3. # If a dialog is visible, OCR it: $PY ocr_text.py --app "AppName" --python $PY 4. # Find and click the appropriate button (OK, Cancel, Allow, etc.) $PY target_resolver.py --app "AppName" --text "OK" --python $PY 5. $PY desktop_ops.py click --x <x> --y <y> 6. # After dialog is dismissed, re-get window bounds and continue $PY desktop_ops.py front-window-bounds --app "AppName" ``` --- ## Reference Documents Load as needed: | Document | When to read | |----------|-------------| | `references/workflow.md` | Core 8-step closed loop | | `references/platform-macos.md` | macOS-specific tools and permissions | | `references/platform-windows.md` | Windows setup | | `references/platform-linux.md` | Linux X11/Wayland setup | | `references/operation-patterns.md` | Reusable task templates | | `references/validation-patterns.md` | Two-stage validation | | `references/precise-targeting.md` | 5-layer precision targeting | | `references/target-providers.md` | Provider ordering and fallback contract | | `references/coordinate-reconstruction.md` | Rebuild click coordinates from screenshot evidence | | `references/chat-app-macos.md` | Chat app workflow | | `references/app-wechat-desktop.md` | Cross-platform WeChat guidance | | `references/cleanup-rules.md` | Cleanup timing and scope | | `references/collaboration-rules.md` | When multi-agent collaboration is justified | | `references/example-cases.md` | Repeatable task examples | | `references/reproducible-setup.md` | Host bring-up checklist | ## Scope Use this skill for: chat apps, browsers, file managers, editors, office apps, system settings, any closed desktop software with no usable API. ## Hard Rules 1. **Always run auto-setup gate first** 2. **Always use EXACT parameter names from CLI reference — never guess** 3. **Always scope OCR to the target app window — NEVER full-screen OCR** 4. **Always: focus-app → front-window-bounds → OCR within window → verify → act** 5. **Always pass `--python $PY` to ocr_text.py and target_resolver.py** 6. **Always verify coordinates are within window bounds before clicking** 7. **Always re-get window bounds after any UI state change (login, dialog, navigation)** 8. **Use `insert-newline` for line breaks; never use `\n` in `type --text`** 9. **For send actions: prefer visible send button; use `press --key return` only when verified** 10. **One action at a time; verify after each** 11. **Maximum 3 retries per action; each retry must recapture fresh state** 12. **Cleanup is mandatory at task end** 13. **If verification fails, recapture and rebuild — do not retry blindly**

desktop-agent-ops

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

desktop-agent-ops