ui-agent

# UIAgent: Universal UI Automation Skill **Status:** ✅ Production Ready (v1.0) **Tests:** 15/15 passing (100% with real evidence) **License:** MIT **Python:** 3.9+ --- ## Description UIAgent is a production-grade **browser and desktop automation framework** that works without HTML selectors, fragile identifiers, or brittle XPath expressions. It combines: - **Chrome DevTools Protocol (CDP)** for intelligent browser control - **Native OS APIs** (X11, Windows UIA, macOS Accessibility) for desktop automation - **Evidence-based verification** (screenshot hashing, DOM inspection, file verification) - **VirtualBox & headless support** (proven on VirtualBox, works on bare metal) Use it to automate: - Complex web workflows (multi-step login, form filling, error recovery) - Dynamic websites with unstable selectors - Desktop applications (terminal, text editors, file managers) - Cross-browser session management and persistence - Integration testing with visual proof --- ## Quick Start ### Installation ```bash # Add to your project git clone https://github.com/yourusername/uiagent.git cd uiagent pip install -r requirements.txt ``` ### Minimal Example ```python from src.chrome_session_vbox_fixed import get_ctrl import time # Launch browser ctrl = get_ctrl() # Navigate ctrl._send("Page.navigate", {"url": "https://example.com"}) time.sleep(2) # Fill a form field ctrl.js('document.getElementById("email").value = ""') ctrl.js('document.getElementById("email").focus()') ctrl._send("Input.insertText", {"text": "user@example.com"}) time.sleep(0.3) # Verify email = ctrl.js('document.getElementById("email").value') print(f"Filled: {email}") # → user@example.com # Read title title = ctrl.js('document.title') print(f"Title: {title}") ``` ### Real Test Example ```python from src.chrome_session_vbox_fixed import get_ctrl from src.verify_helpers import screen_hash import time ctrl = get_ctrl() ctrl._send("Page.navigate", {"url": "https://example.com"}) time.sleep(2) # BEFORE state hash_before = screen_hash(ctrl) print(f"BEFORE: {hash_before}") # Make change ctrl.js('document.body.style.backgroundColor = "red"') time.sleep(0.5) # AFTER state hash_after = screen_hash(ctrl) print(f"AFTER: {hash_after}") # Verify change was real assert hash_before != hash_after, "Change not detected" print("✅ Change verified via screenshot hash") ``` --- ## API Reference ### Chrome Control (`src/cdp_typer.py`) #### `get_ctrl()` → CDPTyper Launch or reuse a Chrome instance with VirtualBox fixes. ```python ctrl = get_ctrl() ``` **Returns:** `CDPTyper` instance connected to Chrome DevTools Protocol **Features:** - Auto-reuses existing Chrome if healthy - Cleans lock files on VirtualBox - Waits for CDP readiness (tabs loaded) - 5-minute timeout on startup --- #### `ctrl._send(method, params)` → dict Send a CDP command to Chrome and return result. ```python result = ctrl._send("Runtime.evaluate", { "expression": "document.title", "returnByValue": True }) # → {"result": {"value": "Page Title"}} ``` **Common commands:** - `Page.navigate` - Navigate to URL - `Runtime.evaluate` - Run JavaScript - `Input.insertText` - Type text - `Storage.getCookies` - Read cookies (for session persistence) **Full CDP reference:** [ChromeDevTools Protocol](https://chromedevtools.github.io/devtools-protocol/) --- #### `ctrl.js(code)` → result Execute JavaScript in page context and get result. ```python title = ctrl.js('document.title') value = ctrl.js('document.getElementById("email").value') color = ctrl.js('getComputedStyle(document.body).backgroundColor') ``` **Returns:** JavaScript result (strings, objects, booleans, etc.) --- #### `ctrl.click(x, y)` Click at pixel coordinates (CDP method). ```python # Get element position pos = ctrl.js(""" (() => { const el = document.getElementById("button"); const r = el.getBoundingClientRect(); return {x: r.left + r.width/2, y: r.top + r.height/2}; })() """) # Click center of element ctrl.click(pos['x'], pos['y']) ``` --- #### `ctrl.screenshot(filepath)` → bytes Take a screenshot and save to file. ```python ctrl.screenshot("/tmp/page.png") print("Screenshot saved") # Check size import os size = os.path.getsize("/tmp/page.png") print(f"Size: {size} bytes") ``` --- ### Verification Helpers (`src/verify_helpers.py`) #### `screen_hash(ctrl)` → str Get MD5 hash of rendered page (change detection). ```python hash_before = screen_hash(ctrl) ctrl.js('document.body.innerHTML = "<h1>Changed</h1>"') hash_after = screen_hash(ctrl) assert hash_before != hash_after, "Page didn't change" ``` **Returns:** 32-character MD5 hex string **Use for:** Detecting visual changes without pixel-level comparison --- #### `current_url(ctrl)` → str Get current page URL. ```python url = current_url(ctrl) print(f"Current: {url}") assert "example.com" in url, "Wrong page" ``` --- #### `dom_exists(ctrl, selector)` → bool Check if element exists in DOM (not hidden). ```python if dom_exists(ctrl, "#submit-button"): ctrl.js('document.querySelector("#submit-button").click()') else: print("Button not found") ``` --- ### Desktop Automation (`src/desktop_helpers.py`) #### `launch(app, *args, wait=2)` → (proc, display) Launch a desktop application. ```python proc, display = launch("gedit", wait=2) # → Running: gedit on DISPLAY=:99 ``` **Common apps:** - `"gedit"` - Text editor - `"nautilus"` - File manager - `"gnome-terminal"` - Terminal - `"firefox"` - Browser **Returns:** (subprocess.Popen, display_string) --- #### `type_text(text, display=None)` Type text via X11 xdotool (for desktop apps). ```python proc, display = launch("gedit", wait=2) type_text("Hello, UIAgent!", display=display) # Gedit now contains: "Hello, UIAgent!" ``` **Uses:** xdotool for X11 keyboard simulation --- #### `press_key(key, display=None)` Press a key (Tab, Enter, Ctrl+S, etc.). ```python press_key("ctrl+s", display=display) # Save press_key("Tab", display=display) # Next field press_key("Return", display=display) # Submit ``` **Common keys:** - `"Tab"`, `"Return"`, `"Escape"` - `"ctrl+c"`, `"ctrl+s"`, `"ctrl+z"` - `"alt+f4"` --- ### Session Persistence (v1.0 Feature) #### Cookie Survival Across Chrome Restart **The Problem:** Chrome kills without flushing cookies to SQLite in headless mode. **The Solution:** Use JavaScript + CDP Storage API ```python # Before kill: Save cookies from memory result = ctrl._send("Storage.getCookies", {}) saved_cookies = result.get("cookies", []) # Kill Chrome from src.chrome_session_vbox_fixed import close close() time.sleep(2) # Relaunch ctrl2 = get_ctrl() # Restore cookies via JavaScript for cookie in saved_cookies: js = f"document.cookie = '{cookie['name']}={cookie['value']}; path=/; secure; samesite=none';" ctrl2.js(js) # Navigate to verify ctrl2._send("Page.navigate", {"url": "https://httpbin.org/cookies"}) time.sleep(2) page = ctrl2.js("document.body.innerText") assert cookie['value'] in page, "Cookie not persisted" print("✅ Cookie survived restart") ``` **Why this works:** 1. `Storage.getCookies` reads Chrome's in-memory cookie store (no SQLite dependency) 2. `document.cookie =` writes directly to browser memory (instant, no disk needed) 3. No reliance on database flush timing or locks --- ## Patterns & Best Practices ### Pattern 1: Form Filling with Verification ```python def fill_form(ctrl, data): """Fill form fields and verify each one.""" for field_id, value in data.items(): # Clear field ctrl.js(f"document.getElementById('{field_id}').value = ''") # Focus ctrl.js(f"document.getElementById('{field_id}').focus()") time.sleep(0.2) # Type via CDP (more reliable than sendKeys) ctrl._send("Input.insertText", {"text": value}) time.sleep(0.2) # Verify immediately actual = ctrl.js(f"document.getElementById('{field_id}').value") assert actual == value, f"Field {field_id}: expected '{value}', got '{actual}'" print("✅ All fields verified") # Usage fill_form(ctrl, { "firstName": "John", "lastName": "Doe", "email": "john@example.com" }) ``` ### Pattern 2: Error Detection & Recovery ```python def navigate_with_recovery(ctrl, url, max_retries=3): """Navigate with automatic retry on 404/error.""" for attempt in range(max_retries): ctrl._send("Page.navigate", {"url": url}) time.sleep(3) # Check for error indicators body = ctrl.js("document.body.innerText") or "" if any(kw in body.lower() for kw in ["404", "not found", "error"]): print(f"Attempt {attempt+1}: Error page detected, retrying...") continue # Success print(f"✅ Page loaded on attempt {attempt+1}") return True print("❌ Failed after all retries") return False # Usage navigate_with_recovery(ctrl, "https://example.com/sometimes-404") ``` ### Pattern 3: Multi-Tab Coordination ```python def open_and_switch_tabs(ctrl, urls): """Open multiple tabs and return results from each.""" results = {} for url in urls: # Open new tab result = ctrl._send("Target.createTarget", {"url": url}) tab_id = result.get("targetId") time.sleep(2) # Query current tab page_title = ctrl.js("document.title") results[url] = page_title return results # Usage tabs = open_and_switch_tabs(ctrl, [ "https://wikipedia.org", "https://github.com", "https://google.com" ]) # → {url: title, ...} ``` ### Pattern 4: Screenshot-Based Assertion ```python def assert_page_changed(ctrl, description=""): """Before/after screenshot assertion.""" hash1 = screen_hash(ctrl) # Do something ctrl.js('document.body.style.opacity = "0.5"') time.sleep(0.5) hash2 = screen_hash(ctrl) assert hash1 != hash2, f"Page didn't change: {description}" print(f"✅ Visual change confirmed: {description}") # Usage assert_page_changed(ctrl, "Clicked submit button") ``` --- ## Architecture ### Component Stack ``` ┌──────────────────────────────┐ │ Your Automation Script │ └──────────────┬───────────────┘ │ ┌──────▼──────────┐ │ CDPTyper Core │ (cdp_typer.py) │ + Helpers │ └──────┬──────────┘ │ ┌──────────┼──────────┐ │ │ │ ┌───▼──┐ ┌───▼────┐ ┌──▼───┐ │Chrome│ │Terminal│ │Desktop│ │ CDP │ │Command │ │ APIs │ └──────┘ └────────┘ └───────┘ ``` ### Key Components | File | Lines | Purpose | |------|-------|---------| | `src/cdp_typer.py` | 950+ | Chrome DevTools Protocol implementation | | `src/chrome_session_vbox_fixed.py` | 167 | VirtualBox-safe Chrome launcher | | `src/verify_helpers.py` | 120+ | Verification (hashing, DOM, file checks) | | `src/desktop_helpers.py` | 150+ | Desktop app automation (X11) | --- ## Test Evidence (v1.0) All 15 tests pass with **real, measured BEFORE/AFTER values**: ### Browser Tests (13) - ✅ Contenteditable typing - ✅ Form filling with tab navigation - ✅ HTML5 video playback - ✅ Google search workflow - ✅ Shadow DOM access - ✅ Complex form filling (4 fields) - ✅ Canvas drawing (4,091 pixels) - ✅ Multi-tab management (1→3 tabs) - ✅ Keyboard navigation - ✅ 404 error recovery - ✅ Session persistence (full restart) ### Desktop Tests (2) - ✅ Terminal command execution - ✅ Text editor file save - ✅ File manager launch **Full evidence:** [tests/](./tests/) directory --- ## Troubleshooting ### "Chrome exited immediately" **Cause:** Chrome can't start (likely VirtualBox environment) **Solution:** ```bash # Ensure Xvfb is running pgrep Xvfb # Should show process # Or start it Xvfb :99 -screen 0 1920x1080x24 & # Then set DISPLAY export DISPLAY=:99 ``` --- ### "CDP not ready after 20s" **Cause:** Chrome started but tabs not loaded **Solution:** ```python # Add longer wait time.sleep(5) # Instead of 2-3 seconds # Or check manually try: ctrl = get_ctrl() except RuntimeError as e: print(f"Chrome issue: {e}") # Kill and retry close() time.sleep(3) ctrl = get_ctrl() ``` --- ### "Focus not moving between fields" **Cause:** Website JavaScript intercepting focus events **Solution:** ```python # Don't use Tab key on complex sites # Instead, use direct JavaScript focus # ❌ Don't do this: ctrl.key("Tab") # ✅ Do this: ctrl.js('document.getElementById("password").focus()') time.sleep(0.3) ``` --- ## Performance **Typical metrics (on VirtualBox):** - Page load: 2-3 seconds - Form fill (5 fields): 1-2 seconds - Screenshot hash: 200-500ms - DOM query: 50-100ms **Optimizations:** - Reuse `ctrl` instance (don't launch Chrome multiple times) - Use `time.sleep(0.2)` between CDP commands (not 1s) - Cache screenshot hashes if checking same page repeatedly --- ## Version History ### v1.0 (Current) - ✅ 15/15 tests passing - ✅ Chrome DevTools Protocol automation - ✅ VirtualBox support (Xvfb + lock cleanup) - ✅ Desktop automation (X11) - ✅ Session persistence (JavaScript + Storage API) - ✅ Real evidence-based verification ### v1.1 (Planned) - Vision Agent (screenshot analysis + element detection) - Wayland support - Windows Native support --- ## FAQ **Q: Does it work on Windows?** A: Not yet (v1.0 uses X11). Windows Native support coming in v1.1. **Q: Can it use selectors instead?** A: Yes, `ctrl.js('document.querySelector(...)')` works fine. But CDP + JS is more reliable. **Q: How do I test without seeing the browser?** A: That's the whole point! Runs headless on Xvfb, no display needed. **Q: Can it handle JavaScript-heavy sites?** A: Yes, it waits for CDP readiness. For dynamic content, add `time.sleep()` after navigation. --- ## Support & Contributing - **Issues:** Report bugs with full test output - **PRs:** Must include real test evidence (before/after values) - **Docs:** Update this SKILL.md if adding new features --- **Made with ❤️ for automation engineers.**

ui-agent

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

ui-agent

ui-agent

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement