The repository now includes a DesktopManager.Cli project that exposes a small, noun-based command tree over the existing DesktopManager C# library.
- Keep the CLI aligned with the current C# surface area.
- Provide a stable foundation for MCP hosting.
- Reuse existing window, monitor, and layout APIs rather than duplicating desktop logic.
desktopmanager window list
desktopmanager window geometry
desktopmanager window exists
desktopmanager window active-matches
desktopmanager window wait
desktopmanager window wait-visual-change
desktopmanager window type
desktopmanager window keys
desktopmanager window move
desktopmanager window click
desktopmanager window drag
desktopmanager window scroll
desktopmanager window focus
desktopmanager window minimize
desktopmanager window snap
desktopmanager control list
desktopmanager control diagnose
desktopmanager control exists
desktopmanager control wait
desktopmanager control click
desktopmanager control set-text
desktopmanager control send-keys
desktopmanager monitor list
desktopmanager process start
desktopmanager process start-and-wait
desktopmanager screenshot desktop
desktopmanager screenshot window
desktopmanager screenshot target
desktopmanager target save
desktopmanager target get
desktopmanager target list
desktopmanager target resolve
desktopmanager visual save
desktopmanager visual get
desktopmanager visual list
desktopmanager visual assert
desktopmanager visual resolve
desktopmanager control-target save
desktopmanager control-target get
desktopmanager control-target list
desktopmanager control-target resolve
desktopmanager layout save
desktopmanager layout apply
desktopmanager layout assert
desktopmanager layout list
desktopmanager snapshot save
desktopmanager snapshot restore
desktopmanager snapshot list
desktopmanager diagnostic hosted-session
desktopmanager workflow prepare-coding
desktopmanager workflow prepare-screen-sharing
desktopmanager workflow clean-up-distractions
desktopmanager mcp serve
desktopmanager mcp serve --allow-mutations
desktopmanager mcp serve --allow-mutations --allow-process notepad
desktopmanager mcp serve --dry-run
layoutstores named JSON files under%AppData%\DesktopManager\layouts.snapshotstores named JSON files under%AppData%\DesktopManager\snapshots.screenshotstores generated PNG files under%AppData%\DesktopManager\captureswhen--outputis not provided.targetstores reusable JSON target definitions under%AppData%\DesktopManager\targets.visualstores reusable JSON metadata plus PNG baseline images under%AppData%\DesktopManager\visual-baselines.control-targetstores reusable JSON control selector definitions under%AppData%\DesktopManager\control-targets.monitor listreports the desktop-coordinate bounds used by monitor screenshots.- snapshots currently reuse the window layout format and are therefore windows-only for now.
process startlaunches a desktop application and can optionally wait for input idle and for a launched window to appear.process startcan now also validate the launched window by title or class and optionally require that a real matching window be found before returning.process start-and-waitnow packages the safer unattended launch flow: start the app, bind the follow-up wait to the launched process, return the resolved window result, and optionally capture before/after evidence.process start-and-wait --follow-process-familyis an explicit opt-in for apps that surface their visible window from a same-name helper or broker process after launch-time correlation finishes.window waitpolls for a matching window and returns when one appears.window wait-visual-changepolls a matching window, client area, or saved target region until the pixels materially change, which makes opaque modern-app flows verifiable without depending on structural UIA exposure.visual savecaptures a reusable baseline for a whole window, the client area, or a saved target region so later runs can assert that the same UI still looks right.visual assertcompares a live window, client area, or saved target region against a stored baseline and returns sampled change metrics instead of forcing agents to hand-roll image diffs.visual resolvesearches a live window or client area for a previously saved baseline image and returns the best match coordinates, which makes saved visual regions reusable as anchors instead of screenshot-only artifacts.visual read-textruns Windows OCR over a live window, client area, or saved target region and returns recognized text plus line and word bounds.visual resolve-textturns that OCR output into a reusable text anchor by returning the best visible match coordinates for a query string.window click --ocr-text <text>turns that same OCR text-anchor path into a direct action, so visible labels can drive clicks without pre-saved targets or app-specific adapters.window drag --start-ocr-text <text> --end-ocr-text <text>andwindow scroll --ocr-text <text>extend that same OCR text-anchor path to richer pointer actions, so visually obvious surfaces can stay generic even when structure is weak.window click --visual-baseline <name>turns that saved visual region into a direct action path, so agents can click a previously seen button or panel without first hand-copying resolved coordinates.window drag --start-visual-baseline <name> --end-visual-baseline <name>andwindow scroll --visual-baseline <name>extend that same anchor model to richer pointer actions.- anchor-driven window mutations now return the resolved region and action point they actually used, which makes debugging and agent retries much more concrete.
window existsandwindow active-matchesprovide non-mutating verification commands.control existsandcontrol waitprovide the same inspect-first verification model for controls.control assert-valueadds a stronger reusable assertion when a workflow depends on the resolved field content, not just control presence.control diagnoseexplains which discovery path was used, how many Win32 and UIA controls were actually found, and what each probed UIA root returned.control diagnosecan now also take--target <name>, so saved control targets and ad-hoc selectors share the same diagnostics path.control diagnose --action-probeadds a read-only UIA action-resolution probe for the first matched UIA control, so you can verify cached action-match reuse without clicking anything.control diagnosenow includes elapsed times for the overall diagnostic pass, and--action-probeadds a separate elapsed time for the read-only action-resolution probe.controlworks with child window controls and can also use UI Automation-oriented selectors.control listnow returns shared control bounds metadata, which makes control discovery more actionable for follow-up clicks or diagnostics.control listalso returns shared capability metadata so you can tell whether a control supports background-safe click, text, or key actions before invoking it.- control selectors can now match
value,enabled, andfocusablestate through the shared library. - control selectors can now also match capability flags such as
background-click,background-text,background-keys, andforeground-fallback. --ensure-foregroundprovides a shared opt-in reliability hint for UIA-heavy control queries.control set-textand handle-backedcontrol send-keysnow use shared direct-to-control message routing instead of relying on foreground focus.- UIA control actions now reuse the same shared fallback-root search strategy as UIA discovery, which reduces “listed but not actionable” mismatches when modern apps expose controls under Chromium-style child roots.
- zero-handle UIA text and key fallback paths are now shared too, but they are intentionally opt-in because they rely on focused foreground input for modern apps.
- when zero-handle UIA text fallback is enabled, the shared library now prefers a focused replace-and-paste flow with verification before it falls back to raw typed input, which is notably more reliable for Chromium-style edit fields.
window typesends text to the target window, either by simulated typing or clipboard paste.window type --foreground-inputrequires real foreground keyboard delivery and fails instead of silently falling back to background window messaging, which is a better fit for remote-session hosts such as RDP, Hyper-V, and Remote Desktop Manager.window type --physical-keysadds a layout-aware physical-key typing mode for foreground targets, which is often closer to how password managers "type" into hosted remote sessions.window type --hosted-sessionis a convenience profile for RDP, Hyper-V, and Remote Desktop Manager style targets. It enables a US-style foreground scancode path with slower defaults that are safer for hosted editors.window type --scriptpreserves multiline formatting, chunks long lines into smaller typed segments, and can be combined with either the default delivery path or the stricter foreground typing modes.- mutating
windowcommands now support--verify, which re-queries the mutated window after the action and reports an observed postcondition instead of only the request outcome. --verify-tolerance-pxtunes geometry verification for commands likewindow move; specifying it also implies--verify.- the verification block is action-aware for
window move,window focus, andwindow minimize, and falls back to honest presence-only observation for other window mutations such as typing and pointer input. - hosted-session live diagnostics now write repo-local artifacts under
Artifacts\HostedSessionTyping, including a raw JSON snapshot and a companion*.summary.txtfile with the likely focus-culprit category and retry summary. diagnostic hosted-sessionreads the newest hosted-session artifact (or a specific one) and can return either the compact summary text or a structured record.- hosted-session diagnostic artifacts now trim older entries automatically, keeping the newest artifact sets so the folder stays readable during repeated harness runs.
window keyssends key chords or single keys to the target window after activating it, which is the safer shared follow-up path for Enter, Escape, and similar actions when modern controls stop being structurally reusable after text entry.- mutating
windowandcontrolcommands can now return shared verification metadata:success,elapsedMilliseconds,safetyMode, optional target name/kind, best-effort before/after screenshots, artifact warnings, and for verified window mutations an explicitverificationblock with observed counts, summary text, and notes. - those mutating commands now also accept
--capture-before,--capture-after, and--artifact-directoryso CLI, MCP, and agent workflows can ask for evidence without changing the core action logic. workflow prepare-codingcan optionally apply a named layout and then focus a likely editor or terminal window.workflow prepare-screen-sharingcan optionally apply a named layout, minimize common distractions, and then focus a likely sharing window.workflow clean-up-distractionsexposes the same shared distraction-minimizing logic as a standalone structured step.- workflow results can include
resolvedWindowfor the explicit target window when the workflow can resolve it, but callers should still treat focus and target resolution as best-effort and rely onNoteswhen Windows blocks the normal path. layout assertnow verifies that the current desktop satisfies a saved named layout within configurable geometry tolerances and optional state matching, which makes saved layouts reusable as assertions instead of restore-only state.window click,window drag, andwindow scrollprovide shared window-relative fallbacks for modern apps when structural control discovery is unavailable.window wait-visual-changeis the matching non-mutating verification primitive for those fallback actions, so agents can wait for real visual change instead of guessing from timing alone.- mutating
windowandcontrolcommands can now also ask for built-in visual-change verification, which waits for a real pixel delta after the action and returns the observed change metrics in the same structured result. windowcommands support exact handle targeting and active-window targeting for safer selection when multiple windows match.window geometryreturns both outer-window and client-area bounds, which makes screenshot-assisted targeting much easier.window click,window drag, andwindow scrollnow also support normalized ratios from0to1for less brittle targeting.target savelets you persist a reusable client-area or window-relative point once and reuse it fromwindow click,window drag, andwindow scroll.target savecan now also persist a reusable target area viawidth/heightorwidthRatio/heightRatio, which makes screenshot-assisted visual targeting much more reusable.target resolveshows the exact screen-space point a named target maps to for a live window.screenshot targetandscreenshot window --target <name>can now capture a resolved named target area directly.control-target savelets you persist a reusable control selector and capability profile once, then resolve it later against live windows.control-target resolveshows which live control a saved target matches, including its current capabilities and parent window.control click,control set-text, andcontrol send-keyscan now reuse a saved control target via--target.control list,control exists, andcontrol waitcan also reuse a saved control target via--target, which makes repeated modern-app inspection much less repetitive.- the shared UIA layer now remembers a preferred root inside the current process after a successful modern-app lookup, and
control diagnoseexposes whether that preferred root was reused. - the shared UIA layer now also keeps a very short-lived in-process cache of enumerated root controls, which helps repeated modern-app control reads and diagnostics in long-lived sessions.
- repeated UIA actions in the same long-lived process now also try a cached exact-match lookup before they fall back to a broader root walk.
- the shared control wait path now prefers already-seen matching window handles inside the same process before it falls back to broad rediscovery, which is safer for stable modern-app windows.
screenshot windownow prefers real window rendering before falling back to screen pixels, which improves captures for covered windows.window typestill falls back to direct message-based delivery by default when Windows refuses to foreground the target window, which avoids leakingSendInputtext into whatever app currently owns focus.window type --foreground-inputdisables that fallback and skips directWM_SETTEXTverification, so it behaves more like deliberate keyboard typing than background control mutation.window type --physical-keysbuilds on the strict foreground path and prefers real keyboard-layout key combinations before it falls back to Unicode packets for characters that have no physical-key mapping.window type --hosted-sessioncurrently wraps a foreground US-style scancode path with slower pacing defaults. It requires the hosted editor surface to already own focus before typing starts, and it now aborts immediately if foreground ownership changes while typing.window type --script --foreground-inputis the preferred shared path when you need to type a multiline script into an RDP, Hyper-V, or Remote Desktop Manager hosted editor without relying on clipboard paste.- when the hosted-session harness goes inconclusive, inspect the matching
Artifacts\HostedSessionTyping\*.summary.txtcompanion first. It now calls out whether the interruption looked like a repeated browser/Electron focus steal, mixed contention, or no retained external culprit. process startnow prefers windows from the launched process and then newer post-launch window handles for the target app, which is safer than binding to any older matching window.process start --require-windowis now a useful shared primitive for unattended workflows that need a validated target window instead of a best-effort launcher result.mcp servehosts a stdio MCP server.mcp servenow defaults to read-only inspection so agents can connect safely before any mutation is enabled.mcp serve --allow-mutationsenables mutating MCP tools for an intentional session.mcp serve --allow-process <pattern>and--deny-process <pattern>constrain live desktop mutations to specific process patterns.mcp serve --allow-foreground-inputis a second explicit opt-in for zero-handle UIA text/key fallback that may need focused foreground input.mcp serve --dry-runpreviews mutating tool calls without changing desktop or saved state.- when process filters are active, broad layout/snapshot/workflow mutations that cannot be scoped to one process are intentionally blocked.
window,monitor,layout, andsnapshotscale better than flat verbs.processandscreenshotadd the first inspect-launch-wait loop needed for desktop automation.process start-and-waitturns that inspect-launch-wait loop into one shared structured result instead of leaving the correlation logic to every caller.controlandwindow typeadd the first direct interaction layer for classic desktop controls.window keysrounds out the shared whole-window input path for accelerators and commit keys without forcing agents back into ad-hoc foreground hacks.window click,window drag, andwindow scrollgive CLI, MCP, and PowerShell the same coordinate-based fallback path when UIA-heavy apps stay opaque.targetturns screenshot-assisted coordinate fallback into reusable state instead of one-off manual ratios.visualturns screenshot-assisted verification into reusable state instead of making every workflow keep one-off reference images and custom compare logic.- the same saved visual baseline can now double as a reusable template anchor, so agents can relocate a previously seen button or panel before clicking it.
- that same anchor can now drive
window clickdirectly, which is a better fit for generic operator loops than forcing a separate resolve step in every caller. - the same anchor family now also drives drag and scroll, which means a previously saved visual region can remain useful even when the next interaction is not a simple click.
- area-capable
targetdefinitions now let the shared core reuse visual regions, not just click points. control-targetturns modern-app control discovery into reusable state instead of repeating long UIA selector sets each time.workflowpackages a few multi-step desktop routines into shared structured results instead of leaving them as prompts or one-off agent logic.layout assertturns named layouts into reusable verification assets, not just restore assets.- when a saved control target points at a modern Chromium-style app, the first resolution can still take a couple of seconds because shared UIA discovery is the expensive part of the workflow.
- those fallbacks now also support client-area coordinates, which are usually a better fit for browser and editor content than raw outer-window coordinates.
- screenshot JSON now includes window geometry metadata for window captures, so agents can map screenshots to client-area coordinates without extra probing.
- the CLI mirrors existing concepts already present in the library and PowerShell module.
- the CLI and MCP server reuse the same desktop operations and storage conventions.
- window selection, control geometry, and text-entry reliability now live in the shared C# library so CLI, MCP, and PowerShell stay aligned.
- Child-window targeting is still the simplest path for classic Win32 controls.
- UIA discovery and action fallback now work through the shared library, but selector validation is still wise before unattended runs.
control diagnoseis the fastest way to understand why a modern app did or did not expose controls through the shared library, because it now shows per-root UIA probe details instead of only a single aggregate count.- preferred UIA root reuse only helps inside a long-lived process like MCP or an in-process wait loop. Separate one-shot CLI invocations still start fresh.
- the short-lived UIA control cache is also process-local, so it mainly helps MCP, in-process waits, and repeated diagnostics inside the same host session.
- For opaque modern apps, the most reliable fallback flow is now:
screenshot window --json, inspectGeometry, then use ratio-basedwindow click,window drag, orwindow scrollwith--client-area. - For opaque modern apps, pair that fallback with
window wait-visual-changefor immediate feedback andvisual save/visual assertwhen the workflow needs a reusable “still looks right” checkpoint across runs.
When a modern app exposes unstable structure, prefer this repeatable flow:
desktopmanager screenshot window --process msedge --json
desktopmanager target save edge-editor-pane --x-ratio 0.1 --y-ratio 0.15 --width-ratio 0.8 --height-ratio 0.7 --client-area
desktopmanager target resolve edge-editor-pane --process msedge --json
desktopmanager screenshot target edge-editor-pane --process msedge --json
desktopmanager window click --process msedge --target edge-editor-pane
desktopmanager window wait-visual-change --process msedge --target edge-editor-pane --timeout-ms 5000 --json
desktopmanager visual save edge-editor-clean --process msedge --target edge-editor-pane --json
desktopmanager visual resolve edge-editor-clean --process msedge --client-area --max-average-difference 10 --json
desktopmanager visual read-text --process msedge --client-area --json
desktopmanager visual resolve-text "Sign in" --process msedge --client-area --contains --json
desktopmanager window click --process msedge --ocr-text "Sign in" --ocr-contains --client-area --json
desktopmanager window drag --process msedge --start-ocr-text "Source" --end-ocr-text "Drop here" --ocr-contains --client-area --json
desktopmanager window scroll --process msedge --ocr-text "Timeline" --delta -120 --ocr-contains --client-area --json
desktopmanager window click --process msedge --visual-baseline edge-editor-clean --client-area --baseline-max-average-difference 10 --json
desktopmanager window scroll --process msedge --visual-baseline edge-editor-clean --delta -120 --client-area --baseline-max-average-difference 10 --json
desktopmanager visual assert edge-editor-clean --process msedge --target edge-editor-pane --max-changed-ratio 0.01 --json
For reusable drags or scrolling, save more than one target and then reuse them from window drag or window scroll instead of repeating raw coordinates.
When you want mutation evidence too, add artifact flags to the action step:
desktopmanager control set-text --window-process msedge --target edge-address --text "https://evotec.xyz" --allow-foreground-input --capture-before --capture-after --json
desktopmanager window click --process msedge --target edge-editor-center --capture-before --capture-after --artifact-directory .\artifacts --json
For hosted-session diagnostics, prefer the summary first and then fall back to the full record only when needed:
desktopmanager diagnostic hosted-session --summary-only
desktopmanager diagnostic hosted-session --repository-root C:\Support\GitHub\DesktopManager
desktopmanager diagnostic hosted-session --artifact C:\Support\GitHub\DesktopManager\Artifacts\HostedSessionTyping\sample.json --json