← back to home
October 2025

Building Umari

Motivation

Reasoning models excel at planning—decomposing "book a flight to NYC" into search, filter, and purchase steps. But executing these plans requires grounding actions to pixel coordinates, which is fundamentally different.

Planning is discrete and semantic: decide what to click. Grounding is continuous and spatial: find where to click. Using a reasoning model for both wastes compute on the grounding task, which doesn't benefit from chain-of-thought.

The cost compounds quickly. Vision models process screenshots as tiles—a 1920×1080 screen is ~1000 tokens. For a 100-step workflow, this costs $3-9 without caching. At scale, repeatedly calling a reasoning model for coordinate prediction becomes prohibitively expensive.

Architecture

The system decomposes into two specialized models:

  1. Planning: Vision-language model generates semantic actions from screenshots
  2. Grounding: Specialized model maps semantic actions to pixel coordinates

Given a screenshot $s_t$ and history $h_t = \{(s_i, a_i)\}_{i=0}^{t-1}$, the planner outputs action $a_t$:

$ a_t = \pi_{\text{plan}}(s_t, h_t) $

The grounding model converts semantic actions to coordinates:

$ (x, y) = \pi_{\text{ground}}(s_t, a_t) $

Example output:

{
  "action": "click",
  "target": "Submit button",
  "coordinates": [834, 672],
  "confidence": 0.94
}

Cost Analysis

Vision-language models tile screenshots for processing. A 1920×1080 screen generates 6 tiles at 170 tokens each:

$ T_{\text{image}} = \lceil \frac{w}{512} \rceil \times \lceil \frac{h}{512} \rceil \times 170 $

For a 100-step workflow:

Metric Without Caching With Caching
Cost $3.20–$9.40 $0.32–$0.94
Latency per action 2–5 seconds 10–50ms
Total time 3–8 minutes 1–5 seconds

Optimization strategy: cache UI elements between actions and route simple interactions directly to the grounding model. UI state changes slowly—most elements remain at fixed coordinates across consecutive actions.

Training

The grounding model is trained via GRPO (Group Relative Policy Optimization) with binary rewards:

$ R(s, a, \hat{c}) = \begin{cases} 1 & \text{if } \|\hat{c} - c_{\text{true}}\|_2 < \tau \\ 0 & \text{otherwise} \end{cases} $

where $\hat{c}$ is the predicted coordinate, $c_{\text{true}}$ is the ground truth, and $\tau$ is the hit threshold.

Training uses trajectory augmentation: one recorded workflow generates multiple training samples by varying UI states and timing. For a trajectory of length $n$, we extract $O(n^2)$ sub-trajectories as training data.

Adaptive Routing

For simple UIs, the grounding model executes directly without planning. We route to the planner only when confidence is low:

$ \text{use\_planner} = \begin{cases} \text{true} & \text{if } H(p_{\text{ground}}) > \theta \text{ or } \max(p_{\text{ground}}) < \gamma \\ \text{false} & \text{otherwise} \end{cases} $

where $H(p)$ is the entropy of the grounding model's coordinate distribution, $\theta$ is the entropy threshold, and $\gamma$ is the confidence threshold.

This achieves ~50ms latency on simple actions (A100), escalating to 2-5s only for ambiguous cases.

Future Work

Current implementation processes discrete states. Moving to streaming would enable continuous perception-action loops at 5-10 Hz, handling dynamic interactions (drag, hover, scroll) more naturally.

For repeated workflows, policy distillation can compile trajectories into specialized models:

$ \pi_{\text{task}}(s) = \arg\min_{\pi} \mathbb{E}_{s \sim \mathcal{D}_{\text{task}}} \left[ \text{KL}(\pi(s) \| \pi_{\text{plan}}(s)) \right] $

This converts the planner from runtime dependency to training-time teacher, enabling local execution of routine tasks.

Related Work

He, Y., Jin, J., & Liu, P. (2025). Efficient Agent Training for Computer Use. arXiv preprint arXiv:2505.13909.

Yang, Y., Li, D., Dai, Y., Yang, Y., Luo, Z., Zhao, Z., Hu, Z., Huang, J., Saha, A., Chen, Z., Xu, R., Pan, L., Xiong, C., & Li, J. (2025). GTA1: GUI Test-time Scaling Agent. arXiv preprint arXiv:2507.05791.

Written by Alex Hamidi