Building Umari

Motivation

Reasoning models are great at planning—they can decompose "book a flight to NYC" into search, filter, and purchase steps. But executing these plans requires grounding actions to pixel coordinates, which is a fundamentally different problem.

Planning is discrete and semantic: you decide what to click. Grounding is continuous and spatial: you find where to click. Using a reasoning model for both wastes compute on the grounding task, which doesn't benefit from chain-of-thought.

The cost compounds quickly. Vision models process screenshots as tiles—a 1920×1080 screen is ~1000 tokens. For a 100-step workflow, this costs $3-9 without caching. At scale, calling a reasoning model repeatedly for coordinate prediction becomes expensive.

Architecture

I decomposed the system into two specialized models:

Planning: Vision-language model generates semantic actions from screenshots
Grounding: Specialized model maps semantic actions to pixel coordinates

Given a screenshot $$s_t$$ and history $h_t = \{(s_i, a_i)\}_{i=0}^{t-1}$ , the planner outputs action $$a_t$$ :

$ a_t = \pi_{\text{plan}}(s_t, h_t) $

The grounding model converts semantic actions to coordinates:

$ (x, y) = \pi_{\text{ground}}(s_t, a_t) $

Example output:

{
  "action": "click",
  "target": "Submit button",
  "coordinates": [834, 672],
  "confidence": 0.94
}

Cost Analysis

Vision-language models tile screenshots for processing. A 1920×1080 screen generates 6 tiles at 170 tokens each:

$ T_{\text{image}} = \lceil \frac{w}{512} \rceil \times \lceil \frac{h}{512} \rceil \times 170 $

For a 100-step workflow:

Metric	Without Caching	With Caching
Cost	$3.20–$9.40	$0.32–$0.94
Latency per action	2–5 seconds	10–50ms
Total time	3–8 minutes	1–5 seconds

The optimization strategy is to cache UI elements between actions and route simple interactions directly to the grounding model. UI state changes slowly—most elements remain at fixed coordinates across consecutive actions.

Training

I trained the grounding model via GRPO (Group Relative Policy Optimization) with binary rewards:

$ R(s, a, \hat{c}) = \begin{cases} 1 & \text{if } \|\hat{c} - c_{\text{true}}\|_2 < \tau \\ 0 & \text{otherwise} \end{cases} $

where $\hat{c}$ is the predicted coordinate, $c_{\text{true}}$ is the ground truth, and $\tau$ is the hit threshold.

Training uses trajectory augmentation: one recorded workflow generates multiple training samples by varying UI states and timing. For a trajectory of length $$n$$ , I extract $$O(n^2)$$ sub-trajectories as training data.

Adaptive Routing

For simple UIs, the grounding model executes directly without planning. I route to the planner only when confidence is low:

$ \text{use\_planner} = \begin{cases} \text{true} & \text{if } H(p_{\text{ground}}) > \theta \text{ or } \max(p_{\text{ground}}) < \gamma \\ \text{false} & \text{otherwise} \end{cases} $

where $$H(p)$$ is the entropy of the grounding model's coordinate distribution, $\theta$ is the entropy threshold, and $\gamma$ is the confidence threshold.

This achieves ~50ms latency on simple actions (A100), escalating to 2-5s only for ambiguous cases.

Future Work

The current implementation processes discrete states. Moving to streaming would enable continuous perception-action loops at 5-10 Hz, handling dynamic interactions (drag, hover, scroll) more naturally.

For repeated workflows, policy distillation could compile trajectories into specialized models:

$ \pi_{\text{task}}(s) = \arg\min_{\pi} \mathbb{E}_{s \sim \mathcal{D}_{\text{task}}} \left[ \text{KL}(\pi(s) \| \pi_{\text{plan}}(s)) \right] $

This converts the planner from runtime dependency to training-time teacher, enabling local execution of routine tasks.

Related Work

He, Y., Jin, J., & Liu, P. (2025). Efficient Agent Training for Computer Use. arXiv preprint arXiv:2505.13909.

Yang, Y., Li, D., Dai, Y., Yang, Y., Luo, Z., Zhao, Z., Hu, Z., Huang, J., Saha, A., Chen, Z., Xu, R., Pan, L., Xiong, C., & Li, J. (2025). GTA1: GUI Test-time Scaling Agent. arXiv preprint arXiv:2507.05791.