← back

building a sub-100ms browser agent with VLM + GPT

Dec 2025

Umari is a web agent that controls your browser through natural language. Making umari fast enough to actually feel useful is a surprisingly deep engineering problem. This post is about how it works and the cost and latency tradeoffs I made

Umari runs as a lightweight overlay where you describe tasks in plain English. It takes screenshots to perceive the current state, uses a reasoning model to plan what to do next, then executes. A simple instruction like "find me flights from SF to NYC under $300" decomposes into a sequence of clicks, form fills, and scrolls - all handled autonomously. The goal is to make browsers self-driving. This post is about how we get there.

Why GPT-5

Umari was built entirely on GPT-5's reasoning capabilities - development, debugging, training data generation, and the agent loop itself. The high-compute reasoning mode handled planning multi-step interactions, while GPT-5's vision capabilities handled screen understanding. Earlier models simply couldn't maintain context across long, branching workflows.

What makes GPT-5 particularly suited for computer control is how it handles complex chains of actions without losing state. When you tell it "start playing this game," it doesn't just output one action - it breaks the instruction into discrete executable steps, tracks which ones have been completed, and adapts when the UI changes unexpectedly. Previous models would hallucinate clicks on elements that weren't on screen, or lose track of what step they were on after a page transition. GPT-5's chain-of-thought reasoning makes both problems essentially disappear.

We calibrated compute allocation based on task complexity. Simple single-click actions get minimal reasoning. Complex multi-screen workflows get more. For ambiguous UIs - pages with many similar-looking interactive elements - high reasoning effort lets the model map out the full interaction sequence with explicit error handling before committing to any action. This tradeoff between speed and reliability is what separates something that technically works from something people actually want to use.

How it works: split reasoning from grounding

The core architectural insight is that deciding what to do and finding exactly where to click are fundamentally different problems that benefit from different models.

GPT-5 acts as the planner. It sees the current screenshot and your natural language instruction, and outputs a semantic action: "click the blue Submit button at the bottom of the form." This is a reasoning problem - it requires understanding context, intent, and UI structure.

A small fine-tuned grounding model then takes that semantic description plus the screenshot and outputs exact pixel coordinates: (834, 672). This is a localization problem - it doesn't need to understand your intent, just find the element described as precisely as possible.

The grounding model is a vision transformer (ViT) fine-tuned specifically for UI element localization. It runs in ~20ms on an A100 and handles the vast majority of clicks without ever calling GPT-5. One model for the "what," another for the "where."

This split matters computationally. If you use a single large model for both tasks, you're burning expensive reasoning tokens on something a 7B model can do just as well. And you're adding latency on every step, even the trivial ones. Keeping them separate means you can optimize each independently.

Why vision tokens are so expensive

For a computer-using agent, every action involves: take a screenshot, encode it, reason about it, output an action. A 1920×1080 screenshot tiles into roughly 6 patches at ~170 tokens each, plus whatever reasoning tokens the model generates billed as output tokens.

MetricValue
Tokens per step3,200–9,400
Cost per 100 steps3.20–3.20–3.20–9.40
With prompt caching~0.35–0.35–0.35–0.94
Latency per action2–5 seconds
100-step task3–8 minutes

This compounds fast at any real scale. Running the same 100-step workflow 100 times a day costs ~940/day−over940/day - over 940/day−over28,000/month without caching. And it's not just money: a task that takes a human 10 minutes could take 5–30 minutes in compute time. LLMs are also non-deterministic, so you can't assume identical inputs produce identical outputs, which makes batching and caching more complex.

The primary optimization is aggressive patch caching. UI elements don't move much between steps - if a Submit button is at (834, 672) in frame N, it's probably still there in frame N+1. Rather than re-encoding the entire screenshot every step, we cache patches at the tile level and reconstruct only the diff. Combined with a lightweight saliency scorer (~3MB) that identifies interactive regions, we consistently hit 70%+ cache hit rates, bringing effective grounding latency to 10–50ms. The saliency scorer is key: it lets us skip encoding the 80% of the screen that's static background, and focus compute on the regions that actually matter.

One caveat: this caching strategy works best in single-window workflows. Tasks involving frequent tab switches or full-page navigations invalidate the cache more aggressively. We're still figuring out the right policy for those cases.

Training the grounding model with GRPO

We trained the grounding model using GRPO (Group Relative Policy Optimization). The reward signal is simple and binary: 1 if the predicted click coordinates land inside the bounding box of the target UI element, 0 otherwise.

Working at the patch level rather than pixel level is what makes this reward signal practical. Patches are typically 32×32 pixels - large enough that any click within the patch gets rewarded, which avoids the sparse-reward problem you'd get trying to predict exact pixel coordinates with a binary reward. The model learns to route its attention to the right region of the screen and then output a coordinate within that region.

For training data, we generated trajectories through augmentation on human demonstrations. One recorded workflow becomes multiple training examples by varying timing, intermediate UI states, and interaction sequences. This is essential for robustness: real-world UIs don't look exactly the same twice. Loading states differ, animations vary, content changes. A model trained only on clean demonstrations will fail constantly in practice.

One surprising failure mode we found: the model was significantly worse at clicking large, high-contrast elements (bright red buttons, prominent CTAs) than at clicking small, subtle ones. Our hypothesis is a distribution bias - large elements have more pixels near their center, so the model learned to click center-of-mass for large targets, which often overshoots the actual interactive region if the button has thick borders or padding. More investigation needed, but it's a good reminder that RL reward shaping can produce unexpected failure modes that are hard to catch without systematic evaluation.

Adaptive routing

Test-time compute scaling gets a lot of hype right now, especially following the success of o-series models. But for web agents, the bottleneck isn't reasoning - it's grounding. For most steps, a 7B vision model is more than enough. The question is when to escalate to the expensive reasoning model.

On the fast path, the grounding model runs alone. No GPT-5 call. This hits ~50ms end-to-end on an A100 and covers the majority of steps in a typical workflow. The router escalates to GPT-5 only when it detects signals of ambiguity:

  • High saliency entropy: multiple competing interactive regions with similar activation scores
  • Duplicate targets: more than one element matching the semantic description (e.g., three "Submit" buttons on a page)
  • Recent misclick or state error: the previous action didn't produce the expected state transition
  • Complex instruction: the step requires multi-screen planning or conditional logic

When GPT-5 is invoked, we pipeline it: step N executes while the planner prepares the plan for step N+1. This hides most of the latency. The result is that simple flows feel nearly instant, and complex ones feel deliberate rather than frozen.

Simple UIs (~70% of tasks) resolve in under 100ms. Complex UIs - spreadsheets, forms with conditional logic, multi-step flows - take 400–600ms but produce higher accuracy. Same two models, different routing based on confidence signals. The fundamental tradeoff maps well to different users: consumers want individual tasks done fast, enterprises want long workflows done reliably.

After about an hour of use on a given workflow, patch cache hit rates climb high enough that many steps complete in under 100ms end-to-end. Verification is cheap - a single screenshot plus a state predicate - so we can iterate quickly without accumulating silent drift.

Related work

He, Y., Jin, J., & Liu, P. (2025). Efficient Agent Training for Computer Use. arXiv preprint arXiv:2505.13909.

Yang, Y., Li, D., Dai, Y., Yang, Y., Luo, Z., Zhao, Z., Hu, Z., Huang, J., Saha, A., Chen, Z., Xu, R., Pan, L., Xiong, C., & Li, J. (2025). GTA1: GUI Test-time Scaling Agent. arXiv preprint arXiv:2507.05791.

View project