← back

building a sub-100ms browser agent with VLM + GPT

Dec 2025

I've been building Umari, a web agent that lets you control your browser through natural language. The pitch is simple: type what you want done, and it figures out the clicks and keystrokes. But making it fast enough to actually feel useful - not like you're waiting on a slow RPC call - turned into a surprisingly deep engineering problem. This post is about how it works, the inference math behind the cost and latency tradeoffs, and what I learned along the way.

Umari runs as a lightweight overlay where you describe tasks in plain English. It takes screenshots to perceive the current state, uses a reasoning model to plan what to do next, then executes. A simple instruction like "find me flights from SF to NYC under $300" decomposes into a sequence of clicks, form fills, and scrolls - all handled autonomously. The goal is to make browsers self-driving. This post is about how we get there.

Why GPT-5

Umari was built entirely on GPT-5's reasoning capabilities - development, debugging, training data generation, and the agent loop itself. The high-compute reasoning mode handled planning multi-step interactions, while GPT-5's vision capabilities handled screen understanding. Earlier models simply couldn't maintain context across long, branching workflows.

What makes GPT-5 particularly suited for computer control is how it handles complex chains of actions without losing state. When you tell it "start playing this game," it doesn't just output one action - it breaks the instruction into discrete executable steps, tracks which ones have been completed, and adapts when the UI changes unexpectedly. Previous models would hallucinate clicks on elements that weren't on screen, or lose track of what step they were on after a page transition. GPT-5's chain-of-thought reasoning makes both problems essentially disappear.

We calibrated compute allocation based on task complexity. Simple single-click actions get minimal reasoning. Complex multi-screen workflows get more. For ambiguous UIs - pages with many similar-looking interactive elements - high reasoning effort lets the model map out the full interaction sequence with explicit error handling before committing to any action. This tradeoff between speed and reliability is what separates something that technically works from something people actually want to use.

How it works: split reasoning from grounding

The core architectural insight is that deciding what to do and finding exactly where to click are fundamentally different problems that benefit from different models.

GPT-5 acts as the planner. It sees the current screenshot and your natural language instruction, and outputs a semantic action: "click the blue Submit button at the bottom of the form." This is a reasoning problem - it requires understanding context, intent, and UI structure.

A small fine-tuned grounding model then takes that semantic description plus the screenshot and outputs exact pixel coordinates: (834, 672). This is a localization problem - it doesn't need to understand your intent, just find the element described as precisely as possible.

The grounding model is a vision transformer (ViT) fine-tuned specifically for UI element localization. It runs in ~20ms on an A100 and handles the vast majority of clicks without ever calling GPT-5. One model for the "what," another for the "where."

This split matters computationally. If you use a single large model for both tasks, you're burning expensive reasoning tokens on something a 7B model can do just as well. And you're adding latency on every step, even the trivial ones. Keeping them separate means you can optimize each independently.

Why vision tokens are so expensive

For a computer-using agent, every action involves: take a screenshot, encode it, reason about it, output an action. A 1920×1080 screenshot tiles into roughly 6 patches at ~170 tokens each, plus whatever reasoning tokens the model generates billed as output tokens.

MetricValue
Tokens per step3,200–9,400
Cost per 100 steps3.20–3.20–3.20–9.40
With prompt caching~0.35–0.35–0.35–0.94
Latency per action2–5 seconds
100-step task3–8 minutes

This compounds fast at any real scale. Running the same 100-step workflow 100 times a day costs ~940/day−over940/day - over 940/day−over28,000/month without caching. And it's not just money: a task that takes a human 10 minutes could take 5–30 minutes in compute time. LLMs are also non-deterministic, so you can't assume identical inputs produce identical outputs, which makes batching and caching more complex.

The primary optimization is aggressive patch caching. UI elements don't move much between steps - if a Submit button is at (834, 672) in frame N, it's probably still there in frame N+1. Rather than re-encoding the entire screenshot every step, we cache patches at the tile level and reconstruct only the diff. Combined with a lightweight saliency scorer (~3MB) that identifies interactive regions, we consistently hit 70%+ cache hit rates, bringing effective grounding latency to 10–50ms. The saliency scorer is key: it lets us skip encoding the 80% of the screen that's static background, and focus compute on the regions that actually matter.

One caveat: this caching strategy works best in single-window workflows. Tasks involving frequent tab switches or full-page navigations invalidate the cache more aggressively. We're still figuring out the right policy for those cases.

Training the grounding model with GRPO

We trained the grounding model using GRPO (Group Relative Policy Optimization). The reward signal is simple and binary: 1 if the predicted click coordinates land inside the bounding box of the target UI element, 0 otherwise.

Working at the patch level rather than pixel level is what makes this reward signal practical. Patches are typically 32×32 pixels - large enough that any click within the patch gets rewarded, which avoids the sparse-reward problem you'd get trying to predict exact pixel coordinates with a binary reward. The model learns to route its attention to the right region of the screen and then output a coordinate within that region.

For training data, we generated trajectories through augmentation on human demonstrations. One recorded workflow becomes multiple training examples by varying timing, intermediate UI states, and interaction sequences. This is essential for robustness: real-world UIs don't look exactly the same twice. Loading states differ, animations vary, content changes. A model trained only on clean demonstrations will fail constantly in practice.

One surprising failure mode we found: the model was significantly worse at clicking large, high-contrast elements (bright red buttons, prominent CTAs) than at clicking small, subtle ones. Our hypothesis is a distribution bias - large elements have more pixels near their center, so the model learned to click center-of-mass for large targets, which often overshoots the actual interactive region if the button has thick borders or padding. More investigation needed, but it's a good reminder that RL reward shaping can produce unexpected failure modes that are hard to catch without systematic evaluation.

Adaptive routing: making it feel instant

Test-time compute scaling gets a lot of hype right now, especially following the success of o-series models. But for web agents, the bottleneck isn't reasoning - it's grounding. For most steps, a 7B vision model is more than enough. The question is when to escalate to the expensive reasoning model.

On the fast path, the grounding model runs alone. No GPT-5 call. This hits ~50ms end-to-end on an A100 and covers the majority of steps in a typical workflow. The router escalates to GPT-5 only when it detects signals of ambiguity:

  • High saliency entropy: multiple competing interactive regions with similar activation scores
  • Duplicate targets: more than one element matching the semantic description (e.g., three "Submit" buttons on a page)
  • Recent misclick or state error: the previous action didn't produce the expected state transition
  • Complex instruction: the step requires multi-screen planning or conditional logic

When GPT-5 is invoked, we pipeline it: step N executes while the planner prepares the plan for step N+1. This hides most of the latency. The result is that simple flows feel nearly instant, and complex ones feel deliberate rather than frozen.

Simple UIs (~70% of tasks) resolve in under 100ms. Complex UIs - spreadsheets, forms with conditional logic, multi-step flows - take 400–600ms but produce higher accuracy. Same two models, different routing based on confidence signals. The fundamental tradeoff maps well to different users: consumers want individual tasks done fast, enterprises want long workflows done reliably.

After about an hour of use on a given workflow, patch cache hit rates climb high enough that many steps complete in under 100ms end-to-end. Verification is cheap - a single screenshot plus a state predicate - so we can iterate quickly without accumulating silent drift.

What's next

Streaming. The current architecture is a step-by-step screenshot loop. The natural evolution is a streaming capture pipeline running at 20–30fps, emitting actions at 5–10Hz. This closes the perception-action loop properly and makes drag, hover, and scroll feel natural rather than mechanical. The planner hooks into the same stream but only activates on escalations, keeping it off the critical path.

Micro-policies. Frequently repeated workflows - daily RPA tasks, recurring data entry - get compiled into local policies that run without calling the planner at all. After a workflow runs successfully a few times, the reasoning model distills its plans into the grounding model through fine-tuning, so more steps stay on the fast path. The planner becomes a background teacher rather than a runtime dependency.

End-to-end. The long-term goal is to eliminate the explicit planner-grounding split entirely and go directly from screenshots to mouse/keyboard actions with a single model. This is the same architectural shift Tesla made with FSD: from a rules-based stack with explicit intermediate representations to cameras-to-steering end-to-end. The brittle policies and hand-crafted routing logic go away; the model learns to allocate its own compute based on task complexity. We're not there yet - keeping a planner in the loop for edge cases and safety is the right call right now - but as the executor absorbs more patterns through streaming and distillation, the stack will simplify.

Related work

He, Y., Jin, J., & Liu, P. (2025). Efficient Agent Training for Computer Use. arXiv preprint arXiv:2505.13909.

Yang, Y., Li, D., Dai, Y., Yang, Y., Luo, Z., Zhao, Z., Hu, Z., Huang, J., Saha, A., Chen, Z., Xu, R., Pan, L., Xiong, C., & Li, J. (2025). GTA1: GUI Test-time Scaling Agent. arXiv preprint arXiv:2507.05791.

View project