JUNE - JULY 2025

Webglide - Computer-Using Agents, 2025

Explorations in universal interface for AI to interact with the web.

Role: Design lead, PM.
Explorations in universal interface for AI to interact with the digital world. Using IBM CUGA framework, #1 on AppWorld benchmark and #2 on WebArena.

Summary

→ Four-week mini-MVP exploring the supervision interface for a computer-using agent across web environments.
Built on IBM's CUGA framework (then #1 on AppWorld, #2 on WebArena).
Core insight: Being able to see and evaluate the agent on a turn-by-turn basis isn't what the users want to see. What they need is a legible representation of intent, progress, and risk.
Established a two-level task log, plan-preview-before-execute, and irreversibility-gated confirmation.
→  What we got wrong: underestimating how much supervision cost matters when tasks run longer than a few minutes. Watching a cursor move slowly around a screen isn't engaging for users! 

When an AI agent can navigate any website, click anything, fill any form, and reason across applications - it calls into question your own actions across a screen. But when you have an agent slowly taking those actions for you (clicking a button, filling out a form, scraping a webpage), you notice how efficient we have got at driving intent on the screen. Four weeks on Webglide was a first pass at attempting to understand new patterns for emerging Computer-Use agents. I also wrote a historical blog about some of the early experiments and patterns in this kind of agent.

Built on IBM's CUGA framework, which held #1 on AppWorld and #2 on WebArena at the time, the technical capability was not the open question. The supervision interface was.

In mid-2025, computer-using agents represented a qualitative shift in what AI systems could do. Earlier agents worked inside defined tool sets, structured outputs, known APIs. A computer-using agent operates across arbitrary interfaces: it sees what a person sees, acts where a person would act, reasons across applications that were never designed to be automated.

CUGA's benchmark results (61.7% on WebArena, 48.2% on AppWorld) had confirmed the capability. The design problem was elsewhere. How do you build a supervision interface for a system that can act anywhere, across interfaces you don't own, in ways that are hard to predict in advance? The category was being defined in real time. Anthropic & OAI had just released Computer Use and nobody had a shared answer of how to build the interaction model. A month was an attempt to find one.

The first direction (and why I chose it)

If an agent can operate any interface, what does the human actually need to see?

The instinct was to design for the browser the way a person would see it. An overlay showing where the cursor is about to click, the speed of a cursor moving through screens. The idea was that if you could watch the agent work, you could trust it. This was the pattern Runner H, Operator, and Claude Computer  had all landed on. It was legible research precedent. It felt like an obviously correct answer to ‘how do I supervise an agent’. A person watching another person use a browser understands what's happening; a person watching an agent use a browser should understand it the same way, but the relationship is different.

I built that version. It looked good. It was wrong.

The more useful reframe: what does a person need to know to decide whether to let the agent continue or intervene?

What broke?

Two problems showed up almost immediately, and they compounded, but one showed promise.
1. Agents in live environments are slow.
Not one-click-slow, tediously slow. Real agents in real browsers take 20 to 60 seconds per meaningful step. A four-step task is several minutes of watching a page load, a cursor move, a selector resolve. A person watching that is not supervising, they’re merely waiting.

2. Watching was uninformative. Even when the agent was working, the live view told the supervisor almost nothing about whether the task would succeed but feed back information about actions and tool-use. Seeing the agent click the right search result looked the same as seeing it click the wrong one. The visual fidelity was doing emotional work (look, it's working) but not supervisory work (here is the specific thing you need to decide about). The pattern optimised for the feeling of oversight, not the reality of it.

3. Surface decision points.
Some actions are low-stakes and should proceed automatically. Others (form submissions, purchases, account changes, anything irreversible) require explicit confirmation. The interface needed to make that distinction consistent without turning every click into a dialog.

Put together: the interface I'd built was tracking these actions, but minimally useful as a tool. A user would sit through two minutes of slow animation and still have no idea whether to trust the result.

The reframe

The move that saved the project was stopping trying to make one agent legible and starting to treat the agent as a worker you shouldn't have to watch at all. If one agent is slow, run several in parallel. If watching a single browser is boring and uninformative, don't watch any of them. Show their work as a running feed: task in, status updates as short text, and results out. The interface looks less like a browser and more like a chat with multiple running queries.

This is a different supervision model. In the first version, the supervisor is a pilot with hands on the controls. In the second, the supervisor is a manager reading status updates from a team.

Concretely, the reframe changed three things:
Live browser became optional. It survived as a drill-down for when a specific run looked wrong, not as the primary surface.
Parallel runs became the unit of interaction. One task input could fan out to several agents hitting different sources or trying different strategies. The user sees a feed of concurrent work, not a single animated movie.
Text became the main channel. Plans, step summaries, and results are compressed into short written updates a person can scan in two seconds. That's what makes running five at once possible.

I built a small prototype of this version in the remaining time. It was visually less impressive than the browser-mirror version. It was also the version that worked.

What this unlocked

The parallel-agents-as-text-feed prototype was the seed for two subsequent products: Relay (multi-modal channels for agent work) and Flint (multi-agent modelling systems). Both sit on the insight that arrived too late to ship inside Webglide's four weeks: when agents are slow and individually opaque, the interface has to scale horizontally, not pay for visual fidelity you can't cash.

That's the main thing Webglide changed in how I think about agent interfaces. The 2025 consensus (Runner H, Operator, the split-pane live-browser pattern) is a local maximum. It works for one-agent demos. It does not work for the actual supervision problem, which is about managing many runs, most of which you will never look at closely, and knowing which ones need your attention. Capability without scalable supervision is not a product. It is a pile of very expensive tabs.

What I'd do differently

Remove the live-browser instinct a week earlier. I spent the first two weeks building something that looked like Runner H because the category had legitimised that shape. A user test on day four would have told me that nobody would want to observe an agent take tasks on your behalf. But it was an important one, as the pattern that emerged seemed to validate this. The pattern was later changed by the inclusion of the Tool-use and computer use in the chat, embedding this in chat. The lesson I took forward into Relay was to test the supervision model against the actual failure modes, before committing to a visual direction.

What this opened up

The broader principle that came out of Webglide is that as agents become more capable the interface becomes more important, not less. Capability without observability is not a product. It is a liability, and the design job is building the observability layer that makes the capability trustworthy enough to use.

Four weeks was enough to establish the spine. It was not enough to solve the harder problems underneath. Watching a capable agent run for 20 minutes tests a different supervisor than watching it run for 2. Vigilance is exactly the task humans are worst at, which means the supervision interface has to offload vigilance onto the system (diffs, anomaly flagging, divergence from plan) rather than asking the person to maintain it. That is the V2 problem.

About CUGA

CUGA (ConfigUrable Generalist Agent) is an open-source generalist agent framework from IBM Research, purpose-built for enterprise automation. Designed for developers, CUGA combines and improves the best of foundational agentic patterns such as ReAct, CodeAct, and Planner-Executor — into a modular architecture enabling trustworthy, policy-aware, and composable automation across web interfaces, APIs, and custom enterprise systems.
CUGA achieves state-of-the-art performance on leading benchmarks:
→🥇 #1 on AppWorld — a benchmark with 750 real-world tasks across 457 APIs
→🥈 #2 on WebArena — a complex benchmark for autonomous web agents across application domains.

Key features relevant to the supervision question:
Complex task execution: State of the art results across Web and APIs.
Flexible tool integrations: CUGA works across REST APIs via OpenAPI specs, MCP servers, and custom connectors.
Composable agent architecture: CUGA itself can be exposed as a tool to other agents, enabling nested reasoning and multi-agent collaboration.
Configurable reasoning modes: Choose between fast heuristics or deep planning depending on your task’s complexity and latency needs.
Policy-aware instructions (Experimental): CUGA components can be configured with policy-aware instructions to improve alignment of the agent behavior.
Save & Reuse (Experimental): CUGA captures and reuses successful execution paths, enabling consistent and faster behavior across repeated tasks.

ReAct Agents

ReAct (Reasoning and Acting) agents are a powerful framework for AI agents that enhance a Large Language Model’s (LLM) ability to handle complex, multi-step tasks by integrating logical reasoning (Thought) with external action (Action) in a continuous, interleaved loop. The core idea is to prompt the LLM to alternate between articulating its chain-of-thought to plan and track its progress, and executing actions like using a search engine, calling an API, or interacting with a database.

More Here: https://research.ibm.com/blog/cuga-agent-framework

tl;drThe first directionWhat broke?The reframeWhat this unlockedWhat I'd do differentlyWhat this opened upAbout CUGAReAct Agents
Benjamin Woodmansee is an AI product designer working on frontier AI systems including agent interfaces, computer-using agents, generative AI tools, developer platforms, and human-AI interaction design.