Role: Design lead, PM.
Context: Confidential · Details available upon request.
Explorations in universal interface for AI to interact with the digital world. Using IBM CUGA framework, #1 on AppWorld benchmark and #2 on WebArena.
→ Four-week mini-MVP exploring the supervision interface for a computer-using agent operating across arbitrary web environments.
→ Built on IBM's CUGA framework (then #1 on AppWorld, #2 on WebArena). Interface work was the deliverable.
→ Core insight: the interface the agent is using is not what the supervisor needs to see. What they need is a legible representation of intent, progress, and risk.
→ Established a two-level task log, plan-preview-before-execute, and irreversibility-gated confirmations. The category converged on versions of all three over the next year.
→ What we got wrong: underestimating how much supervision cost matters when tasks run longer than a few minutes. Watching a cursor move slowly around a screen isn't engaging for users!
When an AI agent can navigate any website, click anything, fill any form, and reason across applications it was never meant to touch, the hard question stops being can it. It becomes how does the person watching it know when to let it continue and when to intervene. Four weeks on Webglide was a first pass at answering that. Built on IBM's CUGA framework, which held #1 on AppWorld and #2 on WebArena at the time, the technical capability was not the open question. The supervision interface was.
In mid-2025, computer-using agents represented a qualitative shift in what AI systems could do. Earlier agents worked inside defined tool sets, structured outputs, known APIs. A computer-using agent operates across arbitrary interfaces: it sees what a person sees, acts where a person would act, reasons across applications that were never designed to be automated.
CUGA's benchmark results (61.7% on WebArena, 48.2% on AppWorld) had confirmed the capability. The design problem was elsewhere. How do you build a supervision interface for a system that can act anywhere, across interfaces you don't own, in ways that are hard to predict in advance? The category was being defined in real time. Anthropic & OAI had just released Computer Use and nobody had a shared answer of how to build the interaction model. A month was an attempt to find one.
If an agent can operate any interface, what does the human actually need to see?
The instinct was to design for the browser the way a person would see it. Live iframe of the agent's session. Screenshots streamed back on every step. An overlay showing where the cursor is about to click. Buttons highlighted as the agent hovers. A task log running alongside. The idea was that if you could watch the agent work, you could trust it.This is the pattern Runner H, Operator, and Claude Computer Use had all landed on. It was legible research precedent. It felt like an obviously correct answer to "how do I supervise an agent". A person watching another person use a browser understands what's happening; a person watching an agent use a browser should understand it the same way.
I built that version. It looked good. It was wrong.
The more useful reframe: what does a person need to know to decide whether to let the agent continue or intervene?
Two problems showed up almost immediately, and they compounded, but one showed promise.
1. Agents in live environments are slow. Not one-click-slow, job-interview-slow. Real agents in real browsers take 20 to 60 seconds per meaningful step. A four-step task is several minutes of watching a page load, a cursor move, a selector resolve. A person watching that is not supervising, they are waiting.
2. Watching was uninformative. Even when the agent was working, the live view told the supervisor almost nothing about whether the task would succeed. Seeing the agent click the right search result looked the same as seeing it click the wrong one. The visual fidelity was doing emotional work (look, it's working) but not supervisory work (here is the specific thing you need to decide about). The pattern optimised for the feeling of oversight, not the reality of it.
3. Surface decision points. Some actions are low-stakes and should proceed automatically. Others (form submissions, purchases, account changes, anything irreversible) require explicit confirmation. The interface needed to make that distinction consistent without turning every click into a dialog.
Put together: the interface I'd built was maximally impressive as a screenshot and minimally useful as a tool. A user would sit through two minutes of handsome animation and still have no idea whether to trust the result.
The move that saved the project was stopping trying to make one agent legible and starting to treat the agent as a worker you shouldn't have to watch at all. If one agent is slow, run several in parallel. If watching a single browser is boring and uninformative, don't watch any of them. Show their work as a running feed: task in, status updates as short text, results out. The interface looks less like a browser and more like a chat with multiple running queries.
This is a different supervision model. In the first version, the supervisor is a pilot with hands on the controls. In the second, the supervisor is a manager reading status updates from a team. The pilot model looks cinematic and fails under load. The manager model looks boring and scales.
Concretely, the reframe changed three things:
→ Live browser became optional. It survived as a drill-down for when a specific run looked wrong, not as the primary surface.
→ Parallel runs became the unit of interaction. One task input could fan out to several agents hitting different sources or trying different strategies. The user sees a feed of concurrent work, not a single animated movie.
→ Text became the main channel. Plans, step summaries, and results are compressed into short written updates a person can scan in two seconds. That's what makes running five at once possible.
I built a small prototype of this version in the remaining time. It was visually less impressive than the browser-mirror version. It was also the version that worked.
The parallel-agents-as-text-feed prototype was the seed for two subsequent products: Relay (multi-modal channels for agent work) and Flint (multi-agent modelling systems). Both sit on the insight that arrived too late to ship inside Webglide's four weeks: when agents are slow and individually opaque, the interface has to scale horizontally, not pay for visual fidelity you can't cash.
That's the main thing Webglide changed in how I think about agent interfaces. The 2025 consensus (Runner H, Operator, the split-pane live-browser pattern) is a local maximum. It works for one-agent demos. It does not work for the actual supervision problem, which is about managing many runs, most of which you will never look at closely, and knowing which ones need your attention. Capability without scalable supervision is not a product. It is a pile of very expensive tabs.
Kill the live-browser instinct a week earlier. I spent the first two weeks building something that looked like Runner H because the category had legitimised that shape. A user test on day four would have told me the same thing week three did: watching is not supervising. The lesson I took forward into Relay was to test the supervision model against the actual failure modes (vigilance drop-off, multi-run attention management) before committing to a visual direction.
The broader principle that came out of Webglide is that as agents become more capable the interface becomes more important, not less. Capability without observability is not a product. It is a liability, and the design job is building the observability layer that makes the capability trustworthy enough to use.
Four weeks was enough to establish the spine. It was not enough to solve the harder problems underneath. Watching a capable agent run for 20 minutes tests a different supervisor than watching it run for 2. Vigilance is exactly the task humans are worst at, which means the supervision interface has to offload vigilance onto the system (diffs, anomaly flagging, divergence from plan) rather than asking the person to maintain it. That is the V2 problem.
CUGA (ConfigUrable Generalist Agent) is an open-source generalist agent framework from IBM Research, purpose-built for enterprise automation. Designed for developers, CUGA combines and improves the best of foundational agentic patterns such as ReAct, CodeAct, and Planner-Executor — into a modular architecture enabling trustworthy, policy-aware, and composable automation across web interfaces, APIs, and custom enterprise systems.
CUGA achieves state-of-the-art performance on leading benchmarks:
→🥇 #1 on AppWorld — a benchmark with 750 real-world tasks across 457 APIs
→🥈 #2 on WebArena — a complex benchmark for autonomous web agents across application domains.
Key features relevant to the supervision question:
→ Complex task execution: State of the art results across Web and APIs.
→ Flexible tool integrations: CUGA works across REST APIs via OpenAPI specs, MCP servers, and custom connectors.
→ Composable agent architecture: CUGA itself can be exposed as a tool to other agents, enabling nested reasoning and multi-agent collaboration.
→ Configurable reasoning modes: Choose between fast heuristics or deep planning depending on your task’s complexity and latency needs.
→ Policy-aware instructions (Experimental): CUGA components can be configured with policy-aware instructions to improve alignment of the agent behavior.
→ Save & Reuse (Experimental): CUGA captures and reuses successful execution paths, enabling consistent and faster behavior across repeated tasks.
ReAct (Reasoning and Acting) agents are a powerful framework for AI agents that enhance a Large Language Model’s (LLM) ability to handle complex, multi-step tasks by integrating logical reasoning (Thought) with external action (Action) in a continuous, interleaved loop. The core idea is to prompt the LLM to alternate between articulating its chain-of-thought to plan and track its progress, and executing actions like using a search engine, calling an API, or interacting with a database.
More Here: https://research.ibm.com/blog/cuga-agent-framework