Closing the Agent Loop

12/10/2025

Coding agents have a hole in their feedback loop.

They can write code, run tests, check linters, iterate for hours. But they can't see what users see. For any app with a frontend, the loop never closes.

Every browser automation tool I've tried fails the same way. Context windows explode. Agents wander aimlessly. Your API bill climbs while the agent spends 30 turns clicking a dropdown.

Having spent a considerable amount of time in the browser agent space, I wanted to do a bit of thinking about what a better dev-centric browser agent might look like.

Why Coding Agents Work Now

A year ago, AI coding meant "write this function." Now OpenAI and Anthropic advertise agents that work autonomously for hours. What changed?

Verification.

The Coding Agent Loop

Modern coding agents run an observe-think-act loop:

  1. Observe - Read files, run commands, check test output
  2. Think - Reason about what's wrong
  3. Act - Edit code, run builds, execute tests

The agent iterates until tests pass, linters are happy, and the build succeeds. This is why they can work for hours without human intervention.

As agents get better, these steps get fuzzy. Your coding agent runs commands to observe the environment while simultaneously making tool calls during reasoning. But the high-level structure holds.

The Missing Link

For frontend work, the "observe" step is broken. Agents can't see what users see.

Modern web apps aren't just HTML over HTTP. They're JavaScript-heavy, stateful, interactive. You can't verify them by hitting REST endpoints. You need a browser.

Even if I never write full browser e2e tests (and let's be honest, most of us won't), I'm going to test manually at some point. Giving agents this ability matters.

There are plenty of attempts to solve this:

I've tried most of these, and while they are impressive, they tend to have a few issues that make them hard to use in a coding agent context.

The Status Quo

Approach 1: Step-by-Step Browser Control

The most common approach mirrors human browsing: look at the page, decide what to click, click it, repeat.

Tools like Playwright MCP and browser-use give agents screenshot, click, scroll, and fill capabilities. Here's what it looks like:

user  : go to google.com and search for eggs. open the first result

agent : navigate("https://google.com")
agent : read_page()
         <html>
           <body>
             <input type="text" id="search" placeholder="Search" />
             <button type="submit">Search</button>
           </body>
         </html>
agent : type("#search", "eggs")
agent : submit("button[type='submit']")
agent : click("a:has-text('Eggs - Wikipedia')")
agent : read_page()
         <html>
           <body>
             <h1>Eggs</h1>
             <p>Eggs are a good source of protein and vitamins.</p>
           </body>
         </html>
agent : "I have opened the wikipedia page for eggs"

This example is simplified. Real pages have thousands of DOM nodes. Most tools strip the DOM down to accessibility trees or custom formats, but it's still massive.

The problems:

  • Context explosion. Every step needs a screenshot or DOM snapshot. Context balloons.
  • Tool overload. Playwright MCP has 33 tools. There are enough tools that the agent often has a hard time using them all effectively. For example, the playwright-mcp does have a tool to run Playwright scripts directly, but it rarely reaches for it.
  • Black box assumptions. These tools assume you don't have source code and tend to lead the agent to interact with your application running on localhost the same way it would interact with amazon.com.

Approach 2: Generate Playwright Scripts

Models are good at writing code. Why not have them write Playwright scripts?

The feedback loop kills it.

Browser automation is trial and error. Wait for this network request. Check if that element is visible. When the agent writes script and it fails, it has to make a change and then run the script from the top again. This ends up compounding the failure times.

Playwright, Puppeteer, and Selenium are designed for CI: start fresh, run script, validate. That works for humans writing tests, not for agents iterating interactively.

A Different Approach

Step-by-step control bloats context. Script generation has no feedback loop.

What if we combined them? Keep the browser session alive between script executions.

The best of both: explore step-by-step when learning a page, write scripts once you know the flow.

I tried prototyping this as a Claude Code skill called Dev Browser. A Skill, if you aren't familiar, is a plain markdown file with a small description with criteria of when the llm should read it and apply the instructions. In addition to the markdown instructions on how to use playwright there is a small javascript library that provides some ergonomic functions to let Claude write playwright scripts against the running browser interactively.

The instructions and philosophy basically boil down to this:

Persistent pages. The browser runs as a server. Pages survive between script executions. The agent can navigate to a page and continue interacting with it across multiple scripts without starting over.

Small scripts, tight loops. Each script does one thing: navigate, click, fill, check. Run it, evaluate the result, decide what's next. The agent builds understanding incrementally instead of writing monolithic test scripts that fail opaquely. Once it knows an approach works, it can write longer scripts to speed through repetitive actions.

A11y snapshots for discovery. When the agent doesn't know a page's structure, it requests an accessibility tree with semantic roles, element names, and stable refs. This is just exposing the browser_snapshot tool that Playwright MCP provides, but in a programmatic interface that can be executed from a script. The agent sees button "Submit" [ref=e5] and can interact with e5 directly.

Source code as ground truth. If a button doesn't work, the agent checks the onClick handler. If a form submits wrong, it reads the validation logic. The codebase is right there.

The key insight: models already know Playwright, but Playwright is bad for a live coding context. We need to bend Playwright so that it is more suited for this use case.

Results

I ran a small eval to test this approach: three different techniques, same task (adding games to a personal tracking site), three runs each. This isn't exhaustive, but enough to sniff test whether the idea holds up.

Method Time Cost Turns Success
Dev Browser 3m 53s $0.88 29 100%
Playwright MCP 4m 31s $1.45 51 100%
Playwright Skill 8m 07s $1.45 38 67%

Dev Browser: 14% faster, 39% cheaper, 43% fewer turns than the best alternative.

In the video, you can see the agent incrementally navigate the page, learning how the UI works step by step. Once it figures out an approach that works, it writes a script that adds all three games at once instead of repeating the exploration.

Full methodology: dev-browser-eval.

Overall I was impressed with the results. Dev Browser was faster, cheaper, and more accurate than the other alternatives.

Closing the Loop

Most browser agent work focuses on the public web: cloud browsers, bot detection, anti-scraping. Development is different. You have the source code. You control the server. You can add whatever instrumentation you need.

That's a fundamentally easier problem. The tools should reflect that.

Browser automation is one of the largest remaining gaps in the coding agent loop. Once agents can verify frontend behavior, they own the full cycle: write code, run tests, check the UI, iterate until it works. The loop closes.

I'd love to see this approach integrated into development environments like Cursor's browser. The pieces are there. Someone just needs to put them together.

Subscribe to the newsletter

Get notified when I publish new posts. No spam, unsubscribe anytime.