[HN Gopher] A stateful browser agent using self-healing DOM maps
       ___________________________________________________________________
        
       A stateful browser agent using self-healing DOM maps
        
       Author : shardullavekar
       Score  : 100 points
       Date   : 2025-10-16 12:21 UTC (10 hours ago)
        
 (HTM) web link (100x.bot)
 (TXT) w3m dump (100x.bot)
        
       | brianjking wrote:
       | Is this able to load for anyone?
        
         | shardullavekar wrote:
         | It's a chrome extension. Works if you use chrome.
        
           | brianjking wrote:
           | I couldn't load the article. I was getting a nginx error
           | initially. I'm able to view now. I think they were getting a
           | bit squeezed.
        
             | memet_rush wrote:
             | they didnt use the agent to self heal
        
         | phgn wrote:
         | Nope. Their entire website shows up with a white screen for me
         | in the latest Chrome.
         | 
         | There's this error in the console: Failed to load module
         | script: Expected a JavaScript-or-Wasm module script but the
         | server responded with a MIME type of "text/html". Strict MIME
         | type checking is enforced for module scripts per HTML spec.
        
       | philo23 wrote:
       | Maybe this is a lack of understanding on my part, but this bit of
       | the explanation sets off alarm bells for me:
       | 
       | > Under the hood, we're building a client-sourced RAG for the
       | DOM. An agent's first move on a page is to check a vector DB for
       | a known "map." ... This creates a wild side-effect: the system is
       | self-healing for everyone. One person's failed automation
       | accidentally fixes it for the next hundred users.
       | 
       | I think I'd like to know exactly what kind of data is extracted
       | from the DOM to build that shared map.
        
         | artpar wrote:
         | Agent4 is going to store "stable selectors" that worked (when
         | it performs a task first time most of the time is spent in
         | identifying these css/xpath selectors). Memories are pretty
         | straighforward at this point, they are stored locally in your
         | browser's IndexedDB (you can inspect from chrome inspector).
        
           | philo23 wrote:
           | Good to hear, that's what I was hoping that it was doing.
        
           | erichocean wrote:
           | How are you mapping from "click this element" (presumably
           | obtained via a VLM) to the actual DOM locator that refers to
           | it?
           | 
           | I guess Playwright can do it in "record" mode; I'm curious
           | how you do it from a Chrome extension.
           | 
           | Spitballing here, you inject an event filter on the page and
           | when the click happens, grab the element and run some code to
           | synthesize a selector that just refers to that element?
           | (Presumably you could just reuse Playwright's element-to-
           | locator code at this point.)
        
             | artpar wrote:
             | So when you go into the "selector" mode, the plugin will
             | add event listeners to all the DOM nodes. Based on your
             | click it will try to generate a bunch of selectors
             | statically first (multiple, css and xpath based), and then
             | based on your guidance its the job of agent4 to make stable
             | selectors.
        
             | cjr wrote:
             | document.elementFromPoint to get the elem at co-ordinates,
             | then use npm package similar to optimal-select to come up
             | with a unique css selector.
        
       | tnolet wrote:
       | This is, as far as I understand, self healing ONLY if the name of
       | a CSS class changes. Not for anything else. That seems like a
       | very very very very narrow definition of "self healing": there
       | are 9999 other subtle or not so subtle things that can change per
       | session or per update version of a page.
       | 
       | If you run this against let's say a typical e-commerce page where
       | the navigation and all screen elements are super dynamic -- user
       | specific data, language etc. -- this problems becomes even
       | harder.
        
         | artpar wrote:
         | Everyone thinks of typical e-commerce pages when its comes
         | "browser agent doing something", but our real use cases are far
         | from shopping for the user. But your point still stands valid.
         | The idea is that maybe there are websites where generating
         | stable selectors/hierarchy maps wouldn't solve, but 80% (from
         | 80-20) of websites are not like that (including a lot of
         | internal dashboard/interfaces) (there will also be issues for
         | websites with proper i18n implementations if the selectors are
         | aria label based)
         | 
         | Self healing css selectors is also only 1 part of the story.
         | The other part is the cohesive interface for the agent itself
         | to use these selectors.
        
           | miguelspizza wrote:
           | > The other part is the cohesive interface for the agent
           | itself to use these selectors
           | 
           | We are incubating this over at the WebMCP web standard
           | proposal. You can see the current draft of explainer for the
           | declarative API.
           | https://github.com/webmachinelearning/webmcp/pull/26
           | 
           | Also, great work on the browser agent, this is the best of
           | the DOM parsing/screenshot agents I've used. I was really
           | impressed with the wordle example
        
         | ljm wrote:
         | My running hypothesis on this is that AI is a sentient
         | screenreader and the last thing you should be worrying about is
         | CSS class names, IDs, data-testid attributes, DOM traversal,
         | and all of these things that are essentially querying the
         | 'internal state' of a page. Classes, IDs, data attributes, etc.
         | aren't a public API and semantic elements, ARIA attributes,
         | etc. _are_.
         | 
         | So, focus on WCAG compliance, following the spec as faithfully
         | as you can. The style or presentation of something may change
         | as part of a simple A/B test but the underlying purpose or
         | behaviour would remain the same.
        
           | runlaszlorun wrote:
           | This might be the last of the weed talking but that's rich.
           | I'm gonna have to chew on that.
           | 
           | And maybe even crack the WCAG docs. Wait... It's a trap...
        
         | pverheggen wrote:
         | I feel like this could work if the selectors are chosen
         | carefully to capture semantic meaning, rather than basing off
         | of something arbitrary like a class name. The agent must have
         | some understanding of the document to be able to perform those
         | actions in the first place.
         | 
         | If it can find an ellipse tool, it's likely based off some
         | combination of accessible role, accessible name, and inner text
         | (perhaps the icon if it's multi-modal.) So in theory, couldn't
         | it capture that criteria in a JS snippet and replay it?
        
           | artpar wrote:
           | That's exactly what is it doing. The workflows are pretty
           | much js snippets in itself you can see in the "code" tab (in
           | the plugin when you select a saved workflow).
        
       | simpaticoder wrote:
       | Couldn't you solve this by having the agent do a first pass
       | through a page and generate a (java)script that interacts with
       | the interesting parts of the page, and then prepend the script
       | (if it's short enough) or a list of entry points (if it's not) to
       | the prompt such that subsequent interactions invoke the script
       | rather than interact directly with the page?
        
         | artpar wrote:
         | If I am reading you correctly, you captured the whole essence
         | of agent4.
         | 
         | So it does the first pass (based on your goals) makes memories
         | (and these are local)
         | 
         | Now you tell the agent you want to do this repeatedly, so it
         | will make a workflow (these workflows are saved on server,
         | currently all public for now but we are working out
         | permissions/group based access) for you based on these memories
         | and interactions.
         | 
         | The problem is many times that the agent thinks is stable isn't
         | really, so there a feedback loop for the agent to test out the
         | workflow and improve them. (its basically claude code/codex
         | sitting in the browser)
         | 
         | Workflow details are appended to prompt based on user query
         | match/opened tabs match.
        
           | simpaticoder wrote:
           | Okay I read your post more carefully and it seems like you're
           | attempting to build one central script for a given URL.
           | Assuming on-shot script generation is unreliable and requires
           | iterative improvement this makes sense. Of course I'm biased
           | in favor of local-first, privacy preserving and non-
           | distributed solutions if they exist, so I'd be curious to
           | know if/how you measured the reliability of one-shot script
           | generation for a basket of likely web apps.
        
             | artpar wrote:
             | One shot is pretty much not going to work, both at single
             | step level or if you ask llm to generate workflow in one
             | shot. We haven't measured it as such but even for static
             | websites like hackernews front page it takes a couple tries
             | of to and fro for the llm to get it right. somehow after
             | all the instructions the llm will still "guess" the
             | selector instead of checking the page/dom contents. And
             | then there are lot of other minor details that need to be
             | captured like "you need to wait a couple of second for the
             | auto complete results to show up". If you tell it to just
             | make a workflow, it will generate some garbage and call it
             | a day.
        
           | klntsky wrote:
           | Can you share what a page map looks like (on the data type
           | level)?
        
       | arkmm wrote:
       | Neat approach, but seems like the eventual goal of caching DOM
       | maps for all users would be a privacy nightmare?
        
         | artpar wrote:
         | Yes I can imagine PI somehow being stored in the workflow. I
         | frequently see llms hardcoding tests just to make user happy
         | and this can also happen in the browser version where if
         | something is too hard to scrape but agent is able to infer from
         | screenshot so it might end up making a workflow that seems
         | correct but is just hardcoded with data. We are thinking of
         | multiple guards/blocks to not let user create such a workflow,
         | but the risks that come with an open ended agent are still
         | going to be present.
        
       | bogdanoff_2 wrote:
       | Asking here because it seems related: I'm trying to use cursor to
       | work on a webapp. It gets frustrating because vanilla Cursor is
       | "coding blind" and can't actually see the result of what it is
       | doing, and whether or not it works.
       | 
       | I ask it to fix something. It claims to know what the problem is,
       | changes the code, and then claims it's fixed. I open app, and
       | it's still broken. I have to continuously and way to often
       | repeatedly tell it what it broken.
       | 
       | Now, supposing I'm "vibe coding" and don't really care about the
       | obvious fact that the AI doesn't actually know what it is doing,
       | it's _still_ frustrating that I have to be in the loop just to
       | provide very basic information like that.
       | 
       | Are there any agentic coding setups that allow the agent to
       | interact with the app it's working on to check if it actually
       | works?
        
         | tomashubelbauer wrote:
         | Look into the Playwright MCP server, it allows coding agents to
         | scrutinize the results of their work in the web browser. There
         | is also an MCP server for the Chrome DevTools protocol AFAIK
         | but I haven't tried it.
        
           | artpar wrote:
           | I don't know if plywright works without chrome in debug mode,
           | but I tried the MCP for chrome devtools and it requires
           | chrome to be started in debugging mode and that basically
           | means you cant log into a lot of sites (especially google)
           | since it will block you with an "Unsafe" message. Works
           | pretty well if you owe the target website.
        
           | kevinsync wrote:
           | I was in the same boat on a side project (Electron, Claude
           | Code) -- I considered Playwright but ended up building a
           | simple, focused API instead that allows Claude to connect to
           | the app to inspect logs (main console + browser console),
           | query internal app data + state, and execute arbitrary JS.
           | 
           | It's sped up debugging a lot since I can just give it
           | instructions like "found a bug that does XYZ, I think it's a
           | problem with functionABC(); connect to app, click these four
           | buttons in this order, examine the internal state, then trace
           | through the code to figure out what's going wrong and present
           | a solution"
           | 
           | I was pretty resistant at first of delegating debugging
           | blindly like that, but it's made the workflow pretty smooth
           | to where I can occasionally just open the app, run through it
           | as a human user and take notes on bugs and flow issues that I
           | find, log them with steps to reproduce, then give Claude a
           | list of bugs to noodle on while I'm focusing on stuff LLMs
           | are terrible at (design, UI, frontend work, etc)
        
         | shardullavekar wrote:
         | a built-in mcp server that takes a look at what's broken and
         | communicates with cursor is on our roadmap. Join discord and we
         | will keep you posted there.
        
         | artpar wrote:
         | So actually I have this setup (of a bridge server) which I use
         | for agent4 itself (so claude code can talk to agent4), It makes
         | a lot of sense to publish that bridge as well in the MCP form.
        
         | JimDabell wrote:
         | You can use things like Browser Use and Playwright to hook
         | things like that up, but you're right, this is a very
         | underdeveloped area. Armin Ronacher has a talk that covers some
         | of this, such as unifying console.log, server logs, SQL, etc.
         | to feed back to the LLM.
         | 
         | https://www.youtube.com/watch?v=nfOVgz_omlU
        
           | wahnfrieden wrote:
           | This is the way. You can also feed screenshots back to it.
        
             | dcchuck wrote:
             | My experience with the off the shelf MCP tools is they
             | still fall short of giving Claude or Codex a screenshot.
             | 
             | I'm interested in whether or not others agree.
        
               | wahnfrieden wrote:
               | MCP or CLI (called by agent) can also be used to generate
               | the screenshot input loop
        
               | dcchuck wrote:
               | Yes, but image size restrictions bring more failure than
               | success in my experience.
               | 
               | Thank you for your comment.
        
         | xnx wrote:
         | Gemini CLI Chrome devtools MCP addresses this:
         | https://developer.chrome.com/blog/chrome-devtools-mcp
        
           | hatmanstack wrote:
           | Jump the line and just install it. who needs to read stuff.
           | https://github.com/ChromeDevTools/chrome-devtools-
           | mcp?tab=re...
        
       | ripped_britches wrote:
       | "One persons map fixes everyone else's"
       | 
       | Hm somehow I feel like this is a giant step in the wrong
       | direction.
        
         | artpar wrote:
         | Worst case scenario we can just shut down sharing/public
         | workflows altogether, or do you have something else in mind ?
        
       | rco8786 wrote:
       | This tool seems relevant to my interests, but I gotta say I
       | cannot figure out how to use the extension.
       | 
       | It seems like I'm only able to use the pre-existing/canned
       | workflows that are provided under different "Persona"s? And
       | there's no way for me to just create a new workflow from scratch
       | for my specific use case.
       | 
       | Am I missing something obvious?
        
         | shardullavekar wrote:
         | We launched Agent4 recently. You can install it from here:
         | https://chromewebstore.google.com/detail/agent4/kipkglfnhnpb...
         | 
         | The one you refer will be taken down soon. Ping me on discord
         | if you need help in trying it.
        
           | rco8786 wrote:
           | Thanks! I installed the new one but am still unable to figure
           | out how to create my own workflow. I see that there a bunch
           | of them on the left panel, but there doesn't seem to be a way
           | for me to create one? When I use the chatbot it seems like it
           | just tries to use the LLM to do whatever task I asked it to
           | do but again, can't seem to save it or modify specific steps.
           | 
           | Basically, how do I use this self-healing DOM that the
           | article is all about?
           | 
           | Related - the new extension only works if I allow it to be my
           | new tab default page? That's pretty intrusive, if I'm honest.
        
             | artpar wrote:
             | Ask the agent to make workflows for you.
             | 
             | You can use floating window or sidepanel also for the
             | agent. New tab is just for convenience.
        
               | kratoskr221 wrote:
               | I click on "Suggest Workflow" -> record -> "Submit
               | Workflow" But now I cant find my workflow or any script I
               | can copy. Its quite non-intuitive to use the product.
               | Could you point where I can play my recorded workflow or
               | edit it?
        
               | rco8786 wrote:
               | I went through the same thing. I'm not sure what it means
               | to "suggest" a workflow. I tried both of the extensions
               | and just couldn't figure out how to build a workflow for
               | myself, and couldn't find anything like what this blog
               | post is talking about :-/
        
               | artpar wrote:
               | I am sorry I think you might have installed the our
               | workflow plugin(close ended agent) instead of agent4
               | (open ended). Its a new experiment we are working. We are
               | working on the website messaging.
               | 
               | Agent4 - https://chromewebstore.google.com/detail/agent4/
               | kipkglfnhnpb...
               | 
               | Old workflow plugin - https://chromewebstore.google.com/d
               | etail/100xbot/dhcenlmiiom...
        
       | neuroelectron wrote:
       | DOM has clearly gotten to the point where it's no longer
       | maintainable or a net benefit to the web.
        
         | artpar wrote:
         | Yeah like so many legacy things, unfortunately they are not
         | going away that fast. People are still clicking on these
         | tedious interfaces day in day out to get all the "smaller"
         | stuff running. Even if every one agreed on the "One Best UI",
         | it would take decades to convert all the existing ones before
         | breaking a lot of flows.
        
       | jjangkke wrote:
       | is there an open source version of this in github? i think i've
       | seen something similar.
       | 
       | one off putting thing about installing the extension is all the
       | reviewers seem to be Indian and I've seen similar patterns across
       | Google Reviews where there is a flood of reviews from Indian
       | users and they are almost always fraud or some weird scam
       | 
       | not saying this is the case here but whenever I see a bunch of
       | reviews from Indian names, it automatically makes me trust
       | whatever service or product less just fyi.
        
         | dgfitz wrote:
         | At the morning standup:"OK team I need all of you to post
         | positive reviews and use your network to accomplish that as you
         | see fit."
        
       | jadbox wrote:
       | Is Agent4 open-source? I'm only installing OSS browser extensions
       | for some level of verification.
        
         | artpar wrote:
         | No it is not open-source. But it is not obfuscated either, so
         | you can always look into the code by downloading the plugin
         | from chrome webstore if (and these days llms can help with that
         | a lot) if you are into that kind of verification.
        
       | klntsky wrote:
       | I vibed something like this for markdown extraction just a week
       | ago: https://github.com/promptware/readweb
       | 
       | Opensourced it just now.
       | 
       | More specifically, it works like this:
       | suggestPreset: HTML -> Preset (via LLM)       applyPreset: HTML +
       | Preset -> Markdown (programmatically)
       | 
       | Where preset is:                 type Preset = {         //
       | anchors to make this preset more fragile on purpose.         //
       | Elements that identify website engine layout go here.
       | preset_match_detectors: CSSSelector[];         // main content
       | extractors         main_content_selectors: CSSSelector[];
       | // filter selectors to trim the main content.         // banners,
       | subscription forms, sponsor content, etc.
       | main_content_filters: CSSSelector[];       };
       | 
       | suggestPreset uses a feedback loop that enhances + applies preset
       | until the markdown is really clean
        
       ___________________________________________________________________
       (page generated 2025-10-16 23:00 UTC)