[HN Gopher] A stateful browser agent using self-healing DOM maps
___________________________________________________________________
A stateful browser agent using self-healing DOM maps
Author : shardullavekar
Score : 100 points
Date : 2025-10-16 12:21 UTC (10 hours ago)
(HTM) web link (100x.bot)
(TXT) w3m dump (100x.bot)
| brianjking wrote:
| Is this able to load for anyone?
| shardullavekar wrote:
| It's a chrome extension. Works if you use chrome.
| brianjking wrote:
| I couldn't load the article. I was getting a nginx error
| initially. I'm able to view now. I think they were getting a
| bit squeezed.
| memet_rush wrote:
| they didnt use the agent to self heal
| phgn wrote:
| Nope. Their entire website shows up with a white screen for me
| in the latest Chrome.
|
| There's this error in the console: Failed to load module
| script: Expected a JavaScript-or-Wasm module script but the
| server responded with a MIME type of "text/html". Strict MIME
| type checking is enforced for module scripts per HTML spec.
| philo23 wrote:
| Maybe this is a lack of understanding on my part, but this bit of
| the explanation sets off alarm bells for me:
|
| > Under the hood, we're building a client-sourced RAG for the
| DOM. An agent's first move on a page is to check a vector DB for
| a known "map." ... This creates a wild side-effect: the system is
| self-healing for everyone. One person's failed automation
| accidentally fixes it for the next hundred users.
|
| I think I'd like to know exactly what kind of data is extracted
| from the DOM to build that shared map.
| artpar wrote:
| Agent4 is going to store "stable selectors" that worked (when
| it performs a task first time most of the time is spent in
| identifying these css/xpath selectors). Memories are pretty
| straighforward at this point, they are stored locally in your
| browser's IndexedDB (you can inspect from chrome inspector).
| philo23 wrote:
| Good to hear, that's what I was hoping that it was doing.
| erichocean wrote:
| How are you mapping from "click this element" (presumably
| obtained via a VLM) to the actual DOM locator that refers to
| it?
|
| I guess Playwright can do it in "record" mode; I'm curious
| how you do it from a Chrome extension.
|
| Spitballing here, you inject an event filter on the page and
| when the click happens, grab the element and run some code to
| synthesize a selector that just refers to that element?
| (Presumably you could just reuse Playwright's element-to-
| locator code at this point.)
| artpar wrote:
| So when you go into the "selector" mode, the plugin will
| add event listeners to all the DOM nodes. Based on your
| click it will try to generate a bunch of selectors
| statically first (multiple, css and xpath based), and then
| based on your guidance its the job of agent4 to make stable
| selectors.
| cjr wrote:
| document.elementFromPoint to get the elem at co-ordinates,
| then use npm package similar to optimal-select to come up
| with a unique css selector.
| tnolet wrote:
| This is, as far as I understand, self healing ONLY if the name of
| a CSS class changes. Not for anything else. That seems like a
| very very very very narrow definition of "self healing": there
| are 9999 other subtle or not so subtle things that can change per
| session or per update version of a page.
|
| If you run this against let's say a typical e-commerce page where
| the navigation and all screen elements are super dynamic -- user
| specific data, language etc. -- this problems becomes even
| harder.
| artpar wrote:
| Everyone thinks of typical e-commerce pages when its comes
| "browser agent doing something", but our real use cases are far
| from shopping for the user. But your point still stands valid.
| The idea is that maybe there are websites where generating
| stable selectors/hierarchy maps wouldn't solve, but 80% (from
| 80-20) of websites are not like that (including a lot of
| internal dashboard/interfaces) (there will also be issues for
| websites with proper i18n implementations if the selectors are
| aria label based)
|
| Self healing css selectors is also only 1 part of the story.
| The other part is the cohesive interface for the agent itself
| to use these selectors.
| miguelspizza wrote:
| > The other part is the cohesive interface for the agent
| itself to use these selectors
|
| We are incubating this over at the WebMCP web standard
| proposal. You can see the current draft of explainer for the
| declarative API.
| https://github.com/webmachinelearning/webmcp/pull/26
|
| Also, great work on the browser agent, this is the best of
| the DOM parsing/screenshot agents I've used. I was really
| impressed with the wordle example
| ljm wrote:
| My running hypothesis on this is that AI is a sentient
| screenreader and the last thing you should be worrying about is
| CSS class names, IDs, data-testid attributes, DOM traversal,
| and all of these things that are essentially querying the
| 'internal state' of a page. Classes, IDs, data attributes, etc.
| aren't a public API and semantic elements, ARIA attributes,
| etc. _are_.
|
| So, focus on WCAG compliance, following the spec as faithfully
| as you can. The style or presentation of something may change
| as part of a simple A/B test but the underlying purpose or
| behaviour would remain the same.
| runlaszlorun wrote:
| This might be the last of the weed talking but that's rich.
| I'm gonna have to chew on that.
|
| And maybe even crack the WCAG docs. Wait... It's a trap...
| pverheggen wrote:
| I feel like this could work if the selectors are chosen
| carefully to capture semantic meaning, rather than basing off
| of something arbitrary like a class name. The agent must have
| some understanding of the document to be able to perform those
| actions in the first place.
|
| If it can find an ellipse tool, it's likely based off some
| combination of accessible role, accessible name, and inner text
| (perhaps the icon if it's multi-modal.) So in theory, couldn't
| it capture that criteria in a JS snippet and replay it?
| artpar wrote:
| That's exactly what is it doing. The workflows are pretty
| much js snippets in itself you can see in the "code" tab (in
| the plugin when you select a saved workflow).
| simpaticoder wrote:
| Couldn't you solve this by having the agent do a first pass
| through a page and generate a (java)script that interacts with
| the interesting parts of the page, and then prepend the script
| (if it's short enough) or a list of entry points (if it's not) to
| the prompt such that subsequent interactions invoke the script
| rather than interact directly with the page?
| artpar wrote:
| If I am reading you correctly, you captured the whole essence
| of agent4.
|
| So it does the first pass (based on your goals) makes memories
| (and these are local)
|
| Now you tell the agent you want to do this repeatedly, so it
| will make a workflow (these workflows are saved on server,
| currently all public for now but we are working out
| permissions/group based access) for you based on these memories
| and interactions.
|
| The problem is many times that the agent thinks is stable isn't
| really, so there a feedback loop for the agent to test out the
| workflow and improve them. (its basically claude code/codex
| sitting in the browser)
|
| Workflow details are appended to prompt based on user query
| match/opened tabs match.
| simpaticoder wrote:
| Okay I read your post more carefully and it seems like you're
| attempting to build one central script for a given URL.
| Assuming on-shot script generation is unreliable and requires
| iterative improvement this makes sense. Of course I'm biased
| in favor of local-first, privacy preserving and non-
| distributed solutions if they exist, so I'd be curious to
| know if/how you measured the reliability of one-shot script
| generation for a basket of likely web apps.
| artpar wrote:
| One shot is pretty much not going to work, both at single
| step level or if you ask llm to generate workflow in one
| shot. We haven't measured it as such but even for static
| websites like hackernews front page it takes a couple tries
| of to and fro for the llm to get it right. somehow after
| all the instructions the llm will still "guess" the
| selector instead of checking the page/dom contents. And
| then there are lot of other minor details that need to be
| captured like "you need to wait a couple of second for the
| auto complete results to show up". If you tell it to just
| make a workflow, it will generate some garbage and call it
| a day.
| klntsky wrote:
| Can you share what a page map looks like (on the data type
| level)?
| arkmm wrote:
| Neat approach, but seems like the eventual goal of caching DOM
| maps for all users would be a privacy nightmare?
| artpar wrote:
| Yes I can imagine PI somehow being stored in the workflow. I
| frequently see llms hardcoding tests just to make user happy
| and this can also happen in the browser version where if
| something is too hard to scrape but agent is able to infer from
| screenshot so it might end up making a workflow that seems
| correct but is just hardcoded with data. We are thinking of
| multiple guards/blocks to not let user create such a workflow,
| but the risks that come with an open ended agent are still
| going to be present.
| bogdanoff_2 wrote:
| Asking here because it seems related: I'm trying to use cursor to
| work on a webapp. It gets frustrating because vanilla Cursor is
| "coding blind" and can't actually see the result of what it is
| doing, and whether or not it works.
|
| I ask it to fix something. It claims to know what the problem is,
| changes the code, and then claims it's fixed. I open app, and
| it's still broken. I have to continuously and way to often
| repeatedly tell it what it broken.
|
| Now, supposing I'm "vibe coding" and don't really care about the
| obvious fact that the AI doesn't actually know what it is doing,
| it's _still_ frustrating that I have to be in the loop just to
| provide very basic information like that.
|
| Are there any agentic coding setups that allow the agent to
| interact with the app it's working on to check if it actually
| works?
| tomashubelbauer wrote:
| Look into the Playwright MCP server, it allows coding agents to
| scrutinize the results of their work in the web browser. There
| is also an MCP server for the Chrome DevTools protocol AFAIK
| but I haven't tried it.
| artpar wrote:
| I don't know if plywright works without chrome in debug mode,
| but I tried the MCP for chrome devtools and it requires
| chrome to be started in debugging mode and that basically
| means you cant log into a lot of sites (especially google)
| since it will block you with an "Unsafe" message. Works
| pretty well if you owe the target website.
| kevinsync wrote:
| I was in the same boat on a side project (Electron, Claude
| Code) -- I considered Playwright but ended up building a
| simple, focused API instead that allows Claude to connect to
| the app to inspect logs (main console + browser console),
| query internal app data + state, and execute arbitrary JS.
|
| It's sped up debugging a lot since I can just give it
| instructions like "found a bug that does XYZ, I think it's a
| problem with functionABC(); connect to app, click these four
| buttons in this order, examine the internal state, then trace
| through the code to figure out what's going wrong and present
| a solution"
|
| I was pretty resistant at first of delegating debugging
| blindly like that, but it's made the workflow pretty smooth
| to where I can occasionally just open the app, run through it
| as a human user and take notes on bugs and flow issues that I
| find, log them with steps to reproduce, then give Claude a
| list of bugs to noodle on while I'm focusing on stuff LLMs
| are terrible at (design, UI, frontend work, etc)
| shardullavekar wrote:
| a built-in mcp server that takes a look at what's broken and
| communicates with cursor is on our roadmap. Join discord and we
| will keep you posted there.
| artpar wrote:
| So actually I have this setup (of a bridge server) which I use
| for agent4 itself (so claude code can talk to agent4), It makes
| a lot of sense to publish that bridge as well in the MCP form.
| JimDabell wrote:
| You can use things like Browser Use and Playwright to hook
| things like that up, but you're right, this is a very
| underdeveloped area. Armin Ronacher has a talk that covers some
| of this, such as unifying console.log, server logs, SQL, etc.
| to feed back to the LLM.
|
| https://www.youtube.com/watch?v=nfOVgz_omlU
| wahnfrieden wrote:
| This is the way. You can also feed screenshots back to it.
| dcchuck wrote:
| My experience with the off the shelf MCP tools is they
| still fall short of giving Claude or Codex a screenshot.
|
| I'm interested in whether or not others agree.
| wahnfrieden wrote:
| MCP or CLI (called by agent) can also be used to generate
| the screenshot input loop
| dcchuck wrote:
| Yes, but image size restrictions bring more failure than
| success in my experience.
|
| Thank you for your comment.
| xnx wrote:
| Gemini CLI Chrome devtools MCP addresses this:
| https://developer.chrome.com/blog/chrome-devtools-mcp
| hatmanstack wrote:
| Jump the line and just install it. who needs to read stuff.
| https://github.com/ChromeDevTools/chrome-devtools-
| mcp?tab=re...
| ripped_britches wrote:
| "One persons map fixes everyone else's"
|
| Hm somehow I feel like this is a giant step in the wrong
| direction.
| artpar wrote:
| Worst case scenario we can just shut down sharing/public
| workflows altogether, or do you have something else in mind ?
| rco8786 wrote:
| This tool seems relevant to my interests, but I gotta say I
| cannot figure out how to use the extension.
|
| It seems like I'm only able to use the pre-existing/canned
| workflows that are provided under different "Persona"s? And
| there's no way for me to just create a new workflow from scratch
| for my specific use case.
|
| Am I missing something obvious?
| shardullavekar wrote:
| We launched Agent4 recently. You can install it from here:
| https://chromewebstore.google.com/detail/agent4/kipkglfnhnpb...
|
| The one you refer will be taken down soon. Ping me on discord
| if you need help in trying it.
| rco8786 wrote:
| Thanks! I installed the new one but am still unable to figure
| out how to create my own workflow. I see that there a bunch
| of them on the left panel, but there doesn't seem to be a way
| for me to create one? When I use the chatbot it seems like it
| just tries to use the LLM to do whatever task I asked it to
| do but again, can't seem to save it or modify specific steps.
|
| Basically, how do I use this self-healing DOM that the
| article is all about?
|
| Related - the new extension only works if I allow it to be my
| new tab default page? That's pretty intrusive, if I'm honest.
| artpar wrote:
| Ask the agent to make workflows for you.
|
| You can use floating window or sidepanel also for the
| agent. New tab is just for convenience.
| kratoskr221 wrote:
| I click on "Suggest Workflow" -> record -> "Submit
| Workflow" But now I cant find my workflow or any script I
| can copy. Its quite non-intuitive to use the product.
| Could you point where I can play my recorded workflow or
| edit it?
| rco8786 wrote:
| I went through the same thing. I'm not sure what it means
| to "suggest" a workflow. I tried both of the extensions
| and just couldn't figure out how to build a workflow for
| myself, and couldn't find anything like what this blog
| post is talking about :-/
| artpar wrote:
| I am sorry I think you might have installed the our
| workflow plugin(close ended agent) instead of agent4
| (open ended). Its a new experiment we are working. We are
| working on the website messaging.
|
| Agent4 - https://chromewebstore.google.com/detail/agent4/
| kipkglfnhnpb...
|
| Old workflow plugin - https://chromewebstore.google.com/d
| etail/100xbot/dhcenlmiiom...
| neuroelectron wrote:
| DOM has clearly gotten to the point where it's no longer
| maintainable or a net benefit to the web.
| artpar wrote:
| Yeah like so many legacy things, unfortunately they are not
| going away that fast. People are still clicking on these
| tedious interfaces day in day out to get all the "smaller"
| stuff running. Even if every one agreed on the "One Best UI",
| it would take decades to convert all the existing ones before
| breaking a lot of flows.
| jjangkke wrote:
| is there an open source version of this in github? i think i've
| seen something similar.
|
| one off putting thing about installing the extension is all the
| reviewers seem to be Indian and I've seen similar patterns across
| Google Reviews where there is a flood of reviews from Indian
| users and they are almost always fraud or some weird scam
|
| not saying this is the case here but whenever I see a bunch of
| reviews from Indian names, it automatically makes me trust
| whatever service or product less just fyi.
| dgfitz wrote:
| At the morning standup:"OK team I need all of you to post
| positive reviews and use your network to accomplish that as you
| see fit."
| jadbox wrote:
| Is Agent4 open-source? I'm only installing OSS browser extensions
| for some level of verification.
| artpar wrote:
| No it is not open-source. But it is not obfuscated either, so
| you can always look into the code by downloading the plugin
| from chrome webstore if (and these days llms can help with that
| a lot) if you are into that kind of verification.
| klntsky wrote:
| I vibed something like this for markdown extraction just a week
| ago: https://github.com/promptware/readweb
|
| Opensourced it just now.
|
| More specifically, it works like this:
| suggestPreset: HTML -> Preset (via LLM) applyPreset: HTML +
| Preset -> Markdown (programmatically)
|
| Where preset is: type Preset = { //
| anchors to make this preset more fragile on purpose. //
| Elements that identify website engine layout go here.
| preset_match_detectors: CSSSelector[]; // main content
| extractors main_content_selectors: CSSSelector[];
| // filter selectors to trim the main content. // banners,
| subscription forms, sponsor content, etc.
| main_content_filters: CSSSelector[]; };
|
| suggestPreset uses a feedback loop that enhances + applies preset
| until the markdown is really clean
___________________________________________________________________
(page generated 2025-10-16 23:00 UTC)