[HN Gopher] Show HN: Tarsier - Vision utilities for web interact...
       ___________________________________________________________________
        
       Show HN: Tarsier - Vision utilities for web interaction agents
        
       Hey HN! I built a tool that gives LLMs the ability to understand
       the visual structure of a webpage even if they don't accept image
       input. We've found that unimodal GPT-4 + Tarsier's textual webpage
       representation consistently beats multimodal GPT-4V/4o + webpage
       screenshot by 10-20%, probably because multimodal LLMs still aren't
       as performant as they're hyped to be.  Over the course of
       experimenting with pruned HTML, accessibility trees, and other
       perception systems for web agents, we've iterated on Tarsier's
       components to maximize downstream agent/codegen performance.
       Here's the Tarsier pipeline in a nutshell:  1. tag interactable
       elements with IDs for the LLM to act upon & grab a full-sized
       webpage screenshot  2. for text-only LLMs, run OCR on the
       screenshot & convert it to whitespace-structured text (this is the
       coolest part imo)  3. map LLM intents back to actions on elements
       in the browser via an ID-to-XPath dict  Humans interact with the
       web through visually-rendered pages, and agents should too. We run
       Tarsier in production for thousands of web data extraction agents a
       day at Reworkd (https://reworkd.ai).  By the way, we're hiring
       backend/infra engineers with experience in compute-intensive
       distributed systems!  https://reworkd.ai/careers
        
       Author : KhoomeiK
       Score  : 126 points
       Date   : 2024-05-15 16:46 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | pk19238 wrote:
       | this is such a creative solution. reminds me of how a team
       | rendered wolfenstein into ASCII characters and fine tuned mistral
       | to successfully play it.
        
         | KhoomeiK wrote:
         | Thanks! Yeah, it seems like a lot can be done with just text
         | while we wait for multimodal models to catch up. The recent
         | Platonic Representation Hypothesis [1] also suggests that
         | different models, regardless of modality, build the same
         | internal representations of the world.
         | 
         | [1] https://arxiv.org/abs/2405.07987
        
       | dbish wrote:
       | Very cool. We do something similar by combining OCR along with
       | accessiblity data and other data (speech reco et. al.) for
       | desktop based screensharing understanding, but evaluation
       | compared to multi-modal LLMs has not been easy. How are you
       | evaluating to come up with this number "consistently beats
       | multimodal GPT-4V/4o + webpage screenshot by 10-20%,"?
       | 
       | fwiw so far we've seen that Azure has the best OCR for screenshot
       | type data across the proprietary and open source models, though
       | we are far more focused on grabbing data from desktop based
       | applications then web pages so ymmv
        
         | KhoomeiK wrote:
         | Yup, evals can definitely be tough. We basically have a suite
         | of several hundred web data extraction evals in a tool we built
         | called Bananalyzer [1]. It's made it pretty straightforward for
         | us to benchmark how accurately our agent generates code when it
         | uses Tarsier-text (+ GPT-4) for perception v.s. Tarsier-
         | screenshot (+ GPT-4V/o).
         | 
         | Will have to look into supporting Azure OCR in Tarsier then--
         | thanks for the tip!
         | 
         | [1] https://github.com/reworkd/bananalyzer
        
           | timabdulla wrote:
           | Neat. Do you have the Bananalyzer eval results for Tarsier
           | published somewhere?
        
             | KhoomeiK wrote:
             | We're hoping to release an evals paper about Bananalyzer
             | this summer and compare Tarsier to a variety of other
             | perception systems in it. The hard part with evaluating a
             | perception/context system though is that it's very
             | intertwined with the agent's architecture, and that's not
             | something we're comfortable fully open-sourcing yet. We'll
             | have to think of interesting ways to decouple the
             | perception system and eval them with Bananalyzer.
        
           | dbish wrote:
           | Awesome, will take a look at this. thank you
        
         | SomaticPirate wrote:
         | Surprised to hear Azure beats AWS Textract. I found it to be
         | the best OCR offering but that was when I was doing documents.
        
           | navanchauhan wrote:
           | In my experience Azure is probably the best OCR offering
           | right now. They are also the only ones to be able to
           | recognize my terrible handwriting.
        
           | dbish wrote:
           | Yes, Textract does not work as well for desktop screenshots
           | from our testing
        
       | davedx wrote:
       | How do you make sure the tagging of elements is robust? With
       | regular browser automation it's quite hard to write selectors
       | that will keep working after webpages get updated; often when
       | writing E2E testing teams end up putting [data] attributes into
       | the elements to aid with selection. Using a numerical identifier
       | seems quite fragile.
        
         | KhoomeiK wrote:
         | Totally agreed--this is a design choice that basically comes
         | from our agent architecture, and the codegen-based architecture
         | that we think will likely proliferate for web agent tasks in
         | the future. We provide Tarsier's text/screenshot to an LLM and
         | have it write code with generically written selectors rather
         | than the naive selectors that Tarsier assigns to each element.
         | 
         | It's sort of like when you (as a human) write a web scraper and
         | visually click on individual elements to look at the
         | surrounding HTML structure / their selectors, but then end up
         | writing code with more general selectors--not copypasting the
         | selectors of the elements you clicked.
        
           | davedx wrote:
           | Ooh that's a very neat approach, great idea! Chains of
           | thought across abstraction layers. Definitely worth a blog
           | post I reckon.
           | 
           | Good luck!
        
             | KhoomeiK wrote:
             | Thanks! We might put out a paper about it with some
             | Carnegie Mellon collaborators this summer.
        
         | ghxst wrote:
         | Great question, also situations where you have multiple CTAs
         | with similar names/contexts on a page is still something I see
         | LLM based automation struggle with.
        
           | KhoomeiK wrote:
           | Hm, not sure I follow why those situations would be
           | especially difficult? Regarding website changes, the nice
           | thing about using LLMs is that we can simply provide the
           | previous scraper as context and have it regenerate the
           | scraper to "self-heal" when significant website changes are
           | detected.
        
       | bckmn wrote:
       | Reminds me of [Language as Intermediate
       | Representation](https://chrisvoncsefalvay.com/posts/lair/) - LLMs
       | are optimized for language, so translate an image into language
       | and they'll do better at modeling it.
        
         | KhoomeiK wrote:
         | Cool connection, hadn't seen this before but feels intuitively
         | correct! I also formulate similar (but a bit more out-there)
         | philosophical thoughts on word-meaning as being described by
         | the topological structure of its corresponding images in
         | embedding space, in Section 5.3 of my undergrad thesis [1].
         | 
         | [1] https://arxiv.org/abs/2305.16328
        
       | shekhar101 wrote:
       | Tangential - I just want a decent (financial transaction) Table
       | to text conversion that can retain the table structure well
       | enough (e.g. merged cells) and have tried everything under the
       | sun short of fine tuning my own model, including all the
       | multimodal LLMs. None of them work very well without a lot of
       | prompt engineering on case by case basis. Can this help? How can
       | I set it up with a large number of pdfs that are sorted by type
       | and extract tabular information? Any other suggestions?
        
         | KhoomeiK wrote:
         | That's an interesting problem--Tarsier probably isn't the best
         | solution here since it's focused on webpage perception rather
         | than any kind of OCR. But one could try adapting the
         | `format_text` function in tarsier/text_format.py to convert any
         | set of OCR annotations to a whitespace-structured string.
         | Curious to see if that works.
        
         | Oras wrote:
         | Have you tried AWS textract for table extraction then LLM to
         | format the data?
        
           | davedx wrote:
           | Azure have a decent set of offerings for this too, they work
           | quite well: https://learn.microsoft.com/en-gb/azure/ai-
           | services/document...
        
         | vikp wrote:
         | This isn't specifically tuned for tables (more for general pdf
         | to markdown), but it's worked for some people with similar use-
         | cases - https://github.com/VikParuchuri/marker
        
         | derefr wrote:
         | Or how about the opposite? Give me a CLI tool to pipe
         | implicitly-tabular space-padded text into -- a smart cut(1) --
         | where I can say "give me column 3" and it understands how to
         | analyze the document as a whole (or at least a running sample
         | of a dozen lines or so), to model the correct column
         | boundaries, to extract the contents of that column. (Which
         | would also include trimming off any space-padding from the
         | content. I want the data, not a fixed-width field containing
         | it!)
         | 
         | For that matter, give me a CLI tool that takes in an _entire_
         | such table, and lets me say  "give me rows 4-6 of column Foo"
         | -- and it reads the table's header (even through fancy box-
         | drawing line-art) to determine which column is Foo, ignores any
         | horizontal dividing lines, etc.
         | 
         | I'm not sure whether these tasks actually require full-on ML --
         | probably just a pile of heuristics would work. Anything would
         | be better than the low-level tools we have today.
        
         | davedx wrote:
         | I'm having decent success with GPT4o on this. Have you given it
         | a try? It probably varies from table structure to table
         | structure.
        
       | shodai80 wrote:
       | How do you know, for a specific webelement, what label it is
       | associated with for a textbox or select?
       | 
       | For instance, I might want to tag as you did where elements are,
       | but I still need an association with a label, quite often, to
       | determine what the actual context of the textbox or select is.
        
         | awtkns wrote:
         | Tarsier provides a mapping of element number (eg: [23]) to
         | xpath. So for any tagged item we're able to map it back to the
         | actual element in the DOM, allowing for easy interaction with
         | the elements on the page.
        
           | shodai80 wrote:
           | I understand that, I assume you are tagging the node and
           | making a basic xpath to the node/attribute with your tag id.
           | Understood. But how relevant is tagging a node when I have no
           | idea what the node is actually for?
           | 
           | EX: Given a simple login form, I may not know if the label is
           | above or below the username textbox. A password box would be
           | below it. I have a hard time understanding the relevance to
           | tagging without context.
           | 
           | Tagging is basically irrelevant to any automated task if we
           | do not know the context. I am not trying to diminish your
           | great work, don't get me wrong, but if you don't have context
           | I don't see much relevance. Youre doing something that is
           | easily scripted with xpath templates which I've done for over
           | a decade.
        
             | awtkns wrote:
             | This is where a LLM comes it. In a typical pipeline would
             | tag a page, transform it into a textual representation and
             | then pass it to an llm which would be able to reason about
             | which field(s) are the one you're looking for much like a
             | human.
        
               | shodai80 wrote:
               | My point still stands. How do you augment data for an LLM
               | when you know the context of a page? Do you go through
               | every element and setup the data for an associated label?
               | Do you use div scoping via offset parent through a script
               | to generate associated div (good approach, bad in real-
               | life conditions though)? Do you convert the DOM to JSON
               | or some data structure? That means little because you
               | still don't have context, you'd have to do it by hand
               | every time the layout changes...and you would have to be
               | very specific, which is a separate problem for modeling
               | as layouts are modified. What if the UI can be modified
               | to have different layout types, such as label above,
               | label to side, label below...where this can be
               | dynamically set.
               | 
               | What I am pointing here is, even data modeling is mostly
               | irrelevant unless you want to go through every
               | page/permutation of a page...all the while hoping the
               | layout isn't modified or back to training all over
               | again...which is downtime, and at some point you'll
               | realize its just better to store user created xpath's, as
               | its quicker to update those than retrain.
               | 
               | How do you reason with an LLM without going through any
               | of the above? Automation cannot consistently have
               | downtime for retraining, it's the antithesis for its
               | purpose.
               | 
               | Let's not even get into shadow dom issues.
               | 
               | I am keying on your third bullet point on Github:
               | 
               | "How can you inform a text-only LLM about the page's
               | visual structure?"
               | 
               | My questions suggest a gap in your awesome
               | accomplishment.
        
               | KhoomeiK wrote:
               | We run OCR on the screenshot & convert it to whitespace-
               | structured text, that is passed to the LLM. The images
               | below might make it clearer for you:
               | 
               | [1] https://github.com/reworkd/tarsier/blob/main/.github/
               | assets/...
               | 
               | [2] https://github.com/reworkd/tarsier/blob/main/.github/
               | assets/...
        
               | shodai80 wrote:
               | Provided screenshots below do not show textboxes,
               | selects, or other input nodes with labels. Show me text
               | output with associated labels for inputs being correct
               | and I will be shocked.
        
               | KhoomeiK wrote:
               | They do show textboxes with labels. From our readme:
               | 
               | "Keep in mind that Tarsier tags different types of
               | elements differently to help your LLM identify what
               | actions are performable on each element. Specifically:
               | 
               | [#ID]: text-insertable fields (e.g. textarea, input with
               | textual type)
               | 
               | [@ID]: hyperlinks (<a> tags)
               | 
               | [$ID]: other interactable elements (e.g. button, select)
               | 
               | [ID]: plain text (if you pass tag_text_elements=True)"
               | 
               | Do you see the search boxes labeled [#4] and [#5] at the
               | top? And before you say that the tag is on a different
               | line from the placeholder text--yes, and our agent is
               | smart enough to handle that minor idiosyncrasy. Are you
               | shocked? :)
        
       | abrichr wrote:
       | Congratulations on shipping!
       | 
       | In
       | https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt...
       | we use FastSAM to first segment the UI elements, then have the
       | LLM describe each segment individually. This seems to work quite
       | well; see
       | https://twitter.com/OpenAdaptAI/status/1789430587314336212 for a
       | demo.
       | 
       | More coming soon!
        
         | jackienotchan wrote:
         | Looking at OpenAdapt, I'm wondering why they didn't integrate
         | Tarsier into AgentGPT, which is their flagship github repo but
         | doesn't seem to be under active development anymore.
        
           | KhoomeiK wrote:
           | We have a lot more powerful use-cases for Tarsier in web data
           | extraction at the moment. Stay tuned for a broader launch
           | soon!
        
       | v3ss0n wrote:
       | Since it is just a wrapper around hosted API if Google , can't be
       | ran as local fully opensource
        
         | KhoomeiK wrote:
         | More OCR providers are on the roadmap and we'd love for you to
         | contribute any local OCR models you think could be useful! I
         | wouldn't call it a wrapper though :)
        
       | jackienotchan wrote:
       | Why was the Show HN text removed? Too much self promotion? You're
       | a YC company, so I'm surprised the mods would do that.
       | 
       | https://hn.algolia.com/?dateRange=pastYear&page=0&prefix=tru...
       | 
       | > Hey HN! I built a tool that gives LLMs the ability to
       | understand the visual structure of a webpage even if they don't
       | accept image input. We've found that unimodal GPT-4 + Tarsier's
       | textual webpage representation consistently beats multimodal
       | GPT-4V/4o + webpage screenshot by 10-20%, probably because
       | multimodal LLMs still aren't as performant as they're hyped to
       | be. Over the course of experimenting with pruned HTML,
       | accessibility trees, and other perception systems for web agents,
       | we've iterated on Tarsier's components to maximize downstream
       | agent/codegen performance.
       | 
       | Here's the Tarsier pipeline in a nutshell:
       | 
       | 1. tag interactable elements with IDs for the LLM to act upon &
       | grab a full-sized webpage screenshot
       | 
       | 2. for text-only LLMs, run OCR on the screenshot & convert it to
       | whitespace-structured text (this is the coolest part imo)
       | 
       | 3. map LLM intents back to actions on elements in the browser via
       | an ID-to-XPath dict
       | 
       | Humans interact with the web through visually-rendered pages, and
       | agents should too. We run Tarsier in production for thousands of
       | web data extraction agents a day at Reworkd (https://reworkd.ai).
       | 
       | By the way, we're hiring backend/infra engineers with experience
       | in compute-intensive distributed systems!
       | 
       | reworkd.ai/careers
        
         | KhoomeiK wrote:
         | Thanks for pointing this out! Yeah, it's pretty strange. We
         | thought including Show HN text was encouraged to engage with
         | the community?
        
           | jackienotchan wrote:
           | Did you delete it or the mods?
        
             | KhoomeiK wrote:
             | Must have been the mods, I spent quite a bit of time on the
             | content lol
        
         | dang wrote:
         | Not sure what happened there! I've restored the text now.
        
       | savy91 wrote:
       | Am I wrong thinking this could very well be the backbone of an
       | alternative to the Rabbit AI? Where you basically end up having
       | possibly infinite tools for your LLM assistant to use to reach a
       | goal without having to build api integrations.
        
         | KhoomeiK wrote:
         | Yup, it could! There are a lot of players in the generalist
         | personal web agent space but I personally think that use-case
         | will be eaten by big players since fundamental foundation model
         | improvements are required. That being said, Tarsier is a great
         | place to start for building an open-source web agent for
         | automating cool little tasks.
         | 
         | At Reworkd, we're focused on web agents for data extraction at
         | scale, which isn't as hyped as the generalist agents but we
         | find provides a lot of value and already works pretty well.
        
       | jumploops wrote:
       | How does the performance compare to VimGPT[0]?
       | 
       | I assume the screenshot-based approach is similar, whereas the
       | text approach should be improved?
       | 
       | Very cool either way!
       | 
       | [0] https://github.com/ishan0102/vimGPT
        
         | KhoomeiK wrote:
         | VimGPT couples the perception to a specific LLM/agent whereas
         | Tarsier is solely a perception system that you can use for any
         | uni/multi-modal web agent. So it's hard to compare, but you
         | could say that VimGPT's performance probably lies somewhere in
         | the middle of Tarsier's performance distribution (which varies
         | as a function of your specific agent/prompt system).
        
       | reidbarber wrote:
       | Neat! Been building something similar to the tagging feature in
       | Typescript: https://github.com/reidbarber/webmarker
       | 
       | The Python API on this is really nice though.
        
       ___________________________________________________________________
       (page generated 2024-05-15 23:00 UTC)