[HN Gopher] Skyvern Browser Agent 2.0: How We Reached State of t...
       ___________________________________________________________________
        
       Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals
        
       Author : suchintan
       Score  : 19 points
       Date   : 2025-01-17 15:23 UTC (7 hours ago)
        
 (HTM) web link (blog.skyvern.com)
 (TXT) w3m dump (blog.skyvern.com)
        
       | happyopossum wrote:
       | Many of the examples given for agents such as this are things I
       | just flat wouldn't trust an LLM to do - buying something on
       | Amazon for example: Will it pick new or 'renewed'? Will it select
       | an item that is from a janky looking vendor and may be
       | counterfeit? Will it pick the cheapest option for me? What if
       | multiple colors are offered?
       | 
       | This one example alone has so many branches that would require
       | knowing what's in my head.
       | 
       | On the flip side, it's a ridiculously simple task for a human to
       | do for themselves, so what am I truly saving?
       | 
       | Call me when I can ask it to check the professional reviews of X
       | category on N websites (plus YouTube), summarize them for me, and
       | find the cheapest source for the top 2 options in the category
       | that will arrive in Y days or sooner.
       | 
       | That would be useful.
        
         | Fnoord wrote:
         | I got Amazon Prime. If it has Prime, it is a no-brainer. Free
         | return for 30 days. No S&H costs. Only cost is my time.
        
           | drdaeman wrote:
           | Yea, but LLMs cannot reason - we've all seen them blurt out
           | complete non-sequitur, or end up in death loops of pseudo-
           | reasoning (e.g. https://news.ycombinator.com/item?id=42734681
           | has a few examples). I don't think one should trust an LLM to
           | pick Prime products _all the time_ even if that 's very
           | explicitly requested - I'm sure it's possible to minimize
           | errors so it'll do the right thing most of the time, but
           | having a guarantee that it won't pick non-Prime item sounds
           | impossible. Same for any other tasks - if there is a way to
           | make a mistake, a mistake will be eventually made.
           | 
           | (Idk if we can trust a human either - brain farts are a thing
           | after all, but at least humans are accountable. Machines are
           | not - at least not at the moment.)
        
             | lyime wrote:
             | To your last point -- Humans make mistakes too. I asked my
             | EA to order a few things for our office a few days ago, and
             | she ended up ordering things that I did not want. In this
             | case I could have wrote a better prompt. Even with a better
             | prompt she could have ordered the unwanted item. This is a
             | reversible decision.
             | 
             | So my point is, that while you might get some false
             | positives, it's worth automating as long as many of the
             | decisions are reversible or correctable.
             | 
             | You might not want to use this in all cases, but it's still
             | worthwhile for many many cases. The use case worth
             | automating depends on the acceptable rate of error for the
             | given use case.
        
           | CryptoBanker wrote:
           | If it fails enough times and you have to return enough
           | items...well, Amazon has been known to ban people for that.
           | 
           | If you have an AWS account created before 2017, am Amazon ban
           | means an AWS ban
        
         | suchintan wrote:
         | This is a great point -- the example we chose was meant to be a
         | consumer example that we could relate with.. however a similar
         | example exists for the enterprise which may be more interesting
         | 
         | Let's say that you are a parts procurement shop and want to
         | order 10,000 of SKU1, and 20,000 of SKU2. If you go on parts
         | websites like finditparts.com, you'll see that there is little
         | ambiguity when it comes to ordering specific SKUs
         | 
         | We've seen cases of companies that want to automate item
         | ordering like this on tens of different websites, and have
         | people (usually the CEO) spending a few hours a week doing it
         | manually.
         | 
         | Writing a script can take ~10-20hours to do it (if you know how
         | to code).. but we can help you automate it in <30 minutes with
         | Skyvern, even if you don't know how to code!
        
       | govindsb wrote:
       | congrats Suchintan! huge achievement!
        
       | lyime wrote:
       | This is an impressive tool. I especially like the observability
       | around the workflow and the steps it takes to achieve the
       | outcome. We are potentially interested in exploring this if we
       | can get the cost down at scale.
        
         | suchintan wrote:
         | I'd love to chat to see how we can help! Here's my email:
         | suchintan@skyvern.com
         | 
         | We're working on 2 major improvements that will get cost down
         | at scale: 1. We're building a code generation layer under the
         | hood that will start to memorize actions Skyvern has taken on a
         | website, so repeated runs will be nearly free 2. We're
         | exploring some graph re-ranking techniques to eliminate useless
         | elements from the HTML DOM when analyzing the page. For
         | example, if you're looking at the product page and want to add
         | a product to cart, the likelihood you'll need to interact with
         | the Reviews page will be 0. No need to send that context along
         | to the LLM
        
           | dataviz1000 wrote:
           | > We're exploring some graph re-ranking techniques to
           | eliminate useless elements from the HTML DOM when analyzing
           | the page.
           | 
           | Computer vision is useful and very quick, however, it has
           | been my experience parsing stacking context is much more
           | useful. The problem is creating a stacking context when a
           | news site embeds a youtube or blusky post. It requires
           | injecting script into each using playwright. (Not mine, but,
           | prior art [0]).
           | 
           | I've been quietly solving a problem I encountered creating
           | browser agents that didn't have a solution 2 years ago in my
           | free time. Most webpages are several independent global
           | execution contexts and I'm developing a coherent way to get
           | them all to speak with each other. [1]
           | 
           | [0] https://github.com/andreadev-it/stacking-contexts-
           | inspector
           | 
           | [1] https://news.ycombinator.com/item?id=42576240
        
       | skull8888888 wrote:
       | isn't browser use sota on web voyager? At this point web voyager
       | seems to be outdated, there's def a need for a new harder
       | benchmark.
        
       ___________________________________________________________________
       (page generated 2025-01-17 23:01 UTC)