[HN Gopher] Skyvern Browser Agent 2.0: How We Reached State of t...
___________________________________________________________________
Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals
Author : suchintan
Score : 19 points
Date : 2025-01-17 15:23 UTC (7 hours ago)
(HTM) web link (blog.skyvern.com)
(TXT) w3m dump (blog.skyvern.com)
| happyopossum wrote:
| Many of the examples given for agents such as this are things I
| just flat wouldn't trust an LLM to do - buying something on
| Amazon for example: Will it pick new or 'renewed'? Will it select
| an item that is from a janky looking vendor and may be
| counterfeit? Will it pick the cheapest option for me? What if
| multiple colors are offered?
|
| This one example alone has so many branches that would require
| knowing what's in my head.
|
| On the flip side, it's a ridiculously simple task for a human to
| do for themselves, so what am I truly saving?
|
| Call me when I can ask it to check the professional reviews of X
| category on N websites (plus YouTube), summarize them for me, and
| find the cheapest source for the top 2 options in the category
| that will arrive in Y days or sooner.
|
| That would be useful.
| Fnoord wrote:
| I got Amazon Prime. If it has Prime, it is a no-brainer. Free
| return for 30 days. No S&H costs. Only cost is my time.
| drdaeman wrote:
| Yea, but LLMs cannot reason - we've all seen them blurt out
| complete non-sequitur, or end up in death loops of pseudo-
| reasoning (e.g. https://news.ycombinator.com/item?id=42734681
| has a few examples). I don't think one should trust an LLM to
| pick Prime products _all the time_ even if that 's very
| explicitly requested - I'm sure it's possible to minimize
| errors so it'll do the right thing most of the time, but
| having a guarantee that it won't pick non-Prime item sounds
| impossible. Same for any other tasks - if there is a way to
| make a mistake, a mistake will be eventually made.
|
| (Idk if we can trust a human either - brain farts are a thing
| after all, but at least humans are accountable. Machines are
| not - at least not at the moment.)
| lyime wrote:
| To your last point -- Humans make mistakes too. I asked my
| EA to order a few things for our office a few days ago, and
| she ended up ordering things that I did not want. In this
| case I could have wrote a better prompt. Even with a better
| prompt she could have ordered the unwanted item. This is a
| reversible decision.
|
| So my point is, that while you might get some false
| positives, it's worth automating as long as many of the
| decisions are reversible or correctable.
|
| You might not want to use this in all cases, but it's still
| worthwhile for many many cases. The use case worth
| automating depends on the acceptable rate of error for the
| given use case.
| CryptoBanker wrote:
| If it fails enough times and you have to return enough
| items...well, Amazon has been known to ban people for that.
|
| If you have an AWS account created before 2017, am Amazon ban
| means an AWS ban
| suchintan wrote:
| This is a great point -- the example we chose was meant to be a
| consumer example that we could relate with.. however a similar
| example exists for the enterprise which may be more interesting
|
| Let's say that you are a parts procurement shop and want to
| order 10,000 of SKU1, and 20,000 of SKU2. If you go on parts
| websites like finditparts.com, you'll see that there is little
| ambiguity when it comes to ordering specific SKUs
|
| We've seen cases of companies that want to automate item
| ordering like this on tens of different websites, and have
| people (usually the CEO) spending a few hours a week doing it
| manually.
|
| Writing a script can take ~10-20hours to do it (if you know how
| to code).. but we can help you automate it in <30 minutes with
| Skyvern, even if you don't know how to code!
| govindsb wrote:
| congrats Suchintan! huge achievement!
| lyime wrote:
| This is an impressive tool. I especially like the observability
| around the workflow and the steps it takes to achieve the
| outcome. We are potentially interested in exploring this if we
| can get the cost down at scale.
| suchintan wrote:
| I'd love to chat to see how we can help! Here's my email:
| suchintan@skyvern.com
|
| We're working on 2 major improvements that will get cost down
| at scale: 1. We're building a code generation layer under the
| hood that will start to memorize actions Skyvern has taken on a
| website, so repeated runs will be nearly free 2. We're
| exploring some graph re-ranking techniques to eliminate useless
| elements from the HTML DOM when analyzing the page. For
| example, if you're looking at the product page and want to add
| a product to cart, the likelihood you'll need to interact with
| the Reviews page will be 0. No need to send that context along
| to the LLM
| dataviz1000 wrote:
| > We're exploring some graph re-ranking techniques to
| eliminate useless elements from the HTML DOM when analyzing
| the page.
|
| Computer vision is useful and very quick, however, it has
| been my experience parsing stacking context is much more
| useful. The problem is creating a stacking context when a
| news site embeds a youtube or blusky post. It requires
| injecting script into each using playwright. (Not mine, but,
| prior art [0]).
|
| I've been quietly solving a problem I encountered creating
| browser agents that didn't have a solution 2 years ago in my
| free time. Most webpages are several independent global
| execution contexts and I'm developing a coherent way to get
| them all to speak with each other. [1]
|
| [0] https://github.com/andreadev-it/stacking-contexts-
| inspector
|
| [1] https://news.ycombinator.com/item?id=42576240
| skull8888888 wrote:
| isn't browser use sota on web voyager? At this point web voyager
| seems to be outdated, there's def a need for a new harder
| benchmark.
___________________________________________________________________
(page generated 2025-01-17 23:01 UTC)