[HN Gopher] Web Bench: a new way to compare AI browser agents
___________________________________________________________________
Web Bench: a new way to compare AI browser agents
Author : suchintan
Score : 22 points
Date : 2025-05-29 14:57 UTC (8 hours ago)
(HTM) web link (blog.skyvern.com)
(TXT) w3m dump (blog.skyvern.com)
| wm2 wrote:
| super cool!
| neveroddoreven wrote:
| I had no idea WebVoyager only spanned 15 websites lol... the 452
| figure you have still seems a little low though - do you have
| plans to expand it? It seems like you'd want as many sites as
| possible to improve the real-world accuracy of agents due to the
| long tail nature of website traffic
| suchintan wrote:
| We definitely plan to expand it. I want to get to ~10,000 for a
| reasonable benchmark.
|
| 15 blew my mind -- it's too easy to overfit that dataset
| gitmagic wrote:
| Would love to see how Nelly [0] performs on this benchmark.
|
| [0] https://nelly.is
| suchintan wrote:
| Very cool. The benchmark can be found here if you want to take
| a look at it: https://github.com/Halluminate/WebBench
| vasusen wrote:
| Thank you so much for creating this folks! A browser navigation
| agent is key part of our AI QA setup at Donobu
| (https://donobu.com/). We found the WebVoyager benchmarks
| severely lacking for complex e2e test cases like logged-in
| dashboards, onboarding forms, etc.
|
| While the extraction/2fa flows aren't super relevant to us, this
| saves us time from building our own set of benchmarks. Really
| appreciate it and hope we can contribute to make this a really
| large set.
| helsinki wrote:
| Does anyone use Skyvern to build their websites? I'm wondering
| how I might benefit from using an agentic browser workflow
| instead of a playwright MCP server for building a web UI?
___________________________________________________________________
(page generated 2025-05-29 23:01 UTC)