[HN Gopher] Web Bench: a new way to compare AI browser agents
       ___________________________________________________________________
        
       Web Bench: a new way to compare AI browser agents
        
       Author : suchintan
       Score  : 22 points
       Date   : 2025-05-29 14:57 UTC (8 hours ago)
        
 (HTM) web link (blog.skyvern.com)
 (TXT) w3m dump (blog.skyvern.com)
        
       | wm2 wrote:
       | super cool!
        
       | neveroddoreven wrote:
       | I had no idea WebVoyager only spanned 15 websites lol... the 452
       | figure you have still seems a little low though - do you have
       | plans to expand it? It seems like you'd want as many sites as
       | possible to improve the real-world accuracy of agents due to the
       | long tail nature of website traffic
        
         | suchintan wrote:
         | We definitely plan to expand it. I want to get to ~10,000 for a
         | reasonable benchmark.
         | 
         | 15 blew my mind -- it's too easy to overfit that dataset
        
       | gitmagic wrote:
       | Would love to see how Nelly [0] performs on this benchmark.
       | 
       | [0] https://nelly.is
        
         | suchintan wrote:
         | Very cool. The benchmark can be found here if you want to take
         | a look at it: https://github.com/Halluminate/WebBench
        
       | vasusen wrote:
       | Thank you so much for creating this folks! A browser navigation
       | agent is key part of our AI QA setup at Donobu
       | (https://donobu.com/). We found the WebVoyager benchmarks
       | severely lacking for complex e2e test cases like logged-in
       | dashboards, onboarding forms, etc.
       | 
       | While the extraction/2fa flows aren't super relevant to us, this
       | saves us time from building our own set of benchmarks. Really
       | appreciate it and hope we can contribute to make this a really
       | large set.
        
       | helsinki wrote:
       | Does anyone use Skyvern to build their websites? I'm wondering
       | how I might benefit from using an agentic browser workflow
       | instead of a playwright MCP server for building a web UI?
        
       ___________________________________________________________________
       (page generated 2025-05-29 23:01 UTC)