hngopher.com

       [HN Gopher] Launch HN: Skyvern (YC S23) - open-source AI agent f...
       ___________________________________________________________________
        
       Launch HN: Skyvern (YC S23) - open-source AI agent for browser
       automations
        
       Hey HN, we're Suchintan and Shu from Skyvern
       (https://www.skyvern.com). We're building an open source tool to
       help companies automate browser-based workflows using LLMs.  Our
       open source repo is at https://github.com/Skyvern-AI/Skyvern, and
       we're excited to share our cloud version with you
       (https://app.skyvern.com) :)  Skyvern allows you to define a single
       (or a series of) goal-based prompts to instruct an agent to
       complete complex tasks on websites. Here's a quick demo of Skyvern:
       https://www.loom.com/share/76b231309df74a528061fcf102e1967f  We
       built this to solve a specific problem: building browser
       automations often requires companies to either hire people and
       scale out operations teams to do tedious manual work, or hire
       developers to use products like UI-Path or Selenium to build
       automations.  Code-based solutions always run into the same
       problem: they're brittle (wow this website added a new pop-up
       dialog and my script broke), and fail to achieve the same objective
       across multiple websites (how can I fill out a contact-us form on
       hundreds of different websites?)  We did a Show HN a few months ago
       (https://news.ycombinator.com/item?id=39706004), and since then,
       we've onboarded customers for a wide variety of use cases:
       generating insurance quotes on websites like Geico.com; applying to
       jobs on websites like lever.co; automating filing permits in local
       government portals; registering new corporations for employment
       identification; fetching invoices from hundreds of different
       portals such as hydroone.com; automating purchasing on a handful of
       e-commerce websites like zooplus.com; and filling out contact us
       forms on a bunch of random smb websites (such as HVAC websites).
       To be able to service all of these, we've built and open-sourced
       quite a few interesting features:  (1) a fully-featured React
       application allowing you to see every action Skyvern is taking in
       real-time;  (2) livestreaming browser instances to allow our users
       to see what Skyvern is doing when running inside of a docker
       container;  (3) authenticated sessions, integrating with Bitwarden
       and allowing users to specify Email + Phone + QR-code based 2FAs;
       (4) "workflows" allowing users to chain multiple goal-based prompts
       together, which can handle tasks like invoice downloading, or
       automating purchasing pipelines;  (5) processing HTML Elements (ex.
       identifying + summarizing SVGs) and performing website interactions
       (ex. Iterating over dynamic autocompletes to fill in address
       information correctly)  (6) "cached workflows", allowing Skyvern to
       memorize previous interactions (ie text inputs) and re-use them in
       future runs.  We've also been blessed with a few model advancements
       to solve some of the cost concerns the community brought up.
       Skyvern's token costs went down 80% from $15 / 1M tokens (GPT-4V)
       to $2.50 / 1M tokens (GPT-4O)  Despite the model costs going down
       80%, Skyvern is still quite expensive to run, so we give every new
       user $5 of credits to try it out and see if it can be useful for
       you.  We would be honored if you could give it a try at
       https://app.skyvern.com and share some feedback with us, and we
       look forward to any and all of your comments!
        
       Author : suchintan
       Score  : 210 points
       Date   : 2024-10-24 15:51 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | BrandiATMuhkuh wrote:
       | Congratulations on the launch. This is really cool. I was
       | recently tinkering with the same idea. But based on a browser
       | extension.
       | 
       | There are many back office tasks where people copy data from page
       | 1 into a form of page 2.
        
         | suchintan wrote:
         | Yeah we've been surprised by how many interesting things
         | companies do in the background to keep them running
         | 
         | The craziest one we heard about was this government portal in
         | India that was hard to automate because halfway through the
         | portal you had to refresh the page a bunch of times to get a
         | button to show up
        
           | selimthegrim wrote:
           | The railway ticket site?
        
             | suchintan wrote:
             | It was a state level permit website I think. Very
             | interesting!
        
       | Cheesman123 wrote:
       | Congrats on the launch - love the tool
        
       | modo_ wrote:
       | Congrats on the launch! This is really cool - one of the
       | applications of LLM I find most compelling. I've seen so many
       | back office processes that have hundreds of steps, are incredibly
       | error prone, and traditionally couldn't be automated due to API
       | limitations. Solutions like Skyvern are going to supercharge
       | businesses that have had historically low margins due to the
       | number of humans required. (Not as a replacement for a human, but
       | as a force multiplier)
        
         | suchintan wrote:
         | The most fascinating part is how tough that work really is.
         | Everyone we've talked to loathes the manual stuff, but until a
         | better solution comes out, you have to allocate X% of your time
         | to it
        
       | mmaunder wrote:
       | Congrats!!! And super cool that you've open sourced it under the
       | AGPL. Sorry if this is answered in the docs but I did a brief
       | search on the source and noticed you're not using LangChain but
       | do plan to integrate it so it can be offered to that community.
       | I'm curious if you wouldn't mind talking about what you did use
       | to create the chain of thought/actions logic in Skyvern and if
       | you had to start work today if you'd consider going the
       | LangChain/Graph route? Thanks.
        
         | suchintan wrote:
         | We actually started off using the AutoGPT framework. There are
         | a ton of remnants of that (tasks, steps) but we found the
         | framework extremely limiting as we wanted to expand and do more
         | complex things
         | 
         | For example, we're currently using a multi agent architecture
         | where we have micro agents run to analyze SVGs, fill out
         | dynamic autocompletes. This would have been really hard.
         | 
         | Frameworks like langchain are good for early prototyping, but
         | it's too restricting when you want to push the limits
        
       | DennisSFO wrote:
       | Congrats on the launch. I'm curious if you had any experience
       | running skyvern on airline websites (for example to extract award
       | availability for miles tickets from point A to B)? It seems like
       | airlines always change things around and have robust anti
       | scraping measures.
        
         | suchintan wrote:
         | Great question. We haven't helped anyone with that exact use
         | case yet, but we're in the middle of integrating with a company
         | to help them automate purchasing flights with Alaska and
         | Southwest (on the behalf of real people)
         | 
         | It's going to be our way of beta testing CC transaction and
         | testing them for reliabilty
        
       | ganeshkrishnan wrote:
       | awesome work. I had the github starred from the day I saw on Show
       | HN but never got around to using it.
       | 
       | I want to use this to automate approving/declining group members
       | for our facebook group which is approaching half million members
       | and fb admin tools are pretty lacking
        
         | suchintan wrote:
         | Thank you for the star! We had someone talk about us the other
         | day on r/localllama (https://www.reddit.com/r/LocalLLaMA/commen
         | ts/1g9zhbd/if_your...) and I still couldn't believe that we
         | ever got past 50 stars
        
       | delusional wrote:
       | The plaintext version of your signup email replaces the ampersand
       | in the url with an &amp; XML entity. You probably don't want
       | that.
        
         | suchintan wrote:
         | Interesting. We will fix it
        
       | msp26 wrote:
       | Awesome, I've been working on a similar thing at a smaller scale
       | and I think this area is very promising.
       | 
       | I've limited my problem scope to single page interactions /
       | scraping which has been very reliable and useful for my company.
       | But agentic automation does sound fun.
        
         | suchintan wrote:
         | Yeah! We've seen this especially useful if you want to work in
         | highly dynamic situations
         | 
         | Ex: filling out contact forms on hundreds of websites? It's
         | really tough for normal code to be able to handle that
         | cardinality. No problem for an AI agent
        
           | msp26 wrote:
           | Just out of curiosity, what sort of challenges did you run
           | into when scaling this up?
           | 
           | I don't see a need for my current solution to go past a
           | handful of browser instances but I'd imagine it might get
           | crazy.
        
             | suchintan wrote:
             | I made a LinkedIn post about it yesterday, but the funniest
             | has been our customer DoSing our service by accident
             | (sending 10K tasks per hour for 24h straight)
             | 
             | Toughest was Skyvern accidentally talking to a support
             | agent when the website said "your request failed, please
             | contact support"
             | 
             | https://www.linkedin.com/posts/suchintansingh_we-
             | received-20...
        
       | glorpsicle wrote:
       | Congrats on the launch! I've been keeping up with you folks since
       | you last posted (a few months ago, I believe). How does
       | Anthropic's recent announcement of Claude's "computer use"
       | abilities grab you? What key differentiators does Skyvern have,
       | at this point in time ("computer use" with Claude being
       | relatively new)?
        
         | suchintan wrote:
         | Great question -- I was waiting for someone to ask this!
         | 
         | Their product and launch is super cool. It's incredible how
         | much it's able to do by just relying on tool use + micro agents
         | + screen shots + coordinates to interact with websites.
         | 
         | There are a couple of thoughts here:
         | 
         | (1) Will their competitors wait around and not build something
         | similar? Will xAI / Gemini / OpenAI / Mistral / MetaAI teams
         | wait around? Probably not. This is likely a huge part of the
         | future, and one company will not "take it all"
         | 
         | (2) How is value actually derived from these systems? Is a demo
         | + cool usable product enough? Likely not. Most people actually
         | want their workflow automated. For personal use-cases, this
         | might be enough.. but enterprises likely want something more
         | complex
         | 
         | (3) Will this be optimized for Claude only? What if you want to
         | run this with your own open source LLMs? Or you want to point
         | this at the best model on the market all the time? Will you get
         | that flexibility through a solution provided by a big player?
         | Likely not -- Anthropic has incentive to get you to use Claude
         | under the hood
         | 
         | The last point is the one that gives me hope. Our open source
         | users are able to pick their favourite model to run on. You're
         | not locked into Cluade. You can run it on Gemini / GPT-4O or
         | open source ones such as Llama 3.2.
        
           | helloericsf wrote:
           | Congrats on the launch! Curious to know, which OSS models you
           | see works best at the moment?
        
             | suchintan wrote:
             | We've had a decent amount of luck with InternVL 2.0 w/
             | Llama, and are pretty excited about Llama 3.2
             | 
             | It's still super early in the open source x vision model
             | space. The limiter actually seems to be the vision encoder
             | -- advancements here will pay off huge dividends
             | 
             | https://huggingface.co/spaces/opencompass/open_vlm_leaderbo
             | a...
        
               | helloericsf wrote:
               | Thank you! Great insight.
        
         | philipbjorge wrote:
         | I work in this space and Claude's ability to count pixels and
         | interact with a screen using precise coordinates seems like a
         | genuinely useful innovation that I expect will improve upon
         | existing approaches.
         | 
         | Existing approaches tend to involve drawing marked bounding
         | boxes around interactive elements and then asking the LLM to
         | provide a tool call like `click('A12')` where A12 remaps to the
         | underlying HTML element and we perform some sort of Selenium/JS
         | action. Using heuristics to draw those bounding boxes is
         | tricky. Even performing the correct action can be tricky as it
         | might be that click handlers are attached to a different DOM
         | element.
         | 
         | Avoiding this remapping between a visual to an HTML element and
         | instead working with high level operations like `click(x, y)`
         | or `type("foo")` directly on the screen will probably be more
         | effective at automating usecases.
         | 
         | That being said, providing HTML to the LLM as context does tend
         | to improve performance on top of just visual inference right
         | now.
         | 
         | So I dunno... I'm more optimistic about Claude's approach and
         | am very excited about it... especially if visual inference
         | continues to improve.
        
           | suchintan wrote:
           | Agreed. In the short term (X months) I expect the HTML
           | Distillation + giving text to LLMs to win out.. but the long
           | term (Y years) screenshot only + pixels will definitely be
           | the more "scalable" approach
           | 
           | One very subtle advantage of doing HTML analysis is that you
           | can cut out a decent number of LLM calls by doing static
           | analysis of the page
           | 
           | For example, you don't need to click on a dropdown to
           | understand the options behind it, or scroll down on a page to
           | find a button to click.
           | 
           | Certainly, as LLMs get cheaper the extra LLM calls will
           | matter less (similar to what we're seeing happen with Solar
           | panels where cost of panel < cost of labour now, but was
           | reversed the preceding decade)
        
       | drewsonian wrote:
       | This is great, and I can think of several business uses and some
       | personal.
       | 
       | Like this: Could I use this to pull screenshots or PDFs of my
       | grocery receipts from a major grocery chain?
        
         | suchintan wrote:
         | Yes! We're helping a few companies with this right now. This
         | use-case actually surprised me.
         | 
         | I never realized how important it is to track invoices in
         | Europe (where VAT needs to be closely tracked), and a large %
         | of vendors require you to log into their portal to fetch them
        
       | infocollector wrote:
       | Quick question: What does DataDog's ddtrace do in the opensource
       | version?
        
         | suchintan wrote:
         | Nothing -- we use DataDog for our cloud telemetry and haven't
         | built a great way to separate dependencies between cloud and
         | open source
        
       | dboreham wrote:
       | In case anyone else is confused as to what "browser automations"
       | is : this is about making a program that drives a target web site
       | (owned by someone else typically), in the manner of selenium or
       | the like --- inserting key press events and mouse move/click
       | events, to make that target web site do something. Once you know
       | that the rest of the description makes sense.
        
       | sergeyk wrote:
       | Congrats! Do you have numbers on WebArena (https://webarena.dev)
       | or VisualWebArena (https://jykoh.com/vwa)?
        
         | suchintan wrote:
         | Not yet! We haven't shared them publicly yet because our
         | internal dataset is super biased. Keep your eyes peeled though!
         | They'll be coming out in the next few weeks :)
        
       | andychert wrote:
       | Do I understand correctly that this is an open source of the GUI
       | only, you don't show the model itself?
        
         | andychert wrote:
         | Or you don't have your own model, you use GPT-4V to determine
         | the coordinates of where to click the bot?
        
       | drippingfist wrote:
       | This is very cool. Do you think I could use to do UX/UI testing?
        
         | suchintan wrote:
         | Give it a try! It's very capable of doing simple tasks like
         | logging in and clicking around. You'll need to prompt
         | assertions like "Complete if..." and "Terminate if..."
        
       | jackb4040 wrote:
       | > You won't be able to run Skyvern unless you enable at least one
       | provider.
       | 
       | Any plans on bundling a local LLM / supporting local LLMs?
        
         | suchintan wrote:
         | We have an open issue for this right now -- we would LOVE some
         | contributions here. The biggest problem until Llama 3.2 came
         | out was that most (good) open source llms were text-only, and
         | Skyvern needs vision to perform well
         | 
         | This isn't true anymore -- we just need to build and launch
         | support for it
        
       | rokhayakebe wrote:
       | Can I use this to make changes to a Wordpress website if given
       | login?
        
         | suchintan wrote:
         | Depends on the scope of the changes. What did you have in mind?
        
           | rokhayakebe wrote:
           | Maybe add a new page or update a link.
        
       | sirmarksalot wrote:
       | As with any of these LLM workflow automation tools, it raises a
       | few questions about each potential use case, and the likely long-
       | term outcomes.
       | 
       | 1. Is this working around friction due to a lack of
       | interoperability between tools? For example, is this something
       | that would be more efficient if the owner of the website exposed
       | a REST service? Will the existence of this tool disincentivize
       | companies from exposing services when it makes sense?
       | 
       | 2. If there is a good reason for the lack of a service endpoint,
       | perhaps for security reasons, will your automation workflow be
       | used to bypass those security measures? Could your tool be used
       | by malicious actors to disable major services? Are you that
       | malicious actor yourself? Will your tool be used by scalpers to
       | prevent consumers from buying high-demand products?
       | 
       | 3. If this is being used to work around deferred maintenance with
       | internal tools and processes, will the existence of these kind of
       | tools be used by management to justify further deferral of that
       | maintenance? Will your tool become a critical piece of the
       | support staff's workflow?
       | 
       | 4. If your tool is being used in good faith to work around anti-
       | patterns in website design, will the owner of the website be
       | incentivized to break your workflow? Is your use case just a step
       | in an arms race?
       | 
       | These are the thoughts that go through my head whenever I hear
       | about software being laid on top of complicated processes, where
       | instead of simplifying the underlying processes, we add another
       | layer of complexity to sweep it under the rug. I'm sure that
       | people will find your project useful, but I wonder what the
       | longer-term effects will be.
        
         | suchintan wrote:
         | 1. Yes absolutely. But the issue is a little bit more nuanced
         | than that. Websites without APIs don't have them for one of two
         | reasons: (1) They want to protect their data (LinkedIn) or (2)
         | can't be bothered to make an API (boutique websites, government
         | portals). This solves that problem, but also makes it so these
         | websites never have to build an API (after LLM costs go down).
         | 
         | 2. We don't want Skyvern to be used on websites that prohibit
         | this kind of behaviour (LinkedIn is the obvious example).
         | Specifically, we didn't open source any of our anti-bot or
         | captcha related code because we get requests to make "Reddit
         | upvote rings" and such. We don't want to support bad actors
         | like that
         | 
         | (3) I think this is a net net good thing. AI browser
         | automations= less need for APIs = no need to maintain both an
         | API and UI = streamlined experience + less code = simpler
         | systems
         | 
         | (4) I'm not 100% sure about this one. We usually just assume
         | companies don't build APIs because they don't have budget for
         | it. Ie for non malicious reasons. Companies like LinkedIn will
         | likely thwart any attempts at automation, but we're not
         | interested in participating in this cat mouse game
        
       | Workaccount2 wrote:
       | Anyone building a start-up on 3rd party LLMs at this point has to
       | have some big cajones. Or you need a smash-and-grab business
       | model. Serious risk if your horizon is measured in years instead
       | of months.
       | 
       | Anthropic threw their hat in this ring yesterday, and it will
       | very likely be followed by OpenAI and Google soon. Godspeed.
        
         | _HMCB_ wrote:
         | What do you mean they threw in their hat? I am not aware as to
         | what happened.
        
           | paladin314159 wrote:
           | https://www.anthropic.com/news/3-5-models-and-computer-use
        
             | _HMCB_ wrote:
             | Thank you.
        
       ___________________________________________________________________
       (page generated 2024-10-24 23:00 UTC)