[HN Gopher] Show HN: Skyvern - Browser automation using LLMs and...
___________________________________________________________________
Show HN: Skyvern - Browser automation using LLMs and computer
vision
Hey HN, we're building Skyvern (https://www.skyvern.com), an open-
source tool that uses LLMs and computer vision to help companies
automate browser-based workflows. You can see some examples here:
https://github.com/Skyvern-AI/skyvern#real-world-examples-of... and
there's a demo video at https://github.com/Skyvern-AI/skyvern#demo,
along with some instructions on running it locally. We provide a
natural-language API to automate repetitive manual workflows that
happen within the companies' backoffices. You can check out our
code and play with Skyvern here: https://github.com/Skyvern-
AI/Skyvern We talked to hundreds of companies about things they do
in the background and found that most of them depend on repetitive
manual workflows. The breadth of these workflows surprised us -
most companies started off doing things manually, and eventually
either hired people to scale the manual work, or wrote scripts
using Selenium-like browser automation libraries. In these
conversations, one common point stood out: scaling is a pain either
way. Companies relying on hiring struggled to adjust team sizes
with fluctuating demand. Companies using Selenium and similar tools
had a different problem: it can take days or even weeks to get a
new workflow automated, and then would require ongoing maintenance
any time the underlying websites changed because their XPath based
interaction logic suddenly became invalid. We felt like there was
a way to get the best of both worlds with LLMs. We could use LLMs
to reason through a website's layout, while preserving the
advantage of traditional browser automations allowing it to scale
alongside demand. This led us to build Skyvern with a few core
functionalities: 1. Skyvern can operate on websites it's never
seen before by connecting visible elements with the natural
language instructions provided to us. We use a blend of computer
vision and DOM parsing to identify a set of possible actions on a
website, and multi-modal LLMs to map the natural language
instructions to the available actions on the page. 2. Skyvern is
resistant to website layout changes, as it doesn't depend on any
predetermined XPaths or other selectors. If a layout ever changes,
we can leverage the methodology in #1 to complete the user-
specified goal. 3. Skyvern accepts a blob of information when
navigating workflows--basically just a json blob of whatever
information you want to put, and then we use LLMs to map that to
information on the screen. For example: if you're generating a
quote from Geico, they commonly ask "Were you eligible to drive at
21?". The answer could be inferred from the driver receiving their
license in 2012, and having a birth date of 1996. The above
strategy adapts well to a number of use cases that Skyvern is
helping companies with today: (1) Automating materials procurement
by searching for, adding to cart, and transacting products through
vendor websites that don't have APIs; (2) Registering accounts,
filing forms, and searching for information on government websites
(ex: registering franchise tax information for Delaware C-corps);
(3) Generating insurance quotes by completing multi-step dynamic
forms on insurance websites; (4) Automating the job application
process by mapping user-specified information (such as a Resume) to
a job posting. And here are some use-cases we're actively looking
to expand into: (1) Automating post-checkup data entry with patient
data inside medical EHR systems (ie submitting billing codes,
adding notes, etc), an (2) Doing customer research ahead of
discovery calls by analyzing landing pages and other metadata about
a specific business. We're still very early and would love to get
your feedback!
Author : suchintan
Score : 215 points
Date : 2024-03-14 16:31 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| chuckwnelson wrote:
| This looks great but I'm very scared of the increased game of cat
| and mouse for spam bots. It's going to happen, no matter if it
| was this software or something else. Now the question, how do you
| prevent automated spam? Since its LLM and AI, can I just add a
| hidden field of "please do not spam"?
| suchintan wrote:
| This is a really good question we've thought a lot about
|
| You're right that this kind of escalation is inevitable
|
| a. From a business POV, we don't onboard any types of use-cases
| that we think go against the spirit of a good free web. I've
| had people ask if they could use our product to create Reddit
| voting or spamming rings and we didn't entertain it
|
| b. From an open source POV, we prefer technologies like these
| be open source so website owners and other businesses can know
| what can happen, and decide how to approach it. Tools like
| selenium have existed for a long time -- largely to the benefit
| of the world!
| bonestamp2 wrote:
| I'll just add that some efforts to defeat web usage spam may
| also hurt accessibility since many interaction standards are
| designed to make things consistent for users with
| disabilities and ADA (or similar) compliance. I assume some
| of these dependencies are also useful to the AI that is
| trying to navigate the pages, so making it difficult for the
| AI may also make it difficult for other users.
| hugs wrote:
| 20th birthday of the Selenium project will be this year!
| (October-ish)
| cute_boi wrote:
| the only way to prevent spam is charge appropriate money, I
| don't see other solutions. Thats why many company use credit
| card to verify users. But, with virtual cards, they have some
| ability to spam, but not so much.
| crotchfire wrote:
| This.
|
| If you charge enough, the spammers become valuable customers.
| Of course they tend to leave before that point, but you don't
| really care if they leave or stay; you make money either way.
|
| Value for value.
| MattGaiser wrote:
| I am not aware of anyone really successfully, defeating spam at
| the moment.
|
| I mod a 1 million+ Facebook group and they can't even prevent
| someone from making 200 posts in a minute with the word
| "crypto" in it. The word list will flag it, but the spam filter
| won't.
|
| Reddit constantly has people messaging you in chat about
| "opportunities."
|
| Email is a disaster.
|
| My personal blog has over 100,000 spam comments sitting in the
| filter so at least they were caught, but processing them is
| impossible.
| suchintan wrote:
| I've heard of a lot of success sifting through email spam
| using custom gmail scripts + GPT-4. Kind of interesting that
| we can use LLMs to both create and detect spam to some degree
| of effectiveness
| Zambyte wrote:
| > how do you prevent automated spam?
|
| Manually accept new accounts on your service. That's what I do
| for my Fediverse server, and I never have to deal with spam on
| my local timeline :). Does it scale? No. Does everything need
| to scale? Also no.
| lenerdenator wrote:
| but if I can't scale then the VC that gave my startup a huge
| check over a huge pile of blow at a party in Sunnyvale will
| harvest my organs
| resource_waste wrote:
| I've had stuff like that turn me off from signing up or ever
| checking back.
|
| Does it matter to you? Yes.
|
| Will you admit it? No.
|
| But yes, these are all decisions we need to make. That
| manually accepting is some serious dedication. Do you have
| kids?
| Zambyte wrote:
| > Does it matter to you? Yes.
|
| > Will you admit it? No.
|
| Are you trying to telling me my opinion? Because no, it
| does not matter to me. Your account would not be accepted
| because I don't know you.
| PeterisP wrote:
| If your target audience is businesses, not individuals,
| then you can go a _very_ long way with fully manual
| onboarding, invoicing, etc. It 's different for things like
| consumer services or e.g. forum users, but why couldn't you
| manually vet every business your business trades with?
| agreeahmed wrote:
| Exciting to see this on HN. I think very soon agents like Skyvern
| will account for the vast, vast majority of web traffic.
| MattDaEskimo wrote:
| Maybe for a transition period.
|
| There's no reason for somebody to create a website, pay for
| resources, and hope for some sort of revenue if their visitors
| are mostly AI.
|
| So why bother creating a UI? Instead it would make more sense
| to close the website and offer the same information as a paid
| API service.
|
| Any sort of website that needs to validate human visitors will
| be plastered with DRM. Rendering these web browsing LLMs
| useless. And good riddance as well.
|
| Using an LLM to browse the internet feels like a huge waste of
| resources.
|
| Instead it would make more sense to have a wikipedia-like for
| AIs to crawl via embeddings.
| suchintan wrote:
| I suspect that web traffic will encapsulate both. Many
| websites (government ones in particular) aren't interested in
| API-based access patterns.
|
| This kind of pattern makes it so you can serve both users and
| agents with a single interface
| MattDaEskimo wrote:
| This would be ideal. The only issue here is trust. If my
| website relies on advertising then of course I would prefer
| to serve more content to a human visitor.
|
| So what? I bot protect my site, redirecting the AI to a
| minimalistic part that most likely expects some sort of
| value given?
|
| People will just breach this trust, like OP and abuse tools
| like Selenium (as they always have) to imitate being a
| human.
| suchintan wrote:
| I think this is pretty interesting -- I wonder if
| websites could allow agents to self-identify, and not
| count them towards advertising CPM to prevent dilution in
| the advertising metrics
|
| Perhaps a similar thing as robots.txt is in order
| (agents.txt?)
| Spivak wrote:
| I mean what kind of websites are we talking about here? The
| kinds of websites where all the value can be extracted by via
| a LLM are just content farms.
|
| And yeah, that sucks for content farms but putting up content
| and getting nothing in return is already how ad blockers work
| and it hasn't destroyed the them. I seriously doubt that AI
| traffic will even put a dent 1/1000th of the traffic loss of
| Google snippets.
| hipadev23 wrote:
| Why would the majority of web traffic turn into extremely
| expensive to operate agents?
| failuser wrote:
| The expectation is that the price of AI bots will go down and
| get below the human-driven click farms we have now and thus
| make fighting bots too expensive because identifying humans
| gets harder every day.
| failuser wrote:
| That's why we can't have nice things. Are we at the end of
| Eternal September? Will all the signs of human life be
| restricted to paid or otherwise closed groups? If all free
| users are bots, who will even run ads that feed the Web 2.0
| internet?
|
| I still have fear that the real internet has already split from
| what I see and I was left behind.
| mosselman wrote:
| If I were to build some custom GPT powered thing for this. Is
| there a similar project I can use with a command line interface
| or some programmatic interface?
| keremyilmaz wrote:
| Skyvern is actually an API-first product! The UI we built is
| mainly for simplicity and being able to debug the steps our
| agent takes.
|
| You can easily copy sample curl requests through our UI. Feel
| free to check out the quickstart on our GitHub and let us know
| if you have any questions.
| mosselman wrote:
| Thanks I will check it out.
|
| Any idea on pricing/business model?
| suchintan wrote:
| We tend to charge per request our users send us.. although
| the exact amount depends a lot on the exact task you want
| to run. Want to send Skyvern on a 40+ page journey to
| answer a question? It's a bit more expensive than just
| navigating to a page and extracting information
|
| I'd love to chat about your use-case. Happy to follow-up
| over email (suchintan@skyvern.com) or over a quick call
| (https://meetings.hubspot.com/suchintan)
| is_true wrote:
| Weeks to automate something? Anyone experienced would be able to
| automate most workflows in a couple of days top.
| suchintan wrote:
| You're right -- we should have written days to weeks.
|
| What's interesting here is that large companies like UI Path
| charge thousands of dollars to build a single robot for
| companies.. I wonder if that large up-front expense will still
| be necessary in this new world
| is_true wrote:
| That's crazy. We usually create robots and most of the time
| we charge less than a thousand USD.
|
| We have a lot of tooling in place now so most things take
| minutes. The harder step is getting the data in the client's
| infrastructure
| suchintan wrote:
| When you say "getting the data in the client's
| infrastructure", do you mean self-hosting the robots? or
| something else?
| dang wrote:
| I've edited the text above to say "days or even weeks".
| suchintan wrote:
| thank you!!
| dinobones wrote:
| Roughly how much does it cost to run to scrape a page? I see from
| the code this is basically an OpenAI API wrapper but you make no
| mention of that anywhere on your landing page/documentation, nor
| any mention of which LLMs this is capable of working with.
|
| Also, an idea is to offer a "record" and "replay" mode. Let the
| LLM run through the instructions, find the selectors, record and
| save them. Then you can run through again without using the LLM,
| replaying the interaction log, until the workflow breaks, then
| re-generate the "interaction log" or whatever.
| suchintan wrote:
| This is a great call-out. It's something currently in our
| roadmap
|
| Re: cost for execution. This really depends on the page, but
| currently it costs between 5 cents and 20 cents per page to
| execute (today).
|
| We have an improvement planned to help it "remember" or "cache"
| actions it's done in the past so it can just replay them and
| bring the cost down to near zero.
|
| Re: LLMs it's capable of working with, currently it's only
| GPT-4V. I'll get this updated soon!
| pkiv wrote:
| If you want to build it yourself, you could try using
| https://browserbase.com/. We offer managed headless browsers
| work everywhere, every-time. It costs $0.10 per browser
| session/hour (billed minutely). Feel free to shoot me an email
| if you want access! paul@browserbase.com
| pstorm wrote:
| Based on #2, it seems like they only use the LLM when the page
| changes. I had a prototype of this sort of system working and
| it was surprisingly fault tolerant.
| giamma wrote:
| At first I thought this was a test tool for Web applications, but
| now I understand it's meant to be a better RPA.
|
| Would it be usable for test automation? Would API allow to create
| asserts?
| dvngnt_ wrote:
| there are already some existing solutions for e2e testing. I
| would say playwright with codegen works well enough but there
| are ones that make it even easier by wrapping around openapi
| but seems overkill
| suchintan wrote:
| Yes absolutely. You can prompt it to "terminate" if some state
| isn't met (ie XYZ text isn't displayed on the screen), and
| treat terminated results as failures
|
| For example, you could instruct it to go to hackernews and
| terminate if you don't see a comment from giamma by passing in
| this payload:
|
| { "url": "https://news.ycombinator.com", "navigation_goal":
| "goal is met if you see a post from giamma. Terminate if you
| don't" }
| ushakov wrote:
| There's a startup called Octomind (https://octomind.dev) doing
| exactly that and ZeroStep (https://zerostep.com), but on a
| lower level
| hirako2000 wrote:
| It reminds me of that bug a kid found to bypass the password
| locked screen of a very popular Linux distro.
|
| Might be great for pen testing.
| suchintan wrote:
| That's a great idea! I hadn't thought of pen-testing as a
| possible value prop for this product
| hubraumhugo wrote:
| AI should automate tedious and un-creative work, and data entry
| tasks definitely fit this description. Rule-based RPA will likely
| be replaced by fine-tuned AI agents for things like form filling
| and similar.
|
| Can you share some data on costs and scalability?
|
| At Kadoa, we're working on fully automating unstructured data ETL
| from websites, PDFs, etc. We quickly realized that doing this for
| a few data sources with low complexity is one thing, doing it for
| thousands of sources daily in a reliable, scalable, and cost-
| efficient way is a whole different beast.
|
| Using LLMs for every data extraction would be way too expensive
| and very slow. Instead, we use LLMs to generate the scraper and
| data transformation code and subsequently adapt it to website
| changes, which is highly efficient.
| suchintan wrote:
| Nice! We love what you're doing at Kadoa.
|
| We're trying our best not to move into the web scraping space
| -- we're focusing on automating uncreative, boring, tedious
| tasks.
|
| We've seen a lot of success going after form-filling on
| government websites, which would usually be very boring, but
| happens to work pretty well for us
| dvngnt_ wrote:
| > Skyvern understands how to solve CAPTCHAs to complete
| complicated workflows
|
| this seems like this could be used for abuse. the CAPTCHAs are
| specifically designed to stop botting on 3rd party websites.
|
| or this will just be another cat and mouse game where the next
| level of CAPTCHAs get more annoying and invasive to verify we are
| human
| worldsayshi wrote:
| It seems to me that the logical conclusion for captcha is to
| connect it indirectly to electronic id. This could be done in a
| privacy respecting way.
|
| You could get some token from the website. It could include
| encrypted service name and policies, like rate limit, that the
| authority should enforce. The client passes the token to the
| eId authority. The authority signs it and adds timestamp, but
| no user info. Client gives token to the service. Something like
| that. This is a bad top of mind example.
|
| I think we'll need to rely a lot more on eID in the future. I
| think it can be done in a good way but then it needs to be
| thought through before it gets adopted. And we have to be able
| to trust the eId institutes.
| suchintan wrote:
| 2FA and logged-in experience is sort of a proxy for eID. I
| suspect that's why so many companies require that you log in
| with something that knows your identity (log in with google),
| or ask you for your phone number to confirm your account
| hirako2000 wrote:
| But it's the same problem all over again, spammers would get
| an id, auth, then spam.
|
| Anti spams are about detecting whether activities are spam.
|
| Binding an identity, is the naive mechanism that makes us
| think spam wouldn't happen. All it does is say ok we know
| it's pug35372 that teared the linens apart.
|
| We can put all measures to authenticate users, won't makes
| them not potentially bots running havoc right after a manual
| authentication.
|
| There are even farms, manually created accounts by gig
| seekers who would fill forms, email and phone number
| verification for less than a dollar.
| suchintan wrote:
| Agreed.
|
| We didn't open source this functionality on purpose, and are
| very very specific about what use-cases we onboard that require
| it.
|
| That being said, we've gotten to learn a lot more about browser
| fingerprinting and captcha solving and it's a really
| interesting space.
|
| If you're curious about it, check out this blog post:
| https://antoinevastel.com/bot detection/2018/01/17/detect-
| chrome-headless-v2.html
| ushakov wrote:
| How does this compare to OpenAdapt?
|
| I have a feeling that this tech will become a commodity and will
| probably be built-in into the OS or Browser.
|
| Props for open-sourcing though!
| suchintan wrote:
| I agree -- this will likely get commoditized, which is why we
| didn't focus on making this a chrome extension. The API access
| pattern makes this particularly appealing as you can run
| multiple instances in the cloud
| suchintan wrote:
| Ah cool -- we weren't familiar with OpenAdapt. Will check it
| out.
|
| One big decision we made was to focus on browser automations
| (instead of computer automation like Adept or OpenAdapt). The
| reason for this was that we wanted to leverage the information
| available inside of a DOM to improve the quality of our agent's
| actions. We found that relying on image-only analysis with X,Y
| coordinate interactions wasn't able to offer high enough
| reliability for production workflows
| samstave wrote:
| >>(1) Automating post-checkup data entry with patient data inside
| medical EHR systems (ie submitting billing codes, adding notes,
| etc),
|
| FULL FUCKING STOP.
|
| [We talk about AI alignment. THIS is an aligment issue]
|
| Do you understand billing code fraud?
|
| If you supply this function - you will * _eliminate ANY AND ALL
| human accountability*_ unless you have ALSO built a fully
| auditable provenance from DR <-ehr-whatever-> codes.
|
| Codes ARE why the US health system is BS.
|
| Here - if you want to be altruistic - then you will take it upon
| the fact that CODES are one of the most F'd up aspects of
| costing.
|
| Codes = [medical service provided]
|
| so code = 50 = checkup = [$50 <--- WHO THE HECK KNOWS]
|
| So lets say I am Big Hospital. "No, we will only allow $25 for
| code 50" - and so they get that deal.
|
| I am single clinic so they have to charge $50
|
| Build a dashboard for what the large medical groups can negotiate
| per code, vs how a small hospital or clinic group gets per code.
|
| Only automate it if you can literally show a dash of all
| providers and groups and what they can charge per code.
|
| Infact - code pricing is a medical stock market.
|
| (each hospital group negotiates between the price they will pay
| per code, how much lobbying is a factor and all these other
| factors...
|
| what we really need an LLM for is to literally map out all the BS
| in the Code negotiations btwn groups, pharma, insurance,
| lobbying, kickbacks, political)
|
| Thats the medical holy grail.
|
| [EDIT: Just to show how passionate I am on this issue - here are
| some SOURCE:
|
| I have designed and built & commissioned out 11+ hospitals.
|
| Built the first iphone app for medical.. it was rejected by YC
| (hl-7 nurse comm system on iTouch devices) (2006?)
|
| opensourced that app to OpenVista.
|
| Brother was joint chiefs dr / head of va
|
| worked with building medical apps and blocked by every EHR...
|
| Zuckerbergs name is on top of some of the things I built at SFGH
| before he got there...(and ECH mtn vw)
|
| Ive seen way beyond the kimono
| suchintan wrote:
| We know very little about this space, except that the entire
| process is a little bit crazy.
|
| We've talked to a few companies now that would use a product
| like Skyvern to just automate billing information gathering to
| make sure patients don't get screwed in the billing process
|
| Are you open to chatting? I'd love to pick your brain about
| what's behind the kimono
|
| suchintan@skyvern.com or https://meetings.hubspot.com/suchintan
| razfar wrote:
| I'm curious about the computer vision aspect of this tool.
| Specifically, how was the model which draws bounding boxes around
| interactable elements trained? Definitely a step beyond existing
| browser automation software!
| suchintan wrote:
| It's surprisingly dumber than you think!
|
| I'm always fascinated by how far you can get with heuristics in
| certain situations. Check out the code here --
| https://github.com/Skyvern-AI/skyvern/blob/d0935755963b017ed...
| shnkr wrote:
| the moment I saw vision in the title I knew what was going on. it
| was first demoed[0] by AI Jason around 4 months back. is it any
| different?
|
| https://m.youtube.com/watch?v=IXRkmqEYGZA
| suchintan wrote:
| Love this video
|
| > self-operating-computer This is quite different than
| https://github.com/OthersideAI/self-operating-computer
|
| Self-operating-computer uses pixel mapping to control your
| computer. This is a very good approach, but it's extremely
| unreliable. GPT-4V frequently hallucinates pixel outputs,
| causing it to miss interactions, or enter fail-loops
|
| >The approach by AI Jason
|
| AI Jason is using image-only methods to interact with the
| browser. This is a great first step, but this approach tends to
| be rife with hallucinations or errors. We do dom parsing in
| addition to image anaylsis to help GPT-4V correlate information
| in the image to the interactable elements within the DOM. This
| dramatically boosts its ability to perform the same task over
| and over again reliably (which proved impossible with the
| image-only approach)
| shnkr wrote:
| nice. I was looking for simpler hacks as V didn't scale for
| me. Later I couldn't find time and this got back burnered.
|
| interesting concept for problem solving though. congrats!
| suchintan wrote:
| Thanks! We definitely experimented with V only (that's the
| dream), but there's too much context missing:
|
| 1. What's behind a select option? You don't know until you
| click it, which means you need another iteration. This
| sucks. 2. How do you consistently correlate things in the
| images to actual actions (ie upload a file to a file input,
| click on a button, insert a date into a date)? Having the
| additional HTML Tag information dramatically improves the
| action selection process (click vs upload vs type)
| chadash wrote:
| First of all, wonderful work. I'm gonna be using this for sure. I
| can think of many use cases. What would be nice though is a
| simple API. I send you what I need, you send me a jobId that I
| can use to check the status of my job and then let me download
| the results when I'm done.
|
| I played with the Geico example, and it seems to do a good job on
| the happy path there. But I tried another one where it
| struggled... I want to get me car rental prices from
| https://www.costcotravel.com/. I gave it airport + time of pickup
| and dropoff, but it struggled to hit the "rental car" tab. It got
| caught up on hitting the Rental Car button at the top, which
| brings up a popup that it doesn't seem to read.
|
| When I put in https://www.costcotravel.com/Rental-Cars, it
| entered JFK into the pickup location, but then failed to click
| the popup.
| suchintan wrote:
| We have a simple API we're building as a part of our cloud
| offering. It's in private beta today -- if you'd like to check
| it out please email me at suchintan@skyvern.com and I'd be
| happy to chat
|
| Thanks for the feedback re: costcotravel.com Skyvern definitely
| does NOT have 100% coverage of the web. This is one of the
| reasons we were excited to open source -- so we could learn
| about more websites where it doesn't work as expected
|
| I've filed an issue for this case here:
| https://github.com/Skyvern-AI/skyvern/issues/69
| dtnewman wrote:
| I tried it out and it's pretty pricey. My OpenAI API bill is
| $3.20 after using this on a few different pages to test it out.
|
| Not saying I wouldn't pay that for some use cases, but it would
| limit me.
|
| One idea: making scrapers is a big pain. But once they are setup,
| they are cheap and fast to run... this is always going to be
| slower. What I'd love to see is a way to generate scrapers
| quickly. So you wouldn't be returning information from the New
| York City property registry... instead, you'd return Python code
| that I can use to scrape it in the future.
|
| edit: This is likely because it was struggling, so it had to make
| extra calls. What would be nice is a simple feature where you can
| input the maximum number of calls / tokens to use on the entire
| call. Or even better, do some math and put in a dollar cap. i.e.,
| go fill out the Geico forms for me and don't spend more than
| $1.00 doing it.
| tmountain wrote:
| Just piggybacking here, but this is a great suggestion. It
| makes the cost a one-time expense, and you get something
| material (source code) in return.
| suchintan wrote:
| I love all of these ideas!!
|
| 1. You can set a "max steps" limit when you run it locally
| https://github.com/Skyvern-AI/skyvern/blob/d0935755963b017ed...
|
| We also spit out the cost for each step within the visualizer.
| Click on any task > Steps > there's a column that's dedicated
| to how much things cost to run
|
| https://github.com/Skyvern-AI/skyvern/issues/70
|
| 2. We have a roadmap item to "cache" or "memorize" specific
| tasks, so you pay the cost once, and then just run it over and
| over again. We're going to get to it soon!!
| suchintan wrote:
| https://github.com/Skyvern-AI/Skyvern/?tab=readme-ov-
| file#fe...
| keremyilmaz wrote:
| You've raised valid points about the cost and efficiency of our
| approach, which aims to make the LLM function as closely as
| possible to a human user. We chose this approach primarily for
| its compatibility with various websites, as it aligns closely
| with a website's intended audience, which is typically human.
|
| Addressing complex website interactions is a key advantage of
| this approach. For instance, in the process of generating an
| auto insurance quote, the sequence of questions and their
| specifics can vary greatly depending on prior responses. A
| simple example is the choice of a foreign versus a California
| driver's license. Selecting a foreign license triggers
| additional queries about the country of issuance and expiry
| date, illustrating the complexity and branching nature of such
| web interactions.
|
| However, we recognize the concerns about cost and are actively
| working on strategies to reduce it: - Optimizing the context
| provided to the LLM - Implementing caching mechanisms for
| certain repeated actions and only use LLMs when there's a
| problem - Anticipating advancements in LLM efficiency and cost-
| effectiveness, with the hope of eventually finetuning our own
| models for greater efficiency
| dtnewman wrote:
| I like this approach. Just as an example, if I'm getting a
| car insurance quote, I'd rather pay $1 to have the tool fill
| out the forms for me and be 90% that it filled them out
| correctly rather than pay $0.01 and only be 70% sure it did
| it correctly. And there are plenty of use cases like that.
| amne wrote:
| isn't that crazy rabbit thingy supposed to do just that? I
| hope you pre-ordered. I hear they're in great demand.
| dinobones wrote:
| There are two things here:
|
| 1) Using the LLM to find elements/selectors in HTML
|
| 2) Use LLMs to fill out logical/likely/meaningful answers to
| things
|
| I highly recommend you decouple these 2 efforts. While you
| gave a good example of "insurance quote step by step webapp",
| the vast majority of web scraping efforts are much more
| mundane.
|
| Additionally, even in this instance, the selector
| brain/intelligence brain don't need to be coupled.
|
| For example:
|
| Selector brain: "Find/click the button for foreign drivers
| license." Selector brain: "Find the country of origin field."
| Selector brain: "Find the expiry date field."
|
| LLM-intelligence brain: "Use values from prompt to fill out
| the country of origin and expiry date fields."
|
| Not-LLM intelligence brain: Inputs values from a JSON object
| of documentSelector=>value.
| jumploops wrote:
| Scrapers are one of the main use cases we're seeing for Magic
| Loops[0].
|
| ...and you've hit the nail on the head in terms of our design
| philosophy: use LLMs to generate useful logic, then run that
| logic without needing to call an LLM/Agent.
|
| With that said, we don't support browser automation. Skyvern is
| very neat, it reminds me of VimGPT[1], but with a more robust
| planning implementation.
|
| [0] https://magicloops.dev
|
| [1] https://github.com/ishan0102/vimGPT
| suchintan wrote:
| Nice! Thanks for sharing this.
|
| We tried approaches like VimGPT before but found the rate of
| hallucinations to be a bit too high to be used in production.
| The sweet spot definitely seems to be to combine the magic of
| Dom parsing AND vision
|
| We're going to definitely work on logic generation and
| execution, but we're taking it a bit more carefully. Many of
| the workflows we automate have changing workflow steps (ie
| I've never seen the exact same Geico flow twice), but this
| certainly isn't true for all workflows
| umaar wrote:
| Really like the simplicity of your website. I think when you
| first announced it, you mentioned you might open source Magic
| Loops, might you do that?
| enlyth wrote:
| It's getting genuinely difficult these days with everything
| walled behind Cloudflare, various anti-bot protections and
| increasingly creative CAPTCHAs
| spxneo wrote:
| How does it compare to this posted less than 24 hours ago?
|
| https://news.ycombinator.com/item?id=39698546
| suchintan wrote:
| Saw the launch yesteday. Love all of the excitement in the
| space!
|
| LaVague is all about generating selenium code to interact with
| a specific page, and do it step-by-step
|
| Skyvern is all about taking a simple instruction and converting
| it to a series of LLM-driven actions. It's meant to be more
| autonomous ("tell Skyvern what to do")
| spxneo wrote:
| Isn't that the same thing when you interact with the
| underlying webpage?
| suchintan wrote:
| We're quite different than LaVague. LaVague passes in the
| entire HTML DOM to the LLM to help it generate XPaths and
| valid Selenium code. (https://github.com/lavague-
| ai/LaVague/blob/main/src/lavague/...)
|
| Try this at your own risk.. any reasonable website would
| result in extraordinarily high input token costs
|
| We spend quite a bit of our time building a layer between
| the HTML and the LLM call to distill important pieces of
| information down to actions the LLM can take.. better
| weighing cost vs output. We're still not at 100% coverage.
| ilaksh wrote:
| Looks terrific. I hope you will consider adding support for
| Claude 3.
| suchintan wrote:
| We DEFINITELY will. I think we're planning on pushing that next
| week -- we've been super excited about it
|
| Just created this: https://github.com/Skyvern-
| AI/skyvern/issues/72
| 999900000999 wrote:
| Don't make me sign up for a demo, I'd rather just give you my
| credit card number and try it myself.
|
| Aside from that cool project!
| suchintan wrote:
| We're gonna build a self-serve UI soon! We just wanted to get
| it into people's hands ASAP :)
|
| Feel free to email me at suchintan@skyvern.com -- I can let you
| know when the self-serve UI is live
| BasieP2 wrote:
| Is this (finally) a step towards a better way of automated
| frontend testing?
|
| We're currently testing dom instead of vision.
| suchintan wrote:
| This can definitely be used for front end testing. Just tell it
| to do something like a user and monitor whether it's successful
| or not
|
| Here's a prompt example to try out
|
| { "url": "https://news.ycombinator.com", "navigation_goal":
| "goal is met if you see a post from basiep2. Terminate if you
| don't" }
| James_K wrote:
| God this is depressing. Not the product itself, but the need for
| it. That software has failed to be programmable to such a degree
| that a promising approach is rendering the GUI and analysing the
| resultant image with an AI model. It's insane that we have to
| treat computers as fax machines, capable only of sending hand-
| written forms over a network. The gap between how people use
| computers and the utility they could provide is massive.
| kevmo314 wrote:
| On the contrary! Isn't it neat that we now have a unified API
| that both humans and computers can consume?
| suchintan wrote:
| Actually this kind of stuff is super exciting -- we don't need
| to depend on companies exposing APIs for their website -- we
| can just use something like Skyvern instead!
___________________________________________________________________
(page generated 2024-03-14 23:00 UTC)