[HN Gopher] Using GPT-4 Vision with Vimium to browse the web
___________________________________________________________________
Using GPT-4 Vision with Vimium to browse the web
Author : wvoch235
Score : 371 points
Date : 2023-11-09 02:28 UTC (20 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| bnchrch wrote:
| Personally. This is what Im really excited about chatgpt for.
| Data has just become alot more free to access.
| burcs wrote:
| This is amazing, I feel like these vision models are going to
| make everything so much more accessible. Between the Be My Eyes
| app integration and now this, I'm really excited for how this
| transforms the web.
| ctoth wrote:
| I agree, and I think we're a year or two away from a full end-
| to-end trained screen reader. The ground truth from existing
| systems would provide great training material.
|
| As a technical blind person, my only concern is the inherent
| loss of privacy while sharing stuff with the big models.
| supriyo-biswas wrote:
| There are open source models such as
| https://github.com/THUDM/CogVLM and
| https://github.com/haotian-liu/LLaVA.
| ishan0102 wrote:
| Hey! Creator here, thanks for sharing! Let me know if anyone has
| questions and feel free to contribute, I've left some potential
| next steps in the README.
| jgalentine007 wrote:
| Very cool use for Vimium, I like the approach!
| ishan0102 wrote:
| Thank you!
| celeste_lan wrote:
| Omg I also just released something pretty similar earlier today
| https://github.com/Jiayi-Pan/GPT-V-on-Web. But it received
| little attention.
| ishan0102 wrote:
| Woah looks great, not surprised that multiple people thought
| of this! Your prompt looks much better than mine, I'm not
| really taking advantage of any of the default Vimium
| shortcuts.
| squeegmeister wrote:
| How does this differ from how ChatGPT currently browses the
| web?
| jimmySixDOF wrote:
| Nice. I know Open Interpreter are trying to get Selenium
| automated to natural language control and quite a few other
| projects are also popping up on HN lately. The vimium approach
| is a lot lighter so looks promising. One way or another the as-
| published world wide web is turning into its own dynamic API
| overlay server. Ingest all the Sources!
| roland35 wrote:
| what terminal are you using???
| ishan0102 wrote:
| Warp! (warp.dev)
| poulpy123 wrote:
| could it be used to make a bot that visit and parse websites to
| extrat relevant information without writing a parser for each
| websites ?
| transistorfan wrote:
| At my work there are a large contingent of people who essentially
| do manual data copying between legacy programs (govt), because
| the tech debt is so large that we can't figure out a way to plug
| these things together. Excited for tools like this to eventually
| act as a layer that can run over these sort of problems, as
| bizarre a solution as it is from a compute perspective
| morkalork wrote:
| Kinda sci-fi, we're so close to a future where when/if original
| source code is lost, a mainframe runs in an emulator and the
| human operating it is also emulated.
| haswell wrote:
| The industry buzzword is "Robotic Process Automation", which as
| a category of products has been focused on using various forms
| of ML/AI to glue these things together in a common/structured
| way (in addition to good old fashioned screen scraping).
|
| Up this this point, these products have been quite brittle. The
| recent explosion of AI tech seems like quite a boon for this
| space.
| leovander wrote:
| In the OP's specific instance when would you reach out for a
| traditional ETL tool vs an RPA solution?
| transistorfan wrote:
| How much does the involvement of a bank of fax machines
| complicate things?
| Roark66 wrote:
| A little perhaps, but not much. One can replace a bank of
| physical fax machines with modems.
|
| It would be an interesting job for sure. Why wasn't it
| done before? I can imagine only two reasons. One, there
| isn't that much data to move and it makes no sense to
| build software for what few people spend 30min per day
| on. Two, the data in the legacy system is images and
| people are not just moving it between systems, but they
| also do categorisation, verification etc. In which case
| an AI model may be useful, but almost always hard coded
| rules will be faster.
| teaearlgraycold wrote:
| RPA is for data sources and destinations that are meant for
| human consumption and entry. So you'd use RPA to take an
| image of a table and enter every row into a web form.
| keepamovin wrote:
| I totally agree on all points, especially around what AI
| means for this.
|
| I'm kind of in a happy accident situation because I was
| working on something for RPA, which then became a layer that
| was factored as its own product, but now might be able to
| come full circle as a result of AI.
|
| Essentially this layer can function as a "delivery medium"
| for RPA agent creation, that you can use on any device
| without download. However, as it has many others uses I've
| been working on those, but I've been seeking a great reason
| to get back into RPA.
|
| I have a cool idea to leverage human-guided AI creation of
| data maps and action tours for RPA, but similar to what you
| say, unless great care is taken you can end up with a brittle
| approach. Also, as the market has been quite saturated many
| reasonable approaches, I just haven't felt compelled.
|
| Yet now I think the possible merging of GPT level AIs with
| browser instrumentation to deliver an augmented way to browse
| the web makes that incredibly compelling.
|
| So I'm incredibly thrilled that I have this happy accident of
| BrowserBox^0 (the factored out layer originally from RPA work
| above) which provides a pluggable/iframe-emebeddable
| interface for remotely controlling a headless browser. So now
| I want to look at unifying BrowserBox with this kind of GPT
| driven exploration.
|
| It's even cooler, because, as BB enables co-browsing by
| default (multiplayer browsing) and turns the browser into a
| "client-server" architecture, I can see plugging in GPT-4V as
| a connecting client with some kind of minimal API affordance
| for it to use would, like the very cool vimium keyboard-
| enabled browsing in the OP, would be such interesting project
| to try!
|
| We're open source so if you want to check us out or get
| involved in this quest, come say hi, maybe get involved if
| you're game!
|
| 0: https://github.com/BrowserBox/BrowserBox
| jimmySixDOF wrote:
| I have watched your project for a while as a possible
| option for embedded browsers for XR applications like WebXR
| but the high licensing cost was a factor and solutions like
| Hyperbeam or Vueplex in Unity have been possible. Defiantly
| agree that multimodal LLM integration is a huge opportunity
| and multiplayer browsing with AI in realtime is a super
| cool idea if you package it right.
| keepamovin wrote:
| Hi jimmySixDOF thank you for the kind words and the
| attention on our project! :)
|
| Regarding pricing we have heard that feedback over time
| and gradually adjusted our licensing costs. It should now
| be much more affordable as it is targeted towards large
| deployments, with decreasing cost and increasing value at
| scale.
|
| If you'd like to send an email with any thoughts on our
| current prices on https://dosyago.com to cris@dosyago.com
| I'd highly value it!
|
| Your idea of WebXR and embedding within Unity is very
| interesting, and I think it could be a fit.
| aikinai wrote:
| I remember years ago thinking it was weird in Ghost in the
| Shell when a robot had fingers on its fingers to type really
| fast. Maybe that really won't happen since they can plug into
| USB at least, but they will probably use the screen and
| keyboard input sometimes at least.
| nomel wrote:
| Why would a keyboard be required? I think the intent to hit a
| letter would more easily be sent over a bluetooth HID
| "device". ;)
| yjftsjthsd-h wrote:
| USB is an attack vector; if it's not exploiting your USB
| driver it's connecting your data pins to mains power.
| Keyboards are an air gap.
| simbolit wrote:
| Isn't the keyboard connected to the computer via USB?
|
| If I have access to the keyboard, I have access to a USB
| cable plugged into the computer, right?
|
| Perhaps I misunderstand something....
| yjftsjthsd-h wrote:
| I meant the reverse; the computer attacking the robot
| using it
| simbolit wrote:
| Uhhhhh, thanks. That makes a lot of sense!
| pixl97 wrote:
| The issue with USB is you have to have power protection
| circuits. Analog interface at least in the show appeared much
| harder to hack.
| hubraumhugo wrote:
| I believe that LLMs will automate most of our data
| entry/copy/transformation work. 80% of the world's data is
| unstructured and scattered across formats like HTML, PDFs, or
| images that are hard to access and analyze. Multimodal models
| can now tap into that data without having to rely on complex
| OCR technologies or expensive tooling.
|
| If you go to platforms like Upwork, there are thousands of VAs
| in low-cost labor countries that do nothing else than manual
| data entry work. IMO that's a complete waste of human capital
| and I've made it my personal mission to automate such tedious
| and un-creative data work with https://kadoa.com.
| ishan0102 wrote:
| Yup, that's my long term goal. I want an "anything API" that
| brings structure to anything on the web.
| kristopolous wrote:
| I was thinking what the payoff would be to pose as human for
| these terrible pay click jobs and then assign them to an LLM
| en masse. There's an arbitrage there ... it may be a good
| strategy.
|
| I heard recently "click-work" works out to about $4/hr* If
| you could do that x50, passively, it's a fine income.
|
| * - see https://journals.sagepub.com/doi/full/10.1177/1461444
| 8231183... or listen to https://kpfa.org/episode/against-the-
| grain-october-30-2023/ ... it's a fascinating study. Terrible
| pay (way below minimum wage) but surprisingly high worker
| satisfaction. The users seem to view it as entertainment
| essentially categorizing it as casual gaming.
|
| The "asshole innovator" in me wonders if one could simply
| make it more entertaining and forego paying the user
| entirely.
| hubraumhugo wrote:
| Interesting. Instead of doing the click work manually,
| microworkers will just instruct and guide multiple GPTs.
| kristopolous wrote:
| _maybe_. A lot of modern clickwork is actually model
| training and there is a model-collapse phenomena
| (https://arxiv.org/abs/2305.17493) which means that it
| _should be_ banned for such work. I bet a number of
| clever people on the platforms are already trying to
| instrument AI to do the work regardless - it 's pretty
| close to "free money" if you can pull it off and not get
| caught and at a spigot size where there's no real serious
| consequences if you do.
| ishan0102 wrote:
| Yeah this seems easy to build but would rather work on
| making tools that improve accessibility 10x
| Roark66 wrote:
| Whenever I hear about such a thing (people doing legacy system
| data extraction manually) I wonder if in every case someone got
| the estimate for the "proper" solution and just decided a bunch
| of people typing is cheaper?
|
| Integrating things like Chatgpt will still require people who
| know what they are doing to look at it, and I wouldn't be
| surprised if the first advice they give is "don't use chatgpt
| for it".
| spaceman_2020 wrote:
| If the market forces work as they're supposed to (not a given
| anymore), then corporations that adopt better tech will see
| better profits through lower expenses. And then the laggards
| will have to adapt or die.
|
| Also remember that this is essentially v1 of the software-
| the Windows 95 of this adoption cycle
| Garlef wrote:
| "Chinese Room Automation"
| bboygravity wrote:
| Funny that you and others on here don't seem to realize that
| literally everybody who uses the internet has the exact same
| data entry problem all the time. Blame it on "old software",
| but how about the entire internet?
|
| copying (or in most cases even worse: re-typing) form data from
| one location on the screen into yet another webform.
|
| Username, password, email address, physical address, credit
| card info etc etc.
|
| Some extensions try to help with data entry, but none of them
| work properly and consistently enough to really help. Even
| consistently filling just username and pw is too much to ask.
|
| It's my number 1 frustration when using the internet (worse
| than ads) and I find it mind-blowing that this hasn't been
| solved yet with or without LLMs.
|
| I would pay a montly fee for any software that solves this once
| and for all and it sounds like it's coming (and I'm already
| paying their monthly fee).
| pseudosaid wrote:
| use a password manager. i havent copy pasted form data twice
| on a site in a long time
| fragmede wrote:
| consistently filling out username and password is all I
| wanted from my password manager, but it turns out it handles
| credit card number and other bits of information for me as
| well.
| arkitaip wrote:
| I've used Bitwarden to faster fill out job applications.
| TeMPOraL wrote:
| > _It 's my number 1 frustration when using the internet
| (worse than ads) and I find it mind-blowing that this hasn't
| been solved yet with or without LLMs._
|
| Simple: it's because not solving this problem is how our
| godawful industry makes most of its money. Empowering the
| user means relinquishing control over their "journey"[0].
| Ergonomics means fewer opportunities to upsell or show ads.
|
| I don't have the link handy, but I'm reminded of one of the
| earliest Windows user interface guidelines documents, back
| from Windows 95/98 era, which, in a section about
| theming/visual style, already recognized that they _have to_
| allow for full flexibility, because vendors will insist on
| fucking the experience up for the sake of branding anyway,
| and resisting it is futile[1].
|
| --
|
| [0] - I'm trying really hard to hold back my contempt towards
| terms like this, and the whole salesy way of viewing human-
| computer interactions.
|
| [1] - They put it in much more polite terms, but the feeling
| of helplessness was already there.
| itronitron wrote:
| >> because vendors will insist on fucking the experience up
| for the sake of branding anyway
|
| I see that you too have at some point installed printer
| driver software.
| musha68k wrote:
| Ted Nelson's "intertwingularity" isn't far off from the
| data entry problem described. He argues for universal data
| access where duplication is obsolete. Imagine form data as
| a single, linkable object across the web, editable in one
| place, reflected everywhere--no re-typing, just seamless
| auto-fill. That's the unrealized potential of hypertext.
| anonzzzies wrote:
| Yeah, my dream would be using this to scrape pages, pop the
| content into my provide db, serving it up in my own format
| (which is going to be a white page with letters with inline
| images and videos that are not ads. And my interactions fed
| back to the vision model to post in the original. So I never
| have to see a 'design' (heavy js riddled unreadable crap)
| again in my life. And so I can, with my own tooling, browse
| and reuse my history including content instead rely on all
| the broken stuff bolted on the web.
| loud_cloud wrote:
| FTL. See NiagraFiles.
| williamcotton wrote:
| Bash pipes? The free flow of information through composable
| tools.
|
| The commercial web? Not the above.
|
| This is just a baseline. I'm sure that an LLM can help the
| issue but the biggest problem is that these varied HTTP-with-
| datastores are islands passing messages in bottles back and
| forth while a bash pipeline is akin to fiber optics.
| FooBarWidget wrote:
| It's bizarre computationally, but at this point maybe we have
| to compare it to the alternative: hiring a person. At least the
| AI only consumes electricity (which is hopefully green), while
| a person consumes food (grown with mined fertilizers), or meat
| (which we know is really bad for the environment).
| alexirobbins wrote:
| Working on this layer at https://autotab.com. This sounds like
| an amazing problem for browser automation to solve, would love
| to talk with you if you're interested!
| monkeydust wrote:
| This has been fruitful ground for RPA offerings like UIPath and
| Automation Anywhere. Multi-model LLMs open up chance to disrupt
| them
| specialist wrote:
| > _a large contingent of people who essentially do manual data
| copying_
|
| Yup.
|
| I was briefly part of a decades long effort to migrate off a
| main frame backend. It was basically a very expensive shared
| flat file database (eg FileMaker Pro). Used by thousands of
| applications, neither inventoried or managed. Surely a handful
| were critical for daily operations, but no one remembered which
| ones.
|
| And the source data (quality) was filthy.
|
| I suggested we pay some students to manually copy just the bits
| of data our spiffy "modern" apps needed.
|
| No one was amused.
|
| --
|
| I also suggested we find a suitable COBOL runtime and just
| forklift the mainframe's "critical" infra into a virtual
| machine.
|
| No one was amused.
|
| Lastly, I suggested we throttle access to every unidentified
| mainframe client. Progressively making it slower over time.
| Surely we'd hear about anything critical breaking.
|
| That suggestion flew like a lead zeppelin.
| yreg wrote:
| A long, long time ago I worked on a small project for a major
| multinational grocery chain.
|
| I made them a tool that parses an Excel file with a specific
| structure and calls some endpoints in their internal system to
| submit the data.
|
| I was curious, so I asked how they are doing it currently. They
| led me to a computer at the back of their office. The wallpaper
| had two rectangles, one of them said MS EXCEL and the other
| said INTERNET EXPLORER. Then the person opened these apps,
| carefully positioned both windows exactly into those rectangles
| and ran some auto-clicker - the kind cheaters would use in
| RuneScape - which moved the cursor and copied and pasted the
| values from the Excel into the various forms on the website.
|
| Amazing.
| Valgrim wrote:
| I worked with a client who used a multi-millon dollar system
| for moving goods automatically into packaging stations. The
| system was built and maintained by a major european company.
| All the data was transfered automatically between systems
| normally, but one day, for some reason, there was an internal
| communication error inside the machine which caused a lot of
| packages to be sent without being recorded as such.
|
| Now normally we would just have contacted the company and
| asked them for a data extraction so we could cross-reference
| the data. But since it wasn't clear who was at fault, and we
| knew it would take weeks for that extraction, we looked for
| an internal solution first.
|
| Now there was a subsystem in the machine that worked only in
| Internet Explorer, with an old authentication scheme, that we
| could use to see the information we needed, so I, being the
| only person in the team without formal analysis training but
| having made my way there from a clerk job, knew exactly what
| to do.
|
| I fired up the old IE, Excel, wrote in 5 minutes a VBA script
| that did exactly what you described, click there copy that
| etc, and 30 minutes later we had our extraction, and resolved
| the issue completely before the packages were even shipped.
|
| All hail Excel.
| mst wrote:
| For all its flaws as a programming language, VBA made an
| excellent bodging language and I salute your expedient
| field hack.
| kspacewalk2 wrote:
| I wonder if it used something like AutoIt[0]. I remember
| using it at one of my more boring co-op jobs about 20 years
| ago to automate moving data between a spreadsheet and some
| obscure database product.
|
| [0] https://en.wikipedia.org/wiki/AutoIt
| gumballindie wrote:
| Wow. Leaking confidential tax payer data.
| jackconsidine wrote:
| Looks extremely cool. Trying to run it though, I get stuck at
| "Getting actions for the given objective..." (using the example
| on the repo)
| ishan0102 wrote:
| Huh weird, I'm getting that too. OpenAI has been having
| periodic outages today, think that might be why since it was
| working fine earlier.
| jechamt wrote:
| https://www.bleepingcomputer.com/news/security/openai-
| confir... News reports and their
| https://status.openai.com/incidents/21vl32gvx3hb incident
| reports indicate they are mitigating / fighting off attacks
| recently
| comment_ran wrote:
| It's so cool. I was wondering if we can make crawler tool much
| easier and better. It's more similar to the "human" way to
| interact with a website.
| imranq wrote:
| Is the vision model directly reading the screen and therefore
| also reading the Vimeo tags? It might be more effective to export
| the DOM tags and the associated elements as a Json object that is
| fed into chatGPT without using the vision component
| dymk wrote:
| > Currently the Vision API doesn't support JSON mode or
| function calling, so we have to rely on more primitive
| prompting methods.
| maccam912 wrote:
| I found that it works well to ask it to generate JSON the
| best it can, then pass it to gpt-3.5-turbo with the JSON
| response mode and instruct it to just clean up whatever input
| it received.
| ishan0102 wrote:
| Perfect, I have this as a todo in my readme and I'll
| implement this soon
| thekid314 wrote:
| I'm curious to see what it does when it sees a captcha.
| ishan0102 wrote:
| From OpenAI docs[1]: "For safety reasons, we have implemented a
| system to block the submission of CAPTCHAs."
|
| [1] https://platform.openai.com/docs/guides/vision
| xur17 wrote:
| Yeah, I've been feeding screenshots from selenium to the
| vision API, and when I trigger bot detection on a website,
| chatgpt refuses to process the image.
| NorwegianDude wrote:
| It does solve, or at least try to solve, captchas for me.
| It gets like half the characters correct, it's very bad at
| it.
| snake_doc wrote:
| Ah, very similar to Adept's[1] concept? Though, their product
| seems not yet ready.
|
| [1] https://www.adept.ai/
| ishan0102 wrote:
| Yep, took inspiration from them and a couple other startups
| QkPrsMizkYvt wrote:
| What other startups did you use for inspiration?
| karmasimida wrote:
| This is precisely the demo I am thinking.
| jatins wrote:
| It's also a little insane to me that what Adept has been
| supposedly building for years with 300+ mil in funding can now
| be built in a day with Open AI APIs?
|
| I think Adept pivoted along the way but original concept was
| very similar to this.
| sunshadow wrote:
| But its too expensive to become practical with the OpenAI
| API. Also, demo is cool until you see the real-world
| webpages, then you'll realize that this only works less than
| %50 of webpages.
| og_kalu wrote:
| GPT-4V may be surprisingly robust here. Set of mark
| prompting(which is accomplished here with Vim) improves
| grounding by a silly high amount. https://som-
| gpt4v.github.io/
| amks wrote:
| https://www.adept.ai/blog/experiments :)
| snthpy wrote:
| Looks cool. Unfortunately I expected this to enhance my Vimium
| experience but it looks like this is using Vimium to enhance
| GPT4, right?
| maccam912 wrote:
| I've been playing with a similar idea of screenshots and actions
| from GPT-4 Vision for browsing, but after trying and failing to
| overlay info in the screenshot, I ended up just getting the
| accessibility tree from playwright and sending that along as text
| so the model would know what options it had for interaction. In
| my case it seemed to work better, I see the creator is here and
| has a list of future ideas, maybe add this to the list if you
| think its a good idea?
| ishan0102 wrote:
| Cool that's a solid idea, I was trying to only use visual data
| but this could make the agent a lot more powerful, I'll try
| this really soon
| manmal wrote:
| Probably better to capture all the content and not just what
| fits on one screen. Most pages should fit as text (or HTML?) in
| the new extended token window.
| arbuge wrote:
| Better watch token costs. The per token costs are lower now
| but even so a full context load still costs almost $4.
| karmasimida wrote:
| We can create an autopilot for browser.
|
| It is going to incredibly difficult moving forward to distinguish
| bot traffic, if this is deployed at scale.
|
| The problem I see is this isn't going to be cheap or even
| affordable in short term.
| ishan0102 wrote:
| I think costs can come down if you finetune open source models
| like llava or cogvlm. This demo also cost about 6 cents so it's
| not insanely expensive either, especially with clever
| prompting.
| owenpalmer wrote:
| This will be fantastic for accessibility
| reqo wrote:
| How will tools like this affect web tracking or generally
| advertisements on the internet? Imagine you could have an agent
| browse the web for you and fetch exactly what you are seraching
| for without you seeing any ads/pop ups or being tracked along the
| way! Could be a great "ad blocker"! Could it perhaps also make
| SEO useless and thus improve the quality of internet? But I
| wonder if it also could have negative effects such as the ads
| being "interweaved" into the fetch content somehow!
| og_kalu wrote:
| Since this is sending screenshots of pages to GPT, won't it see
| the ads as well?
| braindead_in wrote:
| Why not build a new browser with GPT baked in?
| reustle wrote:
| Curious, how would that differ? Assuming it is just grabbing
| the rendered HTML DOM after each action, isn't it nearly the
| same?
| lachlan_gray wrote:
| I think vim is unintentionally a great "embodiment" for chatgpt.
| There's nothing that can't be done with a stream of text, and the
| internet is full of vimscript already
|
| I started a similar experiment if anyone else is thinking along
| the same lines :)
|
| https://github.com/LachlanGray/vim-agent
| gsuuon wrote:
| This is a neat idea!
| gvv wrote:
| Nice job! The horrors GPT-4 must endure to watch ads, truly
| inhumane
| FooBarWidget wrote:
| Many Dutch companies pay salaries by
|
| 1. receiving payslips from the accountant, and then
|
| 2. manually initiating bank transfers to each employee for the
| amount in the corresponding payslip, and then
|
| 3. manually initiating a bank transfer to the tax authority to
| pay the withholded salary taxes.
|
| This is completely useless manual labor. There should be no
| reason for this to be a manual procedure. And yet it's almost
| impossible to automate this. The accountant portal either has no
| API, or it has an API but lets you download the data as PDF,
| and/or the API costs good money. The bank either has no API, or
| it requires you to sign up for a developer account as if you're
| going to publish a public app, when you're just looking to
| automate some internal procedures.
|
| So the easiest way to pay salaries and taxes is still to hire a
| person to do it manually. Hopefully one day that won't be
| necessary anymore. I wouldn't trust an AI to actually initiate
| the bank transfers, but maybe they can just prepare the
| transactions and then a person has to approve the submission.
| martinald wrote:
| I don't think this really has much to do with AI. In the UK
| there are solutions like Pento now which do all this, including
| automating payments via open banking to the user and the tax
| authority and automatically filing tax filings:
|
| https://www.pento.io/la/payroll-software
| is_true wrote:
| In my country it's similar but for some data you have to upload
| to the government agency's site, I think it was earlier this
| year that they released a statement saying that people using
| software to perform actions on the website could get banned.
| nvm0n2 wrote:
| That's just a bank problem. Certainly this isn't how payroll
| works for large companies. Banks usually let you upload XML
| files that define a set of SWIFT payments, this is how I do
| payroll even for a small company. The accountants supply the
| XML file too, presumably they have an app that generates it.
| ranulo wrote:
| This could enable human language test automation scripts and
| could either improve my life as a QA engineer a lot or completely
| destroy it. Not sure yet.
| sunshadow wrote:
| You're good until this is cheaper than your salary.
| mackross wrote:
| Been playing with this through the ChatGPT interface for the past
| few weeks. Couple of tips. Update the css to get rid of the
| gradients and rounded corners. I found red with bold white text
| to be most consistent. Increase the font size. If two labels
| overlap, push them apart and add an arrow to the element. Send
| both images to the API, a version with the annotations added and
| a version without.
| bilekas wrote:
| This is actually pretty interesting.. I am thinking maybe it
| would be faster than writing up selenium tests themselves if we
| could just give a few instructions.
|
| I'm still going through the source, but really nice idea and
| great example of enriching the GPT with tools like vimium.
| startages wrote:
| There is just so much you can do with GPT-4 vision, I just hope
| it's more affordable.
| jonathanlb wrote:
| Hmm interesting. I'm curious what this means for accessibility
| and screen readers.
| e12e wrote:
| It's insane that this is now possible:
|
| https://github.com/ishan0102/vimGPT/blob/682b5e539541cd6d710...
|
| > "You need to choose which action to take to help a user do this
| task: {objective}. Your options are navigate, type, click, and
| done. Navigate should take you to the specified URL. Type and
| click take strings where if you want to click on an object,
| return the string with the yellow character sequence you want to
| click on, and to type just a string with the message you want to
| type. For clicks, please only respond with the 1-2 letter
| sequence in the yellow box, and if there are multiple valid
| options choose the one you think a user would select. For typing,
| please return a click to click on the box along with a type with
| the message to write. When the page seems satisfactory, return
| done as a key with no value. You must respond in JSON only with
| no other fluff or bad things will happen. The JSON keys must ONLY
| be one of navigate, type, or click. Do not return the JSON inside
| a code block."
| Maxion wrote:
| The speed at which this is moving at is mind boggling. This may
| become crazier than the dot.com boom.
| pms wrote:
| Until you realize that it doesn't work well with less popular
| videos (any items really), because "Large Language Models
| Struggle to Learn Long-Tail Knowledge" [1].
|
| [1] https://proceedings.mlr.press/v202/kandpal23a.html
| DalasNoin wrote:
| I tried to use it, but unfortunately it often did not add the
| little annotations for the different options to the screen and it
| got stuck in a loop. This bot works by adding a two letter
| combination to each clickable option, but sometimes they don't
| show up. It managed to sign in to twitter ones, but really
| quickly I burned through the 100 images api limit.
|
| Maybe for a future version it only uses vision for difficult
| situations in which it gets stuck and otherwise uses the text
| based browser?
| nostrowski wrote:
| This will be in a future history book under a chapter titled "the
| beginning of the end"
| dangerwill wrote:
| How is this making your browsing experience any better? You still
| have to know what you want to do, and it is just faster to type
| Rick roll into youtube directly and click the links directly
| instead of having to type k, or vh, or whatever. You are just
| adding a useless chatgpt middleman between you and the browser
| that you likely spend all day in anyway and should be adept at
| navigating
| ternaus wrote:
| Love the idea.
|
| It also shows that GPT-4V created a new angle in web scraping.
|
| I guess, this or similar code would be leveraged in many projects
| like:
|
| 1. Scrape XXX websites, say LinkedIn or Twitter use all types of
| methods in the DOM to prevent it, but fighting working well
| GPT-4V + OCR would be ultra hard.
|
| 2. Give me an analysis of what these XXX companies are doing. And
| this could be done for competitors, to understand the landscape
| of some industry, or even plainly to get news.
|
| Large-scale scrapping, not depending on the source code of the
| pages is a powerful infrastructural change.
| doctorM wrote:
| i think this is actively dangerous. well not yet. but getting
| there.
|
| i know - ai isn't meant to be sentient. but if it looks like a
| duck and quacks like a duck...
|
| how do i know that the comments here aren't done by dedicated
| hacker news ai bots?
|
| the potential danger could come from lack of supervision down the
| road.
|
| i didn't get much sleep last night so this is less coherent than
| it could be.
___________________________________________________________________
(page generated 2023-11-09 23:01 UTC)