[HN Gopher] Using GPT-4 Vision with Vimium to browse the web
       ___________________________________________________________________
        
       Using GPT-4 Vision with Vimium to browse the web
        
       Author : wvoch235
       Score  : 371 points
       Date   : 2023-11-09 02:28 UTC (20 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | bnchrch wrote:
       | Personally. This is what Im really excited about chatgpt for.
       | Data has just become alot more free to access.
        
       | burcs wrote:
       | This is amazing, I feel like these vision models are going to
       | make everything so much more accessible. Between the Be My Eyes
       | app integration and now this, I'm really excited for how this
       | transforms the web.
        
         | ctoth wrote:
         | I agree, and I think we're a year or two away from a full end-
         | to-end trained screen reader. The ground truth from existing
         | systems would provide great training material.
         | 
         | As a technical blind person, my only concern is the inherent
         | loss of privacy while sharing stuff with the big models.
        
           | supriyo-biswas wrote:
           | There are open source models such as
           | https://github.com/THUDM/CogVLM and
           | https://github.com/haotian-liu/LLaVA.
        
       | ishan0102 wrote:
       | Hey! Creator here, thanks for sharing! Let me know if anyone has
       | questions and feel free to contribute, I've left some potential
       | next steps in the README.
        
         | jgalentine007 wrote:
         | Very cool use for Vimium, I like the approach!
        
           | ishan0102 wrote:
           | Thank you!
        
         | celeste_lan wrote:
         | Omg I also just released something pretty similar earlier today
         | https://github.com/Jiayi-Pan/GPT-V-on-Web. But it received
         | little attention.
        
           | ishan0102 wrote:
           | Woah looks great, not surprised that multiple people thought
           | of this! Your prompt looks much better than mine, I'm not
           | really taking advantage of any of the default Vimium
           | shortcuts.
        
         | squeegmeister wrote:
         | How does this differ from how ChatGPT currently browses the
         | web?
        
         | jimmySixDOF wrote:
         | Nice. I know Open Interpreter are trying to get Selenium
         | automated to natural language control and quite a few other
         | projects are also popping up on HN lately. The vimium approach
         | is a lot lighter so looks promising. One way or another the as-
         | published world wide web is turning into its own dynamic API
         | overlay server. Ingest all the Sources!
        
         | roland35 wrote:
         | what terminal are you using???
        
           | ishan0102 wrote:
           | Warp! (warp.dev)
        
         | poulpy123 wrote:
         | could it be used to make a bot that visit and parse websites to
         | extrat relevant information without writing a parser for each
         | websites ?
        
       | transistorfan wrote:
       | At my work there are a large contingent of people who essentially
       | do manual data copying between legacy programs (govt), because
       | the tech debt is so large that we can't figure out a way to plug
       | these things together. Excited for tools like this to eventually
       | act as a layer that can run over these sort of problems, as
       | bizarre a solution as it is from a compute perspective
        
         | morkalork wrote:
         | Kinda sci-fi, we're so close to a future where when/if original
         | source code is lost, a mainframe runs in an emulator and the
         | human operating it is also emulated.
        
         | haswell wrote:
         | The industry buzzword is "Robotic Process Automation", which as
         | a category of products has been focused on using various forms
         | of ML/AI to glue these things together in a common/structured
         | way (in addition to good old fashioned screen scraping).
         | 
         | Up this this point, these products have been quite brittle. The
         | recent explosion of AI tech seems like quite a boon for this
         | space.
        
           | leovander wrote:
           | In the OP's specific instance when would you reach out for a
           | traditional ETL tool vs an RPA solution?
        
             | transistorfan wrote:
             | How much does the involvement of a bank of fax machines
             | complicate things?
        
               | Roark66 wrote:
               | A little perhaps, but not much. One can replace a bank of
               | physical fax machines with modems.
               | 
               | It would be an interesting job for sure. Why wasn't it
               | done before? I can imagine only two reasons. One, there
               | isn't that much data to move and it makes no sense to
               | build software for what few people spend 30min per day
               | on. Two, the data in the legacy system is images and
               | people are not just moving it between systems, but they
               | also do categorisation, verification etc. In which case
               | an AI model may be useful, but almost always hard coded
               | rules will be faster.
        
             | teaearlgraycold wrote:
             | RPA is for data sources and destinations that are meant for
             | human consumption and entry. So you'd use RPA to take an
             | image of a table and enter every row into a web form.
        
           | keepamovin wrote:
           | I totally agree on all points, especially around what AI
           | means for this.
           | 
           | I'm kind of in a happy accident situation because I was
           | working on something for RPA, which then became a layer that
           | was factored as its own product, but now might be able to
           | come full circle as a result of AI.
           | 
           | Essentially this layer can function as a "delivery medium"
           | for RPA agent creation, that you can use on any device
           | without download. However, as it has many others uses I've
           | been working on those, but I've been seeking a great reason
           | to get back into RPA.
           | 
           | I have a cool idea to leverage human-guided AI creation of
           | data maps and action tours for RPA, but similar to what you
           | say, unless great care is taken you can end up with a brittle
           | approach. Also, as the market has been quite saturated many
           | reasonable approaches, I just haven't felt compelled.
           | 
           | Yet now I think the possible merging of GPT level AIs with
           | browser instrumentation to deliver an augmented way to browse
           | the web makes that incredibly compelling.
           | 
           | So I'm incredibly thrilled that I have this happy accident of
           | BrowserBox^0 (the factored out layer originally from RPA work
           | above) which provides a pluggable/iframe-emebeddable
           | interface for remotely controlling a headless browser. So now
           | I want to look at unifying BrowserBox with this kind of GPT
           | driven exploration.
           | 
           | It's even cooler, because, as BB enables co-browsing by
           | default (multiplayer browsing) and turns the browser into a
           | "client-server" architecture, I can see plugging in GPT-4V as
           | a connecting client with some kind of minimal API affordance
           | for it to use would, like the very cool vimium keyboard-
           | enabled browsing in the OP, would be such interesting project
           | to try!
           | 
           | We're open source so if you want to check us out or get
           | involved in this quest, come say hi, maybe get involved if
           | you're game!
           | 
           | 0: https://github.com/BrowserBox/BrowserBox
        
             | jimmySixDOF wrote:
             | I have watched your project for a while as a possible
             | option for embedded browsers for XR applications like WebXR
             | but the high licensing cost was a factor and solutions like
             | Hyperbeam or Vueplex in Unity have been possible. Defiantly
             | agree that multimodal LLM integration is a huge opportunity
             | and multiplayer browsing with AI in realtime is a super
             | cool idea if you package it right.
        
               | keepamovin wrote:
               | Hi jimmySixDOF thank you for the kind words and the
               | attention on our project! :)
               | 
               | Regarding pricing we have heard that feedback over time
               | and gradually adjusted our licensing costs. It should now
               | be much more affordable as it is targeted towards large
               | deployments, with decreasing cost and increasing value at
               | scale.
               | 
               | If you'd like to send an email with any thoughts on our
               | current prices on https://dosyago.com to cris@dosyago.com
               | I'd highly value it!
               | 
               | Your idea of WebXR and embedding within Unity is very
               | interesting, and I think it could be a fit.
        
         | aikinai wrote:
         | I remember years ago thinking it was weird in Ghost in the
         | Shell when a robot had fingers on its fingers to type really
         | fast. Maybe that really won't happen since they can plug into
         | USB at least, but they will probably use the screen and
         | keyboard input sometimes at least.
        
           | nomel wrote:
           | Why would a keyboard be required? I think the intent to hit a
           | letter would more easily be sent over a bluetooth HID
           | "device". ;)
        
           | yjftsjthsd-h wrote:
           | USB is an attack vector; if it's not exploiting your USB
           | driver it's connecting your data pins to mains power.
           | Keyboards are an air gap.
        
             | simbolit wrote:
             | Isn't the keyboard connected to the computer via USB?
             | 
             | If I have access to the keyboard, I have access to a USB
             | cable plugged into the computer, right?
             | 
             | Perhaps I misunderstand something....
        
               | yjftsjthsd-h wrote:
               | I meant the reverse; the computer attacking the robot
               | using it
        
               | simbolit wrote:
               | Uhhhhh, thanks. That makes a lot of sense!
        
           | pixl97 wrote:
           | The issue with USB is you have to have power protection
           | circuits. Analog interface at least in the show appeared much
           | harder to hack.
        
         | hubraumhugo wrote:
         | I believe that LLMs will automate most of our data
         | entry/copy/transformation work. 80% of the world's data is
         | unstructured and scattered across formats like HTML, PDFs, or
         | images that are hard to access and analyze. Multimodal models
         | can now tap into that data without having to rely on complex
         | OCR technologies or expensive tooling.
         | 
         | If you go to platforms like Upwork, there are thousands of VAs
         | in low-cost labor countries that do nothing else than manual
         | data entry work. IMO that's a complete waste of human capital
         | and I've made it my personal mission to automate such tedious
         | and un-creative data work with https://kadoa.com.
        
           | ishan0102 wrote:
           | Yup, that's my long term goal. I want an "anything API" that
           | brings structure to anything on the web.
        
           | kristopolous wrote:
           | I was thinking what the payoff would be to pose as human for
           | these terrible pay click jobs and then assign them to an LLM
           | en masse. There's an arbitrage there ... it may be a good
           | strategy.
           | 
           | I heard recently "click-work" works out to about $4/hr* If
           | you could do that x50, passively, it's a fine income.
           | 
           | * - see https://journals.sagepub.com/doi/full/10.1177/1461444
           | 8231183... or listen to https://kpfa.org/episode/against-the-
           | grain-october-30-2023/ ... it's a fascinating study. Terrible
           | pay (way below minimum wage) but surprisingly high worker
           | satisfaction. The users seem to view it as entertainment
           | essentially categorizing it as casual gaming.
           | 
           | The "asshole innovator" in me wonders if one could simply
           | make it more entertaining and forego paying the user
           | entirely.
        
             | hubraumhugo wrote:
             | Interesting. Instead of doing the click work manually,
             | microworkers will just instruct and guide multiple GPTs.
        
               | kristopolous wrote:
               | _maybe_. A lot of modern clickwork is actually model
               | training and there is a model-collapse phenomena
               | (https://arxiv.org/abs/2305.17493) which means that it
               | _should be_ banned for such work. I bet a number of
               | clever people on the platforms are already trying to
               | instrument AI to do the work regardless - it 's pretty
               | close to "free money" if you can pull it off and not get
               | caught and at a spigot size where there's no real serious
               | consequences if you do.
        
             | ishan0102 wrote:
             | Yeah this seems easy to build but would rather work on
             | making tools that improve accessibility 10x
        
         | Roark66 wrote:
         | Whenever I hear about such a thing (people doing legacy system
         | data extraction manually) I wonder if in every case someone got
         | the estimate for the "proper" solution and just decided a bunch
         | of people typing is cheaper?
         | 
         | Integrating things like Chatgpt will still require people who
         | know what they are doing to look at it, and I wouldn't be
         | surprised if the first advice they give is "don't use chatgpt
         | for it".
        
           | spaceman_2020 wrote:
           | If the market forces work as they're supposed to (not a given
           | anymore), then corporations that adopt better tech will see
           | better profits through lower expenses. And then the laggards
           | will have to adapt or die.
           | 
           | Also remember that this is essentially v1 of the software-
           | the Windows 95 of this adoption cycle
        
         | Garlef wrote:
         | "Chinese Room Automation"
        
         | bboygravity wrote:
         | Funny that you and others on here don't seem to realize that
         | literally everybody who uses the internet has the exact same
         | data entry problem all the time. Blame it on "old software",
         | but how about the entire internet?
         | 
         | copying (or in most cases even worse: re-typing) form data from
         | one location on the screen into yet another webform.
         | 
         | Username, password, email address, physical address, credit
         | card info etc etc.
         | 
         | Some extensions try to help with data entry, but none of them
         | work properly and consistently enough to really help. Even
         | consistently filling just username and pw is too much to ask.
         | 
         | It's my number 1 frustration when using the internet (worse
         | than ads) and I find it mind-blowing that this hasn't been
         | solved yet with or without LLMs.
         | 
         | I would pay a montly fee for any software that solves this once
         | and for all and it sounds like it's coming (and I'm already
         | paying their monthly fee).
        
           | pseudosaid wrote:
           | use a password manager. i havent copy pasted form data twice
           | on a site in a long time
        
           | fragmede wrote:
           | consistently filling out username and password is all I
           | wanted from my password manager, but it turns out it handles
           | credit card number and other bits of information for me as
           | well.
        
             | arkitaip wrote:
             | I've used Bitwarden to faster fill out job applications.
        
           | TeMPOraL wrote:
           | > _It 's my number 1 frustration when using the internet
           | (worse than ads) and I find it mind-blowing that this hasn't
           | been solved yet with or without LLMs._
           | 
           | Simple: it's because not solving this problem is how our
           | godawful industry makes most of its money. Empowering the
           | user means relinquishing control over their "journey"[0].
           | Ergonomics means fewer opportunities to upsell or show ads.
           | 
           | I don't have the link handy, but I'm reminded of one of the
           | earliest Windows user interface guidelines documents, back
           | from Windows 95/98 era, which, in a section about
           | theming/visual style, already recognized that they _have to_
           | allow for full flexibility, because vendors will insist on
           | fucking the experience up for the sake of branding anyway,
           | and resisting it is futile[1].
           | 
           | --
           | 
           | [0] - I'm trying really hard to hold back my contempt towards
           | terms like this, and the whole salesy way of viewing human-
           | computer interactions.
           | 
           | [1] - They put it in much more polite terms, but the feeling
           | of helplessness was already there.
        
             | itronitron wrote:
             | >> because vendors will insist on fucking the experience up
             | for the sake of branding anyway
             | 
             | I see that you too have at some point installed printer
             | driver software.
        
             | musha68k wrote:
             | Ted Nelson's "intertwingularity" isn't far off from the
             | data entry problem described. He argues for universal data
             | access where duplication is obsolete. Imagine form data as
             | a single, linkable object across the web, editable in one
             | place, reflected everywhere--no re-typing, just seamless
             | auto-fill. That's the unrealized potential of hypertext.
        
           | anonzzzies wrote:
           | Yeah, my dream would be using this to scrape pages, pop the
           | content into my provide db, serving it up in my own format
           | (which is going to be a white page with letters with inline
           | images and videos that are not ads. And my interactions fed
           | back to the vision model to post in the original. So I never
           | have to see a 'design' (heavy js riddled unreadable crap)
           | again in my life. And so I can, with my own tooling, browse
           | and reuse my history including content instead rely on all
           | the broken stuff bolted on the web.
        
           | loud_cloud wrote:
           | FTL. See NiagraFiles.
        
           | williamcotton wrote:
           | Bash pipes? The free flow of information through composable
           | tools.
           | 
           | The commercial web? Not the above.
           | 
           | This is just a baseline. I'm sure that an LLM can help the
           | issue but the biggest problem is that these varied HTTP-with-
           | datastores are islands passing messages in bottles back and
           | forth while a bash pipeline is akin to fiber optics.
        
         | FooBarWidget wrote:
         | It's bizarre computationally, but at this point maybe we have
         | to compare it to the alternative: hiring a person. At least the
         | AI only consumes electricity (which is hopefully green), while
         | a person consumes food (grown with mined fertilizers), or meat
         | (which we know is really bad for the environment).
        
         | alexirobbins wrote:
         | Working on this layer at https://autotab.com. This sounds like
         | an amazing problem for browser automation to solve, would love
         | to talk with you if you're interested!
        
         | monkeydust wrote:
         | This has been fruitful ground for RPA offerings like UIPath and
         | Automation Anywhere. Multi-model LLMs open up chance to disrupt
         | them
        
         | specialist wrote:
         | > _a large contingent of people who essentially do manual data
         | copying_
         | 
         | Yup.
         | 
         | I was briefly part of a decades long effort to migrate off a
         | main frame backend. It was basically a very expensive shared
         | flat file database (eg FileMaker Pro). Used by thousands of
         | applications, neither inventoried or managed. Surely a handful
         | were critical for daily operations, but no one remembered which
         | ones.
         | 
         | And the source data (quality) was filthy.
         | 
         | I suggested we pay some students to manually copy just the bits
         | of data our spiffy "modern" apps needed.
         | 
         | No one was amused.
         | 
         | --
         | 
         | I also suggested we find a suitable COBOL runtime and just
         | forklift the mainframe's "critical" infra into a virtual
         | machine.
         | 
         | No one was amused.
         | 
         | Lastly, I suggested we throttle access to every unidentified
         | mainframe client. Progressively making it slower over time.
         | Surely we'd hear about anything critical breaking.
         | 
         | That suggestion flew like a lead zeppelin.
        
         | yreg wrote:
         | A long, long time ago I worked on a small project for a major
         | multinational grocery chain.
         | 
         | I made them a tool that parses an Excel file with a specific
         | structure and calls some endpoints in their internal system to
         | submit the data.
         | 
         | I was curious, so I asked how they are doing it currently. They
         | led me to a computer at the back of their office. The wallpaper
         | had two rectangles, one of them said MS EXCEL and the other
         | said INTERNET EXPLORER. Then the person opened these apps,
         | carefully positioned both windows exactly into those rectangles
         | and ran some auto-clicker - the kind cheaters would use in
         | RuneScape - which moved the cursor and copied and pasted the
         | values from the Excel into the various forms on the website.
         | 
         | Amazing.
        
           | Valgrim wrote:
           | I worked with a client who used a multi-millon dollar system
           | for moving goods automatically into packaging stations. The
           | system was built and maintained by a major european company.
           | All the data was transfered automatically between systems
           | normally, but one day, for some reason, there was an internal
           | communication error inside the machine which caused a lot of
           | packages to be sent without being recorded as such.
           | 
           | Now normally we would just have contacted the company and
           | asked them for a data extraction so we could cross-reference
           | the data. But since it wasn't clear who was at fault, and we
           | knew it would take weeks for that extraction, we looked for
           | an internal solution first.
           | 
           | Now there was a subsystem in the machine that worked only in
           | Internet Explorer, with an old authentication scheme, that we
           | could use to see the information we needed, so I, being the
           | only person in the team without formal analysis training but
           | having made my way there from a clerk job, knew exactly what
           | to do.
           | 
           | I fired up the old IE, Excel, wrote in 5 minutes a VBA script
           | that did exactly what you described, click there copy that
           | etc, and 30 minutes later we had our extraction, and resolved
           | the issue completely before the packages were even shipped.
           | 
           | All hail Excel.
        
             | mst wrote:
             | For all its flaws as a programming language, VBA made an
             | excellent bodging language and I salute your expedient
             | field hack.
        
           | kspacewalk2 wrote:
           | I wonder if it used something like AutoIt[0]. I remember
           | using it at one of my more boring co-op jobs about 20 years
           | ago to automate moving data between a spreadsheet and some
           | obscure database product.
           | 
           | [0] https://en.wikipedia.org/wiki/AutoIt
        
         | gumballindie wrote:
         | Wow. Leaking confidential tax payer data.
        
       | jackconsidine wrote:
       | Looks extremely cool. Trying to run it though, I get stuck at
       | "Getting actions for the given objective..." (using the example
       | on the repo)
        
         | ishan0102 wrote:
         | Huh weird, I'm getting that too. OpenAI has been having
         | periodic outages today, think that might be why since it was
         | working fine earlier.
        
           | jechamt wrote:
           | https://www.bleepingcomputer.com/news/security/openai-
           | confir... News reports and their
           | https://status.openai.com/incidents/21vl32gvx3hb incident
           | reports indicate they are mitigating / fighting off attacks
           | recently
        
       | comment_ran wrote:
       | It's so cool. I was wondering if we can make crawler tool much
       | easier and better. It's more similar to the "human" way to
       | interact with a website.
        
       | imranq wrote:
       | Is the vision model directly reading the screen and therefore
       | also reading the Vimeo tags? It might be more effective to export
       | the DOM tags and the associated elements as a Json object that is
       | fed into chatGPT without using the vision component
        
         | dymk wrote:
         | > Currently the Vision API doesn't support JSON mode or
         | function calling, so we have to rely on more primitive
         | prompting methods.
        
           | maccam912 wrote:
           | I found that it works well to ask it to generate JSON the
           | best it can, then pass it to gpt-3.5-turbo with the JSON
           | response mode and instruct it to just clean up whatever input
           | it received.
        
             | ishan0102 wrote:
             | Perfect, I have this as a todo in my readme and I'll
             | implement this soon
        
       | thekid314 wrote:
       | I'm curious to see what it does when it sees a captcha.
        
         | ishan0102 wrote:
         | From OpenAI docs[1]: "For safety reasons, we have implemented a
         | system to block the submission of CAPTCHAs."
         | 
         | [1] https://platform.openai.com/docs/guides/vision
        
           | xur17 wrote:
           | Yeah, I've been feeding screenshots from selenium to the
           | vision API, and when I trigger bot detection on a website,
           | chatgpt refuses to process the image.
        
             | NorwegianDude wrote:
             | It does solve, or at least try to solve, captchas for me.
             | It gets like half the characters correct, it's very bad at
             | it.
        
       | snake_doc wrote:
       | Ah, very similar to Adept's[1] concept? Though, their product
       | seems not yet ready.
       | 
       | [1] https://www.adept.ai/
        
         | ishan0102 wrote:
         | Yep, took inspiration from them and a couple other startups
        
           | QkPrsMizkYvt wrote:
           | What other startups did you use for inspiration?
        
         | karmasimida wrote:
         | This is precisely the demo I am thinking.
        
         | jatins wrote:
         | It's also a little insane to me that what Adept has been
         | supposedly building for years with 300+ mil in funding can now
         | be built in a day with Open AI APIs?
         | 
         | I think Adept pivoted along the way but original concept was
         | very similar to this.
        
           | sunshadow wrote:
           | But its too expensive to become practical with the OpenAI
           | API. Also, demo is cool until you see the real-world
           | webpages, then you'll realize that this only works less than
           | %50 of webpages.
        
             | og_kalu wrote:
             | GPT-4V may be surprisingly robust here. Set of mark
             | prompting(which is accomplished here with Vim) improves
             | grounding by a silly high amount. https://som-
             | gpt4v.github.io/
        
         | amks wrote:
         | https://www.adept.ai/blog/experiments :)
        
       | snthpy wrote:
       | Looks cool. Unfortunately I expected this to enhance my Vimium
       | experience but it looks like this is using Vimium to enhance
       | GPT4, right?
        
       | maccam912 wrote:
       | I've been playing with a similar idea of screenshots and actions
       | from GPT-4 Vision for browsing, but after trying and failing to
       | overlay info in the screenshot, I ended up just getting the
       | accessibility tree from playwright and sending that along as text
       | so the model would know what options it had for interaction. In
       | my case it seemed to work better, I see the creator is here and
       | has a list of future ideas, maybe add this to the list if you
       | think its a good idea?
        
         | ishan0102 wrote:
         | Cool that's a solid idea, I was trying to only use visual data
         | but this could make the agent a lot more powerful, I'll try
         | this really soon
        
         | manmal wrote:
         | Probably better to capture all the content and not just what
         | fits on one screen. Most pages should fit as text (or HTML?) in
         | the new extended token window.
        
           | arbuge wrote:
           | Better watch token costs. The per token costs are lower now
           | but even so a full context load still costs almost $4.
        
       | karmasimida wrote:
       | We can create an autopilot for browser.
       | 
       | It is going to incredibly difficult moving forward to distinguish
       | bot traffic, if this is deployed at scale.
       | 
       | The problem I see is this isn't going to be cheap or even
       | affordable in short term.
        
         | ishan0102 wrote:
         | I think costs can come down if you finetune open source models
         | like llava or cogvlm. This demo also cost about 6 cents so it's
         | not insanely expensive either, especially with clever
         | prompting.
        
       | owenpalmer wrote:
       | This will be fantastic for accessibility
        
       | reqo wrote:
       | How will tools like this affect web tracking or generally
       | advertisements on the internet? Imagine you could have an agent
       | browse the web for you and fetch exactly what you are seraching
       | for without you seeing any ads/pop ups or being tracked along the
       | way! Could be a great "ad blocker"! Could it perhaps also make
       | SEO useless and thus improve the quality of internet? But I
       | wonder if it also could have negative effects such as the ads
       | being "interweaved" into the fetch content somehow!
        
         | og_kalu wrote:
         | Since this is sending screenshots of pages to GPT, won't it see
         | the ads as well?
        
       | braindead_in wrote:
       | Why not build a new browser with GPT baked in?
        
         | reustle wrote:
         | Curious, how would that differ? Assuming it is just grabbing
         | the rendered HTML DOM after each action, isn't it nearly the
         | same?
        
       | lachlan_gray wrote:
       | I think vim is unintentionally a great "embodiment" for chatgpt.
       | There's nothing that can't be done with a stream of text, and the
       | internet is full of vimscript already
       | 
       | I started a similar experiment if anyone else is thinking along
       | the same lines :)
       | 
       | https://github.com/LachlanGray/vim-agent
        
         | gsuuon wrote:
         | This is a neat idea!
        
       | gvv wrote:
       | Nice job! The horrors GPT-4 must endure to watch ads, truly
       | inhumane
        
       | FooBarWidget wrote:
       | Many Dutch companies pay salaries by
       | 
       | 1. receiving payslips from the accountant, and then
       | 
       | 2. manually initiating bank transfers to each employee for the
       | amount in the corresponding payslip, and then
       | 
       | 3. manually initiating a bank transfer to the tax authority to
       | pay the withholded salary taxes.
       | 
       | This is completely useless manual labor. There should be no
       | reason for this to be a manual procedure. And yet it's almost
       | impossible to automate this. The accountant portal either has no
       | API, or it has an API but lets you download the data as PDF,
       | and/or the API costs good money. The bank either has no API, or
       | it requires you to sign up for a developer account as if you're
       | going to publish a public app, when you're just looking to
       | automate some internal procedures.
       | 
       | So the easiest way to pay salaries and taxes is still to hire a
       | person to do it manually. Hopefully one day that won't be
       | necessary anymore. I wouldn't trust an AI to actually initiate
       | the bank transfers, but maybe they can just prepare the
       | transactions and then a person has to approve the submission.
        
         | martinald wrote:
         | I don't think this really has much to do with AI. In the UK
         | there are solutions like Pento now which do all this, including
         | automating payments via open banking to the user and the tax
         | authority and automatically filing tax filings:
         | 
         | https://www.pento.io/la/payroll-software
        
         | is_true wrote:
         | In my country it's similar but for some data you have to upload
         | to the government agency's site, I think it was earlier this
         | year that they released a statement saying that people using
         | software to perform actions on the website could get banned.
        
         | nvm0n2 wrote:
         | That's just a bank problem. Certainly this isn't how payroll
         | works for large companies. Banks usually let you upload XML
         | files that define a set of SWIFT payments, this is how I do
         | payroll even for a small company. The accountants supply the
         | XML file too, presumably they have an app that generates it.
        
       | ranulo wrote:
       | This could enable human language test automation scripts and
       | could either improve my life as a QA engineer a lot or completely
       | destroy it. Not sure yet.
        
         | sunshadow wrote:
         | You're good until this is cheaper than your salary.
        
       | mackross wrote:
       | Been playing with this through the ChatGPT interface for the past
       | few weeks. Couple of tips. Update the css to get rid of the
       | gradients and rounded corners. I found red with bold white text
       | to be most consistent. Increase the font size. If two labels
       | overlap, push them apart and add an arrow to the element. Send
       | both images to the API, a version with the annotations added and
       | a version without.
        
       | bilekas wrote:
       | This is actually pretty interesting.. I am thinking maybe it
       | would be faster than writing up selenium tests themselves if we
       | could just give a few instructions.
       | 
       | I'm still going through the source, but really nice idea and
       | great example of enriching the GPT with tools like vimium.
        
       | startages wrote:
       | There is just so much you can do with GPT-4 vision, I just hope
       | it's more affordable.
        
       | jonathanlb wrote:
       | Hmm interesting. I'm curious what this means for accessibility
       | and screen readers.
        
       | e12e wrote:
       | It's insane that this is now possible:
       | 
       | https://github.com/ishan0102/vimGPT/blob/682b5e539541cd6d710...
       | 
       | > "You need to choose which action to take to help a user do this
       | task: {objective}. Your options are navigate, type, click, and
       | done. Navigate should take you to the specified URL. Type and
       | click take strings where if you want to click on an object,
       | return the string with the yellow character sequence you want to
       | click on, and to type just a string with the message you want to
       | type. For clicks, please only respond with the 1-2 letter
       | sequence in the yellow box, and if there are multiple valid
       | options choose the one you think a user would select. For typing,
       | please return a click to click on the box along with a type with
       | the message to write. When the page seems satisfactory, return
       | done as a key with no value. You must respond in JSON only with
       | no other fluff or bad things will happen. The JSON keys must ONLY
       | be one of navigate, type, or click. Do not return the JSON inside
       | a code block."
        
         | Maxion wrote:
         | The speed at which this is moving at is mind boggling. This may
         | become crazier than the dot.com boom.
        
           | pms wrote:
           | Until you realize that it doesn't work well with less popular
           | videos (any items really), because "Large Language Models
           | Struggle to Learn Long-Tail Knowledge" [1].
           | 
           | [1] https://proceedings.mlr.press/v202/kandpal23a.html
        
       | DalasNoin wrote:
       | I tried to use it, but unfortunately it often did not add the
       | little annotations for the different options to the screen and it
       | got stuck in a loop. This bot works by adding a two letter
       | combination to each clickable option, but sometimes they don't
       | show up. It managed to sign in to twitter ones, but really
       | quickly I burned through the 100 images api limit.
       | 
       | Maybe for a future version it only uses vision for difficult
       | situations in which it gets stuck and otherwise uses the text
       | based browser?
        
       | nostrowski wrote:
       | This will be in a future history book under a chapter titled "the
       | beginning of the end"
        
       | dangerwill wrote:
       | How is this making your browsing experience any better? You still
       | have to know what you want to do, and it is just faster to type
       | Rick roll into youtube directly and click the links directly
       | instead of having to type k, or vh, or whatever. You are just
       | adding a useless chatgpt middleman between you and the browser
       | that you likely spend all day in anyway and should be adept at
       | navigating
        
       | ternaus wrote:
       | Love the idea.
       | 
       | It also shows that GPT-4V created a new angle in web scraping.
       | 
       | I guess, this or similar code would be leveraged in many projects
       | like:
       | 
       | 1. Scrape XXX websites, say LinkedIn or Twitter use all types of
       | methods in the DOM to prevent it, but fighting working well
       | GPT-4V + OCR would be ultra hard.
       | 
       | 2. Give me an analysis of what these XXX companies are doing. And
       | this could be done for competitors, to understand the landscape
       | of some industry, or even plainly to get news.
       | 
       | Large-scale scrapping, not depending on the source code of the
       | pages is a powerful infrastructural change.
        
       | doctorM wrote:
       | i think this is actively dangerous. well not yet. but getting
       | there.
       | 
       | i know - ai isn't meant to be sentient. but if it looks like a
       | duck and quacks like a duck...
       | 
       | how do i know that the comments here aren't done by dedicated
       | hacker news ai bots?
       | 
       | the potential danger could come from lack of supervision down the
       | road.
       | 
       | i didn't get much sleep last night so this is less coherent than
       | it could be.
        
       ___________________________________________________________________
       (page generated 2023-11-09 23:01 UTC)