[HN Gopher] Claude Computer Use - Is Vision the Ultimate API?
       ___________________________________________________________________
        
       Claude Computer Use - Is Vision the Ultimate API?
        
       Author : trq_
       Score  : 81 points
       Date   : 2024-10-24 18:15 UTC (4 hours ago)
        
 (HTM) web link (www.thariq.io)
 (TXT) w3m dump (www.thariq.io)
        
       | simonw wrote:
       | If you want to try out Computer Use (awful name) in a relatively
       | safe environment the Docker container Anthropic provide here is
       | very easy to start running (provided you have Docker setup, I
       | used it with Docker Deaktop for Mac):
       | https://github.com/anthropics/anthropic-quickstarts/tree/mai...
        
         | trq_ wrote:
         | Yes that's a good point! To be honest, I felt that I wanted to
         | try it on the machine I used every day, but it's definitely a
         | bit risky. Let me link that in the article.
        
         | danielbln wrote:
         | I for one appreciate the name Computer Use, no flashy marketing
         | name, just describes what is she's. LLM using a computer.
        
           | croes wrote:
           | Hard to ask questions about Computer Use
        
           | swyx wrote:
           | also it contrasts nicely with Tool Use, which is about
           | calling apis rather than clicking on things
        
             | lostmsu wrote:
             | It is literally a specialized subset of Tool Use both in
             | concept and how it appears to be actually implemented.
        
       | sharpshadow wrote:
       | In this context Windows Recall makes total sense now from a AI
       | learning perspective for them.
       | 
       | It's actually a super cool development and I'm very exiting
       | already to let my computer use any software like a pro infront of
       | me. Paint me canvas of a savanna sunset with animals silhouette,
       | produce me a track of uk garage house, etc. everything with all
       | the layers and elements in the software not just an finished
       | output.
        
         | croes wrote:
         | Lots of energy consumption just to create a remix of something
         | that already exists.
        
           | sharpshadow wrote:
           | Absolutely we need much much more energy and many many more
           | powerful chips. Energy is a resource and we need to harvest
           | more of it.
           | 
           | I don't understand why people make a point about energy
           | consumption as it would be something bad.
        
             | viraptor wrote:
             | Come on, the trolling is too obvious.
        
               | sharpshadow wrote:
               | Absolutely not I'm serious and that is exactly what is
               | going on, it would be trolling to pretend the opposite or
               | accepted the status quo as final.
               | 
               | Obviously we need to use a way to not harm and destroy
               | our environment more but we are on a good way on that.
               | But technically we need much much more energy.
        
               | croes wrote:
               | We are not on a good way, we are far away from our goals
               | and at the moment AI's is near the same as Bitcoin.
               | 
               | We do things fast and expensive that could be done slow
               | but cheap.
               | 
               | The problem is we are running out of time.
               | 
               | If you want more energy you first build clean energy
               | sources then you can pump up consumption not the other
               | way around.
        
               | flemhans wrote:
               | Agreed, let's go!
        
               | sharpshadow wrote:
               | Those goals you are referring to are made up as the
               | illusion we are running out of time.
               | 
               | It's usually the other way around. If we would do things
               | only if the resources are there in the first place we
               | wouldn't the progress we have.
        
             | wwweston wrote:
             | https://dothemath.ucsd.edu/2012/04/economist-meets-
             | physicist...
             | 
             | "the Earth has only one mechanism for releasing heat to
             | space, and that's via (infrared) radiation. We understand
             | the phenomenon perfectly well, and can predict the surface
             | temperature of the planet as a function of how much energy
             | the human race produces. The upshot is that at a 2.3%
             | growth rate, we would reach boiling temperature in about
             | 400 years. And this statement is independent of technology.
             | Even if we don't have a name for the energy source yet, as
             | long as it obeys thermodynamics, we cook ourselves with
             | perpetual energy increase."
        
       | CharlieDigital wrote:
       | Vision is the ultimate API.
       | 
       | The historical progression from _text_ to _still images_ to
       | _audio_ to _moving images_ will hold true for AI as well.
       | 
       | Just look at OpenAI's progression as well from LLM to multi-modal
       | to the realtime API.
       | 
       | A co-worker almost 20 years ago said something interesting to me
       | as we were discussing Al Gore's CurrentTV project: the history of
       | information is constrained by "bandwidth". He mentioned how
       | broadcast television went from 72 hours of "bandwidth" (3
       | channels x 24h) per day to now having so much bandwidth that we
       | could have a channel with citizen journalists. Of course, this
       | was also the same time that YouTube was taking off.
       | 
       | The pattern holds true for AI.
       | 
       | AI is going to create "infinite bandwidth".
        
         | ToDougie wrote:
         | So long as the spectrum is open for infinity, yes.
         | 
         | Was listening to a Seth Godin interview where he pointed out
         | that there was a time when you had to purchase a slice of
         | spectrum to share your voice on the radio. Nowadays you can put
         | your thoughts on a platform, but that platform is owned by
         | corporations who can and will put their thumb on thoughtcrime
         | or challenges.
         | 
         | I really do love your comment. Cheers.
        
           | CharlieDigital wrote:
           | Thanks!
           | 
           | There's a related concept as well which is that as
           | "bandwidth" increases, the ratio of producers to consumers
           | pushes upwards towards 1. My take is that generative AI will
           | accelerate this
           | 
           | I write a bit more in depth about it here:
           | https://chrlschn.dev/blog/2024/10/im-a-gen-ai-maximalist-
           | and...
        
         | ricardo81 wrote:
         | You could call it bandwidth, or call it entropy. I'd lean
         | towards the more physical definition.
         | 
         | I think of how the USA had cable TV and hundreds of channels
         | projecting all kinds of whatever in the 80s while here in the
         | UK we were limited to our finite channels. To be fair those
         | finite channels gave people something to talk about the next
         | day, because millions of people saw the same thing. Surely a
         | lot of what mankind has done is to tame entropy, like steam
         | engines etc.
         | 
         | With AI and everyone having a prompt, it's surely a game
         | changer. How it works out, we'll see.
        
         | swatcoder wrote:
         | > The historical progression from text to still images to audio
         | to moving images will hold true for AI as well.
         | 
         | You'll have to explain what you mean by this. Direct speech,
         | text, illustrations, photos, abstract sounds, music,
         | recordings, videos, circuits, programs, cells... these are all
         | just different mediums with different characteristics. There is
         | no "progression" apparent among them. Why should there be? They
         | each fulfill different ends and have different occasions for
         | which they best suit.
         | 
         | We seem to have discovered a new family of tools that help
         | lossilly transform content or intent from one of these mediums
         | to some others, which is sure to be useful in its own ways. But
         | it's not a medium like the above in the first place, and with
         | none of them representing a progression, it certainly doesn't
         | either.
        
           | CharlieDigital wrote:
           | > You'll have to explain what you mean by this
           | 
           | The progression of distribution. Printing press, photos,
           | radio, movies, television. The early web was text, then came
           | images, then audio (Napster age), and then video (remember
           | that Netflix used to ship DVDs?).
           | 
           | The flip side of that is production and the ratio of
           | producers to consumers. As the bandwidth for distribution
           | increases, there is a decrease in the cost and complexity for
           | producers and naturally, we see the same progression with
           | producers on each new platform and distribution technology:
           | text, still images, audio, moving images.
        
             | margalabargala wrote:
             | > The progression of distribution. Printing press, photos,
             | radio, movies, television.
             | 
             | Your history is incorrect, though. Still images predate
             | text, by a lot.
             | 
             | Cave paintings came before writing. Woodcuts came before
             | the printing press.
        
               | earthnail wrote:
               | Not in the information age. This cascade just corresponds
               | to how much data and processing power is needed for each.
               | 
               | It is entirely logical to say that AI development will
               | follow the same progression as the early internet or as
               | broadcast since they all fall under the same data
               | constraints.
        
               | margalabargala wrote:
               | In the information age we've seen inconsistencies as
               | well.
               | 
               | Ever since the release of Whisper and others, text-to-
               | speech and speech-to-text have been more or less solved,
               | while image generation seems to still sometimes have
               | trouble. Earlier this week was a thread about how no
               | image model could draw a crocodile without a tail.
               | 
               | Meanwhile, the first photographs predate the first sound
               | recordings. And moving images without sound, of course,
               | predate moving images with sound.
               | 
               | The original poster was trying to sound profound as
               | though there was some set sequence of things that always
               | happens through human development. But the reality is a
               | much more mundane "less complex things tend to be easier
               | than more complex things".
        
             | swatcoder wrote:
             | But it's not a progression. There's no transitioning. It's
             | just the introduction of new media alongside prior ones.
             | 
             | And as a sibling commenter noted, the individual history of
             | these media are disjoint and not really in the sequence you
             | suggest.
             | 
             | Regardless, generative AI isn't a media like any of these.
             | It's a means to transform media from one type to another,
             | at some expense and with the introduction of loss/noise.
             | There's something revolutionary about how easy it makes it
             | to perform those transitions, and how generally it can
             | perform them, but it's fundamentally more like a
             | screwdriver than a video.
        
           | rhdunn wrote:
           | I'd argue that multimodal analysis can improve uni/bimodal
           | models.
           | 
           | There is overlap between text to image and text to video --
           | image would help video animating interesting or complex
           | prompts; video would help image learn how to differentiate
           | features as there are additional clues in terms of how the
           | image changes and remains the same.
           | 
           | There's overlap with audio, text transcripts, and video
           | around learning to animate speech e.g. by leaning how faces
           | move with the corresponding audio/text.
           | 
           | There's overlap with sound and video -- e.g. being able to
           | associate sounds like dog barking without direct labelling of
           | either.
        
         | croes wrote:
         | Vision especially GUIs are a pretty limited API.
        
           | abirch wrote:
           | It reminded me of these old Unix lessons of Master Foo
           | 
           | https://prirai.github.io/books/unix-koans/#master-foo-
           | discou...
        
           | CharlieDigital wrote:
           | I mean vision in the most general sense, not just a GUI.
           | 
           | Imagine OpenAI can not only read the inflection in your
           | voice, but also nuances in your facial expressions and how
           | you're using your hands to understand your state of mind.
           | 
           | And instead of merely responding as an audio stream, a real-
           | time avatar.
        
             | corobo wrote:
             | Sweet! I'll have my own Holly from Red Dwarf!
             | 
             | or maybe my own HAL9000..
             | 
             | A little bit ambivalent on this haha, looking forward to
             | seeing what comes of it either way though :)
        
         | skydhash wrote:
         | It is not. Text is very dense information wise and recursive
         | and you can formalize it. And easily coupled with interactions
         | method. And more apt for automation. You can easily see this
         | with software like Autocad which have both. There's a reason
         | all protocol are texts.
         | 
         | Vision and audio plays a nice role, but that's because of
         | humans and reality. _real world <-> vision|audio <->
         | processing_ pipeline makes sense. But _processing <-> data <->
         | vision|audio <-> data <-> processing_ cycle is just non sense
         | and a waste of resources.
        
           | cooper_ganglia wrote:
           | Not a waste of resources, just an increase in use. This is
           | why need more resources.
        
             | skydhash wrote:
             | It's a waste, because you could have just resolve the
             | middleman and have _data <-> processing_ cycle. When you
             | increase resources use, that means some other metrics
             | should be increased with an higher factor (car vs carriage
             | and horses, computers vs doing computation by hand),
             | otherwise it's a waste
        
           | ricardo81 wrote:
           | >Text is very dense information wise and recursive and you
           | can formalize it.
           | 
           | There's been a lot of attempts over the years and varying
           | degrees of accuracy, but I don't know if you can go as far as
           | to "formalize" it. Beyond the syntax, (tokenising, syntactic
           | chunking and ...beyond) there is the intent, and that is
           | super hard to measure. And possibly, the problem with these
           | prompts is they get things right a lot of the time but wrong
           | say 5% of the time. Purely because they couldn't formalize
           | it. My web hosting has 99.99% uptime which is a bit more
           | reassuring than 95%.
        
         | layer8 wrote:
         | Text is still a predominant medium of communication and
         | information processing, and I don't see that changing
         | substantially. TFA was an article, not a video, and you
         | wouldn't want the HN comment section to be composed of videos
         | or images. Similarly, video calls haven't replaced texting.
        
           | CharlieDigital wrote:
           | It's not that it will be replaced, but there's a natural
           | progression of the types of media that is available on a
           | given platform of distribution.
           | 
           | RF: text (telegram), audio, still images (fax), moving images
           | 
           | Web had the same progression: text, still images (inverted
           | here), audio age (MP3s, Napster), video (Netflix, YouTube)
           | 
           | AI: text, images, audio (realtime API), ...?
           | 
           | Vision is the obvious next medium.
        
             | wwweston wrote:
             | Lately I've been realizing that as much as I value YouTube,
             | much of the content is distorted by various incentives
             | favoring length and frequency and a tech culture which
             | overfocuses on elaborating steps as an equivalent to an
             | explanation. This contributes to more duplicate content
             | (often even within the same channel) and less in terms of
             | refined conceptualization. I find myself often wishing I
             | had outlines, summaries, and hyperlinks to greater details.
             | 
             | Of course, I can use AI tools to get approximations of such
             | things, and it'll probably get better, which means we will
             | now be using this increased bandwidth or progression to
             | produce more video to be pushed out through the pipes and
             | distilled by an AI tool into shorter video or something
             | like hypertext.
             | 
             | Progress!
        
         | wwweston wrote:
         | The information coursing through the world around us already
         | exceeds our ability to grasp it by high orders of magnitude.
         | 
         | Three channels of television over 8 hours was already more than
         | anyone had time to take in.
         | 
         | AI _might_ be able to create a summarizing layers and relays
         | that help manage that.
         | 
         | AI isn't going to create infinite bandwidth. It's as likely to
         | increase entropy and introduce noise.
        
       | viraptor wrote:
       | Some time ago I made a prediction that accessibility is the
       | ultimate API for the UI agents, but unfortunately multimodal
       | capabilities went the other way. But we can still change the
       | course:
       | 
       | This is a great place for people to start caring about
       | accessibility annotations. All serious UI toolkits allow you to
       | tell the computer what's on the screen. This allows things like
       | Windows Automation https://learn.microsoft.com/en-
       | us/windows/win32/winauto/entr... to see a tree of controls with
       | labels and descriptions without any vision/OCR. It can be
       | inspected by apps like FlauiInspect
       | https://github.com/FlaUI/FlaUInspect?tab=readme-ov-file#main...
       | But see how the example shows a statusbar with (Text "UIA3" "")?
       | It could've been (Text "UIA3" "Current automation interface")
       | instead for both a good tooltip and an accessibility label.
       | 
       | Now we can kill two birds with one stone - actually improve the
       | accessibility of everything and make sure custom controls adhere
       | to the framework as well, and provide the same data to the coming
       | automation agents. The text description will be much cheaper than
       | a screenshot to process. Also it will help my work with manually
       | coded app automation, so that's a win-win-win.
       | 
       | As a side effect, it would also solve issues with UI weirdness.
       | Have you ever had windows open something on a screen which is not
       | connected anymore? Or under another window? Or minimised?
       | Screenshots won't give enough information here to progress.
        
       | tomatohs wrote:
       | > It is very helpful to give it things like:
       | 
       | - A list of applications that are open - Which application has
       | active focus - What is focused inside the application - Function
       | calls to specifically navigate those applications, as many as
       | possible
       | 
       | We've found the same thing while building the client for
       | testdriver.ai. This info is in every request.
        
       | pabe wrote:
       | I don't think vision is the ultimate API. It wasn't with
       | "traditional" RPA and it won't with more advanced AI-RPA. It's
       | inefficient. If you want something to be used by a bot, write an
       | interface for a bot. I'd make an exception for end2end testing.
        
         | Veen wrote:
         | You're looking at it from a developer's perspective. For non-
         | developers, vision opens up all sorts of new capabilities. And
         | they won't have to rely on the software creator's view of what
         | should be automated and what should not.
        
           | skydhash wrote:
           | Most non-developers won't bother. You have shortcut on iOS
           | and macOS which is like Scratch for automation and still only
           | power users use it. Others just download the shortcut they
           | want.
        
           | croes wrote:
           | If a GUI is confusing for humans AI will be have problems
           | too.
           | 
           | So you still rely on developers to make reasonable GUIs
        
       | cheevly wrote:
       | No, language is the ultimate API.
        
         | ukuina wrote:
         | On the instruction-provision end, sure.
        
       | throwup238 wrote:
       | Vision _plus_ accessibility metadata is the ultimate API. I see
       | little reason that poorly designed flat UIs are going to confuse
       | LLMs any less than humans, especially when they're missing from
       | the training data like most internal apps or the documentation on
       | the web is out of date. Even a basic dump of ARIA attributes or
       | the hierarchy from OS accessibility APIs can help a lot.
        
         | dbish wrote:
         | The problem is accessibility data and apis are very bad across
         | the board.
        
       | unglaublich wrote:
       | Vision here means "2d pixel space".
       | 
       | The ultimate API is "all the raw data you can acquire from your
       | environment".
        
         | layer8 wrote:
         | For a typical GUI, the "mental model" actually needs to be
         | 2.5D, due to stacked windows, popups, menus, modals, and so on.
         | The article mentions that the model has difficulties with
         | those.
        
       | PreInternet01 wrote:
       | Counterpoint: no, it's just more hype.
       | 
       | Doing real-time OCR on 1280x1024 bitmaps has been possible for...
       | the last decade or so? Sure, you can now do it on 4K or 8K
       | bitmaps, but that's just an incremental improvement.
       | 
       | Fact is, full-screen OCR coupled with innovations like "Google"
       | has not lead to "ultimate" productivity improvements, and as
       | impressive as OpenAI _et al_ may appear right now, the impact of
       | these technologies will end up roughly similar.
       | 
       | (Which is to say: the landscape will change, but not in a truly
       | fundamental way. What you're seeing demonstrated right now is,
       | roughly speaking, the next Clippy, which, believe it or not, was
       | hyped to a similar extent around the time it was introduced...)
        
         | acchow wrote:
         | "OCR : Computer Use" is as "voice-to-text : ChatGPT Voice"
        
         | simonw wrote:
         | The way these new LLM vision models work is very different from
         | OCR.
         | 
         | I saw a demo this morning of someone getting Claude to play
         | FreeCiv (admittedly extremely badly):
         | https://twitter.com/greggyb/status/1849198544445432229
         | 
         | Try doing that with Tesseract.
        
           | croes wrote:
           | I bet Tesseract plays pretty badly too.
        
         | KoolKat23 wrote:
         | Existing OCR is extremely limited and requires custom narrow
         | development.
        
       | throwaway19972 wrote:
       | I'd imagine you'd get higher quality leveraging accessibility
       | integrations.
        
       | echoangle wrote:
       | Am I the only one thinking this is an awful way for AI to do
       | useful stuff for you? Why would I train an AI to use a GUI?
       | Wouldn't it be better to just have the AI learn API docs and use
       | that? I don't want the AI to open my browser, open google maps
       | and search for Shawarma, I want the AI to call a google api and
       | give me the result.
        
         | voiper1 wrote:
         | Sure, it's more effecient to have it use an API. And people
         | have been integrating those for the last while.
         | 
         | But there's tons of applications that are locked behind a
         | website or deckstop GUI with no API that are accessible via
         | vision.
        
         | og_kalu wrote:
         | The vast majority of Applications cannot be used by anything
         | other than a GUI.
         | 
         | We built computers to be used by humans and humans
         | overwhelmingly operate computers with GUIs. So if you want a
         | machine that can potentially operate computers as well as
         | humans then you're going to have to stick to GUIs.
         | 
         | It's the same reason we're trying to build general purpose
         | robots in a human form factor.
         | 
         | The fact that a car is about as wide as a two horse drawn
         | carriage is also no coincidence. You can't ignore existing
         | infrastructure.
        
           | echoangle wrote:
           | But I don't want an AI to ,,operate a computer"... maybe I'm
           | missing the point of this but I just can't imagine a usecase
           | where this is a good solution. For everything browser based,
           | the burden of making an API is probably relatively small and
           | if the page is simple enough, you could maybe even get away
           | with training the AI on the page HTML and generating a
           | response to send. And for everything that's not browser-
           | based, I would either want the AI embedded in the software
           | (image editors, IDEs...) or not there are all.
        
             | og_kalu wrote:
             | >But I don't want an AI to ,,operate a computer"...
             | 
             | You don't and that's fine but certainly many people are
             | interested in such a thing.
             | 
             | >maybe I'm missing the point of this but I just can't
             | imagine a usecase where this is a good solution.
             | 
             | If it could operate computers robustly and reliabily then
             | why wouldn't you ? Not everything someone does on a
             | computer is a task they wouldn't like to automate away but
             | can't with current technology.
             | 
             | >For everything browser based, the burden of making an API
             | is probably relatively small
             | 
             | It's definitely not less effort than stricking to a GUI
             | 
             | >and if the page is simple enough, you could maybe even get
             | away with training the AI on the page HTML and generating a
             | response to send.
             | 
             | Sure in special circumstances, it may be a good idea to use
             | something else.
             | 
             | >And for everything that's not browser-based, I would
             | either want the AI embedded in the software (image editors,
             | IDEs...) or not there are all.
             | 
             | AI embedded in software and AI operating the computer
             | itself are entirely different things. The former is not
             | necessarily a substitute for the latter.
             | 
             | Having access to SORA is not at all the same thing as AI
             | that can expertly operate Blender. And right now at least,
             | studios would actually much prefer the latter.
             | 
             | Even if they were equivalent (they're not), then you
             | wouldn't be able to operate most applications without
             | developers explicitly supporting and maintaining it first.
             | That's infeasible.
        
             | dragonwriter wrote:
             | I think the two big applications for programmatically (AI
             | or otherwise) operating a computer via this kind of UI
             | automation are:
             | 
             | (1) Automated testing of apps that use traditional UIs, and
             | 
             | (2) Automating legacy apps that it is not practical or cost
             | effective to update.
        
             | simonw wrote:
             | Google really don't want to provide a useful API to their
             | search results.
        
         | layer8 wrote:
         | A general-purpose assistant should be able to perform general-
         | purpose operations, meaning the same things people do on their
         | computers, and without having to supply a special-purpose AI-
         | compatible interface for every single function the AI might
         | need to operate. The AI should be able to operate any interface
         | a human can operate.
        
         | Workaccount2 wrote:
         | Anthropic is selling a product to people, not software
         | engineers.
        
       | downWidOutaFite wrote:
       | Vision is a crappy interface for computers but I think it could
       | be a useful weapon against all the extremely "secure" platforms
       | that refuse to give you access to your own data and refuse to
       | interoperate with anything outside their militarized walled
       | gardens.
        
       ___________________________________________________________________
       (page generated 2024-10-24 23:00 UTC)