[HN Gopher] Claude Computer Use - Is Vision the Ultimate API?
___________________________________________________________________
Claude Computer Use - Is Vision the Ultimate API?
Author : trq_
Score : 81 points
Date : 2024-10-24 18:15 UTC (4 hours ago)
(HTM) web link (www.thariq.io)
(TXT) w3m dump (www.thariq.io)
| simonw wrote:
| If you want to try out Computer Use (awful name) in a relatively
| safe environment the Docker container Anthropic provide here is
| very easy to start running (provided you have Docker setup, I
| used it with Docker Deaktop for Mac):
| https://github.com/anthropics/anthropic-quickstarts/tree/mai...
| trq_ wrote:
| Yes that's a good point! To be honest, I felt that I wanted to
| try it on the machine I used every day, but it's definitely a
| bit risky. Let me link that in the article.
| danielbln wrote:
| I for one appreciate the name Computer Use, no flashy marketing
| name, just describes what is she's. LLM using a computer.
| croes wrote:
| Hard to ask questions about Computer Use
| swyx wrote:
| also it contrasts nicely with Tool Use, which is about
| calling apis rather than clicking on things
| lostmsu wrote:
| It is literally a specialized subset of Tool Use both in
| concept and how it appears to be actually implemented.
| sharpshadow wrote:
| In this context Windows Recall makes total sense now from a AI
| learning perspective for them.
|
| It's actually a super cool development and I'm very exiting
| already to let my computer use any software like a pro infront of
| me. Paint me canvas of a savanna sunset with animals silhouette,
| produce me a track of uk garage house, etc. everything with all
| the layers and elements in the software not just an finished
| output.
| croes wrote:
| Lots of energy consumption just to create a remix of something
| that already exists.
| sharpshadow wrote:
| Absolutely we need much much more energy and many many more
| powerful chips. Energy is a resource and we need to harvest
| more of it.
|
| I don't understand why people make a point about energy
| consumption as it would be something bad.
| viraptor wrote:
| Come on, the trolling is too obvious.
| sharpshadow wrote:
| Absolutely not I'm serious and that is exactly what is
| going on, it would be trolling to pretend the opposite or
| accepted the status quo as final.
|
| Obviously we need to use a way to not harm and destroy
| our environment more but we are on a good way on that.
| But technically we need much much more energy.
| croes wrote:
| We are not on a good way, we are far away from our goals
| and at the moment AI's is near the same as Bitcoin.
|
| We do things fast and expensive that could be done slow
| but cheap.
|
| The problem is we are running out of time.
|
| If you want more energy you first build clean energy
| sources then you can pump up consumption not the other
| way around.
| flemhans wrote:
| Agreed, let's go!
| sharpshadow wrote:
| Those goals you are referring to are made up as the
| illusion we are running out of time.
|
| It's usually the other way around. If we would do things
| only if the resources are there in the first place we
| wouldn't the progress we have.
| wwweston wrote:
| https://dothemath.ucsd.edu/2012/04/economist-meets-
| physicist...
|
| "the Earth has only one mechanism for releasing heat to
| space, and that's via (infrared) radiation. We understand
| the phenomenon perfectly well, and can predict the surface
| temperature of the planet as a function of how much energy
| the human race produces. The upshot is that at a 2.3%
| growth rate, we would reach boiling temperature in about
| 400 years. And this statement is independent of technology.
| Even if we don't have a name for the energy source yet, as
| long as it obeys thermodynamics, we cook ourselves with
| perpetual energy increase."
| CharlieDigital wrote:
| Vision is the ultimate API.
|
| The historical progression from _text_ to _still images_ to
| _audio_ to _moving images_ will hold true for AI as well.
|
| Just look at OpenAI's progression as well from LLM to multi-modal
| to the realtime API.
|
| A co-worker almost 20 years ago said something interesting to me
| as we were discussing Al Gore's CurrentTV project: the history of
| information is constrained by "bandwidth". He mentioned how
| broadcast television went from 72 hours of "bandwidth" (3
| channels x 24h) per day to now having so much bandwidth that we
| could have a channel with citizen journalists. Of course, this
| was also the same time that YouTube was taking off.
|
| The pattern holds true for AI.
|
| AI is going to create "infinite bandwidth".
| ToDougie wrote:
| So long as the spectrum is open for infinity, yes.
|
| Was listening to a Seth Godin interview where he pointed out
| that there was a time when you had to purchase a slice of
| spectrum to share your voice on the radio. Nowadays you can put
| your thoughts on a platform, but that platform is owned by
| corporations who can and will put their thumb on thoughtcrime
| or challenges.
|
| I really do love your comment. Cheers.
| CharlieDigital wrote:
| Thanks!
|
| There's a related concept as well which is that as
| "bandwidth" increases, the ratio of producers to consumers
| pushes upwards towards 1. My take is that generative AI will
| accelerate this
|
| I write a bit more in depth about it here:
| https://chrlschn.dev/blog/2024/10/im-a-gen-ai-maximalist-
| and...
| ricardo81 wrote:
| You could call it bandwidth, or call it entropy. I'd lean
| towards the more physical definition.
|
| I think of how the USA had cable TV and hundreds of channels
| projecting all kinds of whatever in the 80s while here in the
| UK we were limited to our finite channels. To be fair those
| finite channels gave people something to talk about the next
| day, because millions of people saw the same thing. Surely a
| lot of what mankind has done is to tame entropy, like steam
| engines etc.
|
| With AI and everyone having a prompt, it's surely a game
| changer. How it works out, we'll see.
| swatcoder wrote:
| > The historical progression from text to still images to audio
| to moving images will hold true for AI as well.
|
| You'll have to explain what you mean by this. Direct speech,
| text, illustrations, photos, abstract sounds, music,
| recordings, videos, circuits, programs, cells... these are all
| just different mediums with different characteristics. There is
| no "progression" apparent among them. Why should there be? They
| each fulfill different ends and have different occasions for
| which they best suit.
|
| We seem to have discovered a new family of tools that help
| lossilly transform content or intent from one of these mediums
| to some others, which is sure to be useful in its own ways. But
| it's not a medium like the above in the first place, and with
| none of them representing a progression, it certainly doesn't
| either.
| CharlieDigital wrote:
| > You'll have to explain what you mean by this
|
| The progression of distribution. Printing press, photos,
| radio, movies, television. The early web was text, then came
| images, then audio (Napster age), and then video (remember
| that Netflix used to ship DVDs?).
|
| The flip side of that is production and the ratio of
| producers to consumers. As the bandwidth for distribution
| increases, there is a decrease in the cost and complexity for
| producers and naturally, we see the same progression with
| producers on each new platform and distribution technology:
| text, still images, audio, moving images.
| margalabargala wrote:
| > The progression of distribution. Printing press, photos,
| radio, movies, television.
|
| Your history is incorrect, though. Still images predate
| text, by a lot.
|
| Cave paintings came before writing. Woodcuts came before
| the printing press.
| earthnail wrote:
| Not in the information age. This cascade just corresponds
| to how much data and processing power is needed for each.
|
| It is entirely logical to say that AI development will
| follow the same progression as the early internet or as
| broadcast since they all fall under the same data
| constraints.
| margalabargala wrote:
| In the information age we've seen inconsistencies as
| well.
|
| Ever since the release of Whisper and others, text-to-
| speech and speech-to-text have been more or less solved,
| while image generation seems to still sometimes have
| trouble. Earlier this week was a thread about how no
| image model could draw a crocodile without a tail.
|
| Meanwhile, the first photographs predate the first sound
| recordings. And moving images without sound, of course,
| predate moving images with sound.
|
| The original poster was trying to sound profound as
| though there was some set sequence of things that always
| happens through human development. But the reality is a
| much more mundane "less complex things tend to be easier
| than more complex things".
| swatcoder wrote:
| But it's not a progression. There's no transitioning. It's
| just the introduction of new media alongside prior ones.
|
| And as a sibling commenter noted, the individual history of
| these media are disjoint and not really in the sequence you
| suggest.
|
| Regardless, generative AI isn't a media like any of these.
| It's a means to transform media from one type to another,
| at some expense and with the introduction of loss/noise.
| There's something revolutionary about how easy it makes it
| to perform those transitions, and how generally it can
| perform them, but it's fundamentally more like a
| screwdriver than a video.
| rhdunn wrote:
| I'd argue that multimodal analysis can improve uni/bimodal
| models.
|
| There is overlap between text to image and text to video --
| image would help video animating interesting or complex
| prompts; video would help image learn how to differentiate
| features as there are additional clues in terms of how the
| image changes and remains the same.
|
| There's overlap with audio, text transcripts, and video
| around learning to animate speech e.g. by leaning how faces
| move with the corresponding audio/text.
|
| There's overlap with sound and video -- e.g. being able to
| associate sounds like dog barking without direct labelling of
| either.
| croes wrote:
| Vision especially GUIs are a pretty limited API.
| abirch wrote:
| It reminded me of these old Unix lessons of Master Foo
|
| https://prirai.github.io/books/unix-koans/#master-foo-
| discou...
| CharlieDigital wrote:
| I mean vision in the most general sense, not just a GUI.
|
| Imagine OpenAI can not only read the inflection in your
| voice, but also nuances in your facial expressions and how
| you're using your hands to understand your state of mind.
|
| And instead of merely responding as an audio stream, a real-
| time avatar.
| corobo wrote:
| Sweet! I'll have my own Holly from Red Dwarf!
|
| or maybe my own HAL9000..
|
| A little bit ambivalent on this haha, looking forward to
| seeing what comes of it either way though :)
| skydhash wrote:
| It is not. Text is very dense information wise and recursive
| and you can formalize it. And easily coupled with interactions
| method. And more apt for automation. You can easily see this
| with software like Autocad which have both. There's a reason
| all protocol are texts.
|
| Vision and audio plays a nice role, but that's because of
| humans and reality. _real world <-> vision|audio <->
| processing_ pipeline makes sense. But _processing <-> data <->
| vision|audio <-> data <-> processing_ cycle is just non sense
| and a waste of resources.
| cooper_ganglia wrote:
| Not a waste of resources, just an increase in use. This is
| why need more resources.
| skydhash wrote:
| It's a waste, because you could have just resolve the
| middleman and have _data <-> processing_ cycle. When you
| increase resources use, that means some other metrics
| should be increased with an higher factor (car vs carriage
| and horses, computers vs doing computation by hand),
| otherwise it's a waste
| ricardo81 wrote:
| >Text is very dense information wise and recursive and you
| can formalize it.
|
| There's been a lot of attempts over the years and varying
| degrees of accuracy, but I don't know if you can go as far as
| to "formalize" it. Beyond the syntax, (tokenising, syntactic
| chunking and ...beyond) there is the intent, and that is
| super hard to measure. And possibly, the problem with these
| prompts is they get things right a lot of the time but wrong
| say 5% of the time. Purely because they couldn't formalize
| it. My web hosting has 99.99% uptime which is a bit more
| reassuring than 95%.
| layer8 wrote:
| Text is still a predominant medium of communication and
| information processing, and I don't see that changing
| substantially. TFA was an article, not a video, and you
| wouldn't want the HN comment section to be composed of videos
| or images. Similarly, video calls haven't replaced texting.
| CharlieDigital wrote:
| It's not that it will be replaced, but there's a natural
| progression of the types of media that is available on a
| given platform of distribution.
|
| RF: text (telegram), audio, still images (fax), moving images
|
| Web had the same progression: text, still images (inverted
| here), audio age (MP3s, Napster), video (Netflix, YouTube)
|
| AI: text, images, audio (realtime API), ...?
|
| Vision is the obvious next medium.
| wwweston wrote:
| Lately I've been realizing that as much as I value YouTube,
| much of the content is distorted by various incentives
| favoring length and frequency and a tech culture which
| overfocuses on elaborating steps as an equivalent to an
| explanation. This contributes to more duplicate content
| (often even within the same channel) and less in terms of
| refined conceptualization. I find myself often wishing I
| had outlines, summaries, and hyperlinks to greater details.
|
| Of course, I can use AI tools to get approximations of such
| things, and it'll probably get better, which means we will
| now be using this increased bandwidth or progression to
| produce more video to be pushed out through the pipes and
| distilled by an AI tool into shorter video or something
| like hypertext.
|
| Progress!
| wwweston wrote:
| The information coursing through the world around us already
| exceeds our ability to grasp it by high orders of magnitude.
|
| Three channels of television over 8 hours was already more than
| anyone had time to take in.
|
| AI _might_ be able to create a summarizing layers and relays
| that help manage that.
|
| AI isn't going to create infinite bandwidth. It's as likely to
| increase entropy and introduce noise.
| viraptor wrote:
| Some time ago I made a prediction that accessibility is the
| ultimate API for the UI agents, but unfortunately multimodal
| capabilities went the other way. But we can still change the
| course:
|
| This is a great place for people to start caring about
| accessibility annotations. All serious UI toolkits allow you to
| tell the computer what's on the screen. This allows things like
| Windows Automation https://learn.microsoft.com/en-
| us/windows/win32/winauto/entr... to see a tree of controls with
| labels and descriptions without any vision/OCR. It can be
| inspected by apps like FlauiInspect
| https://github.com/FlaUI/FlaUInspect?tab=readme-ov-file#main...
| But see how the example shows a statusbar with (Text "UIA3" "")?
| It could've been (Text "UIA3" "Current automation interface")
| instead for both a good tooltip and an accessibility label.
|
| Now we can kill two birds with one stone - actually improve the
| accessibility of everything and make sure custom controls adhere
| to the framework as well, and provide the same data to the coming
| automation agents. The text description will be much cheaper than
| a screenshot to process. Also it will help my work with manually
| coded app automation, so that's a win-win-win.
|
| As a side effect, it would also solve issues with UI weirdness.
| Have you ever had windows open something on a screen which is not
| connected anymore? Or under another window? Or minimised?
| Screenshots won't give enough information here to progress.
| tomatohs wrote:
| > It is very helpful to give it things like:
|
| - A list of applications that are open - Which application has
| active focus - What is focused inside the application - Function
| calls to specifically navigate those applications, as many as
| possible
|
| We've found the same thing while building the client for
| testdriver.ai. This info is in every request.
| pabe wrote:
| I don't think vision is the ultimate API. It wasn't with
| "traditional" RPA and it won't with more advanced AI-RPA. It's
| inefficient. If you want something to be used by a bot, write an
| interface for a bot. I'd make an exception for end2end testing.
| Veen wrote:
| You're looking at it from a developer's perspective. For non-
| developers, vision opens up all sorts of new capabilities. And
| they won't have to rely on the software creator's view of what
| should be automated and what should not.
| skydhash wrote:
| Most non-developers won't bother. You have shortcut on iOS
| and macOS which is like Scratch for automation and still only
| power users use it. Others just download the shortcut they
| want.
| croes wrote:
| If a GUI is confusing for humans AI will be have problems
| too.
|
| So you still rely on developers to make reasonable GUIs
| cheevly wrote:
| No, language is the ultimate API.
| ukuina wrote:
| On the instruction-provision end, sure.
| throwup238 wrote:
| Vision _plus_ accessibility metadata is the ultimate API. I see
| little reason that poorly designed flat UIs are going to confuse
| LLMs any less than humans, especially when they're missing from
| the training data like most internal apps or the documentation on
| the web is out of date. Even a basic dump of ARIA attributes or
| the hierarchy from OS accessibility APIs can help a lot.
| dbish wrote:
| The problem is accessibility data and apis are very bad across
| the board.
| unglaublich wrote:
| Vision here means "2d pixel space".
|
| The ultimate API is "all the raw data you can acquire from your
| environment".
| layer8 wrote:
| For a typical GUI, the "mental model" actually needs to be
| 2.5D, due to stacked windows, popups, menus, modals, and so on.
| The article mentions that the model has difficulties with
| those.
| PreInternet01 wrote:
| Counterpoint: no, it's just more hype.
|
| Doing real-time OCR on 1280x1024 bitmaps has been possible for...
| the last decade or so? Sure, you can now do it on 4K or 8K
| bitmaps, but that's just an incremental improvement.
|
| Fact is, full-screen OCR coupled with innovations like "Google"
| has not lead to "ultimate" productivity improvements, and as
| impressive as OpenAI _et al_ may appear right now, the impact of
| these technologies will end up roughly similar.
|
| (Which is to say: the landscape will change, but not in a truly
| fundamental way. What you're seeing demonstrated right now is,
| roughly speaking, the next Clippy, which, believe it or not, was
| hyped to a similar extent around the time it was introduced...)
| acchow wrote:
| "OCR : Computer Use" is as "voice-to-text : ChatGPT Voice"
| simonw wrote:
| The way these new LLM vision models work is very different from
| OCR.
|
| I saw a demo this morning of someone getting Claude to play
| FreeCiv (admittedly extremely badly):
| https://twitter.com/greggyb/status/1849198544445432229
|
| Try doing that with Tesseract.
| croes wrote:
| I bet Tesseract plays pretty badly too.
| KoolKat23 wrote:
| Existing OCR is extremely limited and requires custom narrow
| development.
| throwaway19972 wrote:
| I'd imagine you'd get higher quality leveraging accessibility
| integrations.
| echoangle wrote:
| Am I the only one thinking this is an awful way for AI to do
| useful stuff for you? Why would I train an AI to use a GUI?
| Wouldn't it be better to just have the AI learn API docs and use
| that? I don't want the AI to open my browser, open google maps
| and search for Shawarma, I want the AI to call a google api and
| give me the result.
| voiper1 wrote:
| Sure, it's more effecient to have it use an API. And people
| have been integrating those for the last while.
|
| But there's tons of applications that are locked behind a
| website or deckstop GUI with no API that are accessible via
| vision.
| og_kalu wrote:
| The vast majority of Applications cannot be used by anything
| other than a GUI.
|
| We built computers to be used by humans and humans
| overwhelmingly operate computers with GUIs. So if you want a
| machine that can potentially operate computers as well as
| humans then you're going to have to stick to GUIs.
|
| It's the same reason we're trying to build general purpose
| robots in a human form factor.
|
| The fact that a car is about as wide as a two horse drawn
| carriage is also no coincidence. You can't ignore existing
| infrastructure.
| echoangle wrote:
| But I don't want an AI to ,,operate a computer"... maybe I'm
| missing the point of this but I just can't imagine a usecase
| where this is a good solution. For everything browser based,
| the burden of making an API is probably relatively small and
| if the page is simple enough, you could maybe even get away
| with training the AI on the page HTML and generating a
| response to send. And for everything that's not browser-
| based, I would either want the AI embedded in the software
| (image editors, IDEs...) or not there are all.
| og_kalu wrote:
| >But I don't want an AI to ,,operate a computer"...
|
| You don't and that's fine but certainly many people are
| interested in such a thing.
|
| >maybe I'm missing the point of this but I just can't
| imagine a usecase where this is a good solution.
|
| If it could operate computers robustly and reliabily then
| why wouldn't you ? Not everything someone does on a
| computer is a task they wouldn't like to automate away but
| can't with current technology.
|
| >For everything browser based, the burden of making an API
| is probably relatively small
|
| It's definitely not less effort than stricking to a GUI
|
| >and if the page is simple enough, you could maybe even get
| away with training the AI on the page HTML and generating a
| response to send.
|
| Sure in special circumstances, it may be a good idea to use
| something else.
|
| >And for everything that's not browser-based, I would
| either want the AI embedded in the software (image editors,
| IDEs...) or not there are all.
|
| AI embedded in software and AI operating the computer
| itself are entirely different things. The former is not
| necessarily a substitute for the latter.
|
| Having access to SORA is not at all the same thing as AI
| that can expertly operate Blender. And right now at least,
| studios would actually much prefer the latter.
|
| Even if they were equivalent (they're not), then you
| wouldn't be able to operate most applications without
| developers explicitly supporting and maintaining it first.
| That's infeasible.
| dragonwriter wrote:
| I think the two big applications for programmatically (AI
| or otherwise) operating a computer via this kind of UI
| automation are:
|
| (1) Automated testing of apps that use traditional UIs, and
|
| (2) Automating legacy apps that it is not practical or cost
| effective to update.
| simonw wrote:
| Google really don't want to provide a useful API to their
| search results.
| layer8 wrote:
| A general-purpose assistant should be able to perform general-
| purpose operations, meaning the same things people do on their
| computers, and without having to supply a special-purpose AI-
| compatible interface for every single function the AI might
| need to operate. The AI should be able to operate any interface
| a human can operate.
| Workaccount2 wrote:
| Anthropic is selling a product to people, not software
| engineers.
| downWidOutaFite wrote:
| Vision is a crappy interface for computers but I think it could
| be a useful weapon against all the extremely "secure" platforms
| that refuse to give you access to your own data and refuse to
| interoperate with anything outside their militarized walled
| gardens.
___________________________________________________________________
(page generated 2024-10-24 23:00 UTC)