[HN Gopher] Gemini 2.5 Computer Use model
       ___________________________________________________________________
        
       Gemini 2.5 Computer Use model
        
       Author : mfiguiere
       Score  : 250 points
       Date   : 2025-10-07 19:49 UTC (3 hours ago)
        
 (HTM) web link (blog.google)
 (TXT) w3m dump (blog.google)
        
       | strangescript wrote:
       | I assume its tool calling and structured output are way better,
       | but this model isn't in Studio unless its being silently subbed
       | in.
        
         | phamilton wrote:
         | Just tried it in an existing coding agent and it rejected the
         | requests because computer tools weren't defined.
        
           | omkar_savant wrote:
           | We can definitely make the docs more clear here but the model
           | requires using the computer_use tool. If you have custom
           | tools, you'll need to exclude predefined tools if they clash
           | with our action space.
           | 
           | See this section:
           | https://googledevai.devsite.corp.google.com/gemini-
           | api/docs/...
           | 
           | And the repo has a sample setup for using the default
           | computer use tool: https://github.com/google/computer-use-
           | preview
        
       | xnx wrote:
       | I've had good success with the Chrome devtools MCP
       | (https://github.com/ChromeDevTools/chrome-devtools-mcp) for
       | browser automation with Gemini CLI, so I'm guessing this model
       | will work even better.
        
         | arkmm wrote:
         | What sorts of automations were you able to get working with the
         | Chrome dev tools MCP?
        
           | odie5533 wrote:
           | Not OP, but in my experience, Jest and Playwright are so much
           | faster that it's not worth doing much with the MCP. It's a
           | neat toy, but it's just too slow for an LLM to try to control
           | a browser using MCP calls.
        
             | atonse wrote:
             | Yeah I think it would be better to just have the model
             | write out playwright scripts than the way it's doing it
             | right now (or at least first navigate manually and then
             | based on that, write a playwright typescript script for
             | future tests).
             | 
             | Cuz right now it's way too slow... perform an action, then
             | read the results, then wait for the next tool call, etc.
        
               | omneity wrote:
               | This is basically our approach with Herd[0]. We operate
               | agents that develop, test and heal trails[1, 2], which
               | are packaged browser automations that do not require
               | browser use LLMs to run and therefore are much cheaper
               | and reliable. Trail automations are then abstracted as a
               | REST API and MCP[3] which can be used either as simple
               | functions called from your code, or by your own agent, or
               | any combination of such.
               | 
               | You can build your own trails, publish them on our
               | registry, compose them ... You can also run them in a
               | distributed fashion over several Herd clients where we
               | take care of the signaling and communication but you
               | simply call functions. The CLI and npm & python packages
               | [4, 5] might be interesting as well.
               | 
               | Note: The automation stack is entirely home-grown to
               | enable distributed orchestration, and doesn't rely on
               | puppeteer nor playwright but the browser automation
               | API[6] is relatively similar to ease adoption. We also
               | don't use the Chrome Devtools Protocol and therefore have
               | a different tradeoff footprint.
               | 
               | 0: https://herd.garden
               | 
               | 1: https://herd.garden/trails
               | 
               | 2: https://herd.garden/docs/trails-automations
               | 
               | 3: https://herd.garden/docs/reference-mcp-server
               | 
               | 4: https://www.npmjs.com/package/@monitoro/herd
               | 
               | 5: https://pypi.org/project/monitoro-herd/
               | 
               | 6: https://herd.garden/docs/reference-page
        
         | iLoveOncall wrote:
         | This has absolutely nothing in common with a model for computer
         | use... This uses pre-defined tools provided in the MCP server
         | by Google, nothing to do with a general model supposed to work
         | for any software.
        
       | cryptoz wrote:
       | Computer Use models are going to ruin simple honeypot form fields
       | meant to detect bots :(
        
         | layman51 wrote:
         | You mean the ones where people add a question that is like
         | "What is 10+3?"
        
         | jebronie wrote:
         | I just tried to submit a contact form with it. It successfully
         | solved the ReCaptcha but failed to fill in a required field and
         | got stuck. We're safe.
        
       | phamilton wrote:
       | It successfully got through the captcha at
       | https://www.google.com/recaptcha/api2/demo
        
         | siva7 wrote:
         | probably because its ip is coming from googles own subnet
        
           | asadm wrote:
           | isnt it coming from browserbase container?
        
             | ripbozo wrote:
             | Interestingly the IP I got when prompting `what is my IP`
             | was `73.120.125.54` - which is a residential comcast IP.
        
               | martinald wrote:
               | Looks like browserbase has proxies, which will be often
               | residential IPs.
        
         | jampa wrote:
         | The automation is powered through Browserbase, which has a
         | captcha solver. (Whether it is automated or human, I don't
         | know.)
        
         | simonw wrote:
         | Post edited: I was wrong about this. Gemini tried to solve the
         | Google CAPTCHA but it was actually Browserbase that did the
         | solve, notes here:
         | https://simonwillison.net/2025/Oct/7/gemini-25-computer-use-...
        
           | pants2 wrote:
           | Interesting that they're allowing Gemini to solve CAPTCHAs
           | because OpenAI's agent detects and forces user-input for
           | CAPTCHAs despite being fully able to solve them
        
       | dude250711 wrote:
       | Have average Google developers been told/hinted that their
       | bonuses/promotions will be tied to their proactivity in using
       | Gemini for project work?
        
         | peddling-brink wrote:
         | > bonuses/promotions
         | 
         | more like continued employment.
        
         | teaearlgraycold wrote:
         | I know there was a memo telling Googlers they are expected to
         | use AI at work and it's expected for their performance to
         | increase as a result.
        
           | dude250711 wrote:
           | The HBO's Silicon Valley ended way too soon. The plot pretty
           | much writes itself.
        
             | Imustaskforhelp wrote:
             | Don't worry Maybe someone will create AI slop for this on
             | Sora 2 or the likes (this was satire)
             | 
             | On a serious note, What the fuck is happening in the world.
        
       | password54321 wrote:
       | doesn't seem like it makes sense to train AI around human user
       | interfaces which aren't really efficient. It is like building a
       | mechanical horse.
        
         | pixl97 wrote:
         | Right, let's make APIs for everything...
         | 
         | [Looks around and sees people not making APIs for everything]
         | 
         | Well that didn't work.
        
           | odie5533 wrote:
           | Every website and application is just layers of data.
           | Playwright and similar tools have options for taking
           | Snapshots that contain data like text, forms, buttons, etc
           | that can be interacted with on a site. All the calls a
           | website makes are just APIs. Even a native application is
           | made up of WinForms that can be inspected.
        
             | pixl97 wrote:
             | Ah, so now you're turning LLMs into web browsers capable of
             | parsing Javascript to figure out what a human might be
             | looking at, let's see how many levels deep we can go.
        
               | measurablefunc wrote:
               | Just inspect the memory content of the process. It's all
               | just numbers at the end of the day & algorithms do not
               | have any understanding of what the numbers mean other
               | than generating other numbers in response to the input
               | numbers. For the record I agree w/ OP, screenshots are
               | not a good interface for the same reasons that trains,
               | subways, & dedicates lanes for mass transit are obviously
               | superior to cars & their associated attendant headaches.
        
               | ssl-3 wrote:
               | Maybe some day, sure. We may eventually live in a utopia
               | where everyone has quick, efficient, accessible mass
               | transit available that allows them to move between any
               | two points on the globe with unfettered grace.
               | 
               | That'd be neat.
               | 
               | But for now: The web exists, and is universal. We have
               | programs that can render websites to an image in memory
               | (solved for ~30 years), and other programs that can parse
               | images of fully-rendered websites (solved for at least a
               | few years), along with bots that can click on links
               | (solved much more recently).
               | 
               | Maybe tomorrow will be different.
        
               | measurablefunc wrote:
               | Point was process memory is the source of truth,
               | everything else is derived & only throws away information
               | that a neural network can use to make better decisions.
               | Presentation of data is irrelevant to a neural network,
               | it's all just numbers & arithmetic at the end of the day.
        
         | wahnfrieden wrote:
         | It's not about efficiency but access. Many services do not
         | provide programmatic access.
        
         | CuriouslyC wrote:
         | We're training natural language models to reason by emulating
         | reasoning in natural language, so it's very on brand.
        
           | bonoboTP wrote:
           | It's on the brand of stuff that works. Expert systems and
           | formal symbolic if-else, rules based reasoning was tried, it
           | failed. Real life is messy and fat-tailed.
        
             | CuriouslyC wrote:
             | And yet we give agents deterministic tools to use rather
             | than tell them to compute everything in model!
        
               | bonoboTP wrote:
               | Yes, and here they also operate deterministic GUI tools.
               | Thing is, many GUI programs are not designed so well.
               | Their best interface and the only interface they were
               | tested and designed for is the visual one.
        
         | michaelt wrote:
         | In my country there's a multi-airline API for booking plane
         | tickets, but the cheapest of economy carriers only accept
         | bookings directly on their websites.
         | 
         | If you want to make something that can book _every_ airline?
         | Better be able to navigate a website.
        
           | odie5533 wrote:
           | You can navigate a website without visually decoding the
           | image of a website.
        
             | bonoboTP wrote:
             | Except if its a messy div soup with various shitty absolute
             | and relative pixel offsets where the only way to know what
             | refers to what is by rendering it and using gestalt
             | principles.
        
               | measurablefunc wrote:
               | None of that matters to neural networks.
        
               | bonoboTP wrote:
               | It does, because it's hard to infer where each element
               | will end up in the render. So a checkbox may be set up in
               | a shitty way such that the corresponding text label is
               | not properly placed in the DOM, so it's hard to tell what
               | the checkbox controls just based on the DOM tree. You
               | have to take into account the styling and placement pixel
               | stuff, ie render it properly and look at it.
               | 
               | That's just one obvious example, but the principle holds
               | more generally.
        
               | measurablefunc wrote:
               | Spatial continuity has nothing to do w/ how neural
               | networks interpret an array of numbers. In fact, there is
               | nothing about the topology of the input that is any way
               | relevant to what calculations are done by the network.
               | You are imposing an anthropomorphic structure that does
               | not exist anywhere in the algorithm & how it processes
               | information. Here is an example to demonstrate my point:
               | https://x.com/s_scardapane/status/1975500989299105981
        
               | bonoboTP wrote:
               | It would have to implicitly render the HTML+CSS to know
               | which two elements visually end up next to each other, if
               | the markup is spaghetti and badly done.
        
               | measurablefunc wrote:
               | The linked post demonstrates arbitrary re-ordering of
               | image patches. Spatial continuity is not relevant to
               | neural networks.
        
               | bonoboTP wrote:
               | That's ridiculous, sorry. If that were so, we wouldn't
               | have positional encodings in vision transformers.
        
               | ionwake wrote:
               | Why are you talking about image processing ? The guy
               | you're talking to isn't
        
               | measurablefunc wrote:
               | What do you suppose "render" means?
        
               | bonoboTP wrote:
               | The original comment I replied to said "You can navigate
               | a website without visually decoding the image of a
               | website." I replied that decoding is necessary to know
               | where the elements will end up in a visual arrangement,
               | because often that carries semantics. A label that is
               | rendered next to another element can be crucial for
               | understanding the functioning of the program. It's
               | nontrivial just from the HTML or whatever tree structure
               | where each element will appear in 2D after rendering.
        
         | TulliusCicero wrote:
         | This is just like the comments suggesting we need sensors and
         | signs specifically for self-driving cars for them to work.
         | 
         | It'll never happen, so companies need to deal with the reality
         | we have.
        
           | password54321 wrote:
           | We can build tons of infrastructure for cars that didn't
           | exist before but can't for other things anymore? Seems like
           | society is just becoming lethargic.
        
         | jklinger410 wrote:
         | Why do you think we have fully self driving cars instead of
         | just more simplistic beacon systems? Why doesn't McDonald's
         | have a fully automated kitchen?
         | 
         | New technology is slow due to risk aversion, it's very rare for
         | people to just tear up what they already have to re-implement
         | new technology from the ground up. We always have to shoe-horn
         | new technology into old systems to prove it first.
         | 
         | There are just so many factors that get solved by working with
         | what already exists.
        
           | layman51 wrote:
           | About your self-driving car point, I feel like the approach
           | I'm seeing is akin to designing a humanoid robot that uses
           | its robotic feet to control the brake and accelerator pedals,
           | and its hand to move the gear selector.
        
             | bonoboTP wrote:
             | Yeah, that would be pretty good honestly. It could
             | immediately upgrade every car ever made to self driving and
             | then it could also do your laundry without buying a new
             | washing machine and everything else. It's just hard to do.
             | But it will happen.
        
               | layman51 wrote:
               | Yes, it sounds very cool and sci-fi, but having a
               | humanoid control the car seems less safe than having the
               | spinning cameras and other sensors that are missing from
               | older cars or those that weren't specifically built to be
               | self-driving. I suppose this is why even human drivers
               | are assisted by automatic emergency braking.
               | 
               | I am more leaning into the idea that an efficient self-
               | driving car wouldn't even need to have a steering wheel,
               | pedals, or thin pillars to help the passengers see the
               | outside environment or be seen by pedestrians.
               | 
               | The way this ties back to the computer use models is that
               | a lot of webpages have stuff designed for humans would
               | make it difficult for a model to navigate them well. I
               | think this was the goal of the "semantic web".
        
             | iAMkenough wrote:
             | I could add self-driving to my existing fleet? Sounds
             | intriguing.
        
         | golol wrote:
         | If we could build mechanical horses they wiuld be absolutely
         | amazing!
        
         | ivape wrote:
         | What you say is 100% true until it's not. It seems like a weird
         | thing to say (what I'm saying), but please consider we're in a
         | time period where everything we say is true, minute by minute,
         | and no more. It could be the next version of this just works,
         | and works really well.
        
         | aidenn0 wrote:
         | Reminds me of WALL-E where there is a keypad with a robot
         | finger to press buttons on it.
        
       | ramoz wrote:
       | This will never hit a production enterprise system without some
       | form of hooks/callbacks in place to instill governance.
       | 
       | Obviously much harder with UI vs agent events similar to the
       | below.
       | 
       | https://docs.claude.com/en/docs/claude-code/hooks
       | 
       | https://google.github.io/adk-docs/callbacks/
        
         | peytoncasper wrote:
         | Hi! I work in identity products at Browserbase. I've spent a
         | fair amount of time lately thinking about how to layer RBAC
         | across the web.
         | 
         | Do you think callbacks are how this gets done?
        
           | ramoz wrote:
           | Disclaimer: Im a cofounder, we focus critical spaces with AI.
           | Also i was the feature request for claude code hooks.
           | 
           | But my bet - we will not deploy a single agent into any real
           | environment without deterministic guarantees. Hooks are a
           | means...
           | 
           | Browserbase with hooks would be really powerful, governance
           | beyond RBAC (but of course enabling relevant guardrailing as
           | well - "does agent have permission to access this sharepoint
           | right now, within this context, to conduct action x?").
           | 
           | I would love to meet with you actually, my shop cares
           | intimately about agent verification and governance. Soon to
           | release the tool I originally designed for claude code hooks.
        
       | CuriouslyC wrote:
       | I feel like screenshots should be the last thing you reach for.
       | There's a whole universe of data from accessibility subsystems.
        
         | ekelsen wrote:
         | and all sorts of situations where they don't work. When they do
         | work it's great, but if they don't and you rely on them, you
         | have nothing.
        
           | CuriouslyC wrote:
           | Oh yeah, using all available data channels in proportion to
           | their cost and utility is the right choice, 100%.
        
         | bonoboTP wrote:
         | The rendered visual layout is designed in a way to be spatially
         | organized perceptually to make sense. It's a bit like PDFs. I
         | imagine that the underlying hierarchy tree can be quite messy
         | and spaghetti, so your best bet is to use it in the form that
         | the devs intended and tested it for.
         | 
         | I think screenshots are a really good and robust idea. It
         | bothers the more structured-minded people, but apps are often
         | not built so well. They are built until the point that it
         | _looks_ fine and people are able to use it. I 'm pretty sure
         | people who rely on accessibility systems have lots of
         | complaints about this.
        
           | CuriouslyC wrote:
           | The progressives were pretty good at pushing accessibility in
           | applications, it's not perfect but every company I've worked
           | with since the mid 2010s has made a big todo about
           | accessibility. For stuff on linux you can instrument
           | observability in a lot of different ways that are more
           | efficient than screenshots, so I don't think it's generally
           | the right way to move forward, but screenshots are universal
           | and we already have capable vision models so it's sort of a
           | local optimization move.
        
       | whinvik wrote:
       | My general experience has been that Gemini is pretty bad at tool
       | calling. The recent Gemini 2.5 Flash release actually fixed some
       | of those issues but this one is Gemini 2.5 Pro with no indication
       | about tool calling improvements.
        
       | TIPSIO wrote:
       | Painfully slow
        
         | John7878781 wrote:
         | That doesn't matter so much when it can happen in the
         | background.
        
       | Oras wrote:
       | It is actually quite good at following instructions, but I tried
       | clicking on job application links, and since they open in a new
       | window, it couldn't find the new window. I suppose it might be an
       | issue with BrowserBase, or just the way this demo was set up.
        
         | MiguelG719 wrote:
         | are you running into this issue on gemini.browserbase.com or
         | the google/computer-use-preview github repo?
        
       | mianos wrote:
       | I sure hope this is better than pathetically useless. I assume it
       | is to replace the extremely frustrating Gemini for Android. If I
       | have a bluetooth headset and I try "play music on Spotify" it
       | fails about half the time. Even with youtube music. I could not
       | believe it was so bad so I just sat at my desk with the helmet on
       | and tried it over and over. It seems to recognise the speech but
       | simply fails to do anything. Brand new Pixel 10. The old speech
       | recognition system was way dumber but it actually worked.
        
         | bsimpson wrote:
         | I was riding my motorcycle the other day, and asked my helmet
         | to "call <friend>." Gemini infuriatingly replied "I cannot
         | directly make calls for you. Is there something else I can help
         | you with?" This absolutely used to work.
         | 
         | Reminds me of an anecdote where Amazon invested howevermany
         | personlives in building AI for Alexa, only to discover that
         | alarms, music, and weather make up the large majority of things
         | people actually use smart speakers for. They're making these
         | things worse at their main jobs so they can sell the sizzle of
         | AI to investors.
        
       | mosura wrote:
       | One of the slightly buried stories here is BrowserBase
       | themselves. Great stuff.
        
       | bonoboTP wrote:
       | There are some absolutely atrocious UIs out there for many office
       | workers, who spend hours clicking buttons opening popup after
       | popup clicking repetitively on checkboxes etc. E.g. entering
       | travel costs or somesuch in academia and elsewhere. You have no
       | idea how annoying that type of work is, you pull out your hair.
       | Why don't they make better UIs, you ask? If you ask, you have no
       | idea how bad things are. Because they don't care, there is no
       | communication, it seems fine, the software creators are hard to
       | reach, the software is approved by people who never used it and
       | decide based on gut feel, powerpoints and feature tickmarks. Even
       | big name brands are horrible at this, like SAP.
       | 
       | If such AI tools allow to automate this soulcrushing drudgery, it
       | will be great. I know that you can technically script things
       | Selenium, AutoHotkey whatnot. But you can imagine that it's a
       | nonstarter in a regular office. This kind of tool could make
       | things like that much more efficient. And it's not like it will
       | then obviate the jobs entirely (at least not right away). These
       | offices often have immense backlogs and are understaffed as is.
        
       | numpad0 wrote:
       | How big are Gemini 2.5(Pro/Flash/Lite) models in parameter
       | counts, in experts' guesstimation? Is it towards 50B, 500B, or
       | bigger still? Even Flash feels smart enough for vibe coding
       | tasks.
        
         | thomasm6m6 wrote:
         | 2.5 Flash Lite replaced 2.0 Flash Lite which replaced 1.5 Flash
         | 8B, so one might suspect 2.5 Flash Lite is well under 50B
        
       | jcims wrote:
       | (Just using the browserbase demo)
       | 
       | Knowing it's technically possible is one thing, but giving it a
       | short command and seeing it go log in to a site, scroll around,
       | reply to posts, etc. is eerie.
       | 
       | Also it tied me at wordle today, making the same mistake I did on
       | the second to lass guess. Too bad you can't talk to it while it's
       | working.
        
       | iAMkenough wrote:
       | Not great at Google Sheets. Repeatedly overwrites all previous
       | columns while trying to populate new columns.
       | 
       | > I am back in the Google Sheet. I previously typed "Zip Code" in
       | F1, but it looks like I selected cell A1 and typed "A". I need to
       | correct that first. I'll re-type "Zip Code" in F1 and clear A1.
       | It seems I clicked A1 (y=219, x=72) then F1 (y=219, x=469) and
       | typed "Zip Code", but then maybe clicked A1 again.
        
       | asadm wrote:
       | This is great. Now I want it to run faster than I can do it.
        
       | omkar_savant wrote:
       | Hey - I'm on the team that launched this. Please let me know if
       | you have any questions!
        
         | SoKamil wrote:
         | How are you going to deal with reCAPTCHA and ad impressions?
         | Sounds like a conflict of interest.
        
       | martinald wrote:
       | Interesting, seems to use 'pure' vision and x/y coords for
       | clicking stuff. Most other browser automation with LLMs I've seen
       | uses the dom/accessibility tree which absolutely churns through
       | context, but is much more 'accurate' at clicking stuff because it
       | can use the exact text/elements in a selector.
       | 
       | Unfortunately it really struggled in the demos for me. It took
       | nearly 18 attempts to click the comment link on the HN demo, each
       | a few pixels off.
        
       | dekhn wrote:
       | Many years ago I was sitting at a red light on a secondary road,
       | where the primary cross road was idle. It seemed like you could
       | solve this using a computer vision camera system that watched the
       | primary road and when it was idle, would expedite the secondary
       | road's green light.
       | 
       | This was long before computer vision was mature enough to do
       | anything like that and I found out that instead, there are
       | magnetic systems that can detect cars passing over - trivial
       | hardware and software - and I concluded that my approach was just
       | far too complicated and expensive.
       | 
       | Similarly, when I look at computers, I typically want the ML/AI
       | system to operate on a structured data that is codified for
       | computer use. But I guess the world is complicated enough and
       | computers got fast enough that having an AI look at a computer
       | screen and move/click a mouse makes sense.
        
         | ge96 wrote:
         | It's funny I'll sometimes scoot forward/rock my car but I'm not
         | sure if it's just coincidence. Also a lot of stop lights now
         | have that tall white camera on top.
        
           | bozhark wrote:
           | Like flashing lights for the first responders sensor
        
           | Spooky23 wrote:
           | Sometimes the rocking helps with a ground loop that isn't
           | working well.
        
           | netghost wrote:
           | There's several mechanisms. The most common is (or at least
           | was) a loop detector under the road that triggers when a
           | vehicle is over it. Sometimes if you're not quite over it, or
           | it's somewhat faulty that will trigger it.
        
         | trenchpilgrim wrote:
         | FWIW those type of traffic cameras are in common use.
         | https://www.milesight.com/company/blog/types-of-traffic-came...
        
           | dekhn wrote:
           | If I read the web page, they don't actually use that as a
           | solution to shortening a red - IMHO that has a very high
           | safety bar compared to the more common uses. But I'd be happy
           | to hear this is something that Just Works in the Real World
           | with a reasonable false positive and false negative rate.
        
             | trenchpilgrim wrote:
             | Yes they do, it's listed under Traffic Sensor Cameras.
        
           | jlhawn wrote:
           | The camera systems are also superior from an infrastructure
           | maintenance perspective. You can update them with new
           | capabilities or do re-striping without tearing up the
           | pavement.
        
         | dktp wrote:
         | I cycle a lot. Outdoors I listen to podcasts and the fact that
         | I can say "Hey Google, go back 30sec" to relisten to something
         | (or forward to skip ads) is very valuable to me.
         | 
         | Indoors I tend to cast some show or youtube video. Often enough
         | I want to change the Youtube video or show using voice commands
         | - I can do this for Youtube, but results are horrible unless I
         | know exactly which video I want to watch. For other services
         | it's largely not possible at all
         | 
         | In a perfect world Google would provide superb APIs for these
         | integrations and all app providers would integrate it and keep
         | it up to date. But if we can bypass that and get good results
         | across the board - I would find it very valuable
         | 
         | I understand this is a very specific scenario. But one I would
         | be excited about nonetheless
        
         | yunyu wrote:
         | There is a lot of pretraining data available around screen
         | recordings and mouse movements (Loom, YouTube, etc). There is
         | much less pretraining data available around navigating
         | accessibility trees or DOM structures. Many use cases may also
         | need to be image aware (document scan parsing, looking at
         | images), and keyboard/video/mouse-based models generalize to
         | more applicants.
        
         | chrisfosterelli wrote:
         | Ironically now that computer vision is commonplace, the cameras
         | you talk about have become increasingly popular over the years
         | because the magnetic systems do not do a very good job of
         | detecting cyclists and the cameras double as a congestion
         | monitoring tool for city staff.
        
       | AaronAPU wrote:
       | I'm looking forward to a desktop OS optimized version so it can
       | do the QA that I have no time for!
        
       | alexnewman wrote:
       | A year ago I did something that used rag and accessibility mode
       | to navigate ui.
        
       | dekhn wrote:
       | I just have to say that I consider this an absolutely hilarious
       | outcome. For many years, I focused on tech solutions that
       | eliminated the need for a human to be in front of a computer
       | doing tedious manual operations. For a wide range of activities,
       | I proposed we focus on "turning everything in the world into
       | database objects" so that computers could operate on them with
       | minimal human effort. I spent significant effort in machine
       | learning to achieve this.
       | 
       | It didn't really occur to me that you could just train a computer
       | to work directly on the semi-structured human world data (display
       | screen buffer) through a human interface (mouse + keyboard).
       | 
       | However, I fully support it (like all the other crazy ideas on
       | the web that beat out the "theoretically better" approaches). I
       | do not think it is unrealistic to expect that within a decade, we
       | could have computer systems that can open chrome, start a video
       | chat with somebody, go back and forth for a while to achieve a
       | task, then hang up... with the person on the other end ever
       | knowing they were dealing with a computer instead of a human.
        
       | hipassage wrote:
       | hi
        
       | hipassage wrote:
       | hi there
        
       | hipassage wrote:
       | hi there, interesting post
        
       | realty_geek wrote:
       | Absolutely hilarious how it gets stuck trying to solve captcha
       | each time. I had to explicitly tell it not to go to google first.
       | 
       | In the end I did manage to get it to play the housepriceguess
       | game:
       | 
       | https://www.youtube.com/watch?v=nqYLhGyBOnM
       | 
       | I think I'll make that my equivalent of Simon Willison's "pelican
       | riding a bicycle" test. It is fairly simple to explain but seems
       | to trip up different LLMs in different ways.
        
       | GeminiFan2025 wrote:
       | The new Gemini 2.5 model's ability to understand and interact
       | with computer interfaces looks very impressive. It could be a
       | game-changer for accessibility and automation. I wonder how
       | robust it is with non-standard UI elements.
        
       | GeminiFan2025 wrote:
       | Impressive interface interaction by Gemini 2.5. Could be great
       | for accessibility.
        
       | enjoylife wrote:
       | > It is not yet optimized for desktop OS-level control
       | 
       | Alas, AGI is not yet here. But I feel like if this OS-level of
       | control was good enough, and the cost of the LLM in the loop
       | wasn't bad, maybe that would be enough to kick start something
       | akin to AGI.
        
         | alganet wrote:
         | I am curious. Why do you think controlling an OS (and not just
         | a browser) would be a move towards AGI?
        
       | mmaunder wrote:
       | I prepare to be disappointed every time I click on a Google AI
       | announcement. Which is so very unfortunate, given that they're
       | the source of LLMs. Come on big G!! Get it together!
        
       | orliesaurus wrote:
       | Does it know what's behind the "menu" of different apps? Or does
       | it have to click on all menus and submenus to find out?
        
       ___________________________________________________________________
       (page generated 2025-10-07 23:00 UTC)