[HN Gopher] Gemini 2.5 Computer Use model
___________________________________________________________________
Gemini 2.5 Computer Use model
Author : mfiguiere
Score : 250 points
Date : 2025-10-07 19:49 UTC (3 hours ago)
(HTM) web link (blog.google)
(TXT) w3m dump (blog.google)
| strangescript wrote:
| I assume its tool calling and structured output are way better,
| but this model isn't in Studio unless its being silently subbed
| in.
| phamilton wrote:
| Just tried it in an existing coding agent and it rejected the
| requests because computer tools weren't defined.
| omkar_savant wrote:
| We can definitely make the docs more clear here but the model
| requires using the computer_use tool. If you have custom
| tools, you'll need to exclude predefined tools if they clash
| with our action space.
|
| See this section:
| https://googledevai.devsite.corp.google.com/gemini-
| api/docs/...
|
| And the repo has a sample setup for using the default
| computer use tool: https://github.com/google/computer-use-
| preview
| xnx wrote:
| I've had good success with the Chrome devtools MCP
| (https://github.com/ChromeDevTools/chrome-devtools-mcp) for
| browser automation with Gemini CLI, so I'm guessing this model
| will work even better.
| arkmm wrote:
| What sorts of automations were you able to get working with the
| Chrome dev tools MCP?
| odie5533 wrote:
| Not OP, but in my experience, Jest and Playwright are so much
| faster that it's not worth doing much with the MCP. It's a
| neat toy, but it's just too slow for an LLM to try to control
| a browser using MCP calls.
| atonse wrote:
| Yeah I think it would be better to just have the model
| write out playwright scripts than the way it's doing it
| right now (or at least first navigate manually and then
| based on that, write a playwright typescript script for
| future tests).
|
| Cuz right now it's way too slow... perform an action, then
| read the results, then wait for the next tool call, etc.
| omneity wrote:
| This is basically our approach with Herd[0]. We operate
| agents that develop, test and heal trails[1, 2], which
| are packaged browser automations that do not require
| browser use LLMs to run and therefore are much cheaper
| and reliable. Trail automations are then abstracted as a
| REST API and MCP[3] which can be used either as simple
| functions called from your code, or by your own agent, or
| any combination of such.
|
| You can build your own trails, publish them on our
| registry, compose them ... You can also run them in a
| distributed fashion over several Herd clients where we
| take care of the signaling and communication but you
| simply call functions. The CLI and npm & python packages
| [4, 5] might be interesting as well.
|
| Note: The automation stack is entirely home-grown to
| enable distributed orchestration, and doesn't rely on
| puppeteer nor playwright but the browser automation
| API[6] is relatively similar to ease adoption. We also
| don't use the Chrome Devtools Protocol and therefore have
| a different tradeoff footprint.
|
| 0: https://herd.garden
|
| 1: https://herd.garden/trails
|
| 2: https://herd.garden/docs/trails-automations
|
| 3: https://herd.garden/docs/reference-mcp-server
|
| 4: https://www.npmjs.com/package/@monitoro/herd
|
| 5: https://pypi.org/project/monitoro-herd/
|
| 6: https://herd.garden/docs/reference-page
| iLoveOncall wrote:
| This has absolutely nothing in common with a model for computer
| use... This uses pre-defined tools provided in the MCP server
| by Google, nothing to do with a general model supposed to work
| for any software.
| cryptoz wrote:
| Computer Use models are going to ruin simple honeypot form fields
| meant to detect bots :(
| layman51 wrote:
| You mean the ones where people add a question that is like
| "What is 10+3?"
| jebronie wrote:
| I just tried to submit a contact form with it. It successfully
| solved the ReCaptcha but failed to fill in a required field and
| got stuck. We're safe.
| phamilton wrote:
| It successfully got through the captcha at
| https://www.google.com/recaptcha/api2/demo
| siva7 wrote:
| probably because its ip is coming from googles own subnet
| asadm wrote:
| isnt it coming from browserbase container?
| ripbozo wrote:
| Interestingly the IP I got when prompting `what is my IP`
| was `73.120.125.54` - which is a residential comcast IP.
| martinald wrote:
| Looks like browserbase has proxies, which will be often
| residential IPs.
| jampa wrote:
| The automation is powered through Browserbase, which has a
| captcha solver. (Whether it is automated or human, I don't
| know.)
| simonw wrote:
| Post edited: I was wrong about this. Gemini tried to solve the
| Google CAPTCHA but it was actually Browserbase that did the
| solve, notes here:
| https://simonwillison.net/2025/Oct/7/gemini-25-computer-use-...
| pants2 wrote:
| Interesting that they're allowing Gemini to solve CAPTCHAs
| because OpenAI's agent detects and forces user-input for
| CAPTCHAs despite being fully able to solve them
| dude250711 wrote:
| Have average Google developers been told/hinted that their
| bonuses/promotions will be tied to their proactivity in using
| Gemini for project work?
| peddling-brink wrote:
| > bonuses/promotions
|
| more like continued employment.
| teaearlgraycold wrote:
| I know there was a memo telling Googlers they are expected to
| use AI at work and it's expected for their performance to
| increase as a result.
| dude250711 wrote:
| The HBO's Silicon Valley ended way too soon. The plot pretty
| much writes itself.
| Imustaskforhelp wrote:
| Don't worry Maybe someone will create AI slop for this on
| Sora 2 or the likes (this was satire)
|
| On a serious note, What the fuck is happening in the world.
| password54321 wrote:
| doesn't seem like it makes sense to train AI around human user
| interfaces which aren't really efficient. It is like building a
| mechanical horse.
| pixl97 wrote:
| Right, let's make APIs for everything...
|
| [Looks around and sees people not making APIs for everything]
|
| Well that didn't work.
| odie5533 wrote:
| Every website and application is just layers of data.
| Playwright and similar tools have options for taking
| Snapshots that contain data like text, forms, buttons, etc
| that can be interacted with on a site. All the calls a
| website makes are just APIs. Even a native application is
| made up of WinForms that can be inspected.
| pixl97 wrote:
| Ah, so now you're turning LLMs into web browsers capable of
| parsing Javascript to figure out what a human might be
| looking at, let's see how many levels deep we can go.
| measurablefunc wrote:
| Just inspect the memory content of the process. It's all
| just numbers at the end of the day & algorithms do not
| have any understanding of what the numbers mean other
| than generating other numbers in response to the input
| numbers. For the record I agree w/ OP, screenshots are
| not a good interface for the same reasons that trains,
| subways, & dedicates lanes for mass transit are obviously
| superior to cars & their associated attendant headaches.
| ssl-3 wrote:
| Maybe some day, sure. We may eventually live in a utopia
| where everyone has quick, efficient, accessible mass
| transit available that allows them to move between any
| two points on the globe with unfettered grace.
|
| That'd be neat.
|
| But for now: The web exists, and is universal. We have
| programs that can render websites to an image in memory
| (solved for ~30 years), and other programs that can parse
| images of fully-rendered websites (solved for at least a
| few years), along with bots that can click on links
| (solved much more recently).
|
| Maybe tomorrow will be different.
| measurablefunc wrote:
| Point was process memory is the source of truth,
| everything else is derived & only throws away information
| that a neural network can use to make better decisions.
| Presentation of data is irrelevant to a neural network,
| it's all just numbers & arithmetic at the end of the day.
| wahnfrieden wrote:
| It's not about efficiency but access. Many services do not
| provide programmatic access.
| CuriouslyC wrote:
| We're training natural language models to reason by emulating
| reasoning in natural language, so it's very on brand.
| bonoboTP wrote:
| It's on the brand of stuff that works. Expert systems and
| formal symbolic if-else, rules based reasoning was tried, it
| failed. Real life is messy and fat-tailed.
| CuriouslyC wrote:
| And yet we give agents deterministic tools to use rather
| than tell them to compute everything in model!
| bonoboTP wrote:
| Yes, and here they also operate deterministic GUI tools.
| Thing is, many GUI programs are not designed so well.
| Their best interface and the only interface they were
| tested and designed for is the visual one.
| michaelt wrote:
| In my country there's a multi-airline API for booking plane
| tickets, but the cheapest of economy carriers only accept
| bookings directly on their websites.
|
| If you want to make something that can book _every_ airline?
| Better be able to navigate a website.
| odie5533 wrote:
| You can navigate a website without visually decoding the
| image of a website.
| bonoboTP wrote:
| Except if its a messy div soup with various shitty absolute
| and relative pixel offsets where the only way to know what
| refers to what is by rendering it and using gestalt
| principles.
| measurablefunc wrote:
| None of that matters to neural networks.
| bonoboTP wrote:
| It does, because it's hard to infer where each element
| will end up in the render. So a checkbox may be set up in
| a shitty way such that the corresponding text label is
| not properly placed in the DOM, so it's hard to tell what
| the checkbox controls just based on the DOM tree. You
| have to take into account the styling and placement pixel
| stuff, ie render it properly and look at it.
|
| That's just one obvious example, but the principle holds
| more generally.
| measurablefunc wrote:
| Spatial continuity has nothing to do w/ how neural
| networks interpret an array of numbers. In fact, there is
| nothing about the topology of the input that is any way
| relevant to what calculations are done by the network.
| You are imposing an anthropomorphic structure that does
| not exist anywhere in the algorithm & how it processes
| information. Here is an example to demonstrate my point:
| https://x.com/s_scardapane/status/1975500989299105981
| bonoboTP wrote:
| It would have to implicitly render the HTML+CSS to know
| which two elements visually end up next to each other, if
| the markup is spaghetti and badly done.
| measurablefunc wrote:
| The linked post demonstrates arbitrary re-ordering of
| image patches. Spatial continuity is not relevant to
| neural networks.
| bonoboTP wrote:
| That's ridiculous, sorry. If that were so, we wouldn't
| have positional encodings in vision transformers.
| ionwake wrote:
| Why are you talking about image processing ? The guy
| you're talking to isn't
| measurablefunc wrote:
| What do you suppose "render" means?
| bonoboTP wrote:
| The original comment I replied to said "You can navigate
| a website without visually decoding the image of a
| website." I replied that decoding is necessary to know
| where the elements will end up in a visual arrangement,
| because often that carries semantics. A label that is
| rendered next to another element can be crucial for
| understanding the functioning of the program. It's
| nontrivial just from the HTML or whatever tree structure
| where each element will appear in 2D after rendering.
| TulliusCicero wrote:
| This is just like the comments suggesting we need sensors and
| signs specifically for self-driving cars for them to work.
|
| It'll never happen, so companies need to deal with the reality
| we have.
| password54321 wrote:
| We can build tons of infrastructure for cars that didn't
| exist before but can't for other things anymore? Seems like
| society is just becoming lethargic.
| jklinger410 wrote:
| Why do you think we have fully self driving cars instead of
| just more simplistic beacon systems? Why doesn't McDonald's
| have a fully automated kitchen?
|
| New technology is slow due to risk aversion, it's very rare for
| people to just tear up what they already have to re-implement
| new technology from the ground up. We always have to shoe-horn
| new technology into old systems to prove it first.
|
| There are just so many factors that get solved by working with
| what already exists.
| layman51 wrote:
| About your self-driving car point, I feel like the approach
| I'm seeing is akin to designing a humanoid robot that uses
| its robotic feet to control the brake and accelerator pedals,
| and its hand to move the gear selector.
| bonoboTP wrote:
| Yeah, that would be pretty good honestly. It could
| immediately upgrade every car ever made to self driving and
| then it could also do your laundry without buying a new
| washing machine and everything else. It's just hard to do.
| But it will happen.
| layman51 wrote:
| Yes, it sounds very cool and sci-fi, but having a
| humanoid control the car seems less safe than having the
| spinning cameras and other sensors that are missing from
| older cars or those that weren't specifically built to be
| self-driving. I suppose this is why even human drivers
| are assisted by automatic emergency braking.
|
| I am more leaning into the idea that an efficient self-
| driving car wouldn't even need to have a steering wheel,
| pedals, or thin pillars to help the passengers see the
| outside environment or be seen by pedestrians.
|
| The way this ties back to the computer use models is that
| a lot of webpages have stuff designed for humans would
| make it difficult for a model to navigate them well. I
| think this was the goal of the "semantic web".
| iAMkenough wrote:
| I could add self-driving to my existing fleet? Sounds
| intriguing.
| golol wrote:
| If we could build mechanical horses they wiuld be absolutely
| amazing!
| ivape wrote:
| What you say is 100% true until it's not. It seems like a weird
| thing to say (what I'm saying), but please consider we're in a
| time period where everything we say is true, minute by minute,
| and no more. It could be the next version of this just works,
| and works really well.
| aidenn0 wrote:
| Reminds me of WALL-E where there is a keypad with a robot
| finger to press buttons on it.
| ramoz wrote:
| This will never hit a production enterprise system without some
| form of hooks/callbacks in place to instill governance.
|
| Obviously much harder with UI vs agent events similar to the
| below.
|
| https://docs.claude.com/en/docs/claude-code/hooks
|
| https://google.github.io/adk-docs/callbacks/
| peytoncasper wrote:
| Hi! I work in identity products at Browserbase. I've spent a
| fair amount of time lately thinking about how to layer RBAC
| across the web.
|
| Do you think callbacks are how this gets done?
| ramoz wrote:
| Disclaimer: Im a cofounder, we focus critical spaces with AI.
| Also i was the feature request for claude code hooks.
|
| But my bet - we will not deploy a single agent into any real
| environment without deterministic guarantees. Hooks are a
| means...
|
| Browserbase with hooks would be really powerful, governance
| beyond RBAC (but of course enabling relevant guardrailing as
| well - "does agent have permission to access this sharepoint
| right now, within this context, to conduct action x?").
|
| I would love to meet with you actually, my shop cares
| intimately about agent verification and governance. Soon to
| release the tool I originally designed for claude code hooks.
| CuriouslyC wrote:
| I feel like screenshots should be the last thing you reach for.
| There's a whole universe of data from accessibility subsystems.
| ekelsen wrote:
| and all sorts of situations where they don't work. When they do
| work it's great, but if they don't and you rely on them, you
| have nothing.
| CuriouslyC wrote:
| Oh yeah, using all available data channels in proportion to
| their cost and utility is the right choice, 100%.
| bonoboTP wrote:
| The rendered visual layout is designed in a way to be spatially
| organized perceptually to make sense. It's a bit like PDFs. I
| imagine that the underlying hierarchy tree can be quite messy
| and spaghetti, so your best bet is to use it in the form that
| the devs intended and tested it for.
|
| I think screenshots are a really good and robust idea. It
| bothers the more structured-minded people, but apps are often
| not built so well. They are built until the point that it
| _looks_ fine and people are able to use it. I 'm pretty sure
| people who rely on accessibility systems have lots of
| complaints about this.
| CuriouslyC wrote:
| The progressives were pretty good at pushing accessibility in
| applications, it's not perfect but every company I've worked
| with since the mid 2010s has made a big todo about
| accessibility. For stuff on linux you can instrument
| observability in a lot of different ways that are more
| efficient than screenshots, so I don't think it's generally
| the right way to move forward, but screenshots are universal
| and we already have capable vision models so it's sort of a
| local optimization move.
| whinvik wrote:
| My general experience has been that Gemini is pretty bad at tool
| calling. The recent Gemini 2.5 Flash release actually fixed some
| of those issues but this one is Gemini 2.5 Pro with no indication
| about tool calling improvements.
| TIPSIO wrote:
| Painfully slow
| John7878781 wrote:
| That doesn't matter so much when it can happen in the
| background.
| Oras wrote:
| It is actually quite good at following instructions, but I tried
| clicking on job application links, and since they open in a new
| window, it couldn't find the new window. I suppose it might be an
| issue with BrowserBase, or just the way this demo was set up.
| MiguelG719 wrote:
| are you running into this issue on gemini.browserbase.com or
| the google/computer-use-preview github repo?
| mianos wrote:
| I sure hope this is better than pathetically useless. I assume it
| is to replace the extremely frustrating Gemini for Android. If I
| have a bluetooth headset and I try "play music on Spotify" it
| fails about half the time. Even with youtube music. I could not
| believe it was so bad so I just sat at my desk with the helmet on
| and tried it over and over. It seems to recognise the speech but
| simply fails to do anything. Brand new Pixel 10. The old speech
| recognition system was way dumber but it actually worked.
| bsimpson wrote:
| I was riding my motorcycle the other day, and asked my helmet
| to "call <friend>." Gemini infuriatingly replied "I cannot
| directly make calls for you. Is there something else I can help
| you with?" This absolutely used to work.
|
| Reminds me of an anecdote where Amazon invested howevermany
| personlives in building AI for Alexa, only to discover that
| alarms, music, and weather make up the large majority of things
| people actually use smart speakers for. They're making these
| things worse at their main jobs so they can sell the sizzle of
| AI to investors.
| mosura wrote:
| One of the slightly buried stories here is BrowserBase
| themselves. Great stuff.
| bonoboTP wrote:
| There are some absolutely atrocious UIs out there for many office
| workers, who spend hours clicking buttons opening popup after
| popup clicking repetitively on checkboxes etc. E.g. entering
| travel costs or somesuch in academia and elsewhere. You have no
| idea how annoying that type of work is, you pull out your hair.
| Why don't they make better UIs, you ask? If you ask, you have no
| idea how bad things are. Because they don't care, there is no
| communication, it seems fine, the software creators are hard to
| reach, the software is approved by people who never used it and
| decide based on gut feel, powerpoints and feature tickmarks. Even
| big name brands are horrible at this, like SAP.
|
| If such AI tools allow to automate this soulcrushing drudgery, it
| will be great. I know that you can technically script things
| Selenium, AutoHotkey whatnot. But you can imagine that it's a
| nonstarter in a regular office. This kind of tool could make
| things like that much more efficient. And it's not like it will
| then obviate the jobs entirely (at least not right away). These
| offices often have immense backlogs and are understaffed as is.
| numpad0 wrote:
| How big are Gemini 2.5(Pro/Flash/Lite) models in parameter
| counts, in experts' guesstimation? Is it towards 50B, 500B, or
| bigger still? Even Flash feels smart enough for vibe coding
| tasks.
| thomasm6m6 wrote:
| 2.5 Flash Lite replaced 2.0 Flash Lite which replaced 1.5 Flash
| 8B, so one might suspect 2.5 Flash Lite is well under 50B
| jcims wrote:
| (Just using the browserbase demo)
|
| Knowing it's technically possible is one thing, but giving it a
| short command and seeing it go log in to a site, scroll around,
| reply to posts, etc. is eerie.
|
| Also it tied me at wordle today, making the same mistake I did on
| the second to lass guess. Too bad you can't talk to it while it's
| working.
| iAMkenough wrote:
| Not great at Google Sheets. Repeatedly overwrites all previous
| columns while trying to populate new columns.
|
| > I am back in the Google Sheet. I previously typed "Zip Code" in
| F1, but it looks like I selected cell A1 and typed "A". I need to
| correct that first. I'll re-type "Zip Code" in F1 and clear A1.
| It seems I clicked A1 (y=219, x=72) then F1 (y=219, x=469) and
| typed "Zip Code", but then maybe clicked A1 again.
| asadm wrote:
| This is great. Now I want it to run faster than I can do it.
| omkar_savant wrote:
| Hey - I'm on the team that launched this. Please let me know if
| you have any questions!
| SoKamil wrote:
| How are you going to deal with reCAPTCHA and ad impressions?
| Sounds like a conflict of interest.
| martinald wrote:
| Interesting, seems to use 'pure' vision and x/y coords for
| clicking stuff. Most other browser automation with LLMs I've seen
| uses the dom/accessibility tree which absolutely churns through
| context, but is much more 'accurate' at clicking stuff because it
| can use the exact text/elements in a selector.
|
| Unfortunately it really struggled in the demos for me. It took
| nearly 18 attempts to click the comment link on the HN demo, each
| a few pixels off.
| dekhn wrote:
| Many years ago I was sitting at a red light on a secondary road,
| where the primary cross road was idle. It seemed like you could
| solve this using a computer vision camera system that watched the
| primary road and when it was idle, would expedite the secondary
| road's green light.
|
| This was long before computer vision was mature enough to do
| anything like that and I found out that instead, there are
| magnetic systems that can detect cars passing over - trivial
| hardware and software - and I concluded that my approach was just
| far too complicated and expensive.
|
| Similarly, when I look at computers, I typically want the ML/AI
| system to operate on a structured data that is codified for
| computer use. But I guess the world is complicated enough and
| computers got fast enough that having an AI look at a computer
| screen and move/click a mouse makes sense.
| ge96 wrote:
| It's funny I'll sometimes scoot forward/rock my car but I'm not
| sure if it's just coincidence. Also a lot of stop lights now
| have that tall white camera on top.
| bozhark wrote:
| Like flashing lights for the first responders sensor
| Spooky23 wrote:
| Sometimes the rocking helps with a ground loop that isn't
| working well.
| netghost wrote:
| There's several mechanisms. The most common is (or at least
| was) a loop detector under the road that triggers when a
| vehicle is over it. Sometimes if you're not quite over it, or
| it's somewhat faulty that will trigger it.
| trenchpilgrim wrote:
| FWIW those type of traffic cameras are in common use.
| https://www.milesight.com/company/blog/types-of-traffic-came...
| dekhn wrote:
| If I read the web page, they don't actually use that as a
| solution to shortening a red - IMHO that has a very high
| safety bar compared to the more common uses. But I'd be happy
| to hear this is something that Just Works in the Real World
| with a reasonable false positive and false negative rate.
| trenchpilgrim wrote:
| Yes they do, it's listed under Traffic Sensor Cameras.
| jlhawn wrote:
| The camera systems are also superior from an infrastructure
| maintenance perspective. You can update them with new
| capabilities or do re-striping without tearing up the
| pavement.
| dktp wrote:
| I cycle a lot. Outdoors I listen to podcasts and the fact that
| I can say "Hey Google, go back 30sec" to relisten to something
| (or forward to skip ads) is very valuable to me.
|
| Indoors I tend to cast some show or youtube video. Often enough
| I want to change the Youtube video or show using voice commands
| - I can do this for Youtube, but results are horrible unless I
| know exactly which video I want to watch. For other services
| it's largely not possible at all
|
| In a perfect world Google would provide superb APIs for these
| integrations and all app providers would integrate it and keep
| it up to date. But if we can bypass that and get good results
| across the board - I would find it very valuable
|
| I understand this is a very specific scenario. But one I would
| be excited about nonetheless
| yunyu wrote:
| There is a lot of pretraining data available around screen
| recordings and mouse movements (Loom, YouTube, etc). There is
| much less pretraining data available around navigating
| accessibility trees or DOM structures. Many use cases may also
| need to be image aware (document scan parsing, looking at
| images), and keyboard/video/mouse-based models generalize to
| more applicants.
| chrisfosterelli wrote:
| Ironically now that computer vision is commonplace, the cameras
| you talk about have become increasingly popular over the years
| because the magnetic systems do not do a very good job of
| detecting cyclists and the cameras double as a congestion
| monitoring tool for city staff.
| AaronAPU wrote:
| I'm looking forward to a desktop OS optimized version so it can
| do the QA that I have no time for!
| alexnewman wrote:
| A year ago I did something that used rag and accessibility mode
| to navigate ui.
| dekhn wrote:
| I just have to say that I consider this an absolutely hilarious
| outcome. For many years, I focused on tech solutions that
| eliminated the need for a human to be in front of a computer
| doing tedious manual operations. For a wide range of activities,
| I proposed we focus on "turning everything in the world into
| database objects" so that computers could operate on them with
| minimal human effort. I spent significant effort in machine
| learning to achieve this.
|
| It didn't really occur to me that you could just train a computer
| to work directly on the semi-structured human world data (display
| screen buffer) through a human interface (mouse + keyboard).
|
| However, I fully support it (like all the other crazy ideas on
| the web that beat out the "theoretically better" approaches). I
| do not think it is unrealistic to expect that within a decade, we
| could have computer systems that can open chrome, start a video
| chat with somebody, go back and forth for a while to achieve a
| task, then hang up... with the person on the other end ever
| knowing they were dealing with a computer instead of a human.
| hipassage wrote:
| hi
| hipassage wrote:
| hi there
| hipassage wrote:
| hi there, interesting post
| realty_geek wrote:
| Absolutely hilarious how it gets stuck trying to solve captcha
| each time. I had to explicitly tell it not to go to google first.
|
| In the end I did manage to get it to play the housepriceguess
| game:
|
| https://www.youtube.com/watch?v=nqYLhGyBOnM
|
| I think I'll make that my equivalent of Simon Willison's "pelican
| riding a bicycle" test. It is fairly simple to explain but seems
| to trip up different LLMs in different ways.
| GeminiFan2025 wrote:
| The new Gemini 2.5 model's ability to understand and interact
| with computer interfaces looks very impressive. It could be a
| game-changer for accessibility and automation. I wonder how
| robust it is with non-standard UI elements.
| GeminiFan2025 wrote:
| Impressive interface interaction by Gemini 2.5. Could be great
| for accessibility.
| enjoylife wrote:
| > It is not yet optimized for desktop OS-level control
|
| Alas, AGI is not yet here. But I feel like if this OS-level of
| control was good enough, and the cost of the LLM in the loop
| wasn't bad, maybe that would be enough to kick start something
| akin to AGI.
| alganet wrote:
| I am curious. Why do you think controlling an OS (and not just
| a browser) would be a move towards AGI?
| mmaunder wrote:
| I prepare to be disappointed every time I click on a Google AI
| announcement. Which is so very unfortunate, given that they're
| the source of LLMs. Come on big G!! Get it together!
| orliesaurus wrote:
| Does it know what's behind the "menu" of different apps? Or does
| it have to click on all menus and submenus to find out?
___________________________________________________________________
(page generated 2025-10-07 23:00 UTC)