[HN Gopher] Gemini Robotics On-Device brings AI to local robotic...
___________________________________________________________________
Gemini Robotics On-Device brings AI to local robotic devices
Author : meetpateltech
Score : 140 points
Date : 2025-06-24 14:05 UTC (8 hours ago)
(HTM) web link (deepmind.google)
(TXT) w3m dump (deepmind.google)
| suninsight wrote:
| This will not end well.
| sajithdilshan wrote:
| I wonder what kind of guardrails (like Three Laws of Robotics)
| there are to prevent the robots going crazy while executing the
| prompts
| hn_throwaway_99 wrote:
| A power cord?
| sajithdilshan wrote:
| what if they are battery powered?
| msgodel wrote:
| Usually I put master disconnect switches on my robots just
| to make working on them safe. I use cheap toggle switches
| though I'm too cheap for the big red spiny ones.
| pixl97 wrote:
| [Robot learns to superglue the switch open]
| msgodel wrote:
| It's only going to do that if you RL it with episodes
| that include people shutting it down for safety. The RL
| I've done with my models are all simulations that don't
| even simulate the switch.
| pixl97 wrote:
| Which will likely work for only on machine AI, but it
| seems to me any very complicated actions/interactions
| with the world may require external interactions with
| LLMs which know these kind of actions. Or in the future
| the models will be far larger and more expansive on
| device containing this kind of knowledge.
|
| For example, what if you need to train the model to keep
| unauthorized people from shutting it off?
| msgodel wrote:
| Having a robot near people with no master off switch
| sounds like a dumb idea.
| bigyabai wrote:
| That's what we use twelve gauge buckshot for, here in
| America.
| ctoth wrote:
| The laws of robotics were literally designed to cause conflict
| and facilitate strife in a fictional setting--I certainly hope
| no real goddamn system is built like that,.
|
| > To ensure robots behave safely, Gemini Robotics uses a multi-
| layered approach. "With the full Gemini Robotics, you are
| connecting to a model that is reasoning about what is safe to
| do, period," says Parada. "And then you have it talk to a VLA
| that actually produces options, and then that VLA calls a low-
| level controller, which typically has safety critical
| components, like how much force you can move or how fast you
| can move this arm."
| conception wrote:
| Of course someone will. The terror nexus doesn't build
| itself, yet, you know.
| hlfshell wrote:
| The generally accepted term for the research around this in
| robotics is Constitutional AI
| (https://arxiv.org/abs/2212.08073) and has been
| cited/experimented with in several robotics VLAs.
| JumpCrisscross wrote:
| Is there any evidence we have the technical ability to put
| such ambiguous guardrails on LLMs?
| asadm wrote:
| in practice, those laws are bs.
| suyash wrote:
| What sort of hardware does the SDK runs on, can it run on a
| modern Raspberry Pi ?
| ethan_smith wrote:
| According to the blog post, it requires an NVIDIA Jetson Orin
| with at least 8GB RAM, and they've optimized for Jetson AGX
| Orin (64GB) and Orin NX (16GB) modules.
| v9v wrote:
| Could you quote where in the blog post they claim that?
| CTRL+F "Jetson" gave no results in TFA.
| moffkalast wrote:
| Yeah they didn't really mention anything, I was almost
| getting my hopes up that Google might be announcing a
| modernized Coral TPU for the transformer age, but I guess
| not. It's probably all just API calls to their TPUv6 data
| centers lmao.
| martythemaniak wrote:
| You can think of these as essentially multi-modal LLMs, which
| is to say you can have very small/fast ones (SmolVLA - 0.5B
| params) that are good at specific tasks, and larger/slower more
| general ones (OpenVLA - a finetuned llama2 7B). So a rpi could
| be used for some very specific tasks, but even the more general
| ones could run on beefy consumer hardware.
| Toritori12 wrote:
| Does Anyone know how easy is to join the "trusted tester program"
| and if they offer modules that you can easily plug-in to run the
| sdk?
| martythemaniak wrote:
| I've spent the last few months looking into VLAs and I'm
| convinced that they're gonna be a big deal, ie they very well
| might be the "chatgpt moment for robotics" that everyone's been
| anticipating. Multimodal LLMs already have a ton of built-in
| understanding of images and text, so VLAs are just regular MMLLMs
| that are fine-tuned to output a specific sequence of instructions
| that can be fed to a robot.
|
| OpenVLA, which came out last year, is a Llama2 fine tune with
| extra image encoding that outputs a 7-tuple of integers. The
| integers are rotation and translation inputs for a robot arm. If
| you give a vision llama2 a picture of a an apple and a bowl and
| say "put the apple in the bowl", it already understands apples,
| bowls, knows the end state should apple in bowl etc. What missing
| is a series of tuples that will correctly manipulate the arm to
| do that, and the way they did it is through a large number of
| short instruction videos.
|
| The neat part is that although everyone is focusing on robot arms
| manipulating objects at the moment, there's no reason this method
| can't be applied to any task. Want a smart lawnmower? It already
| understands "lawn" "mow", "don't destroy toy in path" etc, just
| needs a finetune on how to corectly operate a lawnmower. Sam
| Altman made some comments about having self-driving technology
| recently and I'm certain it's a chat-gpt based VLA. After all, if
| you give chatgpt a picture of a street, it knows what's a car,
| pedestrian, etc. It doesn't know how to output the correct
| turn/go/stop commands, and it does need a great deal of diverse
| data, but there's no reason why it can't do it.
| https://www.reddit.com/r/SelfDrivingCars/comments/1le7iq4/sa...
|
| Anyway, super exciting stuff. If I had time, I'd rig a snowblower
| with a remote control setup, record a bunch of runs and get a VLA
| to clean my driveway while I sleep.
| ckcheng wrote:
| VLA = Vision-language-action model:
| https://en.wikipedia.org/wiki/Vision-language-action_model
|
| Not https://public.nrao.edu/telescopes/VLA/ :(
|
| For completeness, MMLLM = Multimodal Large language model.
| generalizations wrote:
| I will be surprised if VLAs stick around, based on your
| description. That sounds far too low-level. Better hand that
| off to the 'nervous system' / kernel of the robot - it's not
| like humans explicitly think about the rotation of their hip &
| ankle when they walk. Sounds like a bad abstraction.
| Workaccount2 wrote:
| I don't think transformers will be viable for self driving cars
| until they can both:
|
| 1) Properly recognize what they are seeing without having to
| lean so hard on their training data. Go photoshop a picture of
| a cat and give it a 5th leg coming out of it's stomach. No LLM
| will be able to properly count the cat's legs (they will keep
| saying 4 legs no matter how many times you insist they
| recount).
|
| 2.) Be extremely fast at outputting tokens. I don't know where
| the threshold is, but its probably going to be a non-thinking
| model (at first) and probably need something like Cerebras or
| diffusion architecture to get there.
| martythemaniak wrote:
| 1. Well, based on Karpathy's talks on Tesla FSD, his solution
| is to actually make the training set reflect everything you'd
| see in reality. The tricky part is that if something occurs
| 0.0000001% IRL and something else occurs 50% of the time,
| they both need to make 5% of the training corpus. The thing
| with multimodal LLMs is that lidar/depth input can just be
| another input that gets encoded along with everything else,
| so for driving "there's a blob I don't quite recognize" is
| still a blob you have to drive around.
|
| 2. Figure has a dual-model architecture which makes a lot of
| sense: A 7B model that does higher-level planning and control
| and a runs at 8Hz, and a tiny 0.08B model that runs at 200Hz
| and does the minute control outputs.
| https://www.figure.ai/news/helix
| baron816 wrote:
| I'm optimistic about humanoid robotics, but I'm curious about the
| reliability issue. Biological limbs and hands are quite
| miraculous when you consider that they are able to constantly
| interact with the world, which entails some natural wear and
| tear, but then constantly heal themselves.
| marinmania wrote:
| It does either get very exciting or very spooky thinking of the
| possibilities in the near future.
|
| I had always assumed that such a robot would be very specific
| (like a cleaning robot) but it does seem like by the time they
| are ready they will be very generalizable.
|
| I know they would require quite a few sensors and motors, but
| compared to self-driving cars their liability would be less and
| they would use far less material.
| fragmede wrote:
| The exciting part comes when two robots are able to do
| repairs on each other.
| pryelluw wrote:
| 2 bots 1 bolt ?
| marinmania wrote:
| I think this is the spooky part. I feel dumb saying it, but
| is there a point where they are able to coordinate and
| build a factory to build chips/more of themselves? Or other
| things entirely?
| bamboozled wrote:
| Of course there is
| didip wrote:
| I think those problems can be solved with further research in
| material science, no? Combined that with very responsive but
| low torque servos, I think this is a solvable problem.
| michaelt wrote:
| It's a simple matter of the number of motors you have. [1]
|
| Assume every motor has a 1% failure rate per year.
|
| A boring wheeled roomba has 3 motors. That's a 2.9% failure
| rate per year, and 8.6% failures over 3 years.
|
| Assume a humanoid robot has 43 motors. That gives you a 35%
| failure rate per year, and 73% over 3 years. That ain't good.
|
| And not only is the humanoid robot less reliable, it's also
| 14.3x the price - because it's got 14.3x as many motors in
| it.
|
| [1] And bearings and encoders and gearboxes and control
| boards and stuff... but they're largely proportional to the
| number of motors.
| mewpmewp2 wrote:
| Would it be possible to reduce the failure rates?
| michaelt wrote:
| To an extent, yes.
|
| For example, an industrial robot arm with 6 motors
| achieves much higher reliability than a consumer roomba
| with 3 motors. They do this with more metal parts, more
| precision machining, much more generous design
| tolerances, and suchlike. Which they can afford by
| charging 100x as much per unit.
| ac29 wrote:
| The 1%/year failure rate appears to just be made up.
| There are plenty of electric motors that dont have
| anywhere near that failure rate (at least during the
| expected service life, failure rates certainly will
| probably hit 1%/year or higher eventually).
|
| For example, do the motors in hard drives fail anywhere
| close to 1% a year in the first ~5 years? Backblaze data
| gives a total drive failure rate around 1% and I imagine
| most of those are not due to failure of motors.
| michaelt wrote:
| Yes, obviously that 1% figure is a simplification. Of
| course not all motors are created equal, and neither are
| all operating conditions!
|
| But the neat thing about my argument is it holds true
| regardless of the underlying failure rate!
|
| So long as your per-motor annual failure rate is >0, 43x
| it will be bigger than 3x it.
| UltraSane wrote:
| Consumable components could be automatically replaced by other
| robots.
| zzzeek wrote:
| THANK YOU.
|
| Please make robots. LLMs should be put to work for *manual*
| tasks, not art/creative/intellectual tasks. The goal is to
| _improve_ humanity. not put us to work putting screws inside of
| iphones
|
| (five years later)
|
| what do you mean you are using a robot for your drummer
| Workaccount2 wrote:
| I continued to be impressed how Google stealth releases fairly
| groundbreaking products, and then (usually) just kind of forgets
| about them.
|
| Rather than advertising blitz and flashy press events, they just
| do blog posts that tech heads circulate, forget about, and then
| wonder 3-4 years later "whatever happened to that?"
|
| This looks awesome. I look forward to someone else building a
| start-up on this and turning it into a great product.
| fusionadvocate wrote:
| Because the whole purpose of these kinds of projects at Google
| is to keep regulators at bay. They don't need these products in
| the sense of making money from them. They will just burn some
| money and move on, exactly the way they did hundreds of times.
| But what kind of company has such a free pass to burning money?
| The kind of company that is a monopoly. Monopolies are THAT
| profitable.
| jagger27 wrote:
| These are going to be war machines, make absolutely no mistake
| about it. On-device autonomy is the perfect foil to escape
| centralized authority and accountability. There's no human behind
| the drone to charge for war crimes. It's what they've always
| dreamed of.
|
| Who's going to stop them? Who's going to say no? The military
| contracts are too big to say no to, and they might not have a
| choice.
|
| The elimination of toil will mean the elimination of humans all
| together. That's where we're headed. There will be no profitable
| life left for you, and you will be liquidated by "AI-Powered
| Automation for Every Decision"[0]. Every. Decision. It's so
| transparent. The optimists in this thread are baffling.
|
| 0: https://www.palantir.com/
| mateus1 wrote:
| MIT spinoff Google-owned Boston Dynamics pledged not to
| militarize their robots. Which is very hard to believe given
| they're backed by DARPA, the DoD/Military investment arm.
| jagger27 wrote:
| Militarize is just bad marketing. Call them cleaning machines
| and put them to work on dirty things.
| paxys wrote:
| _Was_ owned by Google. Then Softbank. Now Hyundai.
| JumpCrisscross wrote:
| > _These are going to be war machines, make absolutely no
| mistake about it_
|
| Of course they will. Practically everything useful has a
| military application. I'm not sure why this is considered a hot
| take.
| jagger27 wrote:
| The difference between this machine and the ones that came
| before is that there won't have to be a human in the loop to
| execute mass murder.
| bamboozled wrote:
| How would these things be competitive with drones on the
| battlefield? They probably cost the equivalent of 1000
| autonomous drones and 100x the time and materials to make, way
| more power would be required to make them work too.
|
| Terminator is a good movie but in reality, a cheap autonomous
| drone would mess one of those up pretty good.
|
| I've seen some of the footage from Ukraine, drones are deadly,
| efficient, they are terrifying on the battlefield. Even though
| those robots will get crazy maneuverable, it's going to be
| pretty hard to out run an exploding drone.
|
| Maybe the Terminators will have shotguns, but I could imagine 5
| drones per terminator being a pretty easy to achieve
| considering they will be built by other autonomous robots.
| polskibus wrote:
| What is the model architecture? I'm assuming it's far away from
| LLMs, but I'm curious about knowing more. Can anyone provide
| links that describe architectures for VLA?
| KoolKat23 wrote:
| Actually very close to one I'd say.
|
| It's a "visual language action" VLA model "built on the
| foundations of Gemini 2.0".
|
| As Gemini 2.0 has native language, audio and video support, I
| suspect it has been adapted to include native "action" data
| too, perhaps only on output fine-tuning rather than
| input/output at training stage (given its Gemini 2.0
| foundation).
|
| Natively multimodal LLM's are basically brains.
| martythemaniak wrote:
| OpenVLA is basically a slightly modified, fine-tuned llama2.
| I found the launch/intro talk by lead author to be quite
| accessible: https://www.youtube.com/watch?v=-0s0v3q7mBk
| moelf wrote:
| The MuJoCo link actually points to https://github.com/google-
| deepmind/aloha_sim
___________________________________________________________________
(page generated 2025-06-24 23:00 UTC)