[HN Gopher] RT-2: Vision-Language-Action Models
___________________________________________________________________
RT-2: Vision-Language-Action Models
Author : elsewhen
Score : 51 points
Date : 2025-01-01 12:29 UTC (10 hours ago)
(HTM) web link (robotics-transformer2.github.io)
(TXT) w3m dump (robotics-transformer2.github.io)
| xnx wrote:
| Impressive work. Connecting with Nvidia's move to make robotics
| there next focus, is there need to have powerful compute local to
| the robot? Cloud latency would seem to be fine for the speed of
| these robotic arms.
| YeGoblynQueenne wrote:
| The problem with all the very impressive videos on that page is
| that we have no idea how many attempts were made before the robot
| could successfully, e.g. "put strawberry into the correct bowl".
| In that task there's four bowls, so a random choice of bowl would
| be correct 25% of the time. How many times did the robot put the
| strawberry e.g. in the bowl of apples? And that's assuming "the
| correct bowl" is the one with the strawberries (which is a big
| assumption- the strawberry should go with the others of its kind:
| says who? How often can the robot put the strawberry in the bowl
| with the apples if that's what we want it to do?).
|
| Plotted results show around 50% average performance on "unseen"
| tasks, environments, objects etc, which sounds a lot like success
| follows some kind of random distribution. That's not a great way
| to engender trust in the "emergent" abilities of a robotic system
| to generalise to unseen tasks etc. Blame bad statistics if you
| get a strawberry in the eye, or a banana in the ear.
| smokel wrote:
| For better or for worse, it is probably time to get accustomed
| to this. Early AI operated with symbols, and it has been clear
| for decades that that was a dead-end. Contemporary AI is
| stochastic, and it works a lot better for many applications
| than previous attempts, _even_ considering the inherent
| uncertainty and errors.
|
| Note that at the level of quantum physics one might not be able
| to trust CPU instructions to be faultless. It is all about
| getting the right error margins.
| YeGoblynQueenne wrote:
| Accustomed- to tech demos that never lead to real-world
| capabilities? That's what I'm pointing out in my comment.
|
| Btw, computers are symbol manipulation machines and in
| general that's how we understand computation: manipulating
| symbols according to a set of rules; like Turing machines.
| Stochastic algorithms also ultimately work that way, and they
| will continue to until we can run all the modern AI stuff on,
| say, analog computers.
| byyoung3 wrote:
| this is a year and a half old
| pkkkzip wrote:
| thanks for pointing this out. i knew not to get excited
| everytime something is posted on the frontpage.
| GaggiX wrote:
| (2023)
| mkagenius wrote:
| > We represent the robot actions as text strings as shown below.
| An example of such a string could be a sequence of robot action
| token numbers: "1 128 91 241 5 101 127 217".
|
| Training with numbers like this might be a little problematic, I
| have tried to fine tune GPT 4o-mini with very little success(just
| me?)
|
| On the other hand I found[1] Gemini and Molmo being able to
| locate elements on screen much better than 4o.
|
| 1. https://github.com/BandarLabs/clickclickclick
| yorwba wrote:
| They do not turn the actions into text that is then tokenized,
| but generate tokens directly. So the action token 128 doesn't
| necessarily correspond to the tokenization of the number 128
| when it appears in text input. (Except for PaLI-X they make use
| of the fact that integers up to 1000 have unique tokens and do
| use those for the actions. But for PaLM-E, they hijack the 256
| least frequently used tokens instead.)
___________________________________________________________________
(page generated 2025-01-01 23:00 UTC)