[HN Gopher] RT-2: Vision-Language-Action Models
       ___________________________________________________________________
        
       RT-2: Vision-Language-Action Models
        
       Author : elsewhen
       Score  : 51 points
       Date   : 2025-01-01 12:29 UTC (10 hours ago)
        
 (HTM) web link (robotics-transformer2.github.io)
 (TXT) w3m dump (robotics-transformer2.github.io)
        
       | xnx wrote:
       | Impressive work. Connecting with Nvidia's move to make robotics
       | there next focus, is there need to have powerful compute local to
       | the robot? Cloud latency would seem to be fine for the speed of
       | these robotic arms.
        
       | YeGoblynQueenne wrote:
       | The problem with all the very impressive videos on that page is
       | that we have no idea how many attempts were made before the robot
       | could successfully, e.g. "put strawberry into the correct bowl".
       | In that task there's four bowls, so a random choice of bowl would
       | be correct 25% of the time. How many times did the robot put the
       | strawberry e.g. in the bowl of apples? And that's assuming "the
       | correct bowl" is the one with the strawberries (which is a big
       | assumption- the strawberry should go with the others of its kind:
       | says who? How often can the robot put the strawberry in the bowl
       | with the apples if that's what we want it to do?).
       | 
       | Plotted results show around 50% average performance on "unseen"
       | tasks, environments, objects etc, which sounds a lot like success
       | follows some kind of random distribution. That's not a great way
       | to engender trust in the "emergent" abilities of a robotic system
       | to generalise to unseen tasks etc. Blame bad statistics if you
       | get a strawberry in the eye, or a banana in the ear.
        
         | smokel wrote:
         | For better or for worse, it is probably time to get accustomed
         | to this. Early AI operated with symbols, and it has been clear
         | for decades that that was a dead-end. Contemporary AI is
         | stochastic, and it works a lot better for many applications
         | than previous attempts, _even_ considering the inherent
         | uncertainty and errors.
         | 
         | Note that at the level of quantum physics one might not be able
         | to trust CPU instructions to be faultless. It is all about
         | getting the right error margins.
        
           | YeGoblynQueenne wrote:
           | Accustomed- to tech demos that never lead to real-world
           | capabilities? That's what I'm pointing out in my comment.
           | 
           | Btw, computers are symbol manipulation machines and in
           | general that's how we understand computation: manipulating
           | symbols according to a set of rules; like Turing machines.
           | Stochastic algorithms also ultimately work that way, and they
           | will continue to until we can run all the modern AI stuff on,
           | say, analog computers.
        
       | byyoung3 wrote:
       | this is a year and a half old
        
         | pkkkzip wrote:
         | thanks for pointing this out. i knew not to get excited
         | everytime something is posted on the frontpage.
        
       | GaggiX wrote:
       | (2023)
        
       | mkagenius wrote:
       | > We represent the robot actions as text strings as shown below.
       | An example of such a string could be a sequence of robot action
       | token numbers: "1 128 91 241 5 101 127 217".
       | 
       | Training with numbers like this might be a little problematic, I
       | have tried to fine tune GPT 4o-mini with very little success(just
       | me?)
       | 
       | On the other hand I found[1] Gemini and Molmo being able to
       | locate elements on screen much better than 4o.
       | 
       | 1. https://github.com/BandarLabs/clickclickclick
        
         | yorwba wrote:
         | They do not turn the actions into text that is then tokenized,
         | but generate tokens directly. So the action token 128 doesn't
         | necessarily correspond to the tokenization of the number 128
         | when it appears in text input. (Except for PaLI-X they make use
         | of the fact that integers up to 1000 have unique tokens and do
         | use those for the actions. But for PaLM-E, they hijack the 256
         | least frequently used tokens instead.)
        
       ___________________________________________________________________
       (page generated 2025-01-01 23:00 UTC)