[HN Gopher] DrEureka: Language Model Guided SIM-to-Real Transfer
       ___________________________________________________________________
        
       DrEureka: Language Model Guided SIM-to-Real Transfer
        
       Author : jasondavies
       Score  : 41 points
       Date   : 2024-05-03 16:48 UTC (6 hours ago)
        
 (HTM) web link (eureka-research.github.io)
 (TXT) w3m dump (eureka-research.github.io)
        
       | throwup238 wrote:
       | This research is all beyond me so maybe someone can explain: How
       | does this compare to the state of the art in using simulators to
       | train physical robots? Does using transformers help in any way or
       | can this just as easily be done with other architectures?
       | 
       | To the uninitiated this looks cool as all heck and yet another
       | step towards the Star Trek future where we do everything in a
       | simulator first and it always kinda just works in the real world
       | (plot requirements notwithstanding).
       | 
       | Although I can also hear the distant sounds of a hundred military
       | R&D labs booting up Metalhead [1] simulators.
       | 
       | Edit: Looks like the previous SOTA was still a manual process
       | where the user had to come up with a reward function that
       | actually rewards the actions they wanted to the algorithm to
       | learn. This research uses language models to do that tedious step
       | instead.
       | 
       | [1] https://en.wikipedia.org/wiki/Metalhead_(Black_Mirror)
        
         | claytonwramsey wrote:
         | Your edit is roughly correct. It's a bit odd that they're
         | comparing a single human-made policy with the very best of the
         | DrEureka outputs; in practice, I would expect to make multiple
         | different reward functions and then do validation on the
         | functions and hyperparameters based on the resulting trained
         | models. However, their comparison isn't necessarily wrong,
         | since it seems (I could be wrong) that they used the reward
         | function from prior papers in the field of policy learning.
         | 
         | If you're interested in methods for actually learning policies
         | for these sorts of dynamic motions, note that this paper is
         | simply applying proximal-policy optimization. They're pulling
         | in the training and implementation methods from Margolis's [1]
         | and Shan's [2] work.
         | 
         | So, in sum, the contribution of this paper is exclusively the
         | method for generating reward functions (which is still pretty
         | cool!!!!!), not all the learning-based policy stuff.
         | 
         | [1]:
         | https://web.archive.org/web/20220703005502id_/http://www.rob...
         | [2]: https://arxiv.org/pdf/2309.06440
        
       | refulgentis wrote:
       | Every single second of every example has a handler holding a
       | leash - and not just holding it, holding it without any slack.
       | 
       | Blindingly obvious interference from Ouija board effect.
       | 
       | I don't mean to denigrate the work, I believe the researchers are
       | honest and I hope there's demoes outside the published one. Just,
       | at best, an obvious unforced error that leaves open a big
       | question.
       | 
       | EDIT: Replier below shared a gif with failures, tl;dr this looks
       | like two different experiment protocols, one for success, one for
       | failure. https://imgur.com/a/DmepBVU
        
         | Imnimo wrote:
         | This sample on Twitter shows how other controllers fail:
         | 
         | https://twitter.com/JasonMa2020/status/1786433841613390023
         | 
         | I agree it's hard to tell whether the controller learned with
         | DrEureka would be sufficient without the leash, but I'm at
         | least convinced that the leash is not sufficient to hold a
         | robot on the ball without a decently competent controller.
        
           | refulgentis wrote:
           | Oh my, that looks quite damning. https://imgur.com/a/DmepBVU
           | 
           | The good case leash is held taught at half the distance of
           | failures, at a parallel angle to the bot and orthogonal to
           | failures.
           | 
           | The failures all held with slack, on a leash held at 2x the
           | distance of successes, at an angle orthogonal to the bot.
           | 
           | (do correct me, we're seeing opposite things, and those are
           | very small and I last took physics...16 years ago :< )
        
             | Imnimo wrote:
             | Hmm, I do see what you mean.
        
       | FrustratedMonky wrote:
       | Kind of like how a human visualizes before a sport.?
       | 
       | Like visualizing free throws in basketball, makes you measurably
       | better, without actually doing free throws for real?
        
       | canadiantim wrote:
       | So the robot dog that's going to kill me in the near future will
       | atleast be adorably balancing on a big rubber ball
        
         | codetrotter wrote:
         | Death by giant rubber ball.
         | 
         | In a scene reminiscent of the giant boulder rolling after
         | Indiana Jones, a robot dog is balancing on top of an enormous
         | rubber ball down the streets of some big city, flattening
         | everything in its way.
         | 
         | Cronch, cronch, cronch, go the cars.
         | 
         | Squish, squish, squish, go the people.
        
           | magicalhippo wrote:
           | > Death by giant rubber ball.
           | 
           | As long as it's not white...
           | 
           | https://www.youtube.com/watch?v=I6Ffr1U7KMY
        
       ___________________________________________________________________
       (page generated 2024-05-03 23:00 UTC)