[HN Gopher] The Surprising Effectiveness of PPO in Cooperative M...
       ___________________________________________________________________
        
       The Surprising Effectiveness of PPO in Cooperative Multi-Agent
       Games
        
       Author : jonbaer
       Score  : 48 points
       Date   : 2021-07-14 15:13 UTC (7 hours ago)
        
 (HTM) web link (bair.berkeley.edu)
 (TXT) w3m dump (bair.berkeley.edu)
        
       | isaacimagine wrote:
       | PPO is awesome, but so is GPT-style reward-trajectory prediction!
       | http://arxiv.org/pdf/2106.01345v1.
       | 
       | As a RL hobbyist, I'd love to see some sort of hybrid approach.
       | Thoughts?
        
       | bsder wrote:
       | Are these things amazingly effective or are they simply
       | demonstrating that Starcraft/DOTA aren't as difficult as we
       | thought?
        
         | seabird wrote:
         | Neither. Once knowledgeable people get a read on these type of
         | things, they can usually handle it. The OpenAI Dota 2 "team"
         | was open for the public to play -- it was certainly very good,
         | but multiple teams beat it, sometimes even multiple times in a
         | row. It was great at cheesy stuff like superhuman Force Staff
         | plays that humans could never reliably pull off, but could be
         | beat through macro pressure.
        
       | ddoran wrote:
       | PPO = Proximal Policy Optimization
       | 
       | [https://openai.com/blog/openai-baselines-ppo/]
        
         | jdlyga wrote:
         | Thank you!
        
           | Robotbeat wrote:
           | Indeed. I looked for the definition in the whole webpage but
           | couldn't find it. Even Googling initially failed.
           | https://arxiv.org/abs/1707.06347
        
             | throwaway81523 wrote:
             | Yeah, did the same, then looked at the linked article. Its
             | abstract:                    Proximal Policy Optimization
             | (PPO) is a popular on-policy reinforcement learning
             | algorithm but is significantly less utilized than off-
             | policy learning algorithms in multi-agent settings. This is
             | often due the belief that on-policy methods are
             | significantly less sample efficient than their off-policy
             | counterparts in multi-agent problems. In this work, we
             | investigate Multi-Agent PPO (MAPPO), a variant of PPO which
             | is specialized for multi-agent settings. Using a 1-GPU
             | desktop, we show that MAPPO achieves surprisingly strong
             | performance in three popular multi-agent testbeds: the
             | particle-world environments, the Starcraft multi-agent
             | challenge, and the Hanabi challenge, with minimal
             | hyperparameter tuning and without any domain-specific
             | algorithmic modifications or architectures. In the majority
             | of environments, we find that compared to off-policy
             | baselines, MAPPO achieves strong results while exhibiting
             | comparable sample efficiency. Finally, through ablation
             | studies, we present the implementation and algorithmic
             | factors which are most influential to MAPPO's practical
             | performance.
        
         | cratermoon wrote:
         | https://jonathan-hui.medium.com/rl-proximal-policy-optimizat...
        
       ___________________________________________________________________
       (page generated 2021-07-14 23:01 UTC)