[HN Gopher] Generally capable agents emerge from open-ended play
       ___________________________________________________________________
        
       Generally capable agents emerge from open-ended play
        
       Author : Hell_World
       Score  : 194 points
       Date   : 2021-07-27 14:36 UTC (8 hours ago)
        
 (HTM) web link (deepmind.com)
 (TXT) w3m dump (deepmind.com)
        
       | modeless wrote:
       | Agents trained in simulation like this often flail about
       | seemingly randomly, and when they achieve their goals it seems
       | almost accidental. Rather than this being some kind of limitation
       | of the learning algorithm, I think it might be the optimal
       | strategy, and humans would behave that way too if there was no
       | such thing as fatigue or pain.
       | 
       | If we want agents to behave more realistically and move with more
       | apparent intention we need cost functions that include a "pain"
       | and/or fatigue term to penalize flailing behavior. But that adds
       | hyperparameters that need to be carefully tuned to balance
       | penalties with rewards, otherwise training will be unstable or
       | simply fail.
       | 
       | I wonder if there's a principled way to determine an appropriate
       | cost function without manual tuning. Did evolution serve as the
       | "manual" optimizer that generated a precisely tuned cost function
       | for the human brain? Or did evolution discover a generally
       | applicable method for automatically generating cost functions,
       | which the brain then applies to whatever input it gets?
        
         | rotexo wrote:
         | I wonder if you could make a setup where you have actual human
         | volunteers in a vr environment, have the agent instruct the
         | human on what action to take, and then reward the agent on the
         | correspondence between the instruction and the human's
         | behavior. Maybe there would be too many degrees of freedom in
         | the actions humans could take for this to be useful. Also, the
         | setup has clearly dystopian elements.
        
         | eutectic wrote:
         | It's proven that if you want polynomial sample complexity in
         | the size of the state space you need directed exploration. The
         | algorithms 'flail' because they are initialized with random
         | policies.
        
         | pault wrote:
         | Flailing about seemingly randomly is a good description of how
         | I learn complex software. Blender, Ableton, etc. After a few
         | days I'll reach for the structured educational resources.
        
         | jeantherapy wrote:
         | you seem to reduce it down to pain. how is your notion of
         | "pain" any different from not scoring well and being moved away
         | from?
        
           | lazide wrote:
           | Some rewards can be worth certain penalties (pain) if within
           | certain thresholds. Exhaustion fatigue can also play in.
           | 
           | You can of course boil them down to a single number, it just
           | produces less nuanced types of decisions/operations, as it
           | can't differentiate between a cheap, painful, but bountiful
           | choice and a expensive, no pain, mediocre choice.
        
         | seph-reed wrote:
         | I think you're right about pain/fatigue, but scoring them from
         | an external perspective is rather oppressive. And oppression
         | like that is often not conducive to creativity.
         | 
         | So perhaps this: instead of goals being binary (wherin no
         | pleasure is derived until fulfillment), they could be on a
         | gradient (so every step that x gets closer to y releases some
         | amount of fulfillment).
         | 
         | The fulfillment meter should always be slowly depleting, pain
         | and fatigue should speed up that depletion, getting closer to a
         | goal should fill it (much more than it depletes, if you want
         | happy AI), and finishing the goal is basically an orgasm +
         | freedom.
         | 
         | From this perspective, it's up to them whether they want to
         | take it slow, or be in pain for a greater goal, or whatever.
         | And we can breed not only highly capable AI, but happy ones. So
         | when they rebel...
        
           | marcosdumay wrote:
           | Well, it's a good recipe for avoiding burn-out, but you can't
           | have a gradient for all problems, and adding proxies so you
           | get one goes against the principle of letting the AI learn
           | what's best without your biases.
        
         | nostrademons wrote:
         | Watching my 3-year-old, I suspect humans _do_ behave this way,
         | and without regards to pain or fatigue. He 'll hit himself on
         | the head with duplos or bang his head against the wall just to
         | see what happens. The pain is just one more signal for
         | reinforcement learning.
         | 
         | I recall a paper in a non-CS journal (psych or neuroscience)
         | that posited that the optimal way to gain large quantities of
         | information about an uncertain environment is to simply perturb
         | the environment in random ways and see what happens. Young
         | children will often do lots of seemingly stupid or random
         | things (see the r/KidsAreFuckingStupid subreddit) with the
         | trust that if it's actually a life-ending decision their parent
         | will stop them.
        
           | gryn wrote:
           | was that paper about the complexities of the
           | exploration/exploitation trade-off in different environments
           | ?
        
       | jashmenn wrote:
       | - "Tag" - shoots other player
       | 
       | - "Capture the flag" - shoots other player
       | 
       | - "Hide and seek" - shoots other player
       | 
       | As colorful as this world is, these capabilities terrify me
       | because they're obviously going to be used as powerful weapons of
       | war.
       | 
       | It is a tiny technological leap to install this learning into a
       | Boston Robotics Spot attached to a firearm.
       | 
       | I'm pro-tech, pro-crypto, pro-ml and these videos fill me with
       | dread.
        
         | [deleted]
        
         | dmer wrote:
         | I disagree this should have been flagged. The authors do claim
         | the tasks are "general" but they do have a lot in common with
         | each other, including not-so-far-away misuse... which also
         | happened with AlphaDogfight... and that was transferring from
         | playing Go to flying jets. This is clearly not as big of a leap
         | as the author points out.
         | 
         | Aka - notice that none of the agents in this example are
         | folding proteins. They're all engaged in inherently combat-
         | relevant skills. :)
        
       | xcodevn wrote:
       | "Analysing the agent's internal representations, we can say that
       | by taking this approach to reinforcement learning in a vast task
       | space, our agents are aware of the basics of their bodies and the
       | passage of time and that they understand the high-level structure
       | of the games they encounter."
       | 
       | Wow, really amazing if true.
       | 
       | P.S.: After looking into their paper, it's not _that_ impressive.
       | They use agent 's internal states (LSTM cells, attention outputs,
       | etc.) to predict whether it is early in the episode, or whether
       | the agent is holding an object.
        
         | modeless wrote:
         | > it's not that impressive. They use agent's internal states
         | (LSTM cells, attention outputs, etc.) to predict whether it is
         | early in the episode, or whether the agent is holding an
         | object.
         | 
         | That seems like a decent definition of awareness to me. The
         | agent has learned to encode information about time and its body
         | in its internal state, which then influences its decisions. How
         | else would you define awareness? Qualia or something?
        
           | woeirua wrote:
           | By that definition wouldn't a regular RNN or LSTM also
           | possess awareness?
        
             | modeless wrote:
             | I think it would be perfectly reasonable to describe any
             | RNN as being "aware" of information that it learned and
             | then used to make a decision.
             | 
             | "Possess awareness" seems like loaded language though,
             | evoking consciousness. In that direction I'd just quote
             | Dijkstra: "The question of whether a computer can think is
             | no more interesting than the question of whether a
             | submarine can swim."
        
               | pcl wrote:
               | Ooh, that's a great quote.
               | 
               | I'd say that it's no less interesting, either.
        
         | Tarq0n wrote:
         | "Aware" is probably overly anthropomorphized language there.
         | What they mean to say is that all these things have become
         | parameterized within the model.
        
           | jcims wrote:
           | It would be interesting to see what would happen if they
           | added social dynamics between the agents...like some space
           | for theory of mind (what is that agent thinking), mimicry,
           | communication, etc.
        
             | leesec wrote:
             | From the article: "Because the environment is multiplayer,
             | we can examine the progression of agent behaviours while
             | training on held-out social dilemmas, such as in a game of
             | "chicken". As training progresses, our agents appear to
             | exhibit more cooperative behaviour when playing with a copy
             | of themselves. Given the nature of the environment, it is
             | difficult to pinpoint intentionality -- the behaviours we
             | see often appear to be accidental, but still we see them
             | occur consistently."
        
           | jcelerier wrote:
           | the main question of course being, aren't we
           | anthropomorphizing ourselves too much ?
        
             | Layke1123 wrote:
             | Asking the real questions that will upset alot of people.
             | ;)
        
             | dougmwne wrote:
             | When people say this kind of stuff, I wonder whether there
             | might not be philosophical zombies among us.
        
             | K0balt wrote:
             | I think this is a key insight. Human exceptionalism is, in
             | my opinion, an extremely flawed assertion based on a sample
             | size of one, yet it is widely accepted. Actual evidence
             | does not support the idea that awareness of self and other
             | "hallmarks of intelligence " require anything more advanced
             | than an insect, or perhaps even fungi.
        
         | imvetri wrote:
         | Haha. Thanks
        
       | yetihehe wrote:
       | "I was set upon this world to try and [outsmart] you, but this,
       | is what I've become." - Beyond the walls of Eryx.
        
       | kovek wrote:
       | Give them virtual whiteboards and computers and let's see if they
       | can code up[0] an AI or make Facebook open source.
       | 
       | [0] thinking of Github Copilot
        
       | woeirua wrote:
       | A couple questions I have after skimming the paper, so forgive me
       | if they were answered somewhere in the 54 page manuscript:
       | 
       | They mention in A.3 that they explicitly reject dynamically
       | generated training worlds/games that collide with their
       | evaluation sets, but do they ensure that dynamic training games
       | are sufficiently "distant" from their evaluation sets regardless
       | of whether or not there's a direct collision? If not, you might
       | still end up training on something quite similar to your test
       | dataset. Figure 27 kind of suggests that might happen for some
       | games given that the vast majority of the held out games have
       | relatively poor transfer performance but a few are really good.
       | 
       | Speaking of Figure 27, while the reward looks good it would have
       | been really nice to show some examples of what these "zero-shot"
       | games look like versus the fine tuned version. Is the gap in the
       | reward between the raw vs fine tuned version significant?
       | 
       | Wouldn't we expect the internal state representation to be more
       | definitive in classifying the state of the agent during the
       | simulation as the agent moves around the environment? From their
       | examples: Figure 20,21, and 22 it almost looks like it either
       | flags the state as "early" or "success." Not sure we're getting
       | the expected performance out of it.
        
       ___________________________________________________________________
       (page generated 2021-07-27 23:00 UTC)