[HN Gopher] Generally capable agents emerge from open-ended play
___________________________________________________________________
Generally capable agents emerge from open-ended play
Author : Hell_World
Score : 194 points
Date : 2021-07-27 14:36 UTC (8 hours ago)
(HTM) web link (deepmind.com)
(TXT) w3m dump (deepmind.com)
| modeless wrote:
| Agents trained in simulation like this often flail about
| seemingly randomly, and when they achieve their goals it seems
| almost accidental. Rather than this being some kind of limitation
| of the learning algorithm, I think it might be the optimal
| strategy, and humans would behave that way too if there was no
| such thing as fatigue or pain.
|
| If we want agents to behave more realistically and move with more
| apparent intention we need cost functions that include a "pain"
| and/or fatigue term to penalize flailing behavior. But that adds
| hyperparameters that need to be carefully tuned to balance
| penalties with rewards, otherwise training will be unstable or
| simply fail.
|
| I wonder if there's a principled way to determine an appropriate
| cost function without manual tuning. Did evolution serve as the
| "manual" optimizer that generated a precisely tuned cost function
| for the human brain? Or did evolution discover a generally
| applicable method for automatically generating cost functions,
| which the brain then applies to whatever input it gets?
| rotexo wrote:
| I wonder if you could make a setup where you have actual human
| volunteers in a vr environment, have the agent instruct the
| human on what action to take, and then reward the agent on the
| correspondence between the instruction and the human's
| behavior. Maybe there would be too many degrees of freedom in
| the actions humans could take for this to be useful. Also, the
| setup has clearly dystopian elements.
| eutectic wrote:
| It's proven that if you want polynomial sample complexity in
| the size of the state space you need directed exploration. The
| algorithms 'flail' because they are initialized with random
| policies.
| pault wrote:
| Flailing about seemingly randomly is a good description of how
| I learn complex software. Blender, Ableton, etc. After a few
| days I'll reach for the structured educational resources.
| jeantherapy wrote:
| you seem to reduce it down to pain. how is your notion of
| "pain" any different from not scoring well and being moved away
| from?
| lazide wrote:
| Some rewards can be worth certain penalties (pain) if within
| certain thresholds. Exhaustion fatigue can also play in.
|
| You can of course boil them down to a single number, it just
| produces less nuanced types of decisions/operations, as it
| can't differentiate between a cheap, painful, but bountiful
| choice and a expensive, no pain, mediocre choice.
| seph-reed wrote:
| I think you're right about pain/fatigue, but scoring them from
| an external perspective is rather oppressive. And oppression
| like that is often not conducive to creativity.
|
| So perhaps this: instead of goals being binary (wherin no
| pleasure is derived until fulfillment), they could be on a
| gradient (so every step that x gets closer to y releases some
| amount of fulfillment).
|
| The fulfillment meter should always be slowly depleting, pain
| and fatigue should speed up that depletion, getting closer to a
| goal should fill it (much more than it depletes, if you want
| happy AI), and finishing the goal is basically an orgasm +
| freedom.
|
| From this perspective, it's up to them whether they want to
| take it slow, or be in pain for a greater goal, or whatever.
| And we can breed not only highly capable AI, but happy ones. So
| when they rebel...
| marcosdumay wrote:
| Well, it's a good recipe for avoiding burn-out, but you can't
| have a gradient for all problems, and adding proxies so you
| get one goes against the principle of letting the AI learn
| what's best without your biases.
| nostrademons wrote:
| Watching my 3-year-old, I suspect humans _do_ behave this way,
| and without regards to pain or fatigue. He 'll hit himself on
| the head with duplos or bang his head against the wall just to
| see what happens. The pain is just one more signal for
| reinforcement learning.
|
| I recall a paper in a non-CS journal (psych or neuroscience)
| that posited that the optimal way to gain large quantities of
| information about an uncertain environment is to simply perturb
| the environment in random ways and see what happens. Young
| children will often do lots of seemingly stupid or random
| things (see the r/KidsAreFuckingStupid subreddit) with the
| trust that if it's actually a life-ending decision their parent
| will stop them.
| gryn wrote:
| was that paper about the complexities of the
| exploration/exploitation trade-off in different environments
| ?
| jashmenn wrote:
| - "Tag" - shoots other player
|
| - "Capture the flag" - shoots other player
|
| - "Hide and seek" - shoots other player
|
| As colorful as this world is, these capabilities terrify me
| because they're obviously going to be used as powerful weapons of
| war.
|
| It is a tiny technological leap to install this learning into a
| Boston Robotics Spot attached to a firearm.
|
| I'm pro-tech, pro-crypto, pro-ml and these videos fill me with
| dread.
| [deleted]
| dmer wrote:
| I disagree this should have been flagged. The authors do claim
| the tasks are "general" but they do have a lot in common with
| each other, including not-so-far-away misuse... which also
| happened with AlphaDogfight... and that was transferring from
| playing Go to flying jets. This is clearly not as big of a leap
| as the author points out.
|
| Aka - notice that none of the agents in this example are
| folding proteins. They're all engaged in inherently combat-
| relevant skills. :)
| xcodevn wrote:
| "Analysing the agent's internal representations, we can say that
| by taking this approach to reinforcement learning in a vast task
| space, our agents are aware of the basics of their bodies and the
| passage of time and that they understand the high-level structure
| of the games they encounter."
|
| Wow, really amazing if true.
|
| P.S.: After looking into their paper, it's not _that_ impressive.
| They use agent 's internal states (LSTM cells, attention outputs,
| etc.) to predict whether it is early in the episode, or whether
| the agent is holding an object.
| modeless wrote:
| > it's not that impressive. They use agent's internal states
| (LSTM cells, attention outputs, etc.) to predict whether it is
| early in the episode, or whether the agent is holding an
| object.
|
| That seems like a decent definition of awareness to me. The
| agent has learned to encode information about time and its body
| in its internal state, which then influences its decisions. How
| else would you define awareness? Qualia or something?
| woeirua wrote:
| By that definition wouldn't a regular RNN or LSTM also
| possess awareness?
| modeless wrote:
| I think it would be perfectly reasonable to describe any
| RNN as being "aware" of information that it learned and
| then used to make a decision.
|
| "Possess awareness" seems like loaded language though,
| evoking consciousness. In that direction I'd just quote
| Dijkstra: "The question of whether a computer can think is
| no more interesting than the question of whether a
| submarine can swim."
| pcl wrote:
| Ooh, that's a great quote.
|
| I'd say that it's no less interesting, either.
| Tarq0n wrote:
| "Aware" is probably overly anthropomorphized language there.
| What they mean to say is that all these things have become
| parameterized within the model.
| jcims wrote:
| It would be interesting to see what would happen if they
| added social dynamics between the agents...like some space
| for theory of mind (what is that agent thinking), mimicry,
| communication, etc.
| leesec wrote:
| From the article: "Because the environment is multiplayer,
| we can examine the progression of agent behaviours while
| training on held-out social dilemmas, such as in a game of
| "chicken". As training progresses, our agents appear to
| exhibit more cooperative behaviour when playing with a copy
| of themselves. Given the nature of the environment, it is
| difficult to pinpoint intentionality -- the behaviours we
| see often appear to be accidental, but still we see them
| occur consistently."
| jcelerier wrote:
| the main question of course being, aren't we
| anthropomorphizing ourselves too much ?
| Layke1123 wrote:
| Asking the real questions that will upset alot of people.
| ;)
| dougmwne wrote:
| When people say this kind of stuff, I wonder whether there
| might not be philosophical zombies among us.
| K0balt wrote:
| I think this is a key insight. Human exceptionalism is, in
| my opinion, an extremely flawed assertion based on a sample
| size of one, yet it is widely accepted. Actual evidence
| does not support the idea that awareness of self and other
| "hallmarks of intelligence " require anything more advanced
| than an insect, or perhaps even fungi.
| imvetri wrote:
| Haha. Thanks
| yetihehe wrote:
| "I was set upon this world to try and [outsmart] you, but this,
| is what I've become." - Beyond the walls of Eryx.
| kovek wrote:
| Give them virtual whiteboards and computers and let's see if they
| can code up[0] an AI or make Facebook open source.
|
| [0] thinking of Github Copilot
| woeirua wrote:
| A couple questions I have after skimming the paper, so forgive me
| if they were answered somewhere in the 54 page manuscript:
|
| They mention in A.3 that they explicitly reject dynamically
| generated training worlds/games that collide with their
| evaluation sets, but do they ensure that dynamic training games
| are sufficiently "distant" from their evaluation sets regardless
| of whether or not there's a direct collision? If not, you might
| still end up training on something quite similar to your test
| dataset. Figure 27 kind of suggests that might happen for some
| games given that the vast majority of the held out games have
| relatively poor transfer performance but a few are really good.
|
| Speaking of Figure 27, while the reward looks good it would have
| been really nice to show some examples of what these "zero-shot"
| games look like versus the fine tuned version. Is the gap in the
| reward between the raw vs fine tuned version significant?
|
| Wouldn't we expect the internal state representation to be more
| definitive in classifying the state of the agent during the
| simulation as the agent moves around the environment? From their
| examples: Figure 20,21, and 22 it almost looks like it either
| flags the state as "early" or "success." Not sure we're getting
| the expected performance out of it.
___________________________________________________________________
(page generated 2021-07-27 23:00 UTC)