[HN Gopher] Why neural networks struggle with the Game of Life (...
       ___________________________________________________________________
        
       Why neural networks struggle with the Game of Life (2020)
        
       Author : DeathArrow
       Score  : 94 points
       Date   : 2024-05-17 09:26 UTC (13 hours ago)
        
 (HTM) web link (bdtechtalks.com)
 (TXT) w3m dump (bdtechtalks.com)
        
       | leric wrote:
       | A good example of irreducible complexity
        
         | rustcleaner wrote:
         | based
         | 
         | Simple systems built on simple rules creating universally
         | complete computation behaviors are both unintuitive to and
         | underrated by common man.
        
       | nottorp wrote:
       | > Interestingly, no matter how complex a grid becomes, you can
       | predict the state of each cell in the next timestep with the same
       | rules.
       | 
       | How is that interesting? It's the definition of the Game of
       | Life... it's not like it's a natural system that you don't know
       | the full rules for...
        
         | brookst wrote:
         | It's interesting the way Go is interesting: very simple rules
         | can produce extreme complexity. At least, I think that's
         | interesting.
        
           | yumong wrote:
           | Stephen Wolfram pushes this to the extreme.
           | 
           | Surprised that nobody mentioned him yet.
        
           | nottorp wrote:
           | I wouldn't compare the game of life that has exactly one
           | "next state" with something like Go.
        
             | brookst wrote:
             | It's interesting that you wouldn't, yet I would.They aren't
             | isomorphic, for sure.
             | 
             | Go's complexity comes from two players alternately picking
             | one out of a very large number of options.
             | 
             | GoL's complexity comes from a very large number of nodes
             | "picking" between two states. That's not precise, just
             | illustrating that there is some symmetry of
             | simplicity/complexity, at least to my eyes.
        
             | SkyBelow wrote:
             | Why not?
             | 
             | We could even think of both as collections of 3d structures
             | showing all valid structures possible for a board of size n
             | by n. There are some differences, every single 3d Conway
             | structure has a unique top layer, while Go does not. But
             | that seems like an overall minor difference. There are many
             | more Go shapes than Conway shapes given the same N, but
             | both are already so numerous that I'm not sure that is a
             | difference worth stopping the comparison.
        
         | bell-cot wrote:
         | From a quick skim, then string search for "interesting" - I'd
         | say that word is fluff, added to keep their audience reading
         | through their dull background intro.
        
       | kozlovsky wrote:
       | If we show a neural network some examples from the Game of Life
       | and expect it to master the rules of a cellular automaton, then
       | aren't we asking too much from it? In some ways, this is
       | analogous to expecting that if we show the neural network
       | examples from the physical world, it will automatically derive
       | Newton's three laws. Not every person observing the world around
       | him can independently deduce Newton's laws from scratch, no
       | matter how many examples he sees.
        
         | passwordoops wrote:
         | "then aren't we asking too much from it"
         | 
         | Not according to the hype merchants, hucksters, and VCs who
         | think word models are displaying emergence and we're 6 months
         | from AGI, if only we can have more data
        
           | nottorp wrote:
           | Let's be snarky a bit:
           | 
           | Can you do a neural network that, given a starting position
           | of the game of life, decides if it cycles or not? ;)
           | 
           | Ok, not cycles... dies, stabilizes, goes into a loop etc.
        
             | GrantMoyer wrote:
             | So Oracle's working on an LLM too, eh?
        
               | DeathArrow wrote:
               | Everybody and their mom are into LLMs.
        
               | brookst wrote:
               | Same thing happened with the Internet.
        
               | yumong wrote:
               | And Second Life and Myspace.
        
               | nottorp wrote:
               | <cough> halting problem. But now I'm spoiling it.
        
               | markisus wrote:
               | We know neural networks cannot solve the halting problem.
               | But isn't the question whether they can learn the
               | transition table for game of life? Since each cell
               | depends only on neighbors, this is as easy as memorizing
               | how each 3x3 tile transitions.
        
               | nottorp wrote:
               | The original question, maybe. Mine is basically the
               | halting problem, I think.
               | 
               | The other difference is I don't take it seriously.
        
             | catlifeonmars wrote:
             | For a grid of a fixed size, yes.
        
             | scarmig wrote:
             | The halting problem doesn't mean you can never decide if
             | something cycles etc, just that you can't always decide.
             | 
             | As it stands, my guess is that the LLM would always
             | confidently make a decision, even if it were wrong, and
             | then politely backtrack if you pushed backed, even if it
             | were originally right.
        
           | danielbln wrote:
           | From the HN comment rules:
           | 
           | > Be kind. Don't be snarky. Converse curiously; don't cross-
           | examine. Edit out swipes.
           | 
           | > Please don't fulminate. Please don't sneer, including at
           | the rest of the community.
        
             | bell-cot wrote:
             | My read of the comment is: "You are correct, but bear in
             | mind that the world seems infested with people who are far
             | less realistic and honest than you."
        
             | iiovemiku wrote:
             | The rules also say "Please don't complain that a submission
             | is inappropriate. If a story is spam or off-topic, flag it.
             | Don't feed egregious comments by replying; flag them
             | instead. If you flag, please don't also comment that you
             | did."
             | 
             | I'm not really sure it's the best idea to accuse someone of
             | breaking the rules if in doing so you're also breaking one
             | yourself.
        
           | elevatedastalt wrote:
           | Every other day we see demos of AIs doing things that were
           | thought of an impossible 6 months earlier, but sure, sounds
           | like it's the "hype merchants" who are out of touch with
           | reality.
        
         | moconnor wrote:
         | This is exactly what we ask of neural networks and in the case
         | of the game of life the article and paper show that yes they do
         | derive the rules. Equally, we can expect them to derive the
         | laws of physics by observation - certainly diffusion networks
         | appear to derive some of them as they pertrain to light.
        
       | red75prime wrote:
       | I struggled with the Game of Life too. I was fascinated by it and
       | evolved cell populations on graph paper by hand (yeah, I'm that
       | old). When I've got a computer, I checked my drawings and all of
       | them were wrong.
        
         | zaphar wrote:
         | I used to do this as a kid on long car trips. I would play GoL
         | on paper. It's a good way to eat up time enjoyably.
        
         | maxbond wrote:
         | Reminds me of Knuth's dragon mosaic, which also contains a
         | mistake.
         | 
         | https://www.youtube.com/watch?v=v678Em6qyzk
        
       | K0balt wrote:
       | What's not clear to me is if it is non trivial to (a)create a GOL
       | NN that models the game cells directly as neurons (which would
       | seem to be an efficient and effective method) or if it's just (b)
       | nontrivial to create a transformer architecture model that can
       | model the game state n-turns in the future.
       | 
       | I would be very surprised if (a) was not effective, but that (b)
       | is difficult is not surprising, since that is a very nontrivial
       | task that requires intermediary modelling tools to perform for
       | humans (arguably the most advanced NN that we have access to at
       | the moment)
       | 
       | (a) is actually a form of (b) in the form of a modelling tool.
        
         | moconnor wrote:
         | Both are ~trivial as detailed in the paper and article.
        
       | moconnor wrote:
       | For everyone reading neither the article nor the paper:
       | 
       | - both show neural networks can learn the game of life just fine
       | 
       | - the finding is that to learn the rules reliably the networks
       | need to be very over-parameterised (e.g. many times larger than
       | the minimal size needed for hand-crafted weights to perfectly
       | solve the problem)
       | 
       | This is not really a new result nor a surprising one, nor does it
       | say anything about the kinds of functions a neural network can
       | represent.
       | 
       | It's an attempt to understand an existing observation: once we
       | have trained a large overparameterized neural network we can
       | often compress it to a smaller one with very little loss. So why
       | can't we learn the smaller one directly?
       | 
       | One of the theories referred to in the article and paper is the
       | lottery hypothesis, which states that a large network is a
       | superposition of many small networks and the larger you are the
       | more likely at least one of those gets a "lucky" set of weights
       | and converges quickly to the right solution. There is already
       | interesting evidence for this.
        
         | actionfromafar wrote:
         | > once we have trained a large overparameterized neural network
         | we can often compress it to a smaller one with very little
         | loss. So why can't we learn the smaller one directly?
         | 
         | I feel something similar goes on in us humans. Interesting to
         | think about.
        
           | rmnclmnt wrote:
           | Indeed, you have to over-engineer something before converge
           | to a leaner solution to a problem
        
             | brookst wrote:
             | And I think this is because the ideal complex system is one
             | where all the subsystems and parts combine to produce
             | adequate reliability.
             | 
             | Exponentiation means it is more efficient to start by far
             | exceeding the required reliability and then optimizing the
             | most expensive subsystems/parts. It is less efficient and
             | far more frustrating if multiple things have to be improved
             | to meet requirements.
        
               | actionfromafar wrote:
               | This is really insightful.
        
           | nopinsight wrote:
           | There is indeed an analogous process in the brain.
           | 
           | "The number of synapses in the brain reaches its peak around
           | ages 2-3, with about 15,000 synapses per neuron. As
           | adolescents, the brain undergoes synaptic pruning. In
           | adulthood, the brain stabilizes at around 7,500 synapses per
           | neuron, roughly half the peak in early childhood.
           | 
           | This figure can vary based on individual experiences and
           | learning." -- written by GPT-4o
           | 
           | Confirmed by e.g.
           | https://extension.umaine.edu/publications/4356e/
        
           | patcon wrote:
           | Also perhaps why the evolved pattern of death is important: a
           | subnetwork is selected in a brain, which is suited to a
           | specific geological, physical, biological and cognitive
           | environment that the brain is navigating. But when the
           | environment shifts beneath the organisim (as culture does and
           | the living world in general does), then the subnetwork is no
           | longer the correct one, and needs to be reinitialized.
           | 
           | Or in other words, even in an information theoretic sense,
           | it's true: you can't teach a old dogs new tricks. You need a
           | new dog.
        
             | yumong wrote:
             | From an evolutionary perspective, wouldn't we expect to
             | have developed a way to "reset" parts of our brain then?
        
               | sitkack wrote:
               | That is what kids are for.
        
               | actionfromafar wrote:
               | Haha, exactly. "We did".
        
               | ianmcgowan wrote:
               | This reminds me of a hacker news comment that blew my
               | mind - basically "I" am really my genetic code, and this
               | particular body "I" am in is just another computer that
               | the code has been moved to, because the old one is
               | scheduled to be decommissioned. So I am really just the
               | latest instance of a program that has been running
               | continuously since the first DNA/RNA molecules started to
               | replicate.
        
               | interroboink wrote:
               | If you're interested in such things, then start layering
               | on epigenetics. The "I" is a product not just of genes,
               | but of your environment as you developed. I was just
               | reading about bees' "royal jelly" recently, and how
               | genetically identical larvae can become a queen or a
               | worker based on their exposure to it.
               | 
               | So the program is not just the zeroes and ones, so to
               | speak, but also more nebulous real-time activity, passed
               | on through time. Like a wave on the ocean.
        
               | sitkack wrote:
               | And my children are not only my offspring, but everyone I
               | come into memetic contact with.
        
             | batshit_beaver wrote:
             | Neuroplasticity is a thing though, with plenty of cases of
             | brains recovering from pretty significant damage. They also
             | do evolve and adjust over time to gradual changes in
             | environment. Lots of elderly people are keeping up with
             | cultural and technological change.
             | 
             | Can't say the same about neural networks (yet?).
        
           | seanhunter wrote:
           | Yes. My naive intuition about this is you need the extra
           | parameters precisely to do the learning because learning a
           | thing is more complicated than doing the thing once you have
           | learned how. There are lots of natural examples that fit this
           | intuition eg in my mind "junk" DNA is needed because the
           | evolutionary mechanism is learning the sequences which work
           | in a similar way. You don't need all that extra DNA once you
           | have it working but once you have it working there's little
           | selection pressure to clean up/optimise the DNA sequence so
           | the junk stays.
        
         | mysecretaccount wrote:
         | Thanks, very clear explanation!
        
         | G3rn0ti wrote:
         | > the lottery hypothesis
         | 
         | Isn't that another way of saying the optimization algorithm
         | used in finding the network's weights (gradient descent) can
         | not find the global optimum? I mean this is nothing new, the
         | curse of dimension prevents any numeric optimizer to completely
         | minimize any complicated error function and it's been known for
         | decades. AFAIK there is no algorithm that can find the global
         | minimum of any function. And this is what currently limits
         | neural network models: They could be much simpler and less
         | resource hungry if we had better optimizers.
        
           | staunton wrote:
           | In practice, you _don 't want_ the global optimum because you
           | can't put _all_ possible inputs in the training data and need
           | your system to  "generalize" instead. Global optimum would
           | mean overfitting.
        
             | WithinReason wrote:
             | Regularisation should not be done with the optimiser but
             | with the loss function and the architecture.
        
               | redox99 wrote:
               | Regularization only helps you so much.
        
               | staunton wrote:
               | Maybe it _should_ not be done but the large neutral
               | networks this decade absolutely rely on this. A network
               | at the global minimum of any of the (regularized) loss
               | functions that are used these days would be _waaay_
               | overfitted.
        
               | uoaei wrote:
               | The entire reason SGD works is because the stochastic
               | nature of updates on minibatches is an implicit
               | regularizer. This one perspective built the foundations
               | for all of modern machine learning.
               | 
               | I completely agree that the most effective regularization
               | is inductive bias in the architecture. But bang for buck,
               | given all the memory/compute savings it accomplishes, SGD
               | is the exemplar of implicit regularization techniques.
        
           | WithinReason wrote:
           | I think you're right, but the issue might be local minima
           | which a better optimiser wouldn't help with much. A reason a
           | larger network might work better is that there are fewer
           | local minima in a higher dimension too.
        
             | jxy wrote:
             | > there are fewer local minima in a higher dimension
             | 
             | Is it actually proven, or another hypothesis? What is the
             | reason behind this?
        
               | jcrites wrote:
               | Just reasoning about this from first principles, but
               | intuitively, the more dimensions you have, the more
               | likely that you are to find a gradient in some dimension.
               | In an N-dimensional space, a local minimum needs to be a
               | minimum in all N dimensions, right? Otherwise the
               | algorithm will keep exploring down the gradient. (Not an
               | expert on this stuff.) The more dimensions there are, the
               | more likely it seems to be that a gradient exists down to
               | some greater minimum from any given point.
        
         | DrScientist wrote:
         | This may be a silly question but - so rather than train a big
         | network and hope a subnetwork wins the lottery - why not just
         | train a smaller network with multiple runs with different
         | starting weights?
        
           | MalphasWats wrote:
           | Largely for the same reason people don't "just win" the
           | lottery by buying every possible ticket (any more).
        
           | scarmig wrote:
           | The larger network contains exponentially more subnetworks.
           | 10x the size contains far more than 10x subnetworks (although
           | it'd also take more than 10x as long to train).
        
         | tomxor wrote:
         | > So why can't we learn the smaller one directly?
         | 
         | The lottery hypothesis intuitively makes sense, but as an
         | outsider I find this concept for evaluating learning methods
         | really interesting - To hand craft a tiny optimal networks for
         | simple yet computationally irreducible problems like GoL as a
         | way to benchmark learning algorithms. Or is it more than that?
         | for a sufficiently small network maybe there aren't that many
         | combinations of "correct solutions", so perhaps the way the
         | network emerges internally could really be interrogated by
         | comparison.
        
         | dilyevsky wrote:
         | Isn't this basically the idea behind dropout technique?
        
       | ngrilly wrote:
       | This paper from 2020 has been peer reviewed:
       | https://openreview.net/forum?id=uKZsVyFKbaj
        
       | rasca wrote:
       | Here's a nice article about how another team solved Game of Life
       | in LLMs using Claude 3 Opus:
       | https://x.com/ctjlewis/status/1786948443472339247 .
       | 
       | It's a really nice read.
        
       | scotty79 wrote:
       | > These findings are in line with "The Lottery Ticket
       | Hypothesis,"
       | 
       | If the fit was due to a lucky subset of weights you could have
       | train smaller networks many times instead of using many times
       | bigger network.
       | 
       | So it must be something more. Like increased opportunity to
       | create best solution out of large number of random lucky parts.
       | 
       | I think there should be way more research on neural pruning.
       | After all it's what our brains do to reach the correct
       | architecture and weights during our development.
        
       | hamilyon2 wrote:
       | > In machine learning, one of the popular ways to improve the
       | accuracy of a model that is underperforming is to increase its
       | complexity. And this technique worked with the Game of Life.
       | 
       | For those who didn't read the article, the content doesn't
       | support the title.
        
       | phaedrus wrote:
       | I wonder if anyone has tried to approach the problem from the
       | other end: start with the hand-tuned network and randomize just
       | some of the weights (or all of the weights a small amount), and
       | see at what point the learning algorithm can no longer get back
       | to the correct formulation of the problem. Map the boundary
       | between almost-solved and failure to converge, instead of
       | starting from a random point trying to get to almost-solved.
        
       | phaedrus wrote:
       | Philosophically, why should it be the case that aggregations of
       | statistical calculations (one way of viewing the matrix
       | multiplications of ANNs) can approximate intelligence? I think
       | it's because our ability to know reality is inherently
       | statistical.
       | 
       | To be clear, I'm not suggesting macro scale, i.e. not quantum,
       | reality itself is probabilistic, only that our ability to
       | interpret perception of it and model it is statistical. That is,
       | an observation or a sensor doesn't actually tell you the state of
       | the world; it is a measurement from which you infer things.
       | 
       | Viewed through this standpoint, maybe the Game of Life and other
       | discrete, fully-knowable toy problem worlds aren't as applicable
       | to the problem of general intelligence as we imagine. A way to
       | put this into practice could be to introduce a level of error in
       | both the hand-tuned and learned networks' ability to accurately
       | measure the input states of the Life tableau (and/or introduce
       | some randomness in the application of the Life rules on the
       | simulation), and see whether the superiority of the hand-tuned
       | network persists or if the learned network is more robust in the
       | face of uncertain inputs or fallible rule-applications.
        
       | ellis0n wrote:
       | Amazing idea! GoL/LLM
        
       ___________________________________________________________________
       (page generated 2024-05-17 23:01 UTC)