[HN Gopher] Why neural networks struggle with the Game of Life (...
___________________________________________________________________
Why neural networks struggle with the Game of Life (2020)
Author : DeathArrow
Score : 94 points
Date : 2024-05-17 09:26 UTC (13 hours ago)
(HTM) web link (bdtechtalks.com)
(TXT) w3m dump (bdtechtalks.com)
| leric wrote:
| A good example of irreducible complexity
| rustcleaner wrote:
| based
|
| Simple systems built on simple rules creating universally
| complete computation behaviors are both unintuitive to and
| underrated by common man.
| nottorp wrote:
| > Interestingly, no matter how complex a grid becomes, you can
| predict the state of each cell in the next timestep with the same
| rules.
|
| How is that interesting? It's the definition of the Game of
| Life... it's not like it's a natural system that you don't know
| the full rules for...
| brookst wrote:
| It's interesting the way Go is interesting: very simple rules
| can produce extreme complexity. At least, I think that's
| interesting.
| yumong wrote:
| Stephen Wolfram pushes this to the extreme.
|
| Surprised that nobody mentioned him yet.
| nottorp wrote:
| I wouldn't compare the game of life that has exactly one
| "next state" with something like Go.
| brookst wrote:
| It's interesting that you wouldn't, yet I would.They aren't
| isomorphic, for sure.
|
| Go's complexity comes from two players alternately picking
| one out of a very large number of options.
|
| GoL's complexity comes from a very large number of nodes
| "picking" between two states. That's not precise, just
| illustrating that there is some symmetry of
| simplicity/complexity, at least to my eyes.
| SkyBelow wrote:
| Why not?
|
| We could even think of both as collections of 3d structures
| showing all valid structures possible for a board of size n
| by n. There are some differences, every single 3d Conway
| structure has a unique top layer, while Go does not. But
| that seems like an overall minor difference. There are many
| more Go shapes than Conway shapes given the same N, but
| both are already so numerous that I'm not sure that is a
| difference worth stopping the comparison.
| bell-cot wrote:
| From a quick skim, then string search for "interesting" - I'd
| say that word is fluff, added to keep their audience reading
| through their dull background intro.
| kozlovsky wrote:
| If we show a neural network some examples from the Game of Life
| and expect it to master the rules of a cellular automaton, then
| aren't we asking too much from it? In some ways, this is
| analogous to expecting that if we show the neural network
| examples from the physical world, it will automatically derive
| Newton's three laws. Not every person observing the world around
| him can independently deduce Newton's laws from scratch, no
| matter how many examples he sees.
| passwordoops wrote:
| "then aren't we asking too much from it"
|
| Not according to the hype merchants, hucksters, and VCs who
| think word models are displaying emergence and we're 6 months
| from AGI, if only we can have more data
| nottorp wrote:
| Let's be snarky a bit:
|
| Can you do a neural network that, given a starting position
| of the game of life, decides if it cycles or not? ;)
|
| Ok, not cycles... dies, stabilizes, goes into a loop etc.
| GrantMoyer wrote:
| So Oracle's working on an LLM too, eh?
| DeathArrow wrote:
| Everybody and their mom are into LLMs.
| brookst wrote:
| Same thing happened with the Internet.
| yumong wrote:
| And Second Life and Myspace.
| nottorp wrote:
| <cough> halting problem. But now I'm spoiling it.
| markisus wrote:
| We know neural networks cannot solve the halting problem.
| But isn't the question whether they can learn the
| transition table for game of life? Since each cell
| depends only on neighbors, this is as easy as memorizing
| how each 3x3 tile transitions.
| nottorp wrote:
| The original question, maybe. Mine is basically the
| halting problem, I think.
|
| The other difference is I don't take it seriously.
| catlifeonmars wrote:
| For a grid of a fixed size, yes.
| scarmig wrote:
| The halting problem doesn't mean you can never decide if
| something cycles etc, just that you can't always decide.
|
| As it stands, my guess is that the LLM would always
| confidently make a decision, even if it were wrong, and
| then politely backtrack if you pushed backed, even if it
| were originally right.
| danielbln wrote:
| From the HN comment rules:
|
| > Be kind. Don't be snarky. Converse curiously; don't cross-
| examine. Edit out swipes.
|
| > Please don't fulminate. Please don't sneer, including at
| the rest of the community.
| bell-cot wrote:
| My read of the comment is: "You are correct, but bear in
| mind that the world seems infested with people who are far
| less realistic and honest than you."
| iiovemiku wrote:
| The rules also say "Please don't complain that a submission
| is inappropriate. If a story is spam or off-topic, flag it.
| Don't feed egregious comments by replying; flag them
| instead. If you flag, please don't also comment that you
| did."
|
| I'm not really sure it's the best idea to accuse someone of
| breaking the rules if in doing so you're also breaking one
| yourself.
| elevatedastalt wrote:
| Every other day we see demos of AIs doing things that were
| thought of an impossible 6 months earlier, but sure, sounds
| like it's the "hype merchants" who are out of touch with
| reality.
| moconnor wrote:
| This is exactly what we ask of neural networks and in the case
| of the game of life the article and paper show that yes they do
| derive the rules. Equally, we can expect them to derive the
| laws of physics by observation - certainly diffusion networks
| appear to derive some of them as they pertrain to light.
| red75prime wrote:
| I struggled with the Game of Life too. I was fascinated by it and
| evolved cell populations on graph paper by hand (yeah, I'm that
| old). When I've got a computer, I checked my drawings and all of
| them were wrong.
| zaphar wrote:
| I used to do this as a kid on long car trips. I would play GoL
| on paper. It's a good way to eat up time enjoyably.
| maxbond wrote:
| Reminds me of Knuth's dragon mosaic, which also contains a
| mistake.
|
| https://www.youtube.com/watch?v=v678Em6qyzk
| K0balt wrote:
| What's not clear to me is if it is non trivial to (a)create a GOL
| NN that models the game cells directly as neurons (which would
| seem to be an efficient and effective method) or if it's just (b)
| nontrivial to create a transformer architecture model that can
| model the game state n-turns in the future.
|
| I would be very surprised if (a) was not effective, but that (b)
| is difficult is not surprising, since that is a very nontrivial
| task that requires intermediary modelling tools to perform for
| humans (arguably the most advanced NN that we have access to at
| the moment)
|
| (a) is actually a form of (b) in the form of a modelling tool.
| moconnor wrote:
| Both are ~trivial as detailed in the paper and article.
| moconnor wrote:
| For everyone reading neither the article nor the paper:
|
| - both show neural networks can learn the game of life just fine
|
| - the finding is that to learn the rules reliably the networks
| need to be very over-parameterised (e.g. many times larger than
| the minimal size needed for hand-crafted weights to perfectly
| solve the problem)
|
| This is not really a new result nor a surprising one, nor does it
| say anything about the kinds of functions a neural network can
| represent.
|
| It's an attempt to understand an existing observation: once we
| have trained a large overparameterized neural network we can
| often compress it to a smaller one with very little loss. So why
| can't we learn the smaller one directly?
|
| One of the theories referred to in the article and paper is the
| lottery hypothesis, which states that a large network is a
| superposition of many small networks and the larger you are the
| more likely at least one of those gets a "lucky" set of weights
| and converges quickly to the right solution. There is already
| interesting evidence for this.
| actionfromafar wrote:
| > once we have trained a large overparameterized neural network
| we can often compress it to a smaller one with very little
| loss. So why can't we learn the smaller one directly?
|
| I feel something similar goes on in us humans. Interesting to
| think about.
| rmnclmnt wrote:
| Indeed, you have to over-engineer something before converge
| to a leaner solution to a problem
| brookst wrote:
| And I think this is because the ideal complex system is one
| where all the subsystems and parts combine to produce
| adequate reliability.
|
| Exponentiation means it is more efficient to start by far
| exceeding the required reliability and then optimizing the
| most expensive subsystems/parts. It is less efficient and
| far more frustrating if multiple things have to be improved
| to meet requirements.
| actionfromafar wrote:
| This is really insightful.
| nopinsight wrote:
| There is indeed an analogous process in the brain.
|
| "The number of synapses in the brain reaches its peak around
| ages 2-3, with about 15,000 synapses per neuron. As
| adolescents, the brain undergoes synaptic pruning. In
| adulthood, the brain stabilizes at around 7,500 synapses per
| neuron, roughly half the peak in early childhood.
|
| This figure can vary based on individual experiences and
| learning." -- written by GPT-4o
|
| Confirmed by e.g.
| https://extension.umaine.edu/publications/4356e/
| patcon wrote:
| Also perhaps why the evolved pattern of death is important: a
| subnetwork is selected in a brain, which is suited to a
| specific geological, physical, biological and cognitive
| environment that the brain is navigating. But when the
| environment shifts beneath the organisim (as culture does and
| the living world in general does), then the subnetwork is no
| longer the correct one, and needs to be reinitialized.
|
| Or in other words, even in an information theoretic sense,
| it's true: you can't teach a old dogs new tricks. You need a
| new dog.
| yumong wrote:
| From an evolutionary perspective, wouldn't we expect to
| have developed a way to "reset" parts of our brain then?
| sitkack wrote:
| That is what kids are for.
| actionfromafar wrote:
| Haha, exactly. "We did".
| ianmcgowan wrote:
| This reminds me of a hacker news comment that blew my
| mind - basically "I" am really my genetic code, and this
| particular body "I" am in is just another computer that
| the code has been moved to, because the old one is
| scheduled to be decommissioned. So I am really just the
| latest instance of a program that has been running
| continuously since the first DNA/RNA molecules started to
| replicate.
| interroboink wrote:
| If you're interested in such things, then start layering
| on epigenetics. The "I" is a product not just of genes,
| but of your environment as you developed. I was just
| reading about bees' "royal jelly" recently, and how
| genetically identical larvae can become a queen or a
| worker based on their exposure to it.
|
| So the program is not just the zeroes and ones, so to
| speak, but also more nebulous real-time activity, passed
| on through time. Like a wave on the ocean.
| sitkack wrote:
| And my children are not only my offspring, but everyone I
| come into memetic contact with.
| batshit_beaver wrote:
| Neuroplasticity is a thing though, with plenty of cases of
| brains recovering from pretty significant damage. They also
| do evolve and adjust over time to gradual changes in
| environment. Lots of elderly people are keeping up with
| cultural and technological change.
|
| Can't say the same about neural networks (yet?).
| seanhunter wrote:
| Yes. My naive intuition about this is you need the extra
| parameters precisely to do the learning because learning a
| thing is more complicated than doing the thing once you have
| learned how. There are lots of natural examples that fit this
| intuition eg in my mind "junk" DNA is needed because the
| evolutionary mechanism is learning the sequences which work
| in a similar way. You don't need all that extra DNA once you
| have it working but once you have it working there's little
| selection pressure to clean up/optimise the DNA sequence so
| the junk stays.
| mysecretaccount wrote:
| Thanks, very clear explanation!
| G3rn0ti wrote:
| > the lottery hypothesis
|
| Isn't that another way of saying the optimization algorithm
| used in finding the network's weights (gradient descent) can
| not find the global optimum? I mean this is nothing new, the
| curse of dimension prevents any numeric optimizer to completely
| minimize any complicated error function and it's been known for
| decades. AFAIK there is no algorithm that can find the global
| minimum of any function. And this is what currently limits
| neural network models: They could be much simpler and less
| resource hungry if we had better optimizers.
| staunton wrote:
| In practice, you _don 't want_ the global optimum because you
| can't put _all_ possible inputs in the training data and need
| your system to "generalize" instead. Global optimum would
| mean overfitting.
| WithinReason wrote:
| Regularisation should not be done with the optimiser but
| with the loss function and the architecture.
| redox99 wrote:
| Regularization only helps you so much.
| staunton wrote:
| Maybe it _should_ not be done but the large neutral
| networks this decade absolutely rely on this. A network
| at the global minimum of any of the (regularized) loss
| functions that are used these days would be _waaay_
| overfitted.
| uoaei wrote:
| The entire reason SGD works is because the stochastic
| nature of updates on minibatches is an implicit
| regularizer. This one perspective built the foundations
| for all of modern machine learning.
|
| I completely agree that the most effective regularization
| is inductive bias in the architecture. But bang for buck,
| given all the memory/compute savings it accomplishes, SGD
| is the exemplar of implicit regularization techniques.
| WithinReason wrote:
| I think you're right, but the issue might be local minima
| which a better optimiser wouldn't help with much. A reason a
| larger network might work better is that there are fewer
| local minima in a higher dimension too.
| jxy wrote:
| > there are fewer local minima in a higher dimension
|
| Is it actually proven, or another hypothesis? What is the
| reason behind this?
| jcrites wrote:
| Just reasoning about this from first principles, but
| intuitively, the more dimensions you have, the more
| likely that you are to find a gradient in some dimension.
| In an N-dimensional space, a local minimum needs to be a
| minimum in all N dimensions, right? Otherwise the
| algorithm will keep exploring down the gradient. (Not an
| expert on this stuff.) The more dimensions there are, the
| more likely it seems to be that a gradient exists down to
| some greater minimum from any given point.
| DrScientist wrote:
| This may be a silly question but - so rather than train a big
| network and hope a subnetwork wins the lottery - why not just
| train a smaller network with multiple runs with different
| starting weights?
| MalphasWats wrote:
| Largely for the same reason people don't "just win" the
| lottery by buying every possible ticket (any more).
| scarmig wrote:
| The larger network contains exponentially more subnetworks.
| 10x the size contains far more than 10x subnetworks (although
| it'd also take more than 10x as long to train).
| tomxor wrote:
| > So why can't we learn the smaller one directly?
|
| The lottery hypothesis intuitively makes sense, but as an
| outsider I find this concept for evaluating learning methods
| really interesting - To hand craft a tiny optimal networks for
| simple yet computationally irreducible problems like GoL as a
| way to benchmark learning algorithms. Or is it more than that?
| for a sufficiently small network maybe there aren't that many
| combinations of "correct solutions", so perhaps the way the
| network emerges internally could really be interrogated by
| comparison.
| dilyevsky wrote:
| Isn't this basically the idea behind dropout technique?
| ngrilly wrote:
| This paper from 2020 has been peer reviewed:
| https://openreview.net/forum?id=uKZsVyFKbaj
| rasca wrote:
| Here's a nice article about how another team solved Game of Life
| in LLMs using Claude 3 Opus:
| https://x.com/ctjlewis/status/1786948443472339247 .
|
| It's a really nice read.
| scotty79 wrote:
| > These findings are in line with "The Lottery Ticket
| Hypothesis,"
|
| If the fit was due to a lucky subset of weights you could have
| train smaller networks many times instead of using many times
| bigger network.
|
| So it must be something more. Like increased opportunity to
| create best solution out of large number of random lucky parts.
|
| I think there should be way more research on neural pruning.
| After all it's what our brains do to reach the correct
| architecture and weights during our development.
| hamilyon2 wrote:
| > In machine learning, one of the popular ways to improve the
| accuracy of a model that is underperforming is to increase its
| complexity. And this technique worked with the Game of Life.
|
| For those who didn't read the article, the content doesn't
| support the title.
| phaedrus wrote:
| I wonder if anyone has tried to approach the problem from the
| other end: start with the hand-tuned network and randomize just
| some of the weights (or all of the weights a small amount), and
| see at what point the learning algorithm can no longer get back
| to the correct formulation of the problem. Map the boundary
| between almost-solved and failure to converge, instead of
| starting from a random point trying to get to almost-solved.
| phaedrus wrote:
| Philosophically, why should it be the case that aggregations of
| statistical calculations (one way of viewing the matrix
| multiplications of ANNs) can approximate intelligence? I think
| it's because our ability to know reality is inherently
| statistical.
|
| To be clear, I'm not suggesting macro scale, i.e. not quantum,
| reality itself is probabilistic, only that our ability to
| interpret perception of it and model it is statistical. That is,
| an observation or a sensor doesn't actually tell you the state of
| the world; it is a measurement from which you infer things.
|
| Viewed through this standpoint, maybe the Game of Life and other
| discrete, fully-knowable toy problem worlds aren't as applicable
| to the problem of general intelligence as we imagine. A way to
| put this into practice could be to introduce a level of error in
| both the hand-tuned and learned networks' ability to accurately
| measure the input states of the Life tableau (and/or introduce
| some randomness in the application of the Life rules on the
| simulation), and see whether the superiority of the hand-tuned
| network persists or if the learned network is more robust in the
| face of uncertain inputs or fallible rule-applications.
| ellis0n wrote:
| Amazing idea! GoL/LLM
___________________________________________________________________
(page generated 2024-05-17 23:01 UTC)