[HN Gopher] The boundary of neural network trainability is fractal
       ___________________________________________________________________
        
       The boundary of neural network trainability is fractal
        
       Author : RafelMri
       Score  : 165 points
       Date   : 2024-02-19 10:27 UTC (12 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | mchinen wrote:
       | Reposting comment from last time since I'm still curious:
       | 
       | This is really fun to see. I love toy experiments like this. I
       | see that each plot is always using the same initialization of
       | weights, which presumably makes it possible to have more
       | smoothness between each pixel. I also would guess it's using the
       | same random seed for training (shuffling data). I'd be curious to
       | know what the plots would look like with a different
       | randomness/shuffling of each pixel's dataset. I'd guess for the
       | high learning rates it would be too noisy, but you might see
       | fractal behavior at more typical and practical learning rates.
       | You could also do the same with the random initialization of each
       | dataset. This would get at if the chaotic boundary also exists in
       | more practical use cases.
        
         | thomashop wrote:
         | exactly. if they always use the same initialization seed then
         | this isn't very surprising.
         | 
         | one would have to do many runs for each point in the grid and
         | average them or something.
         | 
         | but i didn't read the paper so maybe they did.
        
         | londons_explore wrote:
         | I think if you used a random seed for weights and training data
         | order, and reran the experiment enough times to average out the
         | noise, then the resulting charts would then be smooth with no
         | fractal patterns.
        
           | baq wrote:
           | An interesting conjecture, well worth a paper in response.
        
           | catlifeonmars wrote:
           | Do you consider the random seed (or by extension the
           | randomized initial weights) a hyperparameter?
        
       | jerpint wrote:
       | It feels weird to me to use the hyper parameters as the variables
       | to iterate on, and also wasteful. Surely there must be a family
       | of models that give fractal like behaviour ?
        
         | amelius wrote:
         | > It feels weird to me to use the hyper parameters as the
         | variables to iterate on
         | 
         | Yes, I also think this is strange. In regular fractals the x
         | and y coordinates have the same units (roughly speaking), but
         | here this is not the case, so I wonder how they determine the
         | relative scale.
        
           | thfuran wrote:
           | Is there really any meaningful sense in which real and
           | imaginary numbers have the same units but two dimensionless
           | real hyperparameters don't?
        
             | xvedejas wrote:
             | Complex numbers multiplied by an imaginary number rotate
             | preserving magnitude. In this sense they have the same
             | units.
        
               | thfuran wrote:
               | You can rotate any vector while preserving its magnitude.
        
               | xvedejas wrote:
               | If the units are different, you do need to come up with a
               | conversation between them, or you're implicitly saying
               | it's 1-to-1
        
         | freeone3000 wrote:
         | The fractal behaviour is an undesirable property, not the goal
         | :P ideally every network would be trainable (would converge)!
         | this is the graphed result of a hyperparameter search, a form
         | of optimization in neural networks.
         | 
         | If you envision a given architecture as a class of (higher-
         | order) function, the inputs would be the parameters, and the
         | constants would be the hyperparameters. Varying the constants
         | moves to a different function in the class, or, varying the
         | hyperparameters gives a different model with the same
         | architecture (even with the same data).
        
       | FredrikMeyer wrote:
       | Related: https://news.ycombinator.com/item?id=39349992 (seems to
       | be the same content)
        
       | blackbear_ wrote:
       | Could it be that this behavior is just caused by numerical issues
       | and/or randomness in the calculation, rather than a real property
       | of neural networks?
        
         | sigmoid10 wrote:
         | Fractals are not caused by randomness. Fractals arise from
         | scale invariance and self similarity, which in turn can come
         | from non linear systems and iterative processes. It is very
         | easy to generate fractals in practice and even the most vanilla
         | neural networks trivially fulfill the requirements (at least
         | when you look at training output). In that sense it would be
         | weird _not_ to find fractal structures when you look hard
         | enough.
        
           | tomxor wrote:
           | "If we're built from spirals, while living in a giant spiral,
           | then everything we put our hands to, is infused with the
           | spiral"
           | 
           | ... Sorry I couldn't help myself.
        
           | nyrikki wrote:
           | Not exactly, Newton's fractal, which is topological,
           | specifically the wada property, is a boundary condition.
           | 
           | It relates to fractal (non-integer) dimensions, which was
           | first described by Mandelbrot in a paper about self
           | similarity.
           | 
           | Here is a paper that covers some of that.
           | 
           | https://www.minvydasragulskis.com/sites/default/files/public.
           | ..
           | 
           | In Newton's fractal, no matter how small a circle you can
           | draw, your circle will either contain one root or all the
           | roots.
           | 
           | The basins that contain one root are open sets that share a
           | boundary set.
           | 
           | Even if you could have perfect information and precision this
           | property holds. This means any change in initial conditions
           | that crosses a boundary will be indeterminate.
           | 
           | There is another feature called riddled basins, where every
           | point is arbitrarily close to other basins. This is another
           | situation where even with perfect information and unlimited
           | precision a perturbations would be indeterminate.
           | 
           | A positive Laponov exponent which isn't sufficient to prove
           | chaos, but is always positive in the presence of chaos may
           | even be 0 or negative in the above situations.
           | 
           | Take the typical predator prey model and add fear and refuge
           | and you hit the riddled basins.
           | 
           | Stack four reflective balls in a pyramid and shine different
           | color lights in two sides and you will see the Wada property.
           | 
           | Neither of those problems are addressable with the assumption
           | of deterministic effects with finite precision.
        
       | subroutine wrote:
       | In case it isn't obvious, you can tap on any of the figures in
       | the PDF or HTML version to watch the video.
        
       | JL-Akrasia wrote:
       | I'll add to this.
       | 
       | It's not only the boundary that is fractal.
       | 
       | We'll soon see that learning on one dataset (area of fractal)
       | with enough data will generalize to other seemingly unrelated
       | datasets.
       | 
       | There is evidence that the structure neural networks are learning
       | to approximate in a generative fractal of sorts.
       | 
       | Finally, we'll need to adapt gradient descent to operate at move
       | between different scales
        
       | ttoinou wrote:
       | I've produced some KIFS (Kaleidoscopic iterated function system)
       | fractals that look like this
        
       | lawlessone wrote:
       | What does this mean?
        
         | PaulHoule wrote:
         | Training the network is a dynamic process similar to
         | 
         | https://en.wikipedia.org/wiki/Julia_set
         | 
         | or
         | 
         | https://en.wikipedia.org/wiki/Newton_fractal
        
         | romusha wrote:
         | Nothing
        
         | idiotsecant wrote:
         | As another poster pointed out, its much more intuitive with
         | graphics.
         | 
         | The parameters that you tweak to control model learning have a
         | self-similar property where as you zoom if you see more and
         | more complexity. Its the very definition of local maxima all
         | over the place.
        
       | xcodevn wrote:
       | I am not trying to downplay the contribution of the paper, but
       | isn't it obvious that this is the case?
        
         | teaearlgraycold wrote:
         | Obvious to whom?
        
           | bloaf wrote:
           | I think the "obvious" comment was a bit snarky, but out of
           | curiosity, I posed the question to the Groq website which
           | currently happens to be on the front page right now. (It
           | claims to run Mixtral 8x7B-32k at 500 T/s)
           | 
           | And indeed, the AI response indicated that the boundary
           | between convergence and divergence is not well defined, has
           | many local maxima and minima, and could be quote: "fractal or
           | chaotic, with small changes in hyperparameters leading to
           | drastically different outcomes."
        
         | Buttons840 wrote:
         | I'll defend the idea that it was obvious. (Although, it wasn't
         | obvious to me until someone pointed it out, so maybe that's not
         | obvious.)
         | 
         | If you watch this video[0], you'll see in the first frame that
         | there is a clear boundary between learning rates that converge
         | or not. Ignoring this paper for a moment, what if we zoom in
         | really really close to that boundary? There are two
         | possibilities, either (1) the boundary is perfectly sharp no
         | matter how closely we inspect it, or (2) it is a little bit
         | fuzzy. Of those two possibilities, the perfectly sharp boundary
         | would be more surprising.
         | 
         | [0]: https://x.com/jaschasd/status/1756930242965606582
        
         | eapriv wrote:
         | Not only it is not obvious; it is not known to be true.
        
         | barbarr wrote:
         | I don't think it's obvious per se, but people who have studied
         | numerical methods at the graduate level have likely seen
         | fractal boundaries like this before - even Newton's method
         | produces them [0]. The phenomenon says more about iterative
         | methods than it says about neural networks.
         | 
         | [0] https://en.wikipedia.org/wiki/Newton_fractal
        
       | otaviogood wrote:
       | This is _much_ more interesting if you see the animations.
       | https://x.com/jaschasd/status/1756930242965606582
        
         | notfed wrote:
         | Fractal zoom videos are worth infinite words.
        
           | paulddraper wrote:
           | > infinite
           | 
           | I see you
        
         | catlifeonmars wrote:
         | So what exactly are we looking at here? Did the authors only
         | use two hyperparameters for the purpose of this visualization?
        
           | yazzku wrote:
           | It's explained in the post:
           | 
           | > Have you ever done a dense grid search over neural network
           | hyperparameters? Like a _really dense_ grid search? It looks
           | like this (!!). Blueish colors correspond to hyperparameters
           | for which training converges, redish colors to
           | hyperparameters for which training diverges.
        
       | phaedrus wrote:
       | The blog post would be a better link for this submission.
       | https://sohl-dickstein.github.io/2024/02/12/fractal.html
        
         | calibas wrote:
         | It was posted on HN last week:
         | https://news.ycombinator.com/item?id=39349992
        
       | omginternets wrote:
       | I vaguely recall that in vivo neural oscillations also exhibit
       | fractal structure (in some cases at least).
        
       | pmayrgundter wrote:
       | One of Wofram's comments is that there appears to be much more
       | internal structure in language semantics that we had expected,
       | contra-Chomsky.
       | 
       | We also know the brain, cortex esp, is highly recurrent, so it
       | should be primed for creating fractals and chaotic mixing.
       | 
       | So maybe the hidden structure is the set of neural hyperparams
       | needed to put a given cluster of neurons into fractal/chaotic
       | oscillations like this. Seems potentially more useful too.. way
       | more information content than a configuration that yields a fast
       | convergence to a fixed point.
       | 
       | Perhaps this is what learning deep NNs is doing: producing
       | conditions where the substrate is at the tipping point, to get to
       | a high-information generation condition, and then shaping this to
       | fit the target system as well as it can with so many free
       | parameters.
       | 
       | That suggests that using iterative generators that are somehow
       | closer to the dynamics of real neurons would be more efficient
       | for AI: it'd be easier to drive them to similar feedback
       | conditions and patterns
       | 
       | Like matching resonators in any physical system
        
         | pmayrgundter wrote:
         | Per some offline discussion, I'll note that while this paper is
         | about the structure of hyperparams, it also starts with the
         | analogy between Mandelbrot & Julia sets, Julia as the
         | hyperparam space for Mandelbrot, the _param_ space
         | 
         | Well, they both also have similar fractal dimensions.
         | Mandlebrot's Hausdorff dimension is 2, Julia is 1-2.
         | 
         | I won't argue it here but just suggest that this is an
         | important complexity relationship and that the neural net being
         | fit may also have similar fractal complexity, and that the
         | distinction between param and hyper-param in this sense may be
         | somewhat a red-herring
        
         | catlifeonmars wrote:
         | Two quibbles: (1) neural nets don't necessarily operate on
         | language, (2) they only loosely model biological neurons in
         | that they operate in discrete space. All that is to say that
         | any similarities are purely incidental without accounting for
         | these two facts.
        
           | sjwhevvvvvsj wrote:
           | Well, in some sense they don't operate on language at all -
           | but on mathematical representation of tokens derived from
           | language.
           | 
           | I'm sure from your comment you are aware of the distinction,
           | but it is an interesting concept for people to keep in mind.
        
           | pmayrgundter wrote:
           | 1) agreed. it's exciting seeing the same basic architecture
           | broadly applied
           | 
           | 2) not sure what you mean by "operate in discrete space"
           | 
           | I'd emphasize the potential similarity to biological
           | recurrence. Deep ANNs don't need to have this explicitly (tho
           | e.g. LSTM has explicit recurrence), but it is known that
           | recurrent NNs can be emulated by unrolling, in a process
           | similar to function currying. In this mode, a learned network
           | would learn to recognize certain inputs and carry them across
           | to other parts of the network that can be copies of the
           | originator, thus achieving functional equivalence to self
           | feedback, or neighbor feedback. It takes a lot of layers and
           | nodes in theory, but ofc modern nets are getting very big.
        
       | ahdsr wrote:
       | > Some fractals -- for instance those associated with the
       | Mandelbrot and quadratic Julia sets -- are computed by iterating
       | a function, and identifying the boundary between hyperparameters
       | for which the resulting series diverges or remains bounded.
       | Neural network training similarly involves iterating an update
       | function (e.g. repeated steps of gradient descent), can result in
       | convergent or divergent behavior, and can be extremely sensitive
       | to small changes in hyperparameters. Motivated by these
       | similarities, we experimentally examine the boundary between
       | neural network hyperparameters that lead to stable and divergent
       | training. We find that this boundary is fractal over more than
       | ten decades of scale in all tested configurations.
       | 
       | Reading this gave me goosebumps
        
       | kraig911 wrote:
       | Makes me think of something I think I read from Penrose about
       | consciousness.
        
       | eapriv wrote:
       | Note that the paper does not provide a proof of this statement,
       | only some experimental evidence.
        
       | cactusfrog wrote:
       | Even the boundary for newtons approximation is fractal. This is a
       | feature of non-linear optimization.
        
       | bawolff wrote:
       | Its interesting how fluid-like this fractal is compared to other
       | fractal-zoom videos i see on the internet. I have no idea how
       | common it is for fractals to be like that.
        
       ___________________________________________________________________
       (page generated 2024-02-19 23:01 UTC)