[HN Gopher] The boundary of neural network trainability is fractal
___________________________________________________________________
The boundary of neural network trainability is fractal
Author : RafelMri
Score : 165 points
Date : 2024-02-19 10:27 UTC (12 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| mchinen wrote:
| Reposting comment from last time since I'm still curious:
|
| This is really fun to see. I love toy experiments like this. I
| see that each plot is always using the same initialization of
| weights, which presumably makes it possible to have more
| smoothness between each pixel. I also would guess it's using the
| same random seed for training (shuffling data). I'd be curious to
| know what the plots would look like with a different
| randomness/shuffling of each pixel's dataset. I'd guess for the
| high learning rates it would be too noisy, but you might see
| fractal behavior at more typical and practical learning rates.
| You could also do the same with the random initialization of each
| dataset. This would get at if the chaotic boundary also exists in
| more practical use cases.
| thomashop wrote:
| exactly. if they always use the same initialization seed then
| this isn't very surprising.
|
| one would have to do many runs for each point in the grid and
| average them or something.
|
| but i didn't read the paper so maybe they did.
| londons_explore wrote:
| I think if you used a random seed for weights and training data
| order, and reran the experiment enough times to average out the
| noise, then the resulting charts would then be smooth with no
| fractal patterns.
| baq wrote:
| An interesting conjecture, well worth a paper in response.
| catlifeonmars wrote:
| Do you consider the random seed (or by extension the
| randomized initial weights) a hyperparameter?
| jerpint wrote:
| It feels weird to me to use the hyper parameters as the variables
| to iterate on, and also wasteful. Surely there must be a family
| of models that give fractal like behaviour ?
| amelius wrote:
| > It feels weird to me to use the hyper parameters as the
| variables to iterate on
|
| Yes, I also think this is strange. In regular fractals the x
| and y coordinates have the same units (roughly speaking), but
| here this is not the case, so I wonder how they determine the
| relative scale.
| thfuran wrote:
| Is there really any meaningful sense in which real and
| imaginary numbers have the same units but two dimensionless
| real hyperparameters don't?
| xvedejas wrote:
| Complex numbers multiplied by an imaginary number rotate
| preserving magnitude. In this sense they have the same
| units.
| thfuran wrote:
| You can rotate any vector while preserving its magnitude.
| xvedejas wrote:
| If the units are different, you do need to come up with a
| conversation between them, or you're implicitly saying
| it's 1-to-1
| freeone3000 wrote:
| The fractal behaviour is an undesirable property, not the goal
| :P ideally every network would be trainable (would converge)!
| this is the graphed result of a hyperparameter search, a form
| of optimization in neural networks.
|
| If you envision a given architecture as a class of (higher-
| order) function, the inputs would be the parameters, and the
| constants would be the hyperparameters. Varying the constants
| moves to a different function in the class, or, varying the
| hyperparameters gives a different model with the same
| architecture (even with the same data).
| FredrikMeyer wrote:
| Related: https://news.ycombinator.com/item?id=39349992 (seems to
| be the same content)
| blackbear_ wrote:
| Could it be that this behavior is just caused by numerical issues
| and/or randomness in the calculation, rather than a real property
| of neural networks?
| sigmoid10 wrote:
| Fractals are not caused by randomness. Fractals arise from
| scale invariance and self similarity, which in turn can come
| from non linear systems and iterative processes. It is very
| easy to generate fractals in practice and even the most vanilla
| neural networks trivially fulfill the requirements (at least
| when you look at training output). In that sense it would be
| weird _not_ to find fractal structures when you look hard
| enough.
| tomxor wrote:
| "If we're built from spirals, while living in a giant spiral,
| then everything we put our hands to, is infused with the
| spiral"
|
| ... Sorry I couldn't help myself.
| nyrikki wrote:
| Not exactly, Newton's fractal, which is topological,
| specifically the wada property, is a boundary condition.
|
| It relates to fractal (non-integer) dimensions, which was
| first described by Mandelbrot in a paper about self
| similarity.
|
| Here is a paper that covers some of that.
|
| https://www.minvydasragulskis.com/sites/default/files/public.
| ..
|
| In Newton's fractal, no matter how small a circle you can
| draw, your circle will either contain one root or all the
| roots.
|
| The basins that contain one root are open sets that share a
| boundary set.
|
| Even if you could have perfect information and precision this
| property holds. This means any change in initial conditions
| that crosses a boundary will be indeterminate.
|
| There is another feature called riddled basins, where every
| point is arbitrarily close to other basins. This is another
| situation where even with perfect information and unlimited
| precision a perturbations would be indeterminate.
|
| A positive Laponov exponent which isn't sufficient to prove
| chaos, but is always positive in the presence of chaos may
| even be 0 or negative in the above situations.
|
| Take the typical predator prey model and add fear and refuge
| and you hit the riddled basins.
|
| Stack four reflective balls in a pyramid and shine different
| color lights in two sides and you will see the Wada property.
|
| Neither of those problems are addressable with the assumption
| of deterministic effects with finite precision.
| subroutine wrote:
| In case it isn't obvious, you can tap on any of the figures in
| the PDF or HTML version to watch the video.
| JL-Akrasia wrote:
| I'll add to this.
|
| It's not only the boundary that is fractal.
|
| We'll soon see that learning on one dataset (area of fractal)
| with enough data will generalize to other seemingly unrelated
| datasets.
|
| There is evidence that the structure neural networks are learning
| to approximate in a generative fractal of sorts.
|
| Finally, we'll need to adapt gradient descent to operate at move
| between different scales
| ttoinou wrote:
| I've produced some KIFS (Kaleidoscopic iterated function system)
| fractals that look like this
| lawlessone wrote:
| What does this mean?
| PaulHoule wrote:
| Training the network is a dynamic process similar to
|
| https://en.wikipedia.org/wiki/Julia_set
|
| or
|
| https://en.wikipedia.org/wiki/Newton_fractal
| romusha wrote:
| Nothing
| idiotsecant wrote:
| As another poster pointed out, its much more intuitive with
| graphics.
|
| The parameters that you tweak to control model learning have a
| self-similar property where as you zoom if you see more and
| more complexity. Its the very definition of local maxima all
| over the place.
| xcodevn wrote:
| I am not trying to downplay the contribution of the paper, but
| isn't it obvious that this is the case?
| teaearlgraycold wrote:
| Obvious to whom?
| bloaf wrote:
| I think the "obvious" comment was a bit snarky, but out of
| curiosity, I posed the question to the Groq website which
| currently happens to be on the front page right now. (It
| claims to run Mixtral 8x7B-32k at 500 T/s)
|
| And indeed, the AI response indicated that the boundary
| between convergence and divergence is not well defined, has
| many local maxima and minima, and could be quote: "fractal or
| chaotic, with small changes in hyperparameters leading to
| drastically different outcomes."
| Buttons840 wrote:
| I'll defend the idea that it was obvious. (Although, it wasn't
| obvious to me until someone pointed it out, so maybe that's not
| obvious.)
|
| If you watch this video[0], you'll see in the first frame that
| there is a clear boundary between learning rates that converge
| or not. Ignoring this paper for a moment, what if we zoom in
| really really close to that boundary? There are two
| possibilities, either (1) the boundary is perfectly sharp no
| matter how closely we inspect it, or (2) it is a little bit
| fuzzy. Of those two possibilities, the perfectly sharp boundary
| would be more surprising.
|
| [0]: https://x.com/jaschasd/status/1756930242965606582
| eapriv wrote:
| Not only it is not obvious; it is not known to be true.
| barbarr wrote:
| I don't think it's obvious per se, but people who have studied
| numerical methods at the graduate level have likely seen
| fractal boundaries like this before - even Newton's method
| produces them [0]. The phenomenon says more about iterative
| methods than it says about neural networks.
|
| [0] https://en.wikipedia.org/wiki/Newton_fractal
| otaviogood wrote:
| This is _much_ more interesting if you see the animations.
| https://x.com/jaschasd/status/1756930242965606582
| notfed wrote:
| Fractal zoom videos are worth infinite words.
| paulddraper wrote:
| > infinite
|
| I see you
| catlifeonmars wrote:
| So what exactly are we looking at here? Did the authors only
| use two hyperparameters for the purpose of this visualization?
| yazzku wrote:
| It's explained in the post:
|
| > Have you ever done a dense grid search over neural network
| hyperparameters? Like a _really dense_ grid search? It looks
| like this (!!). Blueish colors correspond to hyperparameters
| for which training converges, redish colors to
| hyperparameters for which training diverges.
| phaedrus wrote:
| The blog post would be a better link for this submission.
| https://sohl-dickstein.github.io/2024/02/12/fractal.html
| calibas wrote:
| It was posted on HN last week:
| https://news.ycombinator.com/item?id=39349992
| omginternets wrote:
| I vaguely recall that in vivo neural oscillations also exhibit
| fractal structure (in some cases at least).
| pmayrgundter wrote:
| One of Wofram's comments is that there appears to be much more
| internal structure in language semantics that we had expected,
| contra-Chomsky.
|
| We also know the brain, cortex esp, is highly recurrent, so it
| should be primed for creating fractals and chaotic mixing.
|
| So maybe the hidden structure is the set of neural hyperparams
| needed to put a given cluster of neurons into fractal/chaotic
| oscillations like this. Seems potentially more useful too.. way
| more information content than a configuration that yields a fast
| convergence to a fixed point.
|
| Perhaps this is what learning deep NNs is doing: producing
| conditions where the substrate is at the tipping point, to get to
| a high-information generation condition, and then shaping this to
| fit the target system as well as it can with so many free
| parameters.
|
| That suggests that using iterative generators that are somehow
| closer to the dynamics of real neurons would be more efficient
| for AI: it'd be easier to drive them to similar feedback
| conditions and patterns
|
| Like matching resonators in any physical system
| pmayrgundter wrote:
| Per some offline discussion, I'll note that while this paper is
| about the structure of hyperparams, it also starts with the
| analogy between Mandelbrot & Julia sets, Julia as the
| hyperparam space for Mandelbrot, the _param_ space
|
| Well, they both also have similar fractal dimensions.
| Mandlebrot's Hausdorff dimension is 2, Julia is 1-2.
|
| I won't argue it here but just suggest that this is an
| important complexity relationship and that the neural net being
| fit may also have similar fractal complexity, and that the
| distinction between param and hyper-param in this sense may be
| somewhat a red-herring
| catlifeonmars wrote:
| Two quibbles: (1) neural nets don't necessarily operate on
| language, (2) they only loosely model biological neurons in
| that they operate in discrete space. All that is to say that
| any similarities are purely incidental without accounting for
| these two facts.
| sjwhevvvvvsj wrote:
| Well, in some sense they don't operate on language at all -
| but on mathematical representation of tokens derived from
| language.
|
| I'm sure from your comment you are aware of the distinction,
| but it is an interesting concept for people to keep in mind.
| pmayrgundter wrote:
| 1) agreed. it's exciting seeing the same basic architecture
| broadly applied
|
| 2) not sure what you mean by "operate in discrete space"
|
| I'd emphasize the potential similarity to biological
| recurrence. Deep ANNs don't need to have this explicitly (tho
| e.g. LSTM has explicit recurrence), but it is known that
| recurrent NNs can be emulated by unrolling, in a process
| similar to function currying. In this mode, a learned network
| would learn to recognize certain inputs and carry them across
| to other parts of the network that can be copies of the
| originator, thus achieving functional equivalence to self
| feedback, or neighbor feedback. It takes a lot of layers and
| nodes in theory, but ofc modern nets are getting very big.
| ahdsr wrote:
| > Some fractals -- for instance those associated with the
| Mandelbrot and quadratic Julia sets -- are computed by iterating
| a function, and identifying the boundary between hyperparameters
| for which the resulting series diverges or remains bounded.
| Neural network training similarly involves iterating an update
| function (e.g. repeated steps of gradient descent), can result in
| convergent or divergent behavior, and can be extremely sensitive
| to small changes in hyperparameters. Motivated by these
| similarities, we experimentally examine the boundary between
| neural network hyperparameters that lead to stable and divergent
| training. We find that this boundary is fractal over more than
| ten decades of scale in all tested configurations.
|
| Reading this gave me goosebumps
| kraig911 wrote:
| Makes me think of something I think I read from Penrose about
| consciousness.
| eapriv wrote:
| Note that the paper does not provide a proof of this statement,
| only some experimental evidence.
| cactusfrog wrote:
| Even the boundary for newtons approximation is fractal. This is a
| feature of non-linear optimization.
| bawolff wrote:
| Its interesting how fluid-like this fractal is compared to other
| fractal-zoom videos i see on the internet. I have no idea how
| common it is for fractals to be like that.
___________________________________________________________________
(page generated 2024-02-19 23:01 UTC)