[HN Gopher] Gradient descent visualization
___________________________________________________________________
Gradient descent visualization
Author : weinzierl
Score : 126 points
Date : 2024-05-07 06:24 UTC (16 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| DrNosferatu wrote:
| Looks great!
|
| Killer feature: allow arbitrary surfaces to be used (choice of 2D
| projection if in higher dimensions). And integrate in Python to
| allow use in production! Allow arbitrary zoom and level of detail
| of surface.
| albertzeyer wrote:
| Note that such a 2D parameter space gives often the wrong
| intuition when thinking about applying gradient descent on high-
| dimensional parameter space.
|
| Also, mini-batch stochastic gradient descent behaves more
| stochastic than just gradient descent.
| GrantMoyer wrote:
| Agreed, but I still think this is a useful tool for building
| intuition about a subset of problems gradient descent faces.
| The animation in the readme is similar to what I picture in my
| head for local minima for instance.
| GaggiX wrote:
| >gives often the wrong intuition when thinking about applying
| gradient descent on high-dimensional parameter space
|
| Can you give some examples?
| euW3EeBe wrote:
| Not an expert in this field, but I'm guessing this is related
| to unintuitive nature of geometry in high-dimensional spaces.
|
| One rough example I can think of is the fact that the number
| of ways to move away from the origin increases exponentially
| (or even faster?) as the dimensionality goes up. There is
| _way_ more volume away from the origin than near it, I've
| seen this explained as something like "most of the volume of
| a high-dimensional orange is in the peel". One result of this
| is the fact that samples of a standard gaussian end up
| forming a "shell" as opposed to a "ball" that you would
| expect (this phenomenon is called the concentration of
| measure in general).
|
| Also, very roughly, high-dimensional objects have lots of
| corners, and these corners are also very sharp. I would guess
| that gradient descent would get stuck in these corners and
| have a hard time getting out.
|
| Some links related to this:
|
| - Spikey Spheres: http://www.penzba.co.uk/cgi-
| bin/PvsNP.py?SpikeySpheres#HN2
|
| - Thinking outside the 10-dimensional box - 3blue1brown:
| https://www.youtube.com/watch?v=zwAD6dRSVyI
|
| - This is a fairly long talk about HMC, but it does talk
| about some problems that come up when sampling high-
| dimensional distributions:
| https://www.youtube.com/watch?v=pHsuIaPbNbY
| dukeofdoom wrote:
| I vaguely remember something like this explained in one of my
| math classes (calculus or linear equations). Not sure if the
| math teacher was referring the same problem:
|
| If you were walking down a mountain following an algorithm that
| told you just to keep going down. You might get stuck in a
| local minimum. Since you might reach a valley. Sometimes you
| need to go up to get out of the valley to keep going down.
| Feathercrown wrote:
| When I took an ML class we solved that with a temperature
| parameter, to allow random deviations out of a local minimum.
| I wonder if there are novel methods now or if they're just
| improved versions of temperature?
| Imnimo wrote:
| In very high dimensional spaces (like trying to optimize a
| neural network with billions of parameters), to be "in a
| valley", you must be in a valley with respect to every one of
| the billions of dimensions. It turns out that in many
| practical settings, loss landscapes are pretty well-behaved
| in this regard, and you can almost always find some direction
| to continue going downward in that lets you go around the
| next hill rather than over it.
|
| This 2015 paper has some good examples (although it does sort
| of sweep some issues under the rug):
| https://arxiv.org/pdf/1412.6544
| alok-g wrote:
| This is great!
|
| Which open source license is this under? (Absence of license by
| default implies 'copyrighted', which in this case could be in
| conflict with the Qt open source license. Note: I am not a
| lawyer.)
| GistNoesis wrote:
| Here is a project I created for myself (some time ago) to help
| visualize the gradient as a vector field.
|
| https://github.com/GistNoesis/VisualizeGradient
|
| Probably best used as a support material with someone teaching
| along the way the right mental picture one should have.
|
| A great exercise is to have the student (or yourself) draw this
| visualization with pen and paper, (both in 2d and 3d), for
| various functions. And you can make the connection to the usual
| "tangent" on the curve of a derivative.
| can16358p wrote:
| As a extremely visual thinker/learner, thank you for creating
| this!
|
| Many things that were too abstract on paper and formulas are MUCH
| easier to understand this way.
| levocardia wrote:
| These are nice animations. However I've always hesitated to get
| too enamored with these simple 2D visualizations of gradient
| descent, because one of the strange takeaways from deep learning
| is that behavior in high dimensions is very different from
| behavior in low dimensions.
|
| In a 2D problem with many local minima, like the "eggholder
| functions" [1], gradient descent will be hopeless. But neural net
| optimization in high dimensions really is a similar situation
| with many local minima, except gradient descent does great.
|
| Gradient descent in high dimensions also seems to have the
| ability to "step over" areas of high loss, which you can see by
| looking at the loss of a linear interpolation between weights at
| successive steps of gradient descent. This, again, seems like
| extremely strange behavior with no low-dimensional analogue.
|
| [1] https://www.sfu.ca/~ssurjano/egg.html
| cafaxo wrote:
| Does gradient descent really do well for deep learning when the
| gradient is computed with respect to the whole dataset? I
| assumed that the noise in SGD played an important role for
| escaping local minima.
| cameldrv wrote:
| There aren't really local minima in most deep networks. When
| you get into millions/billions of parameters, there will
| essentially always be some directions that point downwards.
| You have to get really really close to the true minimum for
| there to be no direction to go that improves the loss.
|
| Incidentally this same phenomenon is IMO how evolution is
| able to build things like the eye. Naively you'd think that
| since you need so many parts arranged so well, it's
| impossible to find a step by step path where fitness goes up
| at every step, i.e. if you just have a retina with no lens or
| vice-versa, it doesn't work. However, due to the high
| dimensionality of DNA, there is essentially guaranteed to be
| a path with monotonically increasing fitness just because
| there are so many different possible paths from A to B in the
| high dimensional space that at least one is bound to work.
|
| Now this isn't strictly true for every high dimensional
| system. You need to have a degree of symmetry or redundancy
| in the encoding for it to work. For example, in convolutional
| neural networks, you see this phenomenon where some filters
| get "stuck" in training, and those are local minima for that
| subspace. What happens though is that if one filter gets
| stuck, the network will just use another one that had a
| better initialization. This is why pruning works, lottery
| tickets, etc. Things like residual connections enhance this
| effect since you can even be stuck in a whole layer and the
| training process can just bypass it.
|
| You see the same thing with life, where you could put a
| sequence for the same protein in different parts of the
| genome and it could still be produced, regardless of the
| position. There are also many different ways to encode the
| exact same protein, and many different possible proteins that
| will have the same shape in the critical areas. Life finds a
| way.
| dheera wrote:
| > There aren't really local minima in most deep networks.
|
| How so?
|
| If there are no local minima other than the global one
| there are convex optimization methods that are far faster
| than SGD or Adam. The only reason these methods exist is
| because deep networks is a non-convex optimization problem.
| aaplok wrote:
| There are many nonconvex functions where every local
| minimum is global, and even many nonconvex functions with
| a unique local minimum (which is de facto global). Convex
| methods can fail on those.
|
| The reason why GD and friends are a good choice in deep
| learning is that computing the gradient is cheap (and
| approximating it even more so). Every descent method
| relies on solving a subproblem of sorts, typically
| projecting the current iterate on a sublevel set of an
| approximation of the function, for some definition of
| projection. With GD, it's as cheap as it gets, just
| subtract a shrinked version of the gradient. Subproblems
| in other algorithms are a lot more expensive
| computationally, particularly at high dimensions. So more
| efficient as in requiring fewer function evaluations,
| yes, but at the cost of doing a lot more work at each
| step.
| huevosabio wrote:
| My understanding of this phenomenon in DL is that its not due
| to anything intrinsic to gradient descent, the same principles
| and understanding apply.
|
| Rather, it is that with very large dimensionality the
| probability that you spuriously get all derivatives to be zero
| is vanishingly small. That is, local minima are less likely
| because you need a lot of dimensions to agree that df(x_i)/dx_i
| = 0.
|
| I may be wrong though!
| dxbydt wrote:
| if the probability that you get a derivative close to 0 is
| small, say only 10%, then you need just 3 dimensions to get
| that multiplicative probability equal to a tenth of a
| percent. 3 is hardly "very large dimensionality"
|
| you can assign different numbers, but still you will find you
| don't need more than say 10 dimensions for this effect to
| happen.
| j_bum wrote:
| I love creating things to solidify my intuition about topics.
| When I learned about gradient descent, I saw this repo and was
| inspired to create my own toy Python package for visualizing
| gradient descent.
|
| My package, which uses PyVista for visualization:
|
| https://github.com/JacobBumgarner/grad-descent-visualizer
___________________________________________________________________
(page generated 2024-05-07 23:00 UTC)