[HN Gopher] Gradient descent visualization
       ___________________________________________________________________
        
       Gradient descent visualization
        
       Author : weinzierl
       Score  : 126 points
       Date   : 2024-05-07 06:24 UTC (16 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | DrNosferatu wrote:
       | Looks great!
       | 
       | Killer feature: allow arbitrary surfaces to be used (choice of 2D
       | projection if in higher dimensions). And integrate in Python to
       | allow use in production! Allow arbitrary zoom and level of detail
       | of surface.
        
       | albertzeyer wrote:
       | Note that such a 2D parameter space gives often the wrong
       | intuition when thinking about applying gradient descent on high-
       | dimensional parameter space.
       | 
       | Also, mini-batch stochastic gradient descent behaves more
       | stochastic than just gradient descent.
        
         | GrantMoyer wrote:
         | Agreed, but I still think this is a useful tool for building
         | intuition about a subset of problems gradient descent faces.
         | The animation in the readme is similar to what I picture in my
         | head for local minima for instance.
        
         | GaggiX wrote:
         | >gives often the wrong intuition when thinking about applying
         | gradient descent on high-dimensional parameter space
         | 
         | Can you give some examples?
        
           | euW3EeBe wrote:
           | Not an expert in this field, but I'm guessing this is related
           | to unintuitive nature of geometry in high-dimensional spaces.
           | 
           | One rough example I can think of is the fact that the number
           | of ways to move away from the origin increases exponentially
           | (or even faster?) as the dimensionality goes up. There is
           | _way_ more volume away from the origin than near it, I've
           | seen this explained as something like "most of the volume of
           | a high-dimensional orange is in the peel". One result of this
           | is the fact that samples of a standard gaussian end up
           | forming a "shell" as opposed to a "ball" that you would
           | expect (this phenomenon is called the concentration of
           | measure in general).
           | 
           | Also, very roughly, high-dimensional objects have lots of
           | corners, and these corners are also very sharp. I would guess
           | that gradient descent would get stuck in these corners and
           | have a hard time getting out.
           | 
           | Some links related to this:
           | 
           | - Spikey Spheres: http://www.penzba.co.uk/cgi-
           | bin/PvsNP.py?SpikeySpheres#HN2
           | 
           | - Thinking outside the 10-dimensional box - 3blue1brown:
           | https://www.youtube.com/watch?v=zwAD6dRSVyI
           | 
           | - This is a fairly long talk about HMC, but it does talk
           | about some problems that come up when sampling high-
           | dimensional distributions:
           | https://www.youtube.com/watch?v=pHsuIaPbNbY
        
         | dukeofdoom wrote:
         | I vaguely remember something like this explained in one of my
         | math classes (calculus or linear equations). Not sure if the
         | math teacher was referring the same problem:
         | 
         | If you were walking down a mountain following an algorithm that
         | told you just to keep going down. You might get stuck in a
         | local minimum. Since you might reach a valley. Sometimes you
         | need to go up to get out of the valley to keep going down.
        
           | Feathercrown wrote:
           | When I took an ML class we solved that with a temperature
           | parameter, to allow random deviations out of a local minimum.
           | I wonder if there are novel methods now or if they're just
           | improved versions of temperature?
        
           | Imnimo wrote:
           | In very high dimensional spaces (like trying to optimize a
           | neural network with billions of parameters), to be "in a
           | valley", you must be in a valley with respect to every one of
           | the billions of dimensions. It turns out that in many
           | practical settings, loss landscapes are pretty well-behaved
           | in this regard, and you can almost always find some direction
           | to continue going downward in that lets you go around the
           | next hill rather than over it.
           | 
           | This 2015 paper has some good examples (although it does sort
           | of sweep some issues under the rug):
           | https://arxiv.org/pdf/1412.6544
        
       | alok-g wrote:
       | This is great!
       | 
       | Which open source license is this under? (Absence of license by
       | default implies 'copyrighted', which in this case could be in
       | conflict with the Qt open source license. Note: I am not a
       | lawyer.)
        
       | GistNoesis wrote:
       | Here is a project I created for myself (some time ago) to help
       | visualize the gradient as a vector field.
       | 
       | https://github.com/GistNoesis/VisualizeGradient
       | 
       | Probably best used as a support material with someone teaching
       | along the way the right mental picture one should have.
       | 
       | A great exercise is to have the student (or yourself) draw this
       | visualization with pen and paper, (both in 2d and 3d), for
       | various functions. And you can make the connection to the usual
       | "tangent" on the curve of a derivative.
        
       | can16358p wrote:
       | As a extremely visual thinker/learner, thank you for creating
       | this!
       | 
       | Many things that were too abstract on paper and formulas are MUCH
       | easier to understand this way.
        
       | levocardia wrote:
       | These are nice animations. However I've always hesitated to get
       | too enamored with these simple 2D visualizations of gradient
       | descent, because one of the strange takeaways from deep learning
       | is that behavior in high dimensions is very different from
       | behavior in low dimensions.
       | 
       | In a 2D problem with many local minima, like the "eggholder
       | functions" [1], gradient descent will be hopeless. But neural net
       | optimization in high dimensions really is a similar situation
       | with many local minima, except gradient descent does great.
       | 
       | Gradient descent in high dimensions also seems to have the
       | ability to "step over" areas of high loss, which you can see by
       | looking at the loss of a linear interpolation between weights at
       | successive steps of gradient descent. This, again, seems like
       | extremely strange behavior with no low-dimensional analogue.
       | 
       | [1] https://www.sfu.ca/~ssurjano/egg.html
        
         | cafaxo wrote:
         | Does gradient descent really do well for deep learning when the
         | gradient is computed with respect to the whole dataset? I
         | assumed that the noise in SGD played an important role for
         | escaping local minima.
        
           | cameldrv wrote:
           | There aren't really local minima in most deep networks. When
           | you get into millions/billions of parameters, there will
           | essentially always be some directions that point downwards.
           | You have to get really really close to the true minimum for
           | there to be no direction to go that improves the loss.
           | 
           | Incidentally this same phenomenon is IMO how evolution is
           | able to build things like the eye. Naively you'd think that
           | since you need so many parts arranged so well, it's
           | impossible to find a step by step path where fitness goes up
           | at every step, i.e. if you just have a retina with no lens or
           | vice-versa, it doesn't work. However, due to the high
           | dimensionality of DNA, there is essentially guaranteed to be
           | a path with monotonically increasing fitness just because
           | there are so many different possible paths from A to B in the
           | high dimensional space that at least one is bound to work.
           | 
           | Now this isn't strictly true for every high dimensional
           | system. You need to have a degree of symmetry or redundancy
           | in the encoding for it to work. For example, in convolutional
           | neural networks, you see this phenomenon where some filters
           | get "stuck" in training, and those are local minima for that
           | subspace. What happens though is that if one filter gets
           | stuck, the network will just use another one that had a
           | better initialization. This is why pruning works, lottery
           | tickets, etc. Things like residual connections enhance this
           | effect since you can even be stuck in a whole layer and the
           | training process can just bypass it.
           | 
           | You see the same thing with life, where you could put a
           | sequence for the same protein in different parts of the
           | genome and it could still be produced, regardless of the
           | position. There are also many different ways to encode the
           | exact same protein, and many different possible proteins that
           | will have the same shape in the critical areas. Life finds a
           | way.
        
             | dheera wrote:
             | > There aren't really local minima in most deep networks.
             | 
             | How so?
             | 
             | If there are no local minima other than the global one
             | there are convex optimization methods that are far faster
             | than SGD or Adam. The only reason these methods exist is
             | because deep networks is a non-convex optimization problem.
        
               | aaplok wrote:
               | There are many nonconvex functions where every local
               | minimum is global, and even many nonconvex functions with
               | a unique local minimum (which is de facto global). Convex
               | methods can fail on those.
               | 
               | The reason why GD and friends are a good choice in deep
               | learning is that computing the gradient is cheap (and
               | approximating it even more so). Every descent method
               | relies on solving a subproblem of sorts, typically
               | projecting the current iterate on a sublevel set of an
               | approximation of the function, for some definition of
               | projection. With GD, it's as cheap as it gets, just
               | subtract a shrinked version of the gradient. Subproblems
               | in other algorithms are a lot more expensive
               | computationally, particularly at high dimensions. So more
               | efficient as in requiring fewer function evaluations,
               | yes, but at the cost of doing a lot more work at each
               | step.
        
         | huevosabio wrote:
         | My understanding of this phenomenon in DL is that its not due
         | to anything intrinsic to gradient descent, the same principles
         | and understanding apply.
         | 
         | Rather, it is that with very large dimensionality the
         | probability that you spuriously get all derivatives to be zero
         | is vanishingly small. That is, local minima are less likely
         | because you need a lot of dimensions to agree that df(x_i)/dx_i
         | = 0.
         | 
         | I may be wrong though!
        
           | dxbydt wrote:
           | if the probability that you get a derivative close to 0 is
           | small, say only 10%, then you need just 3 dimensions to get
           | that multiplicative probability equal to a tenth of a
           | percent. 3 is hardly "very large dimensionality"
           | 
           | you can assign different numbers, but still you will find you
           | don't need more than say 10 dimensions for this effect to
           | happen.
        
       | j_bum wrote:
       | I love creating things to solidify my intuition about topics.
       | When I learned about gradient descent, I saw this repo and was
       | inspired to create my own toy Python package for visualizing
       | gradient descent.
       | 
       | My package, which uses PyVista for visualization:
       | 
       | https://github.com/JacobBumgarner/grad-descent-visualizer
        
       ___________________________________________________________________
       (page generated 2024-05-07 23:00 UTC)