[HN Gopher] Regularized Newton Method with Global $O(1/k^2)$ Con...
       ___________________________________________________________________
        
       Regularized Newton Method with Global $O(1/k^2)$ Convergence
        
       Author : ColinWright
       Score  : 193 points
       Date   : 2021-12-07 13:25 UTC (9 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | redwood wrote:
       | Can someone explain?
       | https://en.m.wikipedia.org/wiki/Newton%27s_method
        
         | kiengcan9999 wrote:
         | There is a helpful note here:
         | https://huynp.com/optimization/2018/11/11/Fundamental-of-Unc...
        
         | gmadsen wrote:
         | [deleted]
        
           | kergonath wrote:
           | This is a useful thing to keep in mind:
           | https://xkcd.com/1053/ .
           | 
           | [edit] removed irrelevant stuff. The xkcd is great though :)
        
             | gmadsen wrote:
             | fair enough, and rereading my comment, it does come off
             | that way. I was more curious what his background was, but
             | that is not how I worded it.
        
               | kergonath wrote:
               | Fair enough! It's difficult to get the tone across in
               | writing.
        
         | mijoharas wrote:
         | Would you like someone to explain newton's method (Newton
         | Raphson iteration), or the linked article?
         | 
         | (I could have a go at explaining newton's method, but I don't
         | understand the arxiv article. In simple terms newton raphson
         | iteration is an iterative method of finding the "zeros" of a
         | function. You do it by evaluating the derivative of a function,
         | and iteratively following that gradient until you find the zero
         | point.)
        
           | whinvik wrote:
           | The problem with Newton's method is that it is provably
           | convergent but only if you are sufficiently close to the
           | solution. That basically means that the initial condition
           | that you chose for solving the problem is very important.
           | What this paper seems to suggest is that their method is
           | globally convergent and they produce the algortithm to do it,
           | although there is some tuning that needs to be done. So in
           | theory, with their method we will always get to a solution,
           | no matter the initial condition we chose.
        
             | abeppu wrote:
             | Their abstract says that their global convergence claim is
             | for convex objectives. So structurally, they're only
             | talking about functions for which the whole domain is
             | "sufficiently close to the solution" i.e. where every
             | tangent is a global bound.
        
               | SpaceManNabs wrote:
               | Important to note that other variants of newton's method
               | do not necessarily converge from any starting point on
               | even strictly convex functions.
               | 
               | Even in convex functions, the whole domain is not
               | necessarily sufficiently close.
        
               | abeppu wrote:
               | Actually, can you expand on that? The paper makes some
               | broad statements about how existing methods are unstable
               | and need to have a good starting point etc. This is not
               | my area, but in the past when I've used (quasi)newton
               | methods, it "just worked". What are the kinds of strictly
               | convex functions that cause this bad behavior? For
               | "strictly convex" in my limited, low-dimensional
               | imagination I just picture a bowl-shape. They mention
               | failure modes, but I'd love a clear (ideally visual)
               | example of when it goes wrong. Robust global quadratic
               | convergence seems to be the main value proposition here,
               | so understanding the non-global fragile non-convergence
               | of other methods seems important.
        
               | mijoharas wrote:
               | I believe if you select as a starting point, a point
               | where the derivative of the function is 0, then 'classic'
               | newton raphson will not converge (since you select the
               | "crossing point" of this tangent line for your next
               | iteration, and that is undefined when you have 0
               | gradient.)
               | 
               | e.g. y = X^2 - 1 choose 'starting position' X = 0.
        
               | abeppu wrote:
               | ... but then your starting point is already at the
               | optimum. You're done in zero steps! I think for a
               | "strictly convex" function, the (sub)derivative will
               | _only_ be zero at the global minimum.
        
               | mijoharas wrote:
               | > but then your starting point is already at the optimum
               | 
               | Newton raphson is used to find where f(x) = 0, not where
               | the derivative f'(x) = 0.
               | 
               | For the function given above f(x) = 0 when x = 1 and x =
               | -1. Not when the derivative is at 0 (when x = 0).
               | 
               | You'd be done in 0 steps to find the minimum of the
               | function, or the point of 0 derivative, but that's not
               | what you're looking for with newton raphson.
        
               | abeppu wrote:
               | Newton's method is _commonly_ used for function
               | minimization by looking for the zero of the derivative,
               | and this is entirely what the article is about. In sec
               | 1.1 at the top of page 2, they say very clearly "In this
               | work, we are interested in solving the unconstrained
               | minimization problem ...". Even in the first para, they
               | say "Newton's method is a cornerstone of convex
               | optimization", meaning optimization/minimization of a
               | convex function.
               | 
               | The article says "Unfortunately, Newton's method is
               | unstable: it works only for strongly convex problems and
               | may diverge exponentially when initialized not very close
               | to the optimum." And I was asking for an example or
               | description of such a case.
               | 
               | You'll also note there are several comments in this
               | discussion that mention the hessian/2nd order
               | derivatives, and these all show up because this is being
               | used in the context of optimization by looking for f'(x)
               | = 0.
        
               | SpaceManNabs wrote:
               | I am super sorry. I meant strongly convex. Autocorrect on
               | mobile.
               | 
               | Example from Boyd.
               | 
               | https://math.stackexchange.com/questions/3408436/newton-
               | rhap...
        
             | dahak27 wrote:
             | Unless I'm mistaken (very possible), the existing cubic
             | Newton method they discuss already has this convergence
             | guarantee, but introduces a lot of extra expensive work at
             | each step. The specific contribution of the paper is
             | finding clever regularization tricks to keep this guarantee
             | while sidestepping the extra hassle the cubic method needs
        
           | [deleted]
        
         | rwilson4 wrote:
         | Haven't read the paper yet, but Boyd and Vandenberghe's book,
         | Convex Optimization, is available for free on the authors'
         | website and covers all the foundational material.
         | 
         | Update: I've skimmed the paper and here's the gist as I
         | understand it. The paper alleges that Newton's method can have
         | convergence issues, even when using line search. Their approach
         | resolves these convergence issues. It's 6am where I am which
         | isn't great for processing this stuff, but I write a lot of
         | custom Convex Optimization solvers, and I've never had
         | convergence issues with Newton's method with the default line
         | search parameters suggested by the book mentioned above, even
         | with some bad condition numbers that throttle off-the-shelf
         | solvers (which is why I write my own).
         | 
         | The paper talks about speed of convergence, which initially
         | made me think this was a first order method, not requiring
         | calculating the Hessian (matrix of 2nd order derivatives). For
         | large systems this is really slow and 1st order methods speed
         | this up by working only with the first derivatives. But that's
         | not what this paper is: they still use the Hessian but add
         | regularization, which presumably just improves the condition
         | number. Reminds me of proximal algorithms but haven't explored
         | that connection.
         | 
         | Update on the update: the paper addresses situations where the
         | self-concordance assumption does not hold. TBH I didn't pay
         | much attention in class when we went over self-concordancy and
         | convergence, but the contribution of this paper seems to be
         | extending convergence guarantees even to non-self-concordant
         | systems.
        
         | jejones3141 wrote:
         | Take an initial guess at where f(x) = 0; call it x1. Then you
         | generate what you hope is a better guess x2 by finding where
         | the line through (x1, f(x1)) with slope f'(x1) hits zero.
         | Lather, rinse, repeat.
         | 
         | Things can get ugly--say, if the derivative is zero at your
         | guess--and if your initial guess is bad, or f is wonky (say it
         | has several zeroes near one another) it may take a long time.
         | But as people have said, if you have a good guess, convergence
         | is quadratic, doubling the number of good bits each time, so
         | for floating point square root, peek at the exponent and take
         | half that (multiply by sqrt(2) if it's odd). log2(#mantissa
         | bits) iterations and you're as good as it gets.
         | 
         | Apparently this new method avoids the nasty cases and is still
         | fast. Cool stuff.
        
         | oriettaxx wrote:
         | This is great for Big-O notation https://www.bigocheatsheet.com
        
         | [deleted]
        
         | djmips wrote:
         | An fun visual explanation of Newton's Method is contained
         | within this video by 3Blue1Brown https://youtu.be/-RdOwhmqP5s
        
         | mvanaltvorst wrote:
         | Newton's method is an algorithm you can use to
         | minimise/maximise/find the root of a differentiable function
         | very quickly if you start with a reasonable guess. There are
         | many variants on Newton's method, e.g. BFGS that does not
         | require the hessian (/2nd order derivative) of the function. A
         | pitfall of Newton's method is that it does not find the global
         | minimum per se, it might just converge to a local minimum if
         | this local minimum is "closer" to your initial guess. Some
         | variants of Newton's method allow you to avoid this pitfall,
         | but they are slow in practice.
         | 
         | This seems to be another variant that claims to find the global
         | minimum quicker than other existing methods. I am not
         | experienced enough to verify this claim. I am also not sure how
         | this new algorithm performs in practice; in theory, ellipsoid
         | methods are a lot more efficient than simplex methods for
         | optimising convex functions, while in practice simplex methods
         | are usually an order of magnitude faster. So take this result
         | with a grain of salt.
        
           | MauranKilom wrote:
           | To clarify, "global" in this context refers to global
           | convergence (as in, it won't diverge no matter the starting
           | point), not global optima. They assume a convex function,
           | i.e. only a single minimum. The paper is unrelated to
           | avoiding local minima.
        
             | abetusk wrote:
             | So isn't the magic here convexity? Convexity is such a huge
             | constraint that I would guess convergence would be hard to
             | avoid.
        
               | jleahy wrote:
               | Tell that to Newton's method, which can fail to converge
               | for convex functions.
        
               | kergonath wrote:
               | There is no way of ensuring that you have found a global
               | minimum if the function is not convex, or if you don't
               | make equivalent assumptions, in the general case (if the
               | domain of the function is continuous, and/or infinite).
               | 
               | If you don't make assumptions, you would need to consider
               | every single point of the domain, and there is an
               | infinite number of them. The job gets easier the more
               | assumptions you make about the continuous,
               | differentiable, and monotonous character of the function.
        
         | MauranKilom wrote:
         | (Disclaimer: Not an expert, and only gave the paper a very
         | cursory read).
         | 
         | Newton's method attempts to find zeroes of a function (i.e.
         | find x where F(x) = 0). This is equivalent to finding a
         | (possible) minimum of a function f(x) where f'(x) = F(x). Note:
         | The wiki article describes the process of finding zeroes for
         | F(x), whereas the article searches minima of f(x), in terms of
         | my notation here.
         | 
         | At a basic level, Newton's method iteratively improves a guess
         | x_k via the following update step:                   x_(k+1) =
         | x_k - f(x_k) / f'(x_k)
         | 
         | When close to the correct solution, convergence is quadratic:
         | You "double the number of correct digits" at every step, so to
         | speak. But if you're far away, you might be diverging (going
         | further away from the solution). Progress can also be slow
         | under other conditions (multiple solutions nearby).
         | 
         | The paper, essentially, suggest the following update step
         | instead:                   x_(k+1) = x_k - f(x_k) / (f'(x_k) +
         | sqrt(H |f(x_k)|))
         | 
         | for some H (that needs to be chosen/estimated). The extra term
         | can be thought of as some form of regularization. This
         | supposedly avoids divergence (where f is non-decreasing and
         | does not grow too fast), and is also consistently fast.
         | 
         | Note that I've simplified to the one-dimensional case here.
         | There are further concerns in higher dimensions (e.g. number of
         | matrix inversions) that this method handles well, according to
         | the authors.
        
           | gugagore wrote:
           | It's great that you clarified the root-finding / minimization
           | perspectives.
           | 
           | Shouldn't your update equations be written in terms of F, not
           | f? (Alternatively, write f'(x_k) / f''(x_k).)
           | 
           | To minimize f, find zeros of f' = F. To find a zero of F,
           | construct a linear approximation using f''=F'
           | 
           | If you're using Newton's method to minimize a function f, you
           | need the first- and second-order derivatives (hence the same
           | "second-order method").
        
             | [deleted]
        
             | MauranKilom wrote:
             | You're correct, I should've used F there, my bad. Too late
             | to edit, unfortunately!
        
           | dataflow wrote:
           | > This is equivalent to finding a (possible) minimum of a
           | function f(x) where f'(x) = F(x).
           | 
           | I'd like to clarify that the "equivalence" might be
           | misleading. Just because you have a vector-valued function,
           | that does not imply it has an antiderivative you can
           | minimize, so merely _having_ that scalar antiderivative
           | actually gives the problem more structure and thus
           | intrinsically makes it  "easier" in that sense.
        
             | gugagore wrote:
             | I think a concrete example of what you're saying is that a
             | rotation vector field (e.g. F(x,y) = [y, -x]) is not the
             | gradient of any potential function.
             | 
             | And I think you're saying it's intrinsically easier to find
             | where a vector field vanishes if it has a potential
             | function. You can just follow the vector field and you'll
             | get to a zero. But if you follow a rotating vector field,
             | it will take you in circles.
             | 
             | But given that you're doing Newton's method, I don't think
             | one problem is easier than the other, right?
        
               | dataflow wrote:
               | Is your conjecture that Newton's method would always work
               | at least as well as gradient descent when the vector
               | field has nonzero rotation? I don't see why that should
               | be true, especially given that's not the case when
               | there's zero rotation. You could easily come up with
               | examples where gradient descent would work better despite
               | the rotation, and I expect you could also find lots of
               | rotation cases where Newton's method would get completely
               | thrown off (whereas GD wouldn't).
               | 
               | (Also, don't conflate the problem with the solution. My
               | statements were about optimization vs. root finding, not
               | about Newton vs. GD. Even if there _is_ a solution that
               | works for multiple problems, that doesn 't mean the
               | problems are equally difficult. If one problem admits
               | solutions that the other doesn't, then it's still easier
               | in that sense. If a problem can be reduced to another,
               | then it's also easier in that sense. Just like how you
               | can use a cannon to kill a fly, but that doesn't mean
               | destroying a building is as easy as killing a fly...)
        
               | gugagore wrote:
               | > Is your conjecture that Newton's method would always
               | work at least as well as gradient descent when the vector
               | field has nonzero rotation?
               | 
               | > You could easily come up with examples where gradient
               | descent would work better despite the rotation, and I
               | expect you could also find lots of rotation cases where
               | Newton's method would get completely thrown off (whereas
               | GD wouldn't).
               | 
               | I don't think I understand the meaning of the questions.
               | It isn't gradient descent unless you have a potential
               | function. You cannot do gradient descent on a vector
               | field that is not a gradient. You can solve an ODE.
               | 
               | You can, of course, turn a root-finding problem (F(x) =
               | 0) into a minimization problem by defining a potential
               | like g(x) = (F(x))^2. You can do gradient descent on g.
               | So in that sense, both of these problems reduce to each
               | other, though there are better methods for solving F(x) =
               | 0 than minimizing g.
               | 
               | I still don't understand this sense of optimization being
               | easier than root finding.
               | 
               | What I do understand is some sense in which there are
               | more vector fields than there are gradients, since only
               | some vector fields are gradients of potential functions.
               | My comment was intended to point that out, since I
               | thought that is maybe what you meant.
        
       | civilized wrote:
       | This is very interesting. It looks like the innovation is to
       | scale the identity matrix in the L-M step formula by the square
       | root of the norm of the gradient. Very simple and elegant, and
       | trivial to implement. Lots of people are going to be shaking
       | their heads wondering why they didn't think of this first.
       | 
       | The motivation for the idea is very nice and easy to understand.
       | The author completely explains it before the middle of the second
       | page. I encourage you to look at it if you know enough vector
       | calc to make sense of the notation. Essentially, he takes a well-
       | known fast-converging method whose steps are hard to compute and
       | shows how these steps can be approximated cheaply while still
       | retaining much of the convergence rate benefit.
       | 
       | This may have implications for many large-scale optimization
       | problems as L-M regularized Newton steps are a sort of natural
       | halfway house between gradient descent and the full Newton's
       | method, both theoretically and practically.
        
         | jedbrown wrote:
         | A very similar technique is popular for rootfinding (since the
         | 80s, convergence analysis in this 1998 paper), where scaling by
         | the norm of the gradient (eq. 1.5) enables q-quadratic
         | convergence.
         | 
         | https://doi.org/10.1137/S0036142996304796 (free version https:/
         | /repository.lib.ncsu.edu/bitstream/handle/1840.4/296/...)
        
           | civilized wrote:
           | Rootfinding and optimization are so closely related!
        
         | midjji wrote:
         | The problem with simple innovations in old methods, is that
         | there is very rarely anything new under the sun, odds are good
         | its already been done. The paper does not evaluate on any of
         | the big standard benchmark either. The paper spends a
         | surprising/suspicious amount of time on very basic things. "The
         | divergence issues of Newton's method with line search seems to
         | be a relatively unknown fact." No its well known!
         | 
         | It looks interesting, but some things, like the L-M lambda
         | beeing constant? No, you adapt it to the data increasing and
         | decreasing it in accordance with the feedback from the step
         | error predictions. Similar, but not identically to what is
         | proposed here. Id need to read it carefully and it doesnt look
         | bad, but think I'll just wait for the reviewers.
        
           | civilized wrote:
           | All good points.
           | 
           | Personally I wouldn't have known divergence of Newton's
           | method with line search off the top of my head. What I think
           | everyone in this field would know is that Newton's method
           | with constant step size diverges. I'm actually curious now if
           | Newton's method with _backtracking_ line search from a
           | constant step size diverges. Exact line search can go a
           | little nuts, that doesn 't surprise me, but divergence with
           | backtracking from a constant step size seems less likely. (I
           | guess it depends on what "constant" means here...)
           | 
           | Seems like it could be awfully nice to have a smart
           | theoretically backed initial guess of the dynamic lambda. The
           | author actually uses a feedback strategy to set H in
           | conjunction with his gradient scaling (see Algorithm 2 /
           | AdaN), and it's possible that the scaling allows H to be
           | basically correct more often and not have to be adjusted so
           | much? Which is potentially really helpful since Newton steps
           | aren't cheap and you don't want to waste them on guess-and-
           | check.
           | 
           | But it's not obvious it's better than other strategies and we
           | do need benchmarks to prove it out, yes.
           | 
           | The author is from INRIA, which does a lot of advanced
           | optimization research, so I'd say it's a serious paper prima
           | facie.
        
       | marcodiego wrote:
       | > We present a Newton-type method that converges fast from any
       | initialization and for arbitrary convex objectives with Lipschitz
       | Hessians. We achieve this by merging the ideas of cubic
       | regularization with a certain adaptive Levenberg--Marquardt
       | penalty.
       | 
       | I feel insignificant just by reading the first phrase.
        
         | SubiculumCode wrote:
         | Longitudinal Analysis of an Amygdalocentric Structural Network
         | using Mixed Effects Multivariate Distance Matrix Regression
         | with Multiple Fractionated Polynomials
         | 
         | A working title for a manuscript I am shopping out to the
         | various neuroscience journals right now. I feel like I'm
         | finally the esoteric person I've always wanted to be.
        
         | jpfr wrote:
         | I teach a class on the subject:
         | https://www.youtube.com/playlist?list=PLdkTDauaUnQpzuOCZyUUZ...
         | 
         | Understanding optimization at this level and being able to
         | formulate your own problems in one of the canonical forms gives
         | you super powers.
         | 
         | You don't even have to build your own solvers. The
         | commercial/open source ones are typically good enough.
        
           | stochtastic wrote:
           | These lectures look fantastic -- thank you for making and
           | sharing them! I've been looking for a good foundational
           | treatment of optimization to digest over the holidays.
        
           | phkahler wrote:
           | >> You don't even have to build your own solvers. The
           | commercial/open source ones are typically good enough.
           | 
           | But as a maintainer of solvespace (Open Source CAD software
           | with geometric/algebraic constraint system) I am left
           | wondering if this can be applied to our constraint solver,
           | which I didn't write. It solves systems of algebraic
           | constraints using some form of Newtons method, partial
           | derivatives, jacobian... All stuff I'm familiar with, but not
           | in great detail. Figuring out how best (and weather) to apply
           | this might be a major digression ;-)
        
             | jpfr wrote:
             | After following the course, you should be able to grok the
             | solver code in /src/system.cpp. That looks fairly standard.
             | 
             | The interesting part (to me, as that is not my specialty)
             | lies in the translation of the 3D constraints (including
             | rotation, etc.) into a single objective function for
             | solving Newton-style.
        
         | tubby12345 wrote:
         | Lolol why? Even if you're a mathematician in a different area
         | you might have no clue what that sentence means.
        
         | latenightcoding wrote:
         | that popular "mathemathics for machine learning" book will
         | teach you almost everything in that sentence, authors even
         | had/have a mooc on coursera.
        
         | brrrrrm wrote:
         | Newton method: find an error and attempt to fix it by
         | approximation
         | 
         | Convex objective: a bowl-like function, we want inputs that put
         | us at the bottom of it
         | 
         | Lipschitz: a space that isn't too stretchy
         | 
         | Hessians: second partial derivatives of the terms we're
         | optimizing (inputs we want)
         | 
         | Regularization: make the "fix by approx" procedure a bit more
         | well behaved
         | 
         | Levenberg-Marquardt: another adaptation to make the fix
         | procedure better
        
           | montebicyclelo wrote:
           | Would be great to have a resource written like this that
           | covers the rest of maths, (or in fact jargon in general).
        
             | adampk wrote:
             | I always thought biology would be easier to learn if
             | instead of entity names based on old languages they were
             | instead verbose descriptions
             | 
             | i.e. Mitochondria->PowerHouseOfTheCell
        
           | robbedpeter wrote:
           | It's funny how many complex ideas are "just" simpler ideas
           | mapped to the right sequence of steps. The jargon and naming
           | conventions can obscure the idea from outsiders, but allows
           | for brief and accurate communication. If they'd explained the
           | idea Barney-style, they'd have needed twice as many pages.
           | 
           | Thanks for the breakdown, it makes the paper more accessible.
        
             | sorokod wrote:
             | Well a simpler idea is in way like the initial guess for
             | Newton-Raphson. Sometimes you get convergence and sometimes
             | you don't
        
             | whatshisface wrote:
             | Every idea in math is like that, because the rules of math
             | are to do everything in little steps.
        
             | tubby12345 wrote:
             | Math is about formalizing "simple ideas" so that they can
             | be reasoned about _exactly_.
        
         | marktangotango wrote:
         | > I feel insignificant just by reading the first phrase.
         | 
         | I often wonder about the intentions of those who post of these
         | types of comments. Being charitable one might suppose the
         | parent is offering a backward sort of compliment to the
         | authors. Something like "Great job, beyond my capabilities, I'm
         | glad someone is working on these things".
         | 
         | There are a lot of less charitable formulations; why should
         | other readers care about parent commenters insecurity? Does
         | parent commenter not appreciate the work that goes into
         | studying these topics? Was parent commenter told they were
         | "smart" to often as a child and never learned to exert
         | themselves academically? Etc.
         | 
         | Although this post is mostly in jest, I really am curious what
         | prompts these comments, comments of this type are quite
         | reliable.
        
           | sam0x17 wrote:
           | > There are a lot of less charitable formulations; why should
           | other readers care about parent commenters insecurity? Does
           | parent commenter not appreciate the work that goes into
           | studying these topics? Was parent commenter told they were
           | "smart" to often as a child and never learned to exert
           | themselves academically? Etc.
           | 
           | You're missing the obvious interpretation -- jargonization of
           | the sciences keeps people out and keeps the masses stupid.
           | Math is particularly guilty of this -- more charitable fields
           | use jargon that is at least semi-self-explanatory, words and
           | phrases that, while jargon, do make sense if you deconstruct
           | them. In Math, people are so arrogant that they discover
           | something and immediately name it after themselves, so if you
           | haven't had a topology class, you won't know that X person's
           | name maps to Y concept. Good jargon can be pieced together
           | with pure reason alone without knowing the names of the
           | people who invented what. This isn't true across the board of
           | course -- "gradient descent" is a particularly well-named
           | piece of jargon, for example.
        
             | drdeca wrote:
             | People generally don't name it after themselves; other
             | people name it after them later.
             | 
             | Not that many people wouldn't like for the thing they
             | introduce to eventually be named after them. But, I think
             | it would generally be considered presumptuous to directly
             | name it after oneself.
             | 
             | I remember one story where one mathematician upon hearing
             | something referred to as a [some term named after them],
             | asked what the term referred to. (I want to say it was
             | Hilbert asking what a "Hilbert space" was, but I'm not
             | sure.)
        
         | jakeinspace wrote:
         | If it makes you feel any better, I knew nothing about the field
         | of optimization (convex or otherwise) 3 months ago. But after
         | going through an optimization course in my CS master's program
         | (MCSO at UT Austin) this semester, I can parse this part of the
         | abstract fine, minus the Levenberg--Marquardt, which I had to
         | look up. That is with not much of a prior mathematics
         | background. You could probably learn enough in a semester of
         | part-time study to grok this paper, as long as you're
         | comfortable with differential calculus.
        
       | rahimiali wrote:
       | People who're familiar with Newton's method might be surprised at
       | the convergence rate. 1/k^2 is slower than the textbook rate for
       | Newton's method with line search, which is 2^{-2^k}, basically
       | implying convergence in a constant number of steps. The rate in
       | this paper seems to be no faster than plain gradient descent on a
       | strongly convex function! So why go through the trouble of the
       | much more complicated update?
       | 
       | Because the textbook proof of the fast convergence of Newton's
       | method make additional assumptions on the objective function, for
       | example that it is strongly convex, or it is self concordant.
       | This paper only assumes Lipschitz continuous Hessians.
       | 
       | The idea of dampening the Hessian is old (it's sometimes called
       | "damped newton method", or "trust region newton method", or
       | "levenberg-marquardt", though the latter two refer to more
       | specific ideas). This paper offers a view to how much dampening
       | to apply.
        
         | civilized wrote:
         | There are two kinds of convergence proofs - the ones that get
         | you a 1/k or 1/k^2 rate, and the ones that get you "linear" or
         | "superlinear" convergence (which is optimization jargon for
         | exponential/superexponential decay of the error). The latter
         | require way too strong of assumptions to be that useful in
         | practice, and we generally think of them as just "local"
         | convergence theorems - i.e. the assumptions they make start to
         | be true when you get close enough to the minimum, and the fast
         | convergence kicks in at that point. Which is nice, but getting
         | close enough in the first place is often 99% of the problem!
        
       | paulpauper wrote:
       | I was not aware that this was a problem in need of solving.
       | aren't computers so fast that this is already solved form a
       | practical standpoint. Newton's method (as well as extensions of
       | it to vector analysis) is ancient math. I was not even aware that
       | this is an area of ongoing interest.
        
         | syvon wrote:
         | I am unsure if this precise problem needed a better solution,
         | but even if not, this is not how math research works. Finding
         | better solutions to random / useless problems can lead to a
         | better solution to a problem which needs solving. The relevance
         | of the problem is not a consideration.
        
         | BeetleB wrote:
         | > aren't computers so fast that this is already solved form a
         | practical standpoint.
         | 
         | Not in the least. Much of the scientific computation out there
         | is to solve these types of problems. Even if you sped up
         | computers ten-fold you wouldn't be satisfied.
         | 
         | When I was in grad school, the computational teams would run
         | calculations where it would take a few days to get a single
         | data point (they were trying to plot a curve). And they had
         | problems in their head where it would take over a month to get
         | a data point - but of course they weren't pursuing those
         | problems as it was impractical.
         | 
         | Things like computing the derivative - let alone the Hessian -
         | can be quite expensive for some problems.
        
       | loehnsberg wrote:
       | I am not an expert on convex, unconstrained optimization but I
       | frequently use these methods. I'd wait until it's peer-reviewed
       | and published in a good journal (SIAM) before I get too excited
       | about this one.
        
       | vsskanth wrote:
       | What is y in these equations
        
       | jmount wrote:
       | For those interested, James Renegar also had some great stuff on
       | Newton's method ("A polynomial-time algorithm, based on Newton's
       | method, for linear programming" 1988). It seems there are a lot
       | of great algorithmic opportunities in specializing Newton's
       | method to a given problem domain.
        
       | random314 wrote:
       | This applies to convex optimization. I am wondering if it could
       | also be extended to neural networks too?
        
         | brrrrrm wrote:
         | You'd need to be careful of overfitting the network.
        
         | rwilson4 wrote:
         | Theoretically yes, but: typically with neural networks, and
         | esp. deep networks with millions or billions of parameters, 2nd
         | order methods which require calculating and factoring the
         | matrix of 2nd derivatives are prohibitively expensive. 1st
         | order methods such as gradient descent skip this, albeit with a
         | tradeoff in convergence. The method proposed in this paper is
         | impractical for large systems such as computer vision or
         | language models.
        
           | kxyvr wrote:
           | As a note, 2nd order methods do not require calculating the
           | entire Hessian nor do they require factorizing the matrix. A
           | trust-region method using Steihaug-Toint (truncated) CG or a
           | Newton-Krylov method simply require Hessian-vector products,
           | which are not particularly costly to compute. In fact, a
           | forward-mode automatic differentiation (AD) method on an
           | existing gradient calculation, which could also be done using
           | AD, will calculate the Hessian-vector product at a cost of
           | twice the gradient calculation. Though, AD is not required
           | and this can be done by hand. Even if only a single Krylov
           | iteration is done, there are many benefits to using this
           | methodology to help handle the scaling of the problem. These
           | algorithms are described fully in a book such as Numerical
           | Optimization by Nocedal and Wright.
           | 
           | The current paper does not directly affect most ML algorithms
           | because it speaks directly to convex optimization. Most ML
           | models are highly nonlinear, so globalization of the
           | optimization algorithm is required. Here, globalization means
           | something like a line-search or a trust-region to ensure
           | convergence to a local minima. Convex problems using an
           | appropriate algorithm and under the correct assumptions do
           | not necessarily require the use of a globalization technique
           | and will converge directly to a local min, which is now
           | global due to convexity. Normally, this discussion is done in
           | the context of interior point methods for linear cone
           | problems and helps explain why globalization is not done for
           | these algorithms. An example of this can be found in Convex
           | Optimization by Boyd and Vandenberghe in section 11.5.3, but
           | it's a well researched topic. The algorithm in this paper
           | discusses a technique where they also don't require a typical
           | globalization technique.
           | 
           | Anyway, mostly I'm commenting to dispel the misconception
           | that 2nd order method can not be applied to large scale
           | problems. I've applied them successfully to problems with
           | hundreds of millions of variables and they work just fine.
           | Largely, their success is tied to how complicated the
           | spectrum of the Hessian is, but that's a longer discussion.
        
         | abeppu wrote:
         | Aside from the other note about 2nd order derivatives being a
         | blocker for most of the models we work with these days, and the
         | fact that DL encounters non-convex loss functions, the other
         | big difference is that we're used to training with SGD which
         | only needs unbiased estimates of a gradient based on samples /
         | batches. This family of approaches requires _actual_
         | derivatives, i.e. from looking at the whole dataset. If you
         | have estimates of the gradient only, I think basically none of
         | the convergence arguments in this hold. This is basically also
         | why we don't already use LBFGS (which doesn't hold onto the
         | full hessian).
        
       | kubb wrote:
       | Is it a groundbreaking better way to approximate continuous
       | functions, or is it fundamentally impractical, like new
       | algorithms with a good asymptotic time often are? I'm curious to
       | know, but my mathematical illiteracy prevents me from reading the
       | paper.
        
         | kiengcan9999 wrote:
         | From abstract: "Our method is the first variant of Newton's
         | method that has both cheap iterations and provably fast global
         | convergence". It seems interesting.
        
           | rwilson4 wrote:
           | "Cheap iterations" led me to expect a 1st order method, not
           | requiring computing the matrix of 2nd derivatives, which is
           | the bottleneck in large systems. But this does indeed rely on
           | the Hessian, so I'm not sure what makes the iterations
           | "cheap".
        
             | MauranKilom wrote:
             | If you don't want to compute the Hessian, you're at the
             | wrong address with Newton's method. Quasi-Newton methods is
             | where you want to look.
        
               | rwilson4 wrote:
               | Sure, or Nesterov's accelerated gradient descent or
               | something. I'm just not sure what the authors mean by
               | "cheap iterations".
               | 
               | Update: your other comment clarified this for me. It's
               | cheap relative to other methods that work with non-self-
               | concordant objectives.
        
         | rwilson4 wrote:
         | This is an approach for solving convex optimization problems.
         | An optimization problem is when you want to find the best
         | option, or the fastest, or the most profitable. If you're
         | asking questions with superlatives (most, best, fastest), you
         | may be working on an optimization problem.
         | 
         | Many optimization problems are NP Hard and cannot be solved in
         | a reasonable amount of time. Convex optimization problems are a
         | special subclass which can typically be solved efficiently
         | (meaning in polynomial time).
         | 
         | This paper provides a new approach to solving Convex
         | Optimization problems for systems where other methods struggle.
        
         | MauranKilom wrote:
         | Newton's method is not for function approximation, but for
         | optimization ("find the minimum").
         | 
         | Newton's method already has "good asymptotic time" (e.g. if you
         | use it to compute a square root, you _double_ the number of
         | correct digits with each step!) - but only under certain
         | conditions (e.g.  "not too far from the solution"). This paper
         | shows how to modify it so this sort of benefit is obtained
         | under a wider range of conditions ("global").
         | 
         | There are other related methods with a similar goal, but some
         | require more expensive computations. The method from the paper
         | appears to fill a sweet spot in high-dimensional optimization
         | by virtue of requiring "few" matrix inversions (or linear
         | solves) while having very good convergence properties.
         | 
         | When you are dealing with the kind of problem that this sort of
         | method is applicable to, it appears to me that you might
         | benefit from looking into using this method. But it's not a
         | revolutionary improvement from what I can tell - there are
         | _many_ related methods, and chances are you could have used
         | something similar for similar benefit already.
         | 
         | And it's likely that someone already used more or less this
         | method in practice. As the paper notes:
         | 
         | > _Our goal is twofold. On the one hand, we are interested in
         | designing methods that are useful for applications and can be
         | used without any change as black-box tools. On the other hand,
         | we hope that our theory will serve as the basis for further
         | study of globally-convergent second-order and quasi-Newton
         | methods with superior rates. Although many of the ideas that we
         | discuss in this paper are not new and have been observed by
         | practitioners to help solve challenging optimization problems,
         | our analysis, however, is the first of its kind. We hope that
         | our theory will lead to appearance of new methods that are
         | motivated by the theoretical insights of our work._
        
         | plafl wrote:
         | (note: comments based on a superficial read)
         | 
         | I would say "interesting" more than "groundbreaking", in the
         | sense that you will not solve problems that you were not
         | previously solving.
         | 
         | The benchmarks provided show nice convergence. The interesting
         | part is that the method will not fail, which is very good. The
         | benchmarks seem to crush other methods but keep in mind that
         | the plots do not show "time required" but "number of iterations
         | required" and that with Newton's method you are doing much much
         | more work per iteration: computing the Hessian instead of the
         | gradient and solving a system of linear equations on every
         | iteration.
        
       | drstrangevibes wrote:
       | Could this be applied to gradient descent in neural networks?
        
       | nitred wrote:
       | I'll just wait for a few months for this algorithm to show up in
       | `scipy.optimize` [1]
       | 
       | [1] https://docs.scipy.org/doc/scipy/reference/optimize.html#
        
       | stellalo wrote:
       | This is interesting, but it's a pity that the only experimental
       | results do not show fast asymptotic convergence of the main and
       | simpler algorithm proposed (Algorithm 1, the one stated in the
       | abstract) but only for its line search variants (Algorithms 2 and
       | 4 in the paper).
        
       ___________________________________________________________________
       (page generated 2021-12-07 23:01 UTC)