[HN Gopher] Regularized Newton Method with Global $O(1/k^2)$ Con...
___________________________________________________________________
Regularized Newton Method with Global $O(1/k^2)$ Convergence
Author : ColinWright
Score : 193 points
Date : 2021-12-07 13:25 UTC (9 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| redwood wrote:
| Can someone explain?
| https://en.m.wikipedia.org/wiki/Newton%27s_method
| kiengcan9999 wrote:
| There is a helpful note here:
| https://huynp.com/optimization/2018/11/11/Fundamental-of-Unc...
| gmadsen wrote:
| [deleted]
| kergonath wrote:
| This is a useful thing to keep in mind:
| https://xkcd.com/1053/ .
|
| [edit] removed irrelevant stuff. The xkcd is great though :)
| gmadsen wrote:
| fair enough, and rereading my comment, it does come off
| that way. I was more curious what his background was, but
| that is not how I worded it.
| kergonath wrote:
| Fair enough! It's difficult to get the tone across in
| writing.
| mijoharas wrote:
| Would you like someone to explain newton's method (Newton
| Raphson iteration), or the linked article?
|
| (I could have a go at explaining newton's method, but I don't
| understand the arxiv article. In simple terms newton raphson
| iteration is an iterative method of finding the "zeros" of a
| function. You do it by evaluating the derivative of a function,
| and iteratively following that gradient until you find the zero
| point.)
| whinvik wrote:
| The problem with Newton's method is that it is provably
| convergent but only if you are sufficiently close to the
| solution. That basically means that the initial condition
| that you chose for solving the problem is very important.
| What this paper seems to suggest is that their method is
| globally convergent and they produce the algortithm to do it,
| although there is some tuning that needs to be done. So in
| theory, with their method we will always get to a solution,
| no matter the initial condition we chose.
| abeppu wrote:
| Their abstract says that their global convergence claim is
| for convex objectives. So structurally, they're only
| talking about functions for which the whole domain is
| "sufficiently close to the solution" i.e. where every
| tangent is a global bound.
| SpaceManNabs wrote:
| Important to note that other variants of newton's method
| do not necessarily converge from any starting point on
| even strictly convex functions.
|
| Even in convex functions, the whole domain is not
| necessarily sufficiently close.
| abeppu wrote:
| Actually, can you expand on that? The paper makes some
| broad statements about how existing methods are unstable
| and need to have a good starting point etc. This is not
| my area, but in the past when I've used (quasi)newton
| methods, it "just worked". What are the kinds of strictly
| convex functions that cause this bad behavior? For
| "strictly convex" in my limited, low-dimensional
| imagination I just picture a bowl-shape. They mention
| failure modes, but I'd love a clear (ideally visual)
| example of when it goes wrong. Robust global quadratic
| convergence seems to be the main value proposition here,
| so understanding the non-global fragile non-convergence
| of other methods seems important.
| mijoharas wrote:
| I believe if you select as a starting point, a point
| where the derivative of the function is 0, then 'classic'
| newton raphson will not converge (since you select the
| "crossing point" of this tangent line for your next
| iteration, and that is undefined when you have 0
| gradient.)
|
| e.g. y = X^2 - 1 choose 'starting position' X = 0.
| abeppu wrote:
| ... but then your starting point is already at the
| optimum. You're done in zero steps! I think for a
| "strictly convex" function, the (sub)derivative will
| _only_ be zero at the global minimum.
| mijoharas wrote:
| > but then your starting point is already at the optimum
|
| Newton raphson is used to find where f(x) = 0, not where
| the derivative f'(x) = 0.
|
| For the function given above f(x) = 0 when x = 1 and x =
| -1. Not when the derivative is at 0 (when x = 0).
|
| You'd be done in 0 steps to find the minimum of the
| function, or the point of 0 derivative, but that's not
| what you're looking for with newton raphson.
| abeppu wrote:
| Newton's method is _commonly_ used for function
| minimization by looking for the zero of the derivative,
| and this is entirely what the article is about. In sec
| 1.1 at the top of page 2, they say very clearly "In this
| work, we are interested in solving the unconstrained
| minimization problem ...". Even in the first para, they
| say "Newton's method is a cornerstone of convex
| optimization", meaning optimization/minimization of a
| convex function.
|
| The article says "Unfortunately, Newton's method is
| unstable: it works only for strongly convex problems and
| may diverge exponentially when initialized not very close
| to the optimum." And I was asking for an example or
| description of such a case.
|
| You'll also note there are several comments in this
| discussion that mention the hessian/2nd order
| derivatives, and these all show up because this is being
| used in the context of optimization by looking for f'(x)
| = 0.
| SpaceManNabs wrote:
| I am super sorry. I meant strongly convex. Autocorrect on
| mobile.
|
| Example from Boyd.
|
| https://math.stackexchange.com/questions/3408436/newton-
| rhap...
| dahak27 wrote:
| Unless I'm mistaken (very possible), the existing cubic
| Newton method they discuss already has this convergence
| guarantee, but introduces a lot of extra expensive work at
| each step. The specific contribution of the paper is
| finding clever regularization tricks to keep this guarantee
| while sidestepping the extra hassle the cubic method needs
| [deleted]
| rwilson4 wrote:
| Haven't read the paper yet, but Boyd and Vandenberghe's book,
| Convex Optimization, is available for free on the authors'
| website and covers all the foundational material.
|
| Update: I've skimmed the paper and here's the gist as I
| understand it. The paper alleges that Newton's method can have
| convergence issues, even when using line search. Their approach
| resolves these convergence issues. It's 6am where I am which
| isn't great for processing this stuff, but I write a lot of
| custom Convex Optimization solvers, and I've never had
| convergence issues with Newton's method with the default line
| search parameters suggested by the book mentioned above, even
| with some bad condition numbers that throttle off-the-shelf
| solvers (which is why I write my own).
|
| The paper talks about speed of convergence, which initially
| made me think this was a first order method, not requiring
| calculating the Hessian (matrix of 2nd order derivatives). For
| large systems this is really slow and 1st order methods speed
| this up by working only with the first derivatives. But that's
| not what this paper is: they still use the Hessian but add
| regularization, which presumably just improves the condition
| number. Reminds me of proximal algorithms but haven't explored
| that connection.
|
| Update on the update: the paper addresses situations where the
| self-concordance assumption does not hold. TBH I didn't pay
| much attention in class when we went over self-concordancy and
| convergence, but the contribution of this paper seems to be
| extending convergence guarantees even to non-self-concordant
| systems.
| jejones3141 wrote:
| Take an initial guess at where f(x) = 0; call it x1. Then you
| generate what you hope is a better guess x2 by finding where
| the line through (x1, f(x1)) with slope f'(x1) hits zero.
| Lather, rinse, repeat.
|
| Things can get ugly--say, if the derivative is zero at your
| guess--and if your initial guess is bad, or f is wonky (say it
| has several zeroes near one another) it may take a long time.
| But as people have said, if you have a good guess, convergence
| is quadratic, doubling the number of good bits each time, so
| for floating point square root, peek at the exponent and take
| half that (multiply by sqrt(2) if it's odd). log2(#mantissa
| bits) iterations and you're as good as it gets.
|
| Apparently this new method avoids the nasty cases and is still
| fast. Cool stuff.
| oriettaxx wrote:
| This is great for Big-O notation https://www.bigocheatsheet.com
| [deleted]
| djmips wrote:
| An fun visual explanation of Newton's Method is contained
| within this video by 3Blue1Brown https://youtu.be/-RdOwhmqP5s
| mvanaltvorst wrote:
| Newton's method is an algorithm you can use to
| minimise/maximise/find the root of a differentiable function
| very quickly if you start with a reasonable guess. There are
| many variants on Newton's method, e.g. BFGS that does not
| require the hessian (/2nd order derivative) of the function. A
| pitfall of Newton's method is that it does not find the global
| minimum per se, it might just converge to a local minimum if
| this local minimum is "closer" to your initial guess. Some
| variants of Newton's method allow you to avoid this pitfall,
| but they are slow in practice.
|
| This seems to be another variant that claims to find the global
| minimum quicker than other existing methods. I am not
| experienced enough to verify this claim. I am also not sure how
| this new algorithm performs in practice; in theory, ellipsoid
| methods are a lot more efficient than simplex methods for
| optimising convex functions, while in practice simplex methods
| are usually an order of magnitude faster. So take this result
| with a grain of salt.
| MauranKilom wrote:
| To clarify, "global" in this context refers to global
| convergence (as in, it won't diverge no matter the starting
| point), not global optima. They assume a convex function,
| i.e. only a single minimum. The paper is unrelated to
| avoiding local minima.
| abetusk wrote:
| So isn't the magic here convexity? Convexity is such a huge
| constraint that I would guess convergence would be hard to
| avoid.
| jleahy wrote:
| Tell that to Newton's method, which can fail to converge
| for convex functions.
| kergonath wrote:
| There is no way of ensuring that you have found a global
| minimum if the function is not convex, or if you don't
| make equivalent assumptions, in the general case (if the
| domain of the function is continuous, and/or infinite).
|
| If you don't make assumptions, you would need to consider
| every single point of the domain, and there is an
| infinite number of them. The job gets easier the more
| assumptions you make about the continuous,
| differentiable, and monotonous character of the function.
| MauranKilom wrote:
| (Disclaimer: Not an expert, and only gave the paper a very
| cursory read).
|
| Newton's method attempts to find zeroes of a function (i.e.
| find x where F(x) = 0). This is equivalent to finding a
| (possible) minimum of a function f(x) where f'(x) = F(x). Note:
| The wiki article describes the process of finding zeroes for
| F(x), whereas the article searches minima of f(x), in terms of
| my notation here.
|
| At a basic level, Newton's method iteratively improves a guess
| x_k via the following update step: x_(k+1) =
| x_k - f(x_k) / f'(x_k)
|
| When close to the correct solution, convergence is quadratic:
| You "double the number of correct digits" at every step, so to
| speak. But if you're far away, you might be diverging (going
| further away from the solution). Progress can also be slow
| under other conditions (multiple solutions nearby).
|
| The paper, essentially, suggest the following update step
| instead: x_(k+1) = x_k - f(x_k) / (f'(x_k) +
| sqrt(H |f(x_k)|))
|
| for some H (that needs to be chosen/estimated). The extra term
| can be thought of as some form of regularization. This
| supposedly avoids divergence (where f is non-decreasing and
| does not grow too fast), and is also consistently fast.
|
| Note that I've simplified to the one-dimensional case here.
| There are further concerns in higher dimensions (e.g. number of
| matrix inversions) that this method handles well, according to
| the authors.
| gugagore wrote:
| It's great that you clarified the root-finding / minimization
| perspectives.
|
| Shouldn't your update equations be written in terms of F, not
| f? (Alternatively, write f'(x_k) / f''(x_k).)
|
| To minimize f, find zeros of f' = F. To find a zero of F,
| construct a linear approximation using f''=F'
|
| If you're using Newton's method to minimize a function f, you
| need the first- and second-order derivatives (hence the same
| "second-order method").
| [deleted]
| MauranKilom wrote:
| You're correct, I should've used F there, my bad. Too late
| to edit, unfortunately!
| dataflow wrote:
| > This is equivalent to finding a (possible) minimum of a
| function f(x) where f'(x) = F(x).
|
| I'd like to clarify that the "equivalence" might be
| misleading. Just because you have a vector-valued function,
| that does not imply it has an antiderivative you can
| minimize, so merely _having_ that scalar antiderivative
| actually gives the problem more structure and thus
| intrinsically makes it "easier" in that sense.
| gugagore wrote:
| I think a concrete example of what you're saying is that a
| rotation vector field (e.g. F(x,y) = [y, -x]) is not the
| gradient of any potential function.
|
| And I think you're saying it's intrinsically easier to find
| where a vector field vanishes if it has a potential
| function. You can just follow the vector field and you'll
| get to a zero. But if you follow a rotating vector field,
| it will take you in circles.
|
| But given that you're doing Newton's method, I don't think
| one problem is easier than the other, right?
| dataflow wrote:
| Is your conjecture that Newton's method would always work
| at least as well as gradient descent when the vector
| field has nonzero rotation? I don't see why that should
| be true, especially given that's not the case when
| there's zero rotation. You could easily come up with
| examples where gradient descent would work better despite
| the rotation, and I expect you could also find lots of
| rotation cases where Newton's method would get completely
| thrown off (whereas GD wouldn't).
|
| (Also, don't conflate the problem with the solution. My
| statements were about optimization vs. root finding, not
| about Newton vs. GD. Even if there _is_ a solution that
| works for multiple problems, that doesn 't mean the
| problems are equally difficult. If one problem admits
| solutions that the other doesn't, then it's still easier
| in that sense. If a problem can be reduced to another,
| then it's also easier in that sense. Just like how you
| can use a cannon to kill a fly, but that doesn't mean
| destroying a building is as easy as killing a fly...)
| gugagore wrote:
| > Is your conjecture that Newton's method would always
| work at least as well as gradient descent when the vector
| field has nonzero rotation?
|
| > You could easily come up with examples where gradient
| descent would work better despite the rotation, and I
| expect you could also find lots of rotation cases where
| Newton's method would get completely thrown off (whereas
| GD wouldn't).
|
| I don't think I understand the meaning of the questions.
| It isn't gradient descent unless you have a potential
| function. You cannot do gradient descent on a vector
| field that is not a gradient. You can solve an ODE.
|
| You can, of course, turn a root-finding problem (F(x) =
| 0) into a minimization problem by defining a potential
| like g(x) = (F(x))^2. You can do gradient descent on g.
| So in that sense, both of these problems reduce to each
| other, though there are better methods for solving F(x) =
| 0 than minimizing g.
|
| I still don't understand this sense of optimization being
| easier than root finding.
|
| What I do understand is some sense in which there are
| more vector fields than there are gradients, since only
| some vector fields are gradients of potential functions.
| My comment was intended to point that out, since I
| thought that is maybe what you meant.
| civilized wrote:
| This is very interesting. It looks like the innovation is to
| scale the identity matrix in the L-M step formula by the square
| root of the norm of the gradient. Very simple and elegant, and
| trivial to implement. Lots of people are going to be shaking
| their heads wondering why they didn't think of this first.
|
| The motivation for the idea is very nice and easy to understand.
| The author completely explains it before the middle of the second
| page. I encourage you to look at it if you know enough vector
| calc to make sense of the notation. Essentially, he takes a well-
| known fast-converging method whose steps are hard to compute and
| shows how these steps can be approximated cheaply while still
| retaining much of the convergence rate benefit.
|
| This may have implications for many large-scale optimization
| problems as L-M regularized Newton steps are a sort of natural
| halfway house between gradient descent and the full Newton's
| method, both theoretically and practically.
| jedbrown wrote:
| A very similar technique is popular for rootfinding (since the
| 80s, convergence analysis in this 1998 paper), where scaling by
| the norm of the gradient (eq. 1.5) enables q-quadratic
| convergence.
|
| https://doi.org/10.1137/S0036142996304796 (free version https:/
| /repository.lib.ncsu.edu/bitstream/handle/1840.4/296/...)
| civilized wrote:
| Rootfinding and optimization are so closely related!
| midjji wrote:
| The problem with simple innovations in old methods, is that
| there is very rarely anything new under the sun, odds are good
| its already been done. The paper does not evaluate on any of
| the big standard benchmark either. The paper spends a
| surprising/suspicious amount of time on very basic things. "The
| divergence issues of Newton's method with line search seems to
| be a relatively unknown fact." No its well known!
|
| It looks interesting, but some things, like the L-M lambda
| beeing constant? No, you adapt it to the data increasing and
| decreasing it in accordance with the feedback from the step
| error predictions. Similar, but not identically to what is
| proposed here. Id need to read it carefully and it doesnt look
| bad, but think I'll just wait for the reviewers.
| civilized wrote:
| All good points.
|
| Personally I wouldn't have known divergence of Newton's
| method with line search off the top of my head. What I think
| everyone in this field would know is that Newton's method
| with constant step size diverges. I'm actually curious now if
| Newton's method with _backtracking_ line search from a
| constant step size diverges. Exact line search can go a
| little nuts, that doesn 't surprise me, but divergence with
| backtracking from a constant step size seems less likely. (I
| guess it depends on what "constant" means here...)
|
| Seems like it could be awfully nice to have a smart
| theoretically backed initial guess of the dynamic lambda. The
| author actually uses a feedback strategy to set H in
| conjunction with his gradient scaling (see Algorithm 2 /
| AdaN), and it's possible that the scaling allows H to be
| basically correct more often and not have to be adjusted so
| much? Which is potentially really helpful since Newton steps
| aren't cheap and you don't want to waste them on guess-and-
| check.
|
| But it's not obvious it's better than other strategies and we
| do need benchmarks to prove it out, yes.
|
| The author is from INRIA, which does a lot of advanced
| optimization research, so I'd say it's a serious paper prima
| facie.
| marcodiego wrote:
| > We present a Newton-type method that converges fast from any
| initialization and for arbitrary convex objectives with Lipschitz
| Hessians. We achieve this by merging the ideas of cubic
| regularization with a certain adaptive Levenberg--Marquardt
| penalty.
|
| I feel insignificant just by reading the first phrase.
| SubiculumCode wrote:
| Longitudinal Analysis of an Amygdalocentric Structural Network
| using Mixed Effects Multivariate Distance Matrix Regression
| with Multiple Fractionated Polynomials
|
| A working title for a manuscript I am shopping out to the
| various neuroscience journals right now. I feel like I'm
| finally the esoteric person I've always wanted to be.
| jpfr wrote:
| I teach a class on the subject:
| https://www.youtube.com/playlist?list=PLdkTDauaUnQpzuOCZyUUZ...
|
| Understanding optimization at this level and being able to
| formulate your own problems in one of the canonical forms gives
| you super powers.
|
| You don't even have to build your own solvers. The
| commercial/open source ones are typically good enough.
| stochtastic wrote:
| These lectures look fantastic -- thank you for making and
| sharing them! I've been looking for a good foundational
| treatment of optimization to digest over the holidays.
| phkahler wrote:
| >> You don't even have to build your own solvers. The
| commercial/open source ones are typically good enough.
|
| But as a maintainer of solvespace (Open Source CAD software
| with geometric/algebraic constraint system) I am left
| wondering if this can be applied to our constraint solver,
| which I didn't write. It solves systems of algebraic
| constraints using some form of Newtons method, partial
| derivatives, jacobian... All stuff I'm familiar with, but not
| in great detail. Figuring out how best (and weather) to apply
| this might be a major digression ;-)
| jpfr wrote:
| After following the course, you should be able to grok the
| solver code in /src/system.cpp. That looks fairly standard.
|
| The interesting part (to me, as that is not my specialty)
| lies in the translation of the 3D constraints (including
| rotation, etc.) into a single objective function for
| solving Newton-style.
| tubby12345 wrote:
| Lolol why? Even if you're a mathematician in a different area
| you might have no clue what that sentence means.
| latenightcoding wrote:
| that popular "mathemathics for machine learning" book will
| teach you almost everything in that sentence, authors even
| had/have a mooc on coursera.
| brrrrrm wrote:
| Newton method: find an error and attempt to fix it by
| approximation
|
| Convex objective: a bowl-like function, we want inputs that put
| us at the bottom of it
|
| Lipschitz: a space that isn't too stretchy
|
| Hessians: second partial derivatives of the terms we're
| optimizing (inputs we want)
|
| Regularization: make the "fix by approx" procedure a bit more
| well behaved
|
| Levenberg-Marquardt: another adaptation to make the fix
| procedure better
| montebicyclelo wrote:
| Would be great to have a resource written like this that
| covers the rest of maths, (or in fact jargon in general).
| adampk wrote:
| I always thought biology would be easier to learn if
| instead of entity names based on old languages they were
| instead verbose descriptions
|
| i.e. Mitochondria->PowerHouseOfTheCell
| robbedpeter wrote:
| It's funny how many complex ideas are "just" simpler ideas
| mapped to the right sequence of steps. The jargon and naming
| conventions can obscure the idea from outsiders, but allows
| for brief and accurate communication. If they'd explained the
| idea Barney-style, they'd have needed twice as many pages.
|
| Thanks for the breakdown, it makes the paper more accessible.
| sorokod wrote:
| Well a simpler idea is in way like the initial guess for
| Newton-Raphson. Sometimes you get convergence and sometimes
| you don't
| whatshisface wrote:
| Every idea in math is like that, because the rules of math
| are to do everything in little steps.
| tubby12345 wrote:
| Math is about formalizing "simple ideas" so that they can
| be reasoned about _exactly_.
| marktangotango wrote:
| > I feel insignificant just by reading the first phrase.
|
| I often wonder about the intentions of those who post of these
| types of comments. Being charitable one might suppose the
| parent is offering a backward sort of compliment to the
| authors. Something like "Great job, beyond my capabilities, I'm
| glad someone is working on these things".
|
| There are a lot of less charitable formulations; why should
| other readers care about parent commenters insecurity? Does
| parent commenter not appreciate the work that goes into
| studying these topics? Was parent commenter told they were
| "smart" to often as a child and never learned to exert
| themselves academically? Etc.
|
| Although this post is mostly in jest, I really am curious what
| prompts these comments, comments of this type are quite
| reliable.
| sam0x17 wrote:
| > There are a lot of less charitable formulations; why should
| other readers care about parent commenters insecurity? Does
| parent commenter not appreciate the work that goes into
| studying these topics? Was parent commenter told they were
| "smart" to often as a child and never learned to exert
| themselves academically? Etc.
|
| You're missing the obvious interpretation -- jargonization of
| the sciences keeps people out and keeps the masses stupid.
| Math is particularly guilty of this -- more charitable fields
| use jargon that is at least semi-self-explanatory, words and
| phrases that, while jargon, do make sense if you deconstruct
| them. In Math, people are so arrogant that they discover
| something and immediately name it after themselves, so if you
| haven't had a topology class, you won't know that X person's
| name maps to Y concept. Good jargon can be pieced together
| with pure reason alone without knowing the names of the
| people who invented what. This isn't true across the board of
| course -- "gradient descent" is a particularly well-named
| piece of jargon, for example.
| drdeca wrote:
| People generally don't name it after themselves; other
| people name it after them later.
|
| Not that many people wouldn't like for the thing they
| introduce to eventually be named after them. But, I think
| it would generally be considered presumptuous to directly
| name it after oneself.
|
| I remember one story where one mathematician upon hearing
| something referred to as a [some term named after them],
| asked what the term referred to. (I want to say it was
| Hilbert asking what a "Hilbert space" was, but I'm not
| sure.)
| jakeinspace wrote:
| If it makes you feel any better, I knew nothing about the field
| of optimization (convex or otherwise) 3 months ago. But after
| going through an optimization course in my CS master's program
| (MCSO at UT Austin) this semester, I can parse this part of the
| abstract fine, minus the Levenberg--Marquardt, which I had to
| look up. That is with not much of a prior mathematics
| background. You could probably learn enough in a semester of
| part-time study to grok this paper, as long as you're
| comfortable with differential calculus.
| rahimiali wrote:
| People who're familiar with Newton's method might be surprised at
| the convergence rate. 1/k^2 is slower than the textbook rate for
| Newton's method with line search, which is 2^{-2^k}, basically
| implying convergence in a constant number of steps. The rate in
| this paper seems to be no faster than plain gradient descent on a
| strongly convex function! So why go through the trouble of the
| much more complicated update?
|
| Because the textbook proof of the fast convergence of Newton's
| method make additional assumptions on the objective function, for
| example that it is strongly convex, or it is self concordant.
| This paper only assumes Lipschitz continuous Hessians.
|
| The idea of dampening the Hessian is old (it's sometimes called
| "damped newton method", or "trust region newton method", or
| "levenberg-marquardt", though the latter two refer to more
| specific ideas). This paper offers a view to how much dampening
| to apply.
| civilized wrote:
| There are two kinds of convergence proofs - the ones that get
| you a 1/k or 1/k^2 rate, and the ones that get you "linear" or
| "superlinear" convergence (which is optimization jargon for
| exponential/superexponential decay of the error). The latter
| require way too strong of assumptions to be that useful in
| practice, and we generally think of them as just "local"
| convergence theorems - i.e. the assumptions they make start to
| be true when you get close enough to the minimum, and the fast
| convergence kicks in at that point. Which is nice, but getting
| close enough in the first place is often 99% of the problem!
| paulpauper wrote:
| I was not aware that this was a problem in need of solving.
| aren't computers so fast that this is already solved form a
| practical standpoint. Newton's method (as well as extensions of
| it to vector analysis) is ancient math. I was not even aware that
| this is an area of ongoing interest.
| syvon wrote:
| I am unsure if this precise problem needed a better solution,
| but even if not, this is not how math research works. Finding
| better solutions to random / useless problems can lead to a
| better solution to a problem which needs solving. The relevance
| of the problem is not a consideration.
| BeetleB wrote:
| > aren't computers so fast that this is already solved form a
| practical standpoint.
|
| Not in the least. Much of the scientific computation out there
| is to solve these types of problems. Even if you sped up
| computers ten-fold you wouldn't be satisfied.
|
| When I was in grad school, the computational teams would run
| calculations where it would take a few days to get a single
| data point (they were trying to plot a curve). And they had
| problems in their head where it would take over a month to get
| a data point - but of course they weren't pursuing those
| problems as it was impractical.
|
| Things like computing the derivative - let alone the Hessian -
| can be quite expensive for some problems.
| loehnsberg wrote:
| I am not an expert on convex, unconstrained optimization but I
| frequently use these methods. I'd wait until it's peer-reviewed
| and published in a good journal (SIAM) before I get too excited
| about this one.
| vsskanth wrote:
| What is y in these equations
| jmount wrote:
| For those interested, James Renegar also had some great stuff on
| Newton's method ("A polynomial-time algorithm, based on Newton's
| method, for linear programming" 1988). It seems there are a lot
| of great algorithmic opportunities in specializing Newton's
| method to a given problem domain.
| random314 wrote:
| This applies to convex optimization. I am wondering if it could
| also be extended to neural networks too?
| brrrrrm wrote:
| You'd need to be careful of overfitting the network.
| rwilson4 wrote:
| Theoretically yes, but: typically with neural networks, and
| esp. deep networks with millions or billions of parameters, 2nd
| order methods which require calculating and factoring the
| matrix of 2nd derivatives are prohibitively expensive. 1st
| order methods such as gradient descent skip this, albeit with a
| tradeoff in convergence. The method proposed in this paper is
| impractical for large systems such as computer vision or
| language models.
| kxyvr wrote:
| As a note, 2nd order methods do not require calculating the
| entire Hessian nor do they require factorizing the matrix. A
| trust-region method using Steihaug-Toint (truncated) CG or a
| Newton-Krylov method simply require Hessian-vector products,
| which are not particularly costly to compute. In fact, a
| forward-mode automatic differentiation (AD) method on an
| existing gradient calculation, which could also be done using
| AD, will calculate the Hessian-vector product at a cost of
| twice the gradient calculation. Though, AD is not required
| and this can be done by hand. Even if only a single Krylov
| iteration is done, there are many benefits to using this
| methodology to help handle the scaling of the problem. These
| algorithms are described fully in a book such as Numerical
| Optimization by Nocedal and Wright.
|
| The current paper does not directly affect most ML algorithms
| because it speaks directly to convex optimization. Most ML
| models are highly nonlinear, so globalization of the
| optimization algorithm is required. Here, globalization means
| something like a line-search or a trust-region to ensure
| convergence to a local minima. Convex problems using an
| appropriate algorithm and under the correct assumptions do
| not necessarily require the use of a globalization technique
| and will converge directly to a local min, which is now
| global due to convexity. Normally, this discussion is done in
| the context of interior point methods for linear cone
| problems and helps explain why globalization is not done for
| these algorithms. An example of this can be found in Convex
| Optimization by Boyd and Vandenberghe in section 11.5.3, but
| it's a well researched topic. The algorithm in this paper
| discusses a technique where they also don't require a typical
| globalization technique.
|
| Anyway, mostly I'm commenting to dispel the misconception
| that 2nd order method can not be applied to large scale
| problems. I've applied them successfully to problems with
| hundreds of millions of variables and they work just fine.
| Largely, their success is tied to how complicated the
| spectrum of the Hessian is, but that's a longer discussion.
| abeppu wrote:
| Aside from the other note about 2nd order derivatives being a
| blocker for most of the models we work with these days, and the
| fact that DL encounters non-convex loss functions, the other
| big difference is that we're used to training with SGD which
| only needs unbiased estimates of a gradient based on samples /
| batches. This family of approaches requires _actual_
| derivatives, i.e. from looking at the whole dataset. If you
| have estimates of the gradient only, I think basically none of
| the convergence arguments in this hold. This is basically also
| why we don't already use LBFGS (which doesn't hold onto the
| full hessian).
| kubb wrote:
| Is it a groundbreaking better way to approximate continuous
| functions, or is it fundamentally impractical, like new
| algorithms with a good asymptotic time often are? I'm curious to
| know, but my mathematical illiteracy prevents me from reading the
| paper.
| kiengcan9999 wrote:
| From abstract: "Our method is the first variant of Newton's
| method that has both cheap iterations and provably fast global
| convergence". It seems interesting.
| rwilson4 wrote:
| "Cheap iterations" led me to expect a 1st order method, not
| requiring computing the matrix of 2nd derivatives, which is
| the bottleneck in large systems. But this does indeed rely on
| the Hessian, so I'm not sure what makes the iterations
| "cheap".
| MauranKilom wrote:
| If you don't want to compute the Hessian, you're at the
| wrong address with Newton's method. Quasi-Newton methods is
| where you want to look.
| rwilson4 wrote:
| Sure, or Nesterov's accelerated gradient descent or
| something. I'm just not sure what the authors mean by
| "cheap iterations".
|
| Update: your other comment clarified this for me. It's
| cheap relative to other methods that work with non-self-
| concordant objectives.
| rwilson4 wrote:
| This is an approach for solving convex optimization problems.
| An optimization problem is when you want to find the best
| option, or the fastest, or the most profitable. If you're
| asking questions with superlatives (most, best, fastest), you
| may be working on an optimization problem.
|
| Many optimization problems are NP Hard and cannot be solved in
| a reasonable amount of time. Convex optimization problems are a
| special subclass which can typically be solved efficiently
| (meaning in polynomial time).
|
| This paper provides a new approach to solving Convex
| Optimization problems for systems where other methods struggle.
| MauranKilom wrote:
| Newton's method is not for function approximation, but for
| optimization ("find the minimum").
|
| Newton's method already has "good asymptotic time" (e.g. if you
| use it to compute a square root, you _double_ the number of
| correct digits with each step!) - but only under certain
| conditions (e.g. "not too far from the solution"). This paper
| shows how to modify it so this sort of benefit is obtained
| under a wider range of conditions ("global").
|
| There are other related methods with a similar goal, but some
| require more expensive computations. The method from the paper
| appears to fill a sweet spot in high-dimensional optimization
| by virtue of requiring "few" matrix inversions (or linear
| solves) while having very good convergence properties.
|
| When you are dealing with the kind of problem that this sort of
| method is applicable to, it appears to me that you might
| benefit from looking into using this method. But it's not a
| revolutionary improvement from what I can tell - there are
| _many_ related methods, and chances are you could have used
| something similar for similar benefit already.
|
| And it's likely that someone already used more or less this
| method in practice. As the paper notes:
|
| > _Our goal is twofold. On the one hand, we are interested in
| designing methods that are useful for applications and can be
| used without any change as black-box tools. On the other hand,
| we hope that our theory will serve as the basis for further
| study of globally-convergent second-order and quasi-Newton
| methods with superior rates. Although many of the ideas that we
| discuss in this paper are not new and have been observed by
| practitioners to help solve challenging optimization problems,
| our analysis, however, is the first of its kind. We hope that
| our theory will lead to appearance of new methods that are
| motivated by the theoretical insights of our work._
| plafl wrote:
| (note: comments based on a superficial read)
|
| I would say "interesting" more than "groundbreaking", in the
| sense that you will not solve problems that you were not
| previously solving.
|
| The benchmarks provided show nice convergence. The interesting
| part is that the method will not fail, which is very good. The
| benchmarks seem to crush other methods but keep in mind that
| the plots do not show "time required" but "number of iterations
| required" and that with Newton's method you are doing much much
| more work per iteration: computing the Hessian instead of the
| gradient and solving a system of linear equations on every
| iteration.
| drstrangevibes wrote:
| Could this be applied to gradient descent in neural networks?
| nitred wrote:
| I'll just wait for a few months for this algorithm to show up in
| `scipy.optimize` [1]
|
| [1] https://docs.scipy.org/doc/scipy/reference/optimize.html#
| stellalo wrote:
| This is interesting, but it's a pity that the only experimental
| results do not show fast asymptotic convergence of the main and
| simpler algorithm proposed (Algorithm 1, the one stated in the
| abstract) but only for its line search variants (Algorithms 2 and
| 4 in the paper).
___________________________________________________________________
(page generated 2021-12-07 23:01 UTC)