[HN Gopher] Nonlinearsolve.jl: Fast and Robust Solvers for Nonli...
___________________________________________________________________
Nonlinearsolve.jl: Fast and Robust Solvers for Nonlinear Equations
in Julia
Author : adgjlsfhk1
Score : 141 points
Date : 2024-03-26 03:27 UTC (19 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| wenc wrote:
| Would be interested to see this compared to IPOPT.
|
| https://coin-or.github.io/Ipopt/
|
| IPOPT isn't tailored to NLEs specifically, but it does solve NLPs
| well. It also uses some amazing Fortran linear algebra routines
| from Harwell (MA57).
| adgjlsfhk1 wrote:
| it should be pretty easy to add to the benchmarks. From a first
| glance, it looks like they have a lot of the necessary
| machinery to do well.
| jpfr wrote:
| This article is about "solving" differential equations and not
| convex optimization.
| wenc wrote:
| > This article is about "solving" differential equations and
| not convex optimization.
|
| This article is about solving nonlinear equations (not
| differential equations, not sure where you got that from).
| All NLP optimizers can solve nonlinear equations -- it's a
| special case where the objective is constant.
|
| Ipopt is not a convex solver so am not sure what convex
| optimization you are referring to. It is a general nonlinear
| solver, which covers nonconvex problems as well (I worked on
| nonconvex nonlinear programs for a decade and it was my
| primary solver)
|
| Also all nonlinear equation systems are nonconvex. (A convex
| program requires equality constraints to be linear)
| FreakLegion wrote:
| _> all nonlinear equation systems are nonconvex_
|
| Maybe you have something more particular in mind when you
| say "systems", but not all nonlinear functions are non-
| convex. Least squares, for example, is nonlinear and
| convex.
|
| Also note that IPOPT, while wonderful, is a local solver.
| It may not be limited to convex problems, but those are the
| only ones it's guaranteed to solve to optimality.
| Bimos wrote:
| The feasible region {x | f(x) = 0} is nonconvex no matter
| whether f is convex.
| graycat wrote:
| Consider claim:
|
| > The feasible region {x | f(x) = 0} is nonconvex no
| matter whether f is convex.
|
| For some positive integer n and for the set of real
| numbers R, consider closed (in the usual topology for
| R^n), convex set A a subset of R^n. Define function f:
| R^n --> R so that f(x) = 0 for all x in A, f(x) > 0 for
| all x not in A, and for all x in R^n f infinitely
| differentiable at x. It is a theorem that such an f
| exists.
|
| Then { x | f(x) = 0 } = A and is both closed and convex
| in contradiction to the claim.
| ubj wrote:
| I'm assuming you're referring to nonlinear f(x) because
| this statement is trivially false for linear f(x).
|
| But consider the function f(x) = max(0, g(x) - c) where
| the following holds:
|
| * g(x) is nonlinear, positive definite in x, and convex.
|
| * c > 0
|
| Then f(x) is nonlinear and convex (it's the pointwise
| maximum of convex functions), and the set {x : f(x) = 0}
| is a convex set.
| wenc wrote:
| I was talking in terms of convex optimization. The
| criteria for convex optimization is convex objective,
| convex inequality constraints and convex feasible region,
| and linear (not just convex) equality constraints.
|
| I'm aware that ipopt is a local solver.
| FreakLegion wrote:
| _> all nonlinear equation systems are nonconvex_
|
| _> I was talking in terms of convex optimization_
|
| I'm not sure what subtlety you're pointing to here. As
| far as non-convex problems, though, my point is that
| IPOPT isn't special in this regard. Any convex solver can
| be a non-convex solver if you don't care about global
| optimality.
| wenc wrote:
| > Any convex solver can be a non-convex solver if you
| don't care about global optimality.
|
| Aside: Structurally I'm not sure how this would be true.
|
| Convex solvers have very specific forms. For instance a
| QP solver requires a very particular form and does not
| admit any arbitrary non convex form except for one: the
| non-PSD Hessian which is the concave problem.
|
| My point is that all NLEs power inside an optimization
| problem gives rise to a non convex optimization problem
| with no guarantee of a global solution. So convex
| optimization is not applicable here.
| akutlay wrote:
| I dont know much about this but the fact that they did not
| mention IPOPT or Highs (which I would think them as two of the
| most popular software used for these problems) made me question
| this too.
| affinepplan wrote:
| those are indeed very popular tools but for a slightly
| different kind of problem than this package is attempting to
| solve
| ChrisRackauckas wrote:
| Yes, especially Highs is completely unrelated because it's
| for (mixed integer) linear programming problems and
| quadratic programming problems, so it's not able to solve
| the types of problems described in this manuscript.
|
| IPOPT is a bit of a stretch but I do know that there are
| some people in the mathematical programming space that use
| IPOPT's constraint satisfaction piece for solving a
| nonlinear system in a pinch, but it's not the main use case
| of IPOPT which is nonlinear optimization with nonlinear
| (in)equality constraints. This case has no optimization and
| is just nonlinear equality constraints, which you can then
| specialize quite a bit on. But while it is considered
| common knowledge in numerical circles to not use a
| nonlinear optimizer to solve nonlinear systems, I don't
| have a good canonical benchmark to point to on that, so we
| might as well make this it.
| avikpal1410 wrote:
| IPOPT as mentioned in other posts solves Nonlinear
| Optimization Problems. Working around it to make it solve
| nonlinear equations or nonlinear least squares problems
| typically doesn't go well. We will add benchmarks for the
| NLLS part (which we don't talk about in the paper) and that
| will contain IPOPT comparisons.
|
| For some comparison (this is not a benchmark me or anyone on
| this paper have written), see https://juliapackagecomparisons
| .github.io/comparisons/math/n.... Specialized solvers for
| NLLS pretty much always beat general optimization solvers.
| ChrisRackauckas wrote:
| Definitely an interesting response I didn't see coming. This is
| one of the reasons to share preprints!
|
| IPOPT is a nonlinear optimization tool, not a nonlinear solver.
| Generally from what I've seen it's not a good idea to solve
| nonlinear systems with a nonlinear optimizer, but you can
| phrase it as a "constraint satisfaction problem", i.e. a
| nonlinear optimization with a trivial `maximize 0` loss
| function but with equality constraints to be satisfied `f(u) =
| 0` and allow it to do that. IIRC the sparse symmetric linear
| solver is only used in the optimization process as it's that
| part where it can guarantee a sparse symmetric linear system,
| the constraints themselves `f` would have to be symmetric to
| reuse that in the Newton method which isn't true for any of the
| benchmarks. As such, I don't think you'd get most of IPOPT's
| advantages even showing up in a constraint satisfaction problem
| at all. Don't get me wrong, IPOPT is an amazing software for
| nonlinear optimization, but when not doing optimization it
| would lose a lot of what makes it amazing.
|
| However, perfectly agreed that I cannot lay this to rest right
| now because I don't have a benchmark to point to. The best
| thing to do here would be to add it to the benchmarks and fully
| describe why in the paper. We'll definitely follow up with this
| in a revision.
|
| In the meantime, if you have more requests for the benchmarks,
| please feel free to open issues at
| https://github.com/SciML/SciMLBenchmarks.jl so we can track
| them. All of our benchmarks run on this open platform and
| anyone can add things via a PR. In particular, the 23
| benchmarks diagram in the paper for example is simply just the
| result of this script
| https://github.com/SciML/SciMLBenchmarks.jl/blob/master/benc...
| which runs automatically on any PR (for new folks it requires
| we click yes for security reasons though). So please feel free
| to send any benchmark requests. We do plan to do a lot more
| comprehensive nonlinear optimization benchmarks in the near
| future. For this case with IPOPT though, we're happy to add
| that in ourselves hopefully in the next few weeks.
| sobellian wrote:
| You can also put it as a minimization problem to minimize
| ||g(u)||, but I assume that does not magically make anything
| more efficient.
| ChrisRackauckas wrote:
| That makes your Newton method require calculating the
| Hessian of the objective function, which is a second
| derivative. A normal Newton method for nonlinear systems
| uses the Jacobian, which is the first derivative (the
| connection is that the Hessian is the Jacobian of the
| gradient). But indeed, formulating it so that you have
| discontinuities (absolute values) and require second
| derivatives instead of the first for the same algorithm is
| not the good kind of magic :-/. You could then try using
| only gradient-based methods, like Adam, but you can see
| that is actually using less problem structure. BFGS for
| example is related to Broyden's method and is thus using
| simple Hessian approximations, so for nonlinear system
| solving this is effectively less stable than just using
| Newton.
|
| That's not to say that there's no reason to use this trick
| though. I have seen that some optimization tools can
| improve robustness in some cases, and this is a way to make
| use of heuristic methods (genetic algorithms, differential
| evolution, etc.) for globalizing the process, but it's
| probably not the first trick you'd want to do in most
| cases.
|
| While raising derivative order is one way to understand why
| this should be less efficient, but another is to understand
| that it loses some specific structure. In the nonlinear
| system problem f(u)=0, any u that solves this is a
| solution. All solutions are the same, if you find one
| you're good. In the optimization problem, you want to find
| "the" u that minimizes f(u). Now if you are minimizing
| f(u)=||g(u)|| and you know there is a u that causes g(u) to
| be zero, then all u that causes g(u) to be zero is a
| solution, all match the global optima. But solvers don't
| necessarily know that. Solvers need to use second order
| information to know that they have not hit a saddle point
| or other behavior. They don't know that "near zero = done",
| and they can't specialize on this information. This is
| another way in which you're just giving sophisticated more
| solvers more work to do where you already know the answer
| by problem design.
| kxyvr wrote:
| This is not entirely true. A typical trick for
| accomplishing this is to simply use the Hessian
| approximation `g'(x)*g'(x)` (star denotes adjoint), thus
| cutting off the second-derivative evaluation of g, and
| then use a modified CG method like Newton-Krylov for
| line-search methods or Steihaug-Toint for trust-region
| methods. It's very robust, matrix-free, and doesn't
| require an excessive amount of code to implement.
| Further, the method can be preconditioned with a positive
| definite preconditioner.
|
| And, to be clear, there's no guarantee that a zero
| residual be found, only that `g'(x)*g(x)=0`, but that
| tends to be the nature of nonlinear equations.
| ChrisRackauckas wrote:
| > It's very robust, matrix-free, and doesn't require an
| excessive amount of code to implement. Further, the
| method can be preconditioned with a positive definite
| preconditioner.
|
| It is not robust, in fact it's well-known as a
| numerically unstable method. g'(x)*g'(x)` is numerically
| unstable because it squares the condition number of the
| matrix. If you have a condition number of 1e-10, which is
| true for many examples in scientific problems (like the
| DFN battery model in the paper), the condition number of
| J'J is 1e-20, which effectively means a linear solve is
| unable to retrieve any digits of accuracy with 64-bit
| floating point numbers. So while I agree that you can use
| symmetric linear solvers as a small performance
| improvement in some cases, you have to be very careful
| when you do this as it is a very numerically unstable
| method. For some details, see
| https://math.stackexchange.com/a/2874902
|
| Also, if you look at Figure 5 in the paper, you will see
| that there is a way to choose operator conditioning https
| ://docs.sciml.ai/LinearSolve/stable/basics/OperatorAssu..
| ., and if you tell it to assume the operator is well-
| conditioned it will actually use this trick. But we do
| not default to that because of the ill-conditioning.
| kxyvr wrote:
| Yes, care must be taken to stabilize the algorithm.
| However, I do not believe it to be true that no digits of
| accuracy can be obtained. The algorithm falls back to the
| Cauchy (steepest-descent) point on the first iteration
| and this will give reduction. That said, I'm willing to
| solve this particular problem and see. Where's your code
| for the DFN battery model? The page here leads to a dead
| github link:
|
| https://help.juliahub.com/batteries/stable/api/#JuliaSimB
| att...
| ChrisRackauckas wrote:
| The full DFN has a lot more going on. Use the benchmark-
| ified version from the benchmarks:
|
| https://docs.sciml.ai/SciMLBenchmarksOutput/dev/Nonlinear
| Pro...
|
| But as Avik states, this same kind of thing is done as
| part of the trust region methods, and so the trust region
| stuff with some ways to avoid J'J is effectively doing
| the same thing, though of course that cannot target a
| symmetric linear solver because J is not generally
| symmetric and it needs to do the L2 solve so it needs a
| QR/GMRES.
|
| But indeed, let's get IPOPT in this interface and
| benchmarks and show it in the plots.
| kxyvr wrote:
| What's your convergence metric? It looks like x_sol gives
| a residual on the order of 1e-7. That's fine, but I'd
| like to measure an equivalent number to the benchmark.
| ChrisRackauckas wrote:
| It's varied, that's why it's measured as work-precision.
| It would be good to similarly produce a full work-
| precision line as point estimates do not tell the full
| story.
|
| Note that we are preparing an extension to easily use the
| optimizers
| https://github.com/SciML/NonlinearSolve.jl/pull/395 to
| add it to the benchmarks.
| avikpal1410 wrote:
| That is precisely how trust region methods work for
| nonlinear least squares problems and nonlinear systems.
| MINPACK, which both SciPy and Matlab use, defaults to
| this kind of TR scheme (without the matrix-free part). We
| do have TR comparisons in the paper.
|
| Note that we don't however solve the normal form J'J
| system as that is generally bad (unless user opts in to
| doing so) and instead use least squares formulation which
| is more numeraically stable
| wenc wrote:
| Yes tribal knowledge says NLP solvers aren't tailored to NLE
| solvers as NLE solvers are. There have been papers in this
| past showing this but in my experience, it's not obvious in
| practice. Nonlinear systems can often surprise us. The
| initial point (and randomness) has more influence on the
| solution time than almost anything else.
|
| I forget the exact internal formulation of the problem in
| IPOPT but for an NLE it's likely the efficiency of the linear
| algebra (MA57) algorithm will dominate since it is as you
| say, constraint satisfaction.
| adgjlsfhk1 wrote:
| isn't MA57 just a standard good sparse linear solver? For
| sparse problems, NonlinearSolve.jl (via LinearSolve.jl)
| will either use KLU or Cholmod both of which are also quite
| fast. Also, even if the linear solver is the bottleneck,
| that doesn't mean different algorithms won't make a
| differenc.e Different nonlinear solve algorithms can lead
| to needing fewer iterations (e.g. fewer linear solves).
| wenc wrote:
| I don't know. I'm curious is all.
|
| MA57 seems generic but I've tried different linear
| solvers with IPOPT even ones that were supposed to be
| better, but MA57 still performed the best.
|
| Nonlinear numerical problems involve a bit of art. Theory
| doesn't always pan out in practice.
| kxyvr wrote:
| MA57 is a sparse symmetric solver. IPOPT uses this
| because it solves a modified KKT system that includes
| both the Hessian and constraint information
| simultaneously. This differs from composite step methods,
| which split the optimization step into a step for
| feasibility (quasi-normal step) and a step for optimality
| (tangential step). An algorithm like NITRO uses a
| composite step method and as a result uses a different
| kind of linear system solver.
| guyomes wrote:
| > The initial point (and randomness) has more influence on
| the solution time than almost anything else.
|
| On that remark, a theoretical tool to choose the initial
| point is the Newton Polygon. For univariate polynomials,
| this leads to some of the fastest solver [1]. The idea to
| choose the initial point in this case is notably described
| in Section 6 of this paper [2].
|
| A very rough intuition of the idea is that a polynomial
| p(z) doesn't vanish if one of its monomials is
| significantly larger than all the other. Reciprocally, the
| polynomial has more chances to vanish for values z such
| that two monomials have the same order of magnitude. Hence
| this is a good value to choose your initial point. Finally,
| deciding if two monomials have the same order of magnitude
| can be done by taking their log, and then it becomes
| geometry with polygons, hence the name Newton Polygon.
|
| [1]: https://numpi.dm.unipi.it/scientific-computing-
| libraries/mps...
|
| [2]: https://woelen.homescience.net/science/math/exps/polyn
| omials...
| Bimos wrote:
| On GPU, it can only solve small instances which one single GPU
| thread can handle, and can only benefit from GPU by solving
| multiple instances, right?
| eigenspace wrote:
| That's the GPU parallelization they offer, but if you had a
| problem where say the objective function itself was really big
| and cumbersome and benefited from GPU parallelization, then you
| could put that objective function on the GPU just fine.
|
| I think this use case isn't really discussed in the paper
| because the solver doesn't need to specialize on that case, it
| should 'just work' without any intervention needed.
| ChrisRackauckas wrote:
| No, there are two types of GPU usage: the ensemble form and the
| array form. In the array form, you simply make your state
| variable a GPU and can run all internal operations on the GPU.
| This is used for GPU acceleration of `f` functions where you
| have a large state and a relatively "regular" function in `f`,
| such as neural networks or PDE discretizations. Thus you can
| see this in action for example on things like the
| DeepEquilibriumNetworks.jl
| https://docs.sciml.ai/DeepEquilibriumNetworks/stable/tutoria...
| use cases, where it's big neural networks with nonlinear
| solvers in them.
|
| The other use case is if you're solving tons of f(u,p)=0
| equations for many different p, but all the same `f`. This
| shows up for example when doing parameter sweeps. You might
| think that you can just use the former for this: if you
| originally had `u` a vector, then you can make `u` a matrix and
| solve `f(u[:,i],p[:,i])` all simultaneously. This is in fact
| what machine learning libraries tend to do with their `vmap`
| approach. However, that hides a choice: how do you change `f`
| to build the composite `f`? You can either do the matrix
| operations on each operation of `f`, or you can build a
| specific GPU kernel for `f`. Of course you can do the former,
| that is exactly the same as the use case above, reusing the
| same machinery. However, we also offer the latter. And the
| reason for that is because calling GPU kernels has a lot of
| overhead and inhibits interprocedural optimizations, and
| therefore the former is a lot slower than the latter when you
| can exploit this known repeated structure. We were able to
| demonstrate this in a recent publication which showed using the
| latter approach for ODEs was about 20x-100x faster than
| PyTorch, Jax, and our own implementation of the former, while
| our approach to the latter is just about as fast as MPGOS which
| is a CUDA kernel, indicating that it's some limitation of the
| design (see https://www.sciencedirect.com/science/article/abs/p
| ii/S00457..., or the arxiv version
| https://arxiv.org/abs/2304.06835). The kernel building approach
| is thus simply more efficient for things like parameter sweeps,
| and it's a pretty unique way of using the GPU as most machine
| learning libraries are not built around kernel generation but
| instead only give you the linear algebra kernel calling
| approaches.
|
| tl;dr, DiffEqGPU.jl's documentation is very clear on the two
| types of GPU usage
| https://docs.sciml.ai/DiffEqGPU/stable/getting_started/, but is
| specifically for ODEs. We need to write some similar tutorials
| for GPU usage on the nonlinear solvers to this effect.
| selimthegrim wrote:
| This sounds fascinating and I'd love to help.
___________________________________________________________________
(page generated 2024-03-26 23:02 UTC)