[HN Gopher] Nonlinearsolve.jl: Fast and Robust Solvers for Nonli...
       ___________________________________________________________________
        
       Nonlinearsolve.jl: Fast and Robust Solvers for Nonlinear Equations
       in Julia
        
       Author : adgjlsfhk1
       Score  : 141 points
       Date   : 2024-03-26 03:27 UTC (19 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | wenc wrote:
       | Would be interested to see this compared to IPOPT.
       | 
       | https://coin-or.github.io/Ipopt/
       | 
       | IPOPT isn't tailored to NLEs specifically, but it does solve NLPs
       | well. It also uses some amazing Fortran linear algebra routines
       | from Harwell (MA57).
        
         | adgjlsfhk1 wrote:
         | it should be pretty easy to add to the benchmarks. From a first
         | glance, it looks like they have a lot of the necessary
         | machinery to do well.
        
         | jpfr wrote:
         | This article is about "solving" differential equations and not
         | convex optimization.
        
           | wenc wrote:
           | > This article is about "solving" differential equations and
           | not convex optimization.
           | 
           | This article is about solving nonlinear equations (not
           | differential equations, not sure where you got that from).
           | All NLP optimizers can solve nonlinear equations -- it's a
           | special case where the objective is constant.
           | 
           | Ipopt is not a convex solver so am not sure what convex
           | optimization you are referring to. It is a general nonlinear
           | solver, which covers nonconvex problems as well (I worked on
           | nonconvex nonlinear programs for a decade and it was my
           | primary solver)
           | 
           | Also all nonlinear equation systems are nonconvex. (A convex
           | program requires equality constraints to be linear)
        
             | FreakLegion wrote:
             | _> all nonlinear equation systems are nonconvex_
             | 
             | Maybe you have something more particular in mind when you
             | say "systems", but not all nonlinear functions are non-
             | convex. Least squares, for example, is nonlinear and
             | convex.
             | 
             | Also note that IPOPT, while wonderful, is a local solver.
             | It may not be limited to convex problems, but those are the
             | only ones it's guaranteed to solve to optimality.
        
               | Bimos wrote:
               | The feasible region {x | f(x) = 0} is nonconvex no matter
               | whether f is convex.
        
               | graycat wrote:
               | Consider claim:
               | 
               | > The feasible region {x | f(x) = 0} is nonconvex no
               | matter whether f is convex.
               | 
               | For some positive integer n and for the set of real
               | numbers R, consider closed (in the usual topology for
               | R^n), convex set A a subset of R^n. Define function f:
               | R^n --> R so that f(x) = 0 for all x in A, f(x) > 0 for
               | all x not in A, and for all x in R^n f infinitely
               | differentiable at x. It is a theorem that such an f
               | exists.
               | 
               | Then { x | f(x) = 0 } = A and is both closed and convex
               | in contradiction to the claim.
        
               | ubj wrote:
               | I'm assuming you're referring to nonlinear f(x) because
               | this statement is trivially false for linear f(x).
               | 
               | But consider the function f(x) = max(0, g(x) - c) where
               | the following holds:
               | 
               | * g(x) is nonlinear, positive definite in x, and convex.
               | 
               | * c > 0
               | 
               | Then f(x) is nonlinear and convex (it's the pointwise
               | maximum of convex functions), and the set {x : f(x) = 0}
               | is a convex set.
        
               | wenc wrote:
               | I was talking in terms of convex optimization. The
               | criteria for convex optimization is convex objective,
               | convex inequality constraints and convex feasible region,
               | and linear (not just convex) equality constraints.
               | 
               | I'm aware that ipopt is a local solver.
        
               | FreakLegion wrote:
               | _> all nonlinear equation systems are nonconvex_
               | 
               |  _> I was talking in terms of convex optimization_
               | 
               | I'm not sure what subtlety you're pointing to here. As
               | far as non-convex problems, though, my point is that
               | IPOPT isn't special in this regard. Any convex solver can
               | be a non-convex solver if you don't care about global
               | optimality.
        
               | wenc wrote:
               | > Any convex solver can be a non-convex solver if you
               | don't care about global optimality.
               | 
               | Aside: Structurally I'm not sure how this would be true.
               | 
               | Convex solvers have very specific forms. For instance a
               | QP solver requires a very particular form and does not
               | admit any arbitrary non convex form except for one: the
               | non-PSD Hessian which is the concave problem.
               | 
               | My point is that all NLEs power inside an optimization
               | problem gives rise to a non convex optimization problem
               | with no guarantee of a global solution. So convex
               | optimization is not applicable here.
        
         | akutlay wrote:
         | I dont know much about this but the fact that they did not
         | mention IPOPT or Highs (which I would think them as two of the
         | most popular software used for these problems) made me question
         | this too.
        
           | affinepplan wrote:
           | those are indeed very popular tools but for a slightly
           | different kind of problem than this package is attempting to
           | solve
        
             | ChrisRackauckas wrote:
             | Yes, especially Highs is completely unrelated because it's
             | for (mixed integer) linear programming problems and
             | quadratic programming problems, so it's not able to solve
             | the types of problems described in this manuscript.
             | 
             | IPOPT is a bit of a stretch but I do know that there are
             | some people in the mathematical programming space that use
             | IPOPT's constraint satisfaction piece for solving a
             | nonlinear system in a pinch, but it's not the main use case
             | of IPOPT which is nonlinear optimization with nonlinear
             | (in)equality constraints. This case has no optimization and
             | is just nonlinear equality constraints, which you can then
             | specialize quite a bit on. But while it is considered
             | common knowledge in numerical circles to not use a
             | nonlinear optimizer to solve nonlinear systems, I don't
             | have a good canonical benchmark to point to on that, so we
             | might as well make this it.
        
           | avikpal1410 wrote:
           | IPOPT as mentioned in other posts solves Nonlinear
           | Optimization Problems. Working around it to make it solve
           | nonlinear equations or nonlinear least squares problems
           | typically doesn't go well. We will add benchmarks for the
           | NLLS part (which we don't talk about in the paper) and that
           | will contain IPOPT comparisons.
           | 
           | For some comparison (this is not a benchmark me or anyone on
           | this paper have written), see https://juliapackagecomparisons
           | .github.io/comparisons/math/n.... Specialized solvers for
           | NLLS pretty much always beat general optimization solvers.
        
         | ChrisRackauckas wrote:
         | Definitely an interesting response I didn't see coming. This is
         | one of the reasons to share preprints!
         | 
         | IPOPT is a nonlinear optimization tool, not a nonlinear solver.
         | Generally from what I've seen it's not a good idea to solve
         | nonlinear systems with a nonlinear optimizer, but you can
         | phrase it as a "constraint satisfaction problem", i.e. a
         | nonlinear optimization with a trivial `maximize 0` loss
         | function but with equality constraints to be satisfied `f(u) =
         | 0` and allow it to do that. IIRC the sparse symmetric linear
         | solver is only used in the optimization process as it's that
         | part where it can guarantee a sparse symmetric linear system,
         | the constraints themselves `f` would have to be symmetric to
         | reuse that in the Newton method which isn't true for any of the
         | benchmarks. As such, I don't think you'd get most of IPOPT's
         | advantages even showing up in a constraint satisfaction problem
         | at all. Don't get me wrong, IPOPT is an amazing software for
         | nonlinear optimization, but when not doing optimization it
         | would lose a lot of what makes it amazing.
         | 
         | However, perfectly agreed that I cannot lay this to rest right
         | now because I don't have a benchmark to point to. The best
         | thing to do here would be to add it to the benchmarks and fully
         | describe why in the paper. We'll definitely follow up with this
         | in a revision.
         | 
         | In the meantime, if you have more requests for the benchmarks,
         | please feel free to open issues at
         | https://github.com/SciML/SciMLBenchmarks.jl so we can track
         | them. All of our benchmarks run on this open platform and
         | anyone can add things via a PR. In particular, the 23
         | benchmarks diagram in the paper for example is simply just the
         | result of this script
         | https://github.com/SciML/SciMLBenchmarks.jl/blob/master/benc...
         | which runs automatically on any PR (for new folks it requires
         | we click yes for security reasons though). So please feel free
         | to send any benchmark requests. We do plan to do a lot more
         | comprehensive nonlinear optimization benchmarks in the near
         | future. For this case with IPOPT though, we're happy to add
         | that in ourselves hopefully in the next few weeks.
        
           | sobellian wrote:
           | You can also put it as a minimization problem to minimize
           | ||g(u)||, but I assume that does not magically make anything
           | more efficient.
        
             | ChrisRackauckas wrote:
             | That makes your Newton method require calculating the
             | Hessian of the objective function, which is a second
             | derivative. A normal Newton method for nonlinear systems
             | uses the Jacobian, which is the first derivative (the
             | connection is that the Hessian is the Jacobian of the
             | gradient). But indeed, formulating it so that you have
             | discontinuities (absolute values) and require second
             | derivatives instead of the first for the same algorithm is
             | not the good kind of magic :-/. You could then try using
             | only gradient-based methods, like Adam, but you can see
             | that is actually using less problem structure. BFGS for
             | example is related to Broyden's method and is thus using
             | simple Hessian approximations, so for nonlinear system
             | solving this is effectively less stable than just using
             | Newton.
             | 
             | That's not to say that there's no reason to use this trick
             | though. I have seen that some optimization tools can
             | improve robustness in some cases, and this is a way to make
             | use of heuristic methods (genetic algorithms, differential
             | evolution, etc.) for globalizing the process, but it's
             | probably not the first trick you'd want to do in most
             | cases.
             | 
             | While raising derivative order is one way to understand why
             | this should be less efficient, but another is to understand
             | that it loses some specific structure. In the nonlinear
             | system problem f(u)=0, any u that solves this is a
             | solution. All solutions are the same, if you find one
             | you're good. In the optimization problem, you want to find
             | "the" u that minimizes f(u). Now if you are minimizing
             | f(u)=||g(u)|| and you know there is a u that causes g(u) to
             | be zero, then all u that causes g(u) to be zero is a
             | solution, all match the global optima. But solvers don't
             | necessarily know that. Solvers need to use second order
             | information to know that they have not hit a saddle point
             | or other behavior. They don't know that "near zero = done",
             | and they can't specialize on this information. This is
             | another way in which you're just giving sophisticated more
             | solvers more work to do where you already know the answer
             | by problem design.
        
               | kxyvr wrote:
               | This is not entirely true. A typical trick for
               | accomplishing this is to simply use the Hessian
               | approximation `g'(x)*g'(x)` (star denotes adjoint), thus
               | cutting off the second-derivative evaluation of g, and
               | then use a modified CG method like Newton-Krylov for
               | line-search methods or Steihaug-Toint for trust-region
               | methods. It's very robust, matrix-free, and doesn't
               | require an excessive amount of code to implement.
               | Further, the method can be preconditioned with a positive
               | definite preconditioner.
               | 
               | And, to be clear, there's no guarantee that a zero
               | residual be found, only that `g'(x)*g(x)=0`, but that
               | tends to be the nature of nonlinear equations.
        
               | ChrisRackauckas wrote:
               | > It's very robust, matrix-free, and doesn't require an
               | excessive amount of code to implement. Further, the
               | method can be preconditioned with a positive definite
               | preconditioner.
               | 
               | It is not robust, in fact it's well-known as a
               | numerically unstable method. g'(x)*g'(x)` is numerically
               | unstable because it squares the condition number of the
               | matrix. If you have a condition number of 1e-10, which is
               | true for many examples in scientific problems (like the
               | DFN battery model in the paper), the condition number of
               | J'J is 1e-20, which effectively means a linear solve is
               | unable to retrieve any digits of accuracy with 64-bit
               | floating point numbers. So while I agree that you can use
               | symmetric linear solvers as a small performance
               | improvement in some cases, you have to be very careful
               | when you do this as it is a very numerically unstable
               | method. For some details, see
               | https://math.stackexchange.com/a/2874902
               | 
               | Also, if you look at Figure 5 in the paper, you will see
               | that there is a way to choose operator conditioning https
               | ://docs.sciml.ai/LinearSolve/stable/basics/OperatorAssu..
               | ., and if you tell it to assume the operator is well-
               | conditioned it will actually use this trick. But we do
               | not default to that because of the ill-conditioning.
        
               | kxyvr wrote:
               | Yes, care must be taken to stabilize the algorithm.
               | However, I do not believe it to be true that no digits of
               | accuracy can be obtained. The algorithm falls back to the
               | Cauchy (steepest-descent) point on the first iteration
               | and this will give reduction. That said, I'm willing to
               | solve this particular problem and see. Where's your code
               | for the DFN battery model? The page here leads to a dead
               | github link:
               | 
               | https://help.juliahub.com/batteries/stable/api/#JuliaSimB
               | att...
        
               | ChrisRackauckas wrote:
               | The full DFN has a lot more going on. Use the benchmark-
               | ified version from the benchmarks:
               | 
               | https://docs.sciml.ai/SciMLBenchmarksOutput/dev/Nonlinear
               | Pro...
               | 
               | But as Avik states, this same kind of thing is done as
               | part of the trust region methods, and so the trust region
               | stuff with some ways to avoid J'J is effectively doing
               | the same thing, though of course that cannot target a
               | symmetric linear solver because J is not generally
               | symmetric and it needs to do the L2 solve so it needs a
               | QR/GMRES.
               | 
               | But indeed, let's get IPOPT in this interface and
               | benchmarks and show it in the plots.
        
               | kxyvr wrote:
               | What's your convergence metric? It looks like x_sol gives
               | a residual on the order of 1e-7. That's fine, but I'd
               | like to measure an equivalent number to the benchmark.
        
               | ChrisRackauckas wrote:
               | It's varied, that's why it's measured as work-precision.
               | It would be good to similarly produce a full work-
               | precision line as point estimates do not tell the full
               | story.
               | 
               | Note that we are preparing an extension to easily use the
               | optimizers
               | https://github.com/SciML/NonlinearSolve.jl/pull/395 to
               | add it to the benchmarks.
        
               | avikpal1410 wrote:
               | That is precisely how trust region methods work for
               | nonlinear least squares problems and nonlinear systems.
               | MINPACK, which both SciPy and Matlab use, defaults to
               | this kind of TR scheme (without the matrix-free part). We
               | do have TR comparisons in the paper.
               | 
               | Note that we don't however solve the normal form J'J
               | system as that is generally bad (unless user opts in to
               | doing so) and instead use least squares formulation which
               | is more numeraically stable
        
           | wenc wrote:
           | Yes tribal knowledge says NLP solvers aren't tailored to NLE
           | solvers as NLE solvers are. There have been papers in this
           | past showing this but in my experience, it's not obvious in
           | practice. Nonlinear systems can often surprise us. The
           | initial point (and randomness) has more influence on the
           | solution time than almost anything else.
           | 
           | I forget the exact internal formulation of the problem in
           | IPOPT but for an NLE it's likely the efficiency of the linear
           | algebra (MA57) algorithm will dominate since it is as you
           | say, constraint satisfaction.
        
             | adgjlsfhk1 wrote:
             | isn't MA57 just a standard good sparse linear solver? For
             | sparse problems, NonlinearSolve.jl (via LinearSolve.jl)
             | will either use KLU or Cholmod both of which are also quite
             | fast. Also, even if the linear solver is the bottleneck,
             | that doesn't mean different algorithms won't make a
             | differenc.e Different nonlinear solve algorithms can lead
             | to needing fewer iterations (e.g. fewer linear solves).
        
               | wenc wrote:
               | I don't know. I'm curious is all.
               | 
               | MA57 seems generic but I've tried different linear
               | solvers with IPOPT even ones that were supposed to be
               | better, but MA57 still performed the best.
               | 
               | Nonlinear numerical problems involve a bit of art. Theory
               | doesn't always pan out in practice.
        
               | kxyvr wrote:
               | MA57 is a sparse symmetric solver. IPOPT uses this
               | because it solves a modified KKT system that includes
               | both the Hessian and constraint information
               | simultaneously. This differs from composite step methods,
               | which split the optimization step into a step for
               | feasibility (quasi-normal step) and a step for optimality
               | (tangential step). An algorithm like NITRO uses a
               | composite step method and as a result uses a different
               | kind of linear system solver.
        
             | guyomes wrote:
             | > The initial point (and randomness) has more influence on
             | the solution time than almost anything else.
             | 
             | On that remark, a theoretical tool to choose the initial
             | point is the Newton Polygon. For univariate polynomials,
             | this leads to some of the fastest solver [1]. The idea to
             | choose the initial point in this case is notably described
             | in Section 6 of this paper [2].
             | 
             | A very rough intuition of the idea is that a polynomial
             | p(z) doesn't vanish if one of its monomials is
             | significantly larger than all the other. Reciprocally, the
             | polynomial has more chances to vanish for values z such
             | that two monomials have the same order of magnitude. Hence
             | this is a good value to choose your initial point. Finally,
             | deciding if two monomials have the same order of magnitude
             | can be done by taking their log, and then it becomes
             | geometry with polygons, hence the name Newton Polygon.
             | 
             | [1]: https://numpi.dm.unipi.it/scientific-computing-
             | libraries/mps...
             | 
             | [2]: https://woelen.homescience.net/science/math/exps/polyn
             | omials...
        
       | Bimos wrote:
       | On GPU, it can only solve small instances which one single GPU
       | thread can handle, and can only benefit from GPU by solving
       | multiple instances, right?
        
         | eigenspace wrote:
         | That's the GPU parallelization they offer, but if you had a
         | problem where say the objective function itself was really big
         | and cumbersome and benefited from GPU parallelization, then you
         | could put that objective function on the GPU just fine.
         | 
         | I think this use case isn't really discussed in the paper
         | because the solver doesn't need to specialize on that case, it
         | should 'just work' without any intervention needed.
        
         | ChrisRackauckas wrote:
         | No, there are two types of GPU usage: the ensemble form and the
         | array form. In the array form, you simply make your state
         | variable a GPU and can run all internal operations on the GPU.
         | This is used for GPU acceleration of `f` functions where you
         | have a large state and a relatively "regular" function in `f`,
         | such as neural networks or PDE discretizations. Thus you can
         | see this in action for example on things like the
         | DeepEquilibriumNetworks.jl
         | https://docs.sciml.ai/DeepEquilibriumNetworks/stable/tutoria...
         | use cases, where it's big neural networks with nonlinear
         | solvers in them.
         | 
         | The other use case is if you're solving tons of f(u,p)=0
         | equations for many different p, but all the same `f`. This
         | shows up for example when doing parameter sweeps. You might
         | think that you can just use the former for this: if you
         | originally had `u` a vector, then you can make `u` a matrix and
         | solve `f(u[:,i],p[:,i])` all simultaneously. This is in fact
         | what machine learning libraries tend to do with their `vmap`
         | approach. However, that hides a choice: how do you change `f`
         | to build the composite `f`? You can either do the matrix
         | operations on each operation of `f`, or you can build a
         | specific GPU kernel for `f`. Of course you can do the former,
         | that is exactly the same as the use case above, reusing the
         | same machinery. However, we also offer the latter. And the
         | reason for that is because calling GPU kernels has a lot of
         | overhead and inhibits interprocedural optimizations, and
         | therefore the former is a lot slower than the latter when you
         | can exploit this known repeated structure. We were able to
         | demonstrate this in a recent publication which showed using the
         | latter approach for ODEs was about 20x-100x faster than
         | PyTorch, Jax, and our own implementation of the former, while
         | our approach to the latter is just about as fast as MPGOS which
         | is a CUDA kernel, indicating that it's some limitation of the
         | design (see https://www.sciencedirect.com/science/article/abs/p
         | ii/S00457..., or the arxiv version
         | https://arxiv.org/abs/2304.06835). The kernel building approach
         | is thus simply more efficient for things like parameter sweeps,
         | and it's a pretty unique way of using the GPU as most machine
         | learning libraries are not built around kernel generation but
         | instead only give you the linear algebra kernel calling
         | approaches.
         | 
         | tl;dr, DiffEqGPU.jl's documentation is very clear on the two
         | types of GPU usage
         | https://docs.sciml.ai/DiffEqGPU/stable/getting_started/, but is
         | specifically for ODEs. We need to write some similar tutorials
         | for GPU usage on the nonlinear solvers to this effect.
        
           | selimthegrim wrote:
           | This sounds fascinating and I'd love to help.
        
       ___________________________________________________________________
       (page generated 2024-03-26 23:02 UTC)