[HN Gopher] Gradient Free Optimizers: A collection of modern opt...
___________________________________________________________________
Gradient Free Optimizers: A collection of modern optimization
methods in Python
Author : EntICOnc
Score : 176 points
Date : 2021-02-28 13:10 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| freemint wrote:
| If you have an expensive but not high dimensional problem you
| might want to try https://github.com/DLR-SC/sqpdfo . It is based
| on Krigging, SQP and also handles constraints.
| klon wrote:
| I am surprised CMA-ES is not included. It is the state of the art
| gradient free optimization algorithm last I checked.
|
| https://en.wikipedia.org/wiki/CMA-ES
| simonblanke wrote:
| Evolution strategy is included. The "Covariance matrix
| adaptation" is for making this algorithm work for continuous
| search spaces. But gradient-free-optimizers has discrete search
| space.
| morelandjs wrote:
| This looks great! You have a lot of algorithms here. Which ones
| do you find to be the most promising for general use and why?
| simonblanke wrote:
| I think Random Restart Hill Climbing is good if your objective
| function evaluates fast (simple functions). It does not get
| stuck in local optima and also does local search very well.
|
| Bayesian Optimization is very good if your objective function
| takes some time (> 1 second) to evaluate. If you look at the
| gifs in the readme of Gradient-Free-Optimizers you will see how
| fast it finds promising solutions. It is almost scary how good
| it is (even compared to other smbo).
| morelandjs wrote:
| Thanks, that's useful. My impression of Bayesian parameter
| optimization using Gaussian processes is that it is quite
| good when the data has a more or less constant correlation
| length across the evaluation points as in your example.
|
| When there are large correlation lengths in some regions of
| the dataset and small correlation lengths elsewhere, it often
| over (or under) shoots the curvature of the hypersurface.
| protoplaid wrote:
| Which algorithm would you recommend when the objective
| function is noisy (and nondeterministic)?
|
| For example the objective function is the "score" of a
| particular stochastic simulation, which can be started with
| varied initial random seed, or the result of a real physical
| experiment, which is naturally stochastic (and expensive to
| evaluate).
|
| There is a tradeoff between getting a very accurate
| estimation of the objective function + variance _of a single
| point_ vs exploring other points. Is there a search algorithm
| that somehow manages this tradeoff automatically?
|
| Note: In the past I've used Tree of Parzen Estimators (Kernel
| density estimators), wasting 3-4 evaluations per point, but I
| have a feeling it is sub-optimal. Is there an "optimal"
| algorithm, like the _optimal algorithm_ for the multi-armed
| bandit problem[1] (which is similar)
|
| [1] https://en.wikipedia.org/wiki/Multi-armed_bandit
| morelandjs wrote:
| I'm wondering the same. I'd be concerned that Random
| Restart Hill climbing would essentially chase random noise.
| simonblanke wrote:
| You could be right. I must confess, that i have a
| (probably) very narrow understanding of typical
| optimization problems. Most of the objective functions i
| optimize have machine learning algorithms in it (to
| optimizer hyperparameters). Depending on the evaluation
| those can have low noise.
|
| If you like you could provide other use cases and
| applications for optimization algorithms.
| plaidfuji wrote:
| Bayesian Optim is designed for that case specifically. It
| fits a surrogate model with uncertainty estimates and picks
| the next point with an understanding of that uncertainty.
| Look up the MEI acquisition function for more info.
|
| Edit: BO does usually require some tuning for your use
| case. Its acquisition function sometimes samples replicates
| where there's high noise, especially if the first sample
| looks particularly "good". There's usually a hyper
| parameter that can be set to favor exploration vs
| exploitation, I.e. to favor non-replicate samples. But I am
| not aware of an algo that can learn your preference along
| that axis.
| protoplaid wrote:
| Correct me if I'm wrong, but it seems the
| bayesian_optimization.py optimizer in this library
| assumes that the sampled points are exact, ie their
| variance is zero. It doesn't seem to re-sample existing
| points.
|
| This will cause the algorithm to "chase random noise", as
| morelandjs wrote below
| plaidfuji wrote:
| I would be very disappointed if that were the case.. no,
| it looks like it's set up to capture variance. The BO
| algo wraps an "Expected Improvement Optimizer":
|
| https://github.com/SimonBlanke/Gradient-Free-
| Optimizers/blob...
|
| Which selects new points based on both the model's mean
| estimate and its variance. See around line 58
| protoplaid wrote:
| line 62: exp_imp[sigma == 0.0] = 0.0
|
| I'm afraid it never samples points more than once, since
| it estimated already-sampled-points as points with
| variance zero, and no expected improvement.
|
| IMHO that's wrong. Variance of a single _sample_ should
| be infinite (classical statistics), or similar to the
| variance of nearby points (bayesian+model), or some pre-
| defined prior (not a great idea... I 'd prefer some
| automatic method). But not zero.
| plaidfuji wrote:
| Ah, good catch. So in the event the gpr predicts zero
| variance, the optimizer says EI is zero and thus won't
| sample again. That may depend on the settings of the
| gpr.. if I'm not mistaken there are ways for gpr to model
| noise and not collapse to zero variance on every sampled
| point.
|
| Anyway, I guess I stand by my original suggestion that BO
| is the best tool for gradient free optim with slow and
| noisy fevals, but to my knowledge nobody has built a way
| to dial in the hyper parameters automatically. And there
| are quite a few. Entire companies exist for this purpose,
| SigOpt comes to mind..
| morelandjs wrote:
| I haven't looked at his source, but usually there's a
| white noise term in the covariance structure of the
| Gaussian process regressor that adds some statistical
| uncertainty at each evaluation point. So even when it
| evaluates a point of parameter space the GPR is still
| somewhat uncertain about the value of the optimization
| function at that point. So it should balance exploration
| versus exploitation taking that statistical uncertainty
| into account.
| protoplaid wrote:
| > especially if the first sample looks particularly
| "good".
|
| You've precisely described the problem: the algorithm
| will get stuck on a point if the _first_ sample looks
| good and the assumption of zero variance. Until it
| randomly hits a luckier sampler (but not necessarily
| better point).
|
| Another related problem, is that the boundaries of the
| parameter space have a bad score (objective function),
| but _very low variance_ (they 're _always_ bad), which
| confuses the search function into believing that the
| interior points also have a very low variance, which is
| incorrect.
|
| If anyone knows of a library that handles those cases
| correctly, without providing user-defined priors for each
| dimensions, I'd be glad to hear
| optimalsolver wrote:
| Not OP, but you can't go wrong with randomized/stochastic hill-
| climbing.It's actually very competitive with complex, fancy
| genetic algorithms:
|
| https://core.ac.uk/download/pdf/37767541.pdf
| uyt wrote:
| For those who want to learn more, I really like the book
| "Algorithms for Optimization":
| https://algorithmsbook.com/optimization/#outline
|
| It's a great intro because it covers a large breadth and gives
| you a sense of what class of techniques are useful for what
| (rather than just going super deep into one particular
| technique).
|
| They also have a github in julia notebooks implementing most of
| the ideas: https://github.com/sisl/algforopt-notebooks
| yaroslavvb wrote:
| SPSA seems to be missing. OpenAI's "evolutionary strategies"
| paper used a form of SPSA.
| https://en.wikipedia.org/wiki/Simultaneous_perturbation_stoc...
| simonblanke wrote:
| Also: Gradient-Free-Optimizers is basically just the
| optimization-backend for a much larger project of mine:
| https://github.com/SimonBlanke/Hyperactive
| amatic wrote:
| Amazing collection! I'm learning about backprop in neural
| networks and I'm reading how that kind of optimization may not be
| biologically 'implementable'. Are non-gradient based optimizers
| an alternative proposal for biological learning? How do they
| compare to backprop?
| MauranKilom wrote:
| I'd recommend also implementing the DIRECT algorithm, which
| balances local and global search very well (but consequently will
| not refine as aggressively near the current [possibly only local]
| optimum as other algorithms):
|
| https://www.researchgate.net/profile/Donald-Jones-5/publicat...
|
| You probably also want to include some advantages/disadvantages
| of each algorithm. How robust against local minima is it? Up to
| how many dimensions does it work well? How is the convergence
| speed when started far from the optimum? Does it work well with
| few function evaluations? Etc.
| simonblanke wrote:
| I will look into this algorithm. Thanks for the suggestion. I
| have some basic explanations of the optimization techniques and
| their parameters in a separate repository:
| https://github.com/SimonBlanke/optimization-tutorial
|
| But there is still a lot of work to be done.
| asah wrote:
| Given the rise of multi-core machines, parallel optimization
| seems pretty important. Which of these support parallel execution
| ?
|
| For this reason, I chose Mango for a recent project:
| https://github.com/ARM-software/mango (bayesian optimizer with
| built-in support for both parallel optimization (via
| multiprocessing) and distributed (via celery))
| simonblanke wrote:
| Gradient-Free-Optimizers is a lightweight optimization package
| that serves as a backend for Hyperactive:
| https://github.com/SimonBlanke/Hyperactive
|
| Hyperactive can do parallel computing with multiprocessing or
| joblib, or a custom wrapper-function. Parallel computing can be
| done with all its optimization algorithms. You could even pass
| a gaussian process regressor like GpFlow to the Bayesian
| Optimizer (GFO and Hyperactive) to get GPU acceleration.
| wtallis wrote:
| Consider adding some affordance for parallel computing to
| Gradient-Free-Optimizers by allowing the user to provide a
| vectorized objective function instead of one that evaluates
| only a single point in the search space per function call.
| That leaves all the hard work of parallelization as an
| exercise for the user, and gives the user the flexibility to
| parallelize their objective function with whatever mechanism
| they wish.
|
| I have previously used this approach in a project where the
| objective function contained a half-hour long simulation,
| which was the bottleneck that made estimating a gradient
| intractable. When the optimization algorithm gave a batch of
| several points in the search space to evaluate, our objective
| function could prepare and run several instances of the
| simulation in parallel, and return when the whole batch was
| complete. From this, it was easy for us to also distribute
| simulation runs across several machines, without needing any
| changes to the optimizers. We would not have been able to
| easily achieve this with an optimization framework that tried
| to directly manage parallelization, because the steps
| necessary to prepare the input files for the simulation
| software had to be done serially.
|
| For that project we tried: DIRECT, several variants of
| Nelder-Mead, and an evolutionary strategy. In hindsight, the
| Nelder-Mead variants worked best; once we accumulated enough
| simulation results it became clear that our objective
| function was convex and pretty well-behaved in the region of
| interest. Nelder-Mead was also trivial to extend to trying
| several extra points per batch to ensure that each of our
| several workstations had something to work on. (We didn't
| have access to a large cluster, and Nelder-Mead wouldn't
| generalize well to a large degree of parallelization in that
| manner.)
| simonblanke wrote:
| Your parallel computing approach sounds intriguing! Could
| you provide an example script? I would like to look into
| this. If you like you could open an issue as a feature
| request and provide a code snipped there.
| dijereedan wrote:
| What about genetic optimization?
| simonblanke wrote:
| Gradient-Free-Optimizers has a type of genetic algorithm:
| Evolution Strategy
| stilley2 wrote:
| Cool! I'd considering adding CMA-ES [1], which as far as I know
| is the gold standard for derivative free optimization.
|
| 1. https://en.m.wikipedia.org/wiki/CMA-ES
| stilley2 wrote:
| Ah nevermind I see this was addressed in another comment.
| morelandjs wrote:
| Getting a lot of these errors when using the GPR. Seems to be an
| issue with the estimation of the noise level term. Scratching my
| head on this one. Anyone familiar with what's going on? I tried
| adding some random noise to the optimization function, and it had
| the same issue.
|
| ...python3.8/site-
| packages/sklearn/gaussian_process/kernels.py:402:
| ConvergenceWarning: The optimal value found for dimension 0 of
| parameter k2__noise_level is close to the specified lower bound
| 1e-05. Decreasing the bound and calling fit again may find a
| better value.
| simonblanke wrote:
| Yeah this is a warning from sklearns gaussian process
| regressor. Sklearn probably wants you to increase the
| "n_restarts_optimizer"-parameter of the gpr, but from my
| experience this warning does not correlate with bad results
| from Gradient-Free-Optimizers.
|
| Sklearn warnings are often difficult to silence. Just google
| "sklearn suppress warnings" to get a code snippet for this
| problem.
| morelandjs wrote:
| thanks!
| mik09 wrote:
| nice repo, i saw some comments expressing excitement and thought
| of the first time i saw pymoo, tpot, and finally automl. there
| are a lot of stuff laying around github but we just need to
| search better, like using topics (tags) and the gazer repo which
| kinda page-ranks the github repos.
| simonblanke wrote:
| I had the same excitement for tpot! That project was the reason
| i started creating mine.
| plaidfuji wrote:
| Be still my heart... could this be... the missing gradient-free
| optim library for Python? Particle swarm, evolutionary, AND
| Bayesian all in a single package? A no-nonsense, simple API with
| strong enough defaults that you can basically just call .search()
| with a function (I'm looking at you, nevergrad)??
|
| I'll have to test it of course but this looks truly life
| changing. A few clarifications, if the author happens to be
| watching this thread:
|
| - looks like the search space is "enumerated", I.e. I define the
| grid spacing for each independent variable in search_space and it
| puts together an outer product? In other words, this will never
| be truly continuous?
|
| - by the same token, constraints are applied by virtue of
| specifying the endpoints of each grid in the search_space? Is it
| possible for me to apply more complex constraints by generating
| my own enumerated list of search_space points and feeding that
| directly?
|
| Seriously well done. Like I said above, this looks to me like an
| immediate staple library.
| optimalsolver wrote:
| There's also Opytimizer [0] for almost every metaheuristic
| optimization algorithm under the Sun.
|
| [0] https://github.com/gugarosa/opytimizer
| simonblanke wrote:
| I am glad you like my project. The search space is not
| continuous. But you can make the steps as small as you want.
| Like: np.arange(0, 10, 0.00001)
|
| I am not sure i understand your second question. The entire
| search space is always a N-dimensional cube. So you just define
| a start and end point and the step size in each dimension.
| plaidfuji wrote:
| Nope, that answers the question. I was asking about cases
| where the space is, for example, a triangle or something non-
| cuboidal. Think constraints like x1 + x2 < 1 or something
| like that.
|
| What I'm wondering is if your library converts the
| search_space to an enumerated list of candidate points
| somewhere in the backend, and if there's a way for me to just
| construct that list and pass it directly for more complex
| cases.
| zmk_ wrote:
| They probably mean that you can search in n-dimensional
| cubes, but what they'd like is having a more general shapes
| of the grid, e.g., have restriction of the sort: a+b<1.
| dwiel wrote:
| Another option in some cases is to just map the space to
| something else. Like in this case, x and y are
| unconstrained and are what the optimizer sees, a = x and b
| = (1 - a) - abs(y).
|
| Not always easy to do, but can work in some cases if the
| optimizer cant deal with it natively.
| simonblanke wrote:
| Okay that is interesting. You could realize that by making
| a restricted area in the search space by returning np.nan
| in the objective function for those cases. Gradient-Free-
| Optimizers can handle np.nan and np.inf just fine.
|
| Maybe you could do something like:
|
| If a+b>=1: return np.nan else: return score
| ialyos wrote:
| This is at a high level how one of our Google internal
| black box optimizers behaves. The term used is
| "infeasible region" but it's the same idea as using a
| nan.
|
| I would caution against using nan to always mean
| infeasible. Instead users should catch experiments
| outside the feasibility region and return a special
| infeasible value. This will increase visibility into the
| behavior of the optimizer, because it leaves nan to be
| used for values inside the region of constraint that are
| still problematic (due to bugs, numerical instability,
| etc)
| unnah wrote:
| If the volume of feasible space is small, the optimizer
| may have considerable trouble finding the feasible
| region(s) in the domain. The simple solution is to use a
| penalty function instead, i.e. add a term like M*max(0,
| 1-(a+b)) to the objective, with sufficiently large M. If
| the optimum is at the boundary of the feasible region,
| you might still get a solution that is slightly
| infeasible, which can be worked around with more special
| case logic...
| [deleted]
| abhgh wrote:
| This would be like the barrier function [1] for gradient
| based methods; which produces some challenges for those
| techniques. Do you know how this affects gradient free
| techniques?
|
| [1] https://en.m.wikipedia.org/wiki/Barrier_function
| aldanor wrote:
| Looks very nice!
|
| Two questions off top of my head:
|
| - Might be worth mentioning how Hyperactive compares to tons of
| other similar packages, like Hyperopt, Optunity and a ton of
| others? Also how your GFO package is different from optima,
| opytimizer, and many others as well. Having some benchmarks or
| performance stats (or even just listing functionality
| differences) would be very helpful to anyone who's been already
| using those.
|
| - One of the best universal 'works-out-of-the-box-most-of-time'
| global optimizers I've used recently is DLib's:
| http://blog.dlib.net/2017/12/a-global-optimization-algorithm... -
| any chances for implementing something similar?
| simonblanke wrote:
| Two very interesting questions! I should work soon on a
| comparison of Hyperactive and GFO to other packages. If some
| important features are missing, maybe i could add them.
|
| I will also look into Dlib. If you like you can open an
| official issue in my repository. We could discuss why it is
| useful and if it should be implemented.
| twic wrote:
| For linear programming solvers, there is an annual competition to
| see which is best:
|
| https://www.minizinc.org/challenge.html
|
| Is there anything like this for gradient-free optimisers?
| simonblanke wrote:
| I thought about a table in the readme that shows some kind of
| metric for each optimizer that describes its performance. I
| will look into that.
| twic wrote:
| I love gradient-free optimisers - there are so many interesting
| problems where there aren't analytical gradients available.
|
| An I right in thinking this library doesn't contain a simplex /
| Nelder-Mead optimiser? To me that's the bread and butter of
| gradient-free optimisation.
|
| Also no Powell algorithms like BOBYQA?
|
| nlopt has these and many more, and a Python wrapper:
|
| https://nlopt.readthedocs.io/en/latest/NLopt_Algorithms/
|
| So, are those omitted because this library is modern and those
| are now considered out of date?
| bloaf wrote:
| Came here to ask about the Powell method. I find Powell is
| typically what engineers come up with if asked to optimize
| something "by hand" without a mathematical model of the system
| they are optimizing.
|
| https://en.wikipedia.org/wiki/Powell%27s_method
| simonblanke wrote:
| I am currently working on the Nelder-Mead optimiser. I will
| also look into "BOBYQA". I am always searching for interesting
| algorithms that my users need. If you have more suggestions you
| can open an issue in the repository.
| sleavey wrote:
| I've used this, and it works nicely:
| https://github.com/numericalalgorithmsgroup/pybobyqa. I'd be
| happy if it were added to your project, then I could just use
| yours and have access to a bunch of alternatives with the
| same API.
| twic wrote:
| That's good to hear. I don't have any other specific requests
| - for me it seems impossible to know what will work on my
| problems, so i just try things at random to see what works!
|
| I don't know the architecture of your project, but is there
| any way it could include NLopt, and so make all of its
| algorithms available? Some of them could be useful, and i
| doubt it's a good use of your time to reimplement them all
| from scratch.
| optimalsolver wrote:
| I hope to see gradient-free/black box optimization algorithms
| gain more popularity in machine learning. I'm always surprised
| when I see a review of ML optimization methods, and it's not even
| mentioned that you can optimize an ML model without using the
| gradient.
|
| Also, love or hate Python, there's almost always a library for
| whatever it is you want to do.
| xyzzyz wrote:
| > and it's not even mentioned that you can optimize an ML model
| without using the gradient.
|
| Can you, though? How does the outcomes compare to gradient
| based methods?
| cgearhart wrote:
| In general GFO is not a good alternative when there are lots
| of model parameters or when it's cheap to get the gradient.
| GFO _could_ be used to train models, but kinda in the same
| way that you _could_ find your way to Cleveland by just
| wandering around long enough. It sure would help a lot if you
| had a compass.
| shankr wrote:
| Can you use the methods in this library to do constrained
| optimization? Whenever I look for such gradient free optimizer
| it's always comes with some constraint.
| bbwwnn wrote:
| One simple trick is doing a bijective transformation between
| your constrained input space and the unconstrained optimizer
| space. For example, if you have x>0 in your input space, you
| take the input from your optimizer and apply e^x to it. Or if
| you have a<x<b you can do a+b*logit(x). What exactly you choose
| will depend on your prior on how your function behaves.
| simonblanke wrote:
| I just saw, that my project got a lot of new stars on github. A
| surprise to be sure, but a welcome one.
| batterylow wrote:
| A book on Practical Evolutionary Algorithms with Python for
| anyone interested https://datacrayon.com/shop/product/practical-
| evolutionary-a...
| jordigh wrote:
| Is the Nelder-Mead a.k.a. the _other_ simplex method any good? I
| remember most of these other methods from my optimisation
| lessons, but I 'm surprised to not see Nelder-Mead in this list.
| simonblanke wrote:
| I am currently working on the Nelder-Mead algorithm. I did not
| realize that it is that popular. This gives me motivation to
| implement it soon ;-)
|
| If there are more "must have"-algorithms you could open an
| issue in Gradient-Free-Optimizers.
| jordigh wrote:
| I like it because it was so easy to implement and it worked
| well, unless your objective function had very narrow and
| steep valleys.
|
| I don't know if it's popular, heheh, it's been a while since
| I've been in maths school, so that's why I'm asking.
___________________________________________________________________
(page generated 2021-02-28 23:00 UTC)