[HN Gopher] Bayesian Neural Networks
___________________________________________________________________
Bayesian Neural Networks
Author : reqo
Score : 231 points
Date : 2024-11-18 16:47 UTC (4 days ago)
(HTM) web link (www.cs.toronto.edu)
(TXT) w3m dump (www.cs.toronto.edu)
| datastoat wrote:
| I like Bayesian inference for few-parameter models where I have
| solid grounds for choosing my priors. For neural networks, I like
| to ask people "what's your prior for ReLU versus LeakyReLU versus
| sigmoid?" and I've never gotten a convincing answer.
| salty_biscuits wrote:
| I'm sure there is a way of interpreting a relu as a sparsity
| prior on the layer.
| pkoird wrote:
| Kolmogorov Arnold nets might have an answer for you!
| jwuphysics wrote:
| Could you say a bit more about how so?
| pkoird wrote:
| KANs have learnable activations based on splines
| parameterized on few variables. You can specify a prior
| over those variables, effectively establishing a prior over
| your activation function.
| dccsillag wrote:
| Ah, Kolmogorov Arnold Networks. Perhaps the only model I have
| ever tried that managed to fairly often get AUCs below 0.5 in
| my tabular ML benchmarks. It even managed to get a frankly
| disturbing 0.33, where pretty much any other method
| (including linear regression, IIRC) would get >=0.99!
| SpaceManNabs wrote:
| Why do you think they perform so poorly?
| dccsillag wrote:
| Theory-wise, I'm not convinced that the models have good
| approximation properties (the Kolmogorov-Arnold /
| Kolmogorov Superposition Theorem they base themselves on
| has quite a bit of nuance), and the optimization problem
| might be a bit tricky. I'm also can't see how to
| incorporate inductive biases other than the standard R^n
| / tabular regression one, and the existing attempts on
| this that I'm aware of are just band-aids (along the
| lines of feature engineering).
|
| In practice, I've personally ran some benchmarks on a
| collection of datasets I had laying around. The results
| were generally abysmal, with the method only matching
| simple baselines in some few datasets.
|
| Finally, the original paper is very weird, and reads more
| as a marketing piece. The theory, which is touted
| throughout the paper, is very weak, the actual algorithm
| is not sufficiently well explained there and the
| experiments are lacking. In particular, I find it telling
| that they do not include and even go out of their way to
| ignore important baselines such as boosted trees, which
| are the state-of-the-art solution to the problem that
| they intended to solve (and even work very well in
| occasions where they claim that both KANs and MLPs
| perform badly, e.g. in high dimensions).
| duvenaud wrote:
| I agree choosing priors is hard, but choosing ReLU versus
| LeakyReLU versus sigmoid seems like a problem with using neural
| nets in general, not Bayesian neural nets in particular. Am I
| misunderstanding?
| stormfather wrote:
| I choose LeakyReLU vs ReLU depending on if it's an odd day of
| the week, LeakyReLU being the slightly favored odd-days because
| it's aesthetically nicer that gradients propagate through
| negative inputs, though I can't discern a difference. I choose
| sigmoid if I want to waste compute to remind myself that it
| converges slowly due to vanishing gradients at extreme
| activation levels. So its empiricism retroactively justified by
| some mathematical common sense that let's me feel good about
| the choices. Kind of like aerodynamics.
| sideshowb wrote:
| I like Bayes, but I thought the "surprising" result is that
| double descent is supposed to prevent nns from overfitting?
| duvenaud wrote:
| Good point. We wrote this pre-double descent, and a massively
| overparameterized model would make a nice addition to the
| tutorial as a baseline. However, if you want a rich predictive
| distribution, it might still make sense to use a Bayesian NN.
| dccsillag wrote:
| Bayesian Neural Networks just seem like a failed approach,
| unfortunately. For one, Bayesian inference and UQ fundamentally
| depends on the choice of the prior, but this is rarely discussed
| in the Bayesian NN literature and practice, and is further
| compounded by how fundamentally hard to interpret and choose
| these priors are (what is the intuition behind a NN's
| parameters?). Add to that the fact that the Bayesian inference is
| very much approximate, and you should see the trouble.
|
| If you want UQ, 'frequentist nonparametric' approaches like
| Conformal Prediction and Calibration/Multi-Calibration methods
| seem to work quite well (especilly when combined with the
| standard ML machinery of taking a log-likelihood as your loss),
| and do not suffer from any of the issues above while also giving
| you formal guarantees of correctness. They are a strict
| improvement over Bayesian NNs, IMO.
| bravura wrote:
| Conformal learning is relatively new to me. Tell me if I'm
| getting any of this wrong: Conformal learning is a frequentist
| approach that uses a calibration set to determine how unusual a
| prediction is.
|
| It seems like the main time they aren't a strict improvement
| over bayesian methods is when it is difficult to define your
| calibration set? I know this scenario isn't so commonplace, but
| I'm working in a scenario where I quickly looked at conformal
| learning and wasn't sure if it is applicable.
| dccsillag wrote:
| That's a particular form of Conformal Prediction, called
| Split Conformal Prediction. Incidentally, it's also one of
| the best ones (i.e., most extensible, strongest guarantees,
| easiest to implement, remarkably sample-efficient).
|
| Making a calibration set is pretty easy, it's just a data
| split (just like the train/test split). The hardest part
| (which is still fairly easy) is creating a 'conformity
| score', which is a function that receives the input and a
| candidate output and scores how well this candidate output
| 'conforms' to the input. This is where an underlying ML model
| can come in handy: it can, itself, estimate this! Split
| Conformal Prediction then does a fairly simple quantile
| calculation on these scores (or some variant thereof) to then
| form the set prediction.
|
| In a sense, you could use Bayesian NNs to produce a
| conformity score. But that doesn't seem to be much better
| than just using e.g. the model's logits for your conformity
| score. Theory-wise, Conformal Prediction methods have a
| number of favorable guarantees that Bayesian models (and
| especially Bayesian NNs) generally don't, and in practice
| we've seen that conditional on the model giving calibrated
| outputs (which is guaranteed for Conformal Prediction, but
| not for Bayesian NNs), Conformal Prediction predicted sets
| seem to be tighter than the Bayesian NN ones.
| duvenaud wrote:
| I agree that Bayesian neural networks haven't been worth it in
| practice for many applications, but I think the main problem is
| that it's usually better to spend your compute training a
| single set of weights for a larger model, rather than doing
| approximate inference over weights in a smaller model. The
| exception is probably scientific applications where you mostly
| know the model, but then you don't really need a neural net
| anymore.
|
| Choosing a prior is hard, but I'd say it's analogously hard to
| choosing an architecture - if all else fails, you can do a
| brute force search, and you even have the marginal likelihood
| to guide you. I don't think it's the main reason why people
| don't use BNNs much.
| dkga wrote:
| I disagree with one conceptual point; if you are truly
| Bayesian you don't "choose" a prior, by definition you
| "already have" a prior that you are updating with data to get
| to a posterior.
| hgomersall wrote:
| At some level, you have to choose something. You can't know
| every level in your hierarchy.
| abm53 wrote:
| 100% correct, but there are ways to push Bayesian inference
| back a step to justify this sort of thing.
|
| It of course makes the problem even more complex and likely
| requires further approximations to computing the posterior
| (or even the MAP solution).
|
| This stretches the notion that you are still doing Bayesian
| reasoning but can still lead to useful insights.
| DiscourseFan wrote:
| Probably should just call it something else then; though,
| I gather that the simplicity of Bayes theorom belies the
| complexity of what it hides.
| duvenaud wrote:
| Sure, instead of saying "choose" a prior, you could say
| "elicit". But I think in this context, focusing on a
| practitioner's prior knowledge is missing the point. For
| the sorts of problems we use NNs for, we don't usually
| think that the guy designing the net has important
| knowledge that would help making good predictions. Choosing
| a prior is just an engineering challenge, where one has to
| avoid accidentally precluding plausible hypotheses.
| waldrews wrote:
| The Conformal Prediction advocates (especially a certain
| prominent Twitter account) tend to rehash old frequentist-vs-
| bayesian arguments with more heated rhetoric than strictly
| necessary. That fight has been going on for almost a century
| now. Bayesian counterargument (in caricature form) would be
| that MLE frequentists just choose an arbitrary (flat) prior,
| and penalty hyperparameters (common in NN) are a de facto
| prior. The formal guarantees only have bite in the asymptotic
| setting or require convoluted statements about probabilities
| over repeated experiments; and asymptotically, the choice of
| prior doesn't matter anyway.
|
| (I'm a moderate that uses both approaches, seeing them as part
| of a general hierarchical modeling method, which means I get
| mocked by either side for lack of purity).
|
| Bayesians are losing ground at the moment because their
| computational methods haven't been advanced as fast by the GPU
| revolution for reasons having to do with difficulty in
| parallelization, but there's serious practical work (especially
| using JAX) to catch up, and the whole normalizing flow
| literature might just get us past the limitations of MCMC for
| hard problems.
|
| But having said that, Conformal Prediction works as advertised
| for UQ as a wrapper on any point estimating model. If you've
| got the data for it - and in the ML setting you do - and you
| don't care about things like missing data imputation, error in
| inputs, non-iid spatio-temporal and hierarchical structures,
| mixtures of models, evidence decay, unbalanced data where
| small-data islands coexist big data - all the complicated
| situations where Bayesian methods just automatically work and
| other methods require elaborate workarounds, yup, use Conformal
| Prediction.
|
| Calibration is also a pretty magical way to improve just about
| any estimator. It's cheap to do and it works (although hard to
| guarantee anything with that in the general case...)
|
| And don't forget quantile regression penalties! Awkward to
| apply in the NN setting, but an easy and effective way to do UQ
| in XGBoost world.
| dccsillag wrote:
| Yeah, I know the account you are talking about, it really is
| a bit over the top. It's a shame, I've met a bunch of people
| who mentioned that they were actually turned away from
| Conformal Prediction due to them.
|
| > But having said that, Conformal Prediction works as
| advertised for UQ as a wrapper on any point estimating model.
| If you've got the data for it - and in the ML setting you do
| - and you don't care about things like missing data
| imputation, error in inputs, non-iid spatio-temporal and
| hierarchical structures, mixtures of models, evidence decay,
| unbalanced data where small-data islands coexist big data -
| all the complicated situations where Bayesian methods just
| automatically work and other methods require elaborate
| workarounds, yup, use Conformal Prediction.
|
| Many of these things can actually work really well with
| Conformal Prediction, but the algorithms require extensions
| (much like if you are doing Bayesian inference, you also need
| to update your model accordingly!). They generally end up
| being some form of reweighting to compensate for the
| distribution shifts (excluding the Online Conformal
| Prediction literature, which is another beast entirely).
| Also, worth noting that if you have iid data then Conformal
| Prediction is remarkably data-efficient; as little as 20
| samples are enough for it to start working for 95% predictive
| intervals, and with 50 samples (and with almost surely unique
| conformity scores) it's going to match 95% coverage fairly
| tightly.
| 3abiton wrote:
| Are we talking about NN Taleb? I am curious about the
| twitter persona.
| GemesAS wrote:
| Someone by the name of V. Minakhin. They have an
| irrational hatred of Bayesian statistics. He blocked me
| on twitter for pointing out his claim about significant
| companies do not use Bayesian methods is contradicted by
| the fact that I work for one of those companies and use
| Bayesian methods.
| ComplexSystems wrote:
| "Bayesian counterargument (in caricature form) would be that
| MLE frequentists just choose an arbitrary (flat) prior, and
| penalty hyperparameters (common in NN) are a de facto prior."
|
| This has been my view for a while now. Is this not correct?
|
| In general, I think the idea of a big "frequentist vs
| Bayesian" debate is silly. I think it is very useful to take
| frequentist ideas and see what they look like from a Bayesian
| point of view, and vice versa (when applicable). I think this
| is pretty much the general stance among most people in the
| field - it's generally expected that one will understand that
| regularization methods equate to certain priors, for
| instance, and in general be able to relate these two
| perspectives as much as possible.
| duvenaud wrote:
| I would argue against the idea that "MLE is just Bayes with
| a flat prior". The power of Bayes usually comes mainly from
| keeping around all the hypothesis that are compatible with
| the data, not from the prior. This is especially true in
| domains where something black-box (essentially prior-less)
| like a neural net has any chance of working.
| fjkdlsjflkds wrote:
| > For one, Bayesian inference and UQ fundamentally depends on
| the choice of the prior, but this is rarely discussed in the
| Bayesian NN literature and practice, and is further compounded
| by how fundamentally hard to interpret and choose these priors
| are (what is the intuition behind a NN's parameters?).
|
| I agree that, computationally, it is hard to justify the use of
| Bayesian methods on large-scale neural networks when stochastic
| gradient descent (and friends) is so damn efficient and
| effective.
|
| On the other hand, the fact that there's a dependence on
| (subjective) priors is hardly a fair critique: non-Bayesian
| training of neural networks also depends on the use of
| (subjective) loss functions with (subjective) regularization
| terms (in fact, it can be shown that, mathematically, the use
| of priors is precisely equivalent to adding regularization to a
| loss function). Non-Bayesian training of neural networks is not
| "a failed approach" just because someone can arbitrarily choose
| L1 regularization (i.e., a Laplacian prior) over L2
| regularization (i.e., a Gaussian prior).
|
| Furthermore, we do have _some_ intuition over NN parameters
| (particularly when inputs and outputs are properly scaled): a
| value of 10^15 should be less likely than a value of 0. Note
| that, in Bayesian practice, people often use weakly-informative
| priors (see, e.g., http://www.stat.columbia.edu/~gelman/present
| ations/weakprior...) to encode such intuitive statements while
| ensuring that (for all practical purposes) the data will
| effectively overwhelm the prior (again, this is equivalent to
| adding a minimal amount of regularization to a loss function,
| to make a problem well-posed when e.g. you have more parameters
| than data points).
| datastoat wrote:
| Non-Bayesian NN training does indeed use regularizers that
| are chosen subjectively --- but they are then tested in
| validation, and the best-performing regularizer is chosen.
| Thus the choice is empirical, not subjective.
|
| A Bayesian could try the same thing: try out several priors,
| and pick the one that performs best in validation. But if you
| pick your prior based on the data, then the classic theory
| about "principled quantification of uncertainty" doesn't
| apply any more. So you're left using a computationally
| unwieldy procedure that doesn't offer theoretical guarantees.
| panda-giddiness wrote:
| You can, in fact, do that. It's called (aptly enough) the
| empirical Bayes method. [1]
|
| [1] https://en.wikipedia.org/wiki/Empirical_Bayes_method
| datastoat wrote:
| Empirical Bayes is exactly what I was getting at. It's a
| pragmatic modelling choice, but it loses the theoretical
| guarantees about uncertainty quantification that pure
| Bayesianism gives us.
|
| (Though if you have a reference for why empirical Bayes
| does give theoretical guarantees, I'll be happy to change
| my mind!)
| dkga wrote:
| I'm not an expert in BNNs but the prior does not need to be
| justified in terms of each parameter. Bayesian analysis
| frequently uses hyperparameters to set the overall tightness or
| looseness of the parameters (a la Minnesota priors in the
| econometric literature for example). This would be a similar
| regularisation intuition as, eg, L1 and L2 regularisation in
| traditional NN training. This is of course just one example.
| nvrmnd wrote:
| What is 'UQ', I assume some measure of uncertainty over your
| model outputs?
| rscho wrote:
| Unbiased quantifier
| proto-n wrote:
| Usually means uncertainty quantification
| scellus wrote:
| Priors on parameters are not an issue. On models of scale,
| priors are just some computationally convenient shrinkage, and
| what works is found empirically and canonized into the
| practice; projecting prior knowledge of the problem at hand by
| parameter priors does not really happen except in some vague
| sense ("I think most predictors are irrelevant, so make it
| sparse by Cauchy/horseshoe/whatever").
|
| The important thing in bayesian (statistical, ML) modelling in
| general is the ability to gain in flexibility and do model
| structures that otherwise would be hard or impossible: latent
| states, hierarchies, etc.
|
| In bayesian NNs the main advantages would be around uncertainty
| quantification (UQ) and in finding good optima and partly to
| avoid overfitting. These do apply in some cases of simple NNs.
|
| Mostly however, especially with larger conventional models (not
| speaking of normalizing flows and such here), using explicit
| bayes is not feasible. Instead, people use approximate point
| estimates with tricks:
|
| (1) UQ has been taken care of by post-calibration. (2)
| Stochastic gradient actually searches for large posterior
| masses like a variational approximation would do, so it is kind
| of bayes. (3) And those priors: using dropout is commonplace,
| it has a bayesian interpretation, and L2 regularization aka
| gaussian priors are frequent too.
|
| So bayes is there in practice, just not in a neat, pure form
| but as a collection of practical hacks.
| duvenaud wrote:
| Author here! What a surprise. This was an abandoned project from
| 2019, that we never linked or advertised anywhere as far as I
| know. Anyways, happy to answer questions.
| esafak wrote:
| just a little typo, but it's Kullback- _Leibler_.
| duvenaud wrote:
| Thanks for pointing that out!
| idontknowmuch wrote:
| Somewhat related -- I'd love to hear your thoughts on dex-Lang
| and Haskell for array programming?
| duvenaud wrote:
| I still am excited by Dex (https://github.com/google-
| research/dex-lang/) and still write code in it! I have a
| bunch of demos and fixes written, and am just waiting for
| Dougal to finish his latest re-write before I can merge them.
| mugivarra69 wrote:
| why (if) was this not picked for further research? i know that
| oatml did quite amount of work on this front as well and it
| seems the direction is still being worked on. want to get ur 2
| cent on this approach.
| duvenaud wrote:
| BNNs certainly have their uses, but I think people in general
| found that it's a better use of compute to fit a larger model
| on more data than to try to squeeze more juice from a given
| small dataset + model. Usually there is more data available,
| it's just somewhat tangentially related. LLMs are the
| ultimate example of how training on tons of tangentially-
| related data can ultimately be worthwhile for almost any
| task.
| timeinput wrote:
| What did you use to produce the article? I really really like
| the formatting.
| duvenaud wrote:
| I think we used a distill.pub template. Also Jerry wrote some
| custom BNN fitting code in javascript. I'll ask my co-authors
| to open-source it.
| duvenaud wrote:
| Update: the code is here:
|
| https://github.com/jerryqhyu/distill_bayes_net
| oli5679 wrote:
| https://publications.aston.ac.uk/id/eprint/373/1/NCRG_94_004...
|
| mixture density networks are quite interesting if you want
| probabilistic estimates of neural. here, your model learns to
| output and array of gaussian distribution coefficient
| distributions, and mixture weights.
|
| these weights are specific to individual observations, and
| trained to maximise likelihood.
| duvenaud wrote:
| This approach characterizes a different type of uncertainty
| than BNNs do, and the approaches can be combined. The BNN
| tracks uncertainty about parameters in the NN, and mixture
| density nets track the noise distribution _conditional on
| knowing the parameters_.
| ok123456 wrote:
| BNNs were an attractive choice in scenarios where the data is
| expensive to collect, like actual physical experiments. But
| boosting and other tree-based regression methods give you similar
| performance with a more straightforward framework for limited
| tabular data.
| levocardia wrote:
| What frustrates me about Bayesian NNs is that talking about
| "priors" doesn't make nearly as much sense as it does in a
| regression context. A prior over parameter weights has no
| interpretation in the way that a prior over a regression
| coefficient, or even a spline smoothness, does. What you really
| want -- and what natural intelligence probably has -- are priors
| over _aspects of the world_.
|
| Francois Chollet's paper on measuring intelligence was really
| informative for me on this front; the "priors" you should have
| about the world are not half-cauchys over certain hyperparameters
| or whatever, but priors about agent-ness, object-ness, goal-
| oriented-ness, and so on. How to encode that in a network...well,
| that's the real trick, right?
| duvenaud wrote:
| I agree that priors over aspects of the world would be more
| useful, but I don't think that they're important in making
| natural intelligence powerful. In my experience, the important
| thing is to make your prior really broad, but containing all
| kinds of different hypotheses with different kinds of rich
| structure.
|
| I claim that knowing _a priori_ about things like agents and
| objects just doesn 't save you all that much data, as long as
| you have the imagination to consider all structures at least
| that complex.
___________________________________________________________________
(page generated 2024-11-22 23:00 UTC)