[HN Gopher] Bayesian Neural Networks
       ___________________________________________________________________
        
       Bayesian Neural Networks
        
       Author : reqo
       Score  : 231 points
       Date   : 2024-11-18 16:47 UTC (4 days ago)
        
 (HTM) web link (www.cs.toronto.edu)
 (TXT) w3m dump (www.cs.toronto.edu)
        
       | datastoat wrote:
       | I like Bayesian inference for few-parameter models where I have
       | solid grounds for choosing my priors. For neural networks, I like
       | to ask people "what's your prior for ReLU versus LeakyReLU versus
       | sigmoid?" and I've never gotten a convincing answer.
        
         | salty_biscuits wrote:
         | I'm sure there is a way of interpreting a relu as a sparsity
         | prior on the layer.
        
         | pkoird wrote:
         | Kolmogorov Arnold nets might have an answer for you!
        
           | jwuphysics wrote:
           | Could you say a bit more about how so?
        
             | pkoird wrote:
             | KANs have learnable activations based on splines
             | parameterized on few variables. You can specify a prior
             | over those variables, effectively establishing a prior over
             | your activation function.
        
           | dccsillag wrote:
           | Ah, Kolmogorov Arnold Networks. Perhaps the only model I have
           | ever tried that managed to fairly often get AUCs below 0.5 in
           | my tabular ML benchmarks. It even managed to get a frankly
           | disturbing 0.33, where pretty much any other method
           | (including linear regression, IIRC) would get >=0.99!
        
             | SpaceManNabs wrote:
             | Why do you think they perform so poorly?
        
               | dccsillag wrote:
               | Theory-wise, I'm not convinced that the models have good
               | approximation properties (the Kolmogorov-Arnold /
               | Kolmogorov Superposition Theorem they base themselves on
               | has quite a bit of nuance), and the optimization problem
               | might be a bit tricky. I'm also can't see how to
               | incorporate inductive biases other than the standard R^n
               | / tabular regression one, and the existing attempts on
               | this that I'm aware of are just band-aids (along the
               | lines of feature engineering).
               | 
               | In practice, I've personally ran some benchmarks on a
               | collection of datasets I had laying around. The results
               | were generally abysmal, with the method only matching
               | simple baselines in some few datasets.
               | 
               | Finally, the original paper is very weird, and reads more
               | as a marketing piece. The theory, which is touted
               | throughout the paper, is very weak, the actual algorithm
               | is not sufficiently well explained there and the
               | experiments are lacking. In particular, I find it telling
               | that they do not include and even go out of their way to
               | ignore important baselines such as boosted trees, which
               | are the state-of-the-art solution to the problem that
               | they intended to solve (and even work very well in
               | occasions where they claim that both KANs and MLPs
               | perform badly, e.g. in high dimensions).
        
         | duvenaud wrote:
         | I agree choosing priors is hard, but choosing ReLU versus
         | LeakyReLU versus sigmoid seems like a problem with using neural
         | nets in general, not Bayesian neural nets in particular. Am I
         | misunderstanding?
        
         | stormfather wrote:
         | I choose LeakyReLU vs ReLU depending on if it's an odd day of
         | the week, LeakyReLU being the slightly favored odd-days because
         | it's aesthetically nicer that gradients propagate through
         | negative inputs, though I can't discern a difference. I choose
         | sigmoid if I want to waste compute to remind myself that it
         | converges slowly due to vanishing gradients at extreme
         | activation levels. So its empiricism retroactively justified by
         | some mathematical common sense that let's me feel good about
         | the choices. Kind of like aerodynamics.
        
       | sideshowb wrote:
       | I like Bayes, but I thought the "surprising" result is that
       | double descent is supposed to prevent nns from overfitting?
        
         | duvenaud wrote:
         | Good point. We wrote this pre-double descent, and a massively
         | overparameterized model would make a nice addition to the
         | tutorial as a baseline. However, if you want a rich predictive
         | distribution, it might still make sense to use a Bayesian NN.
        
       | dccsillag wrote:
       | Bayesian Neural Networks just seem like a failed approach,
       | unfortunately. For one, Bayesian inference and UQ fundamentally
       | depends on the choice of the prior, but this is rarely discussed
       | in the Bayesian NN literature and practice, and is further
       | compounded by how fundamentally hard to interpret and choose
       | these priors are (what is the intuition behind a NN's
       | parameters?). Add to that the fact that the Bayesian inference is
       | very much approximate, and you should see the trouble.
       | 
       | If you want UQ, 'frequentist nonparametric' approaches like
       | Conformal Prediction and Calibration/Multi-Calibration methods
       | seem to work quite well (especilly when combined with the
       | standard ML machinery of taking a log-likelihood as your loss),
       | and do not suffer from any of the issues above while also giving
       | you formal guarantees of correctness. They are a strict
       | improvement over Bayesian NNs, IMO.
        
         | bravura wrote:
         | Conformal learning is relatively new to me. Tell me if I'm
         | getting any of this wrong: Conformal learning is a frequentist
         | approach that uses a calibration set to determine how unusual a
         | prediction is.
         | 
         | It seems like the main time they aren't a strict improvement
         | over bayesian methods is when it is difficult to define your
         | calibration set? I know this scenario isn't so commonplace, but
         | I'm working in a scenario where I quickly looked at conformal
         | learning and wasn't sure if it is applicable.
        
           | dccsillag wrote:
           | That's a particular form of Conformal Prediction, called
           | Split Conformal Prediction. Incidentally, it's also one of
           | the best ones (i.e., most extensible, strongest guarantees,
           | easiest to implement, remarkably sample-efficient).
           | 
           | Making a calibration set is pretty easy, it's just a data
           | split (just like the train/test split). The hardest part
           | (which is still fairly easy) is creating a 'conformity
           | score', which is a function that receives the input and a
           | candidate output and scores how well this candidate output
           | 'conforms' to the input. This is where an underlying ML model
           | can come in handy: it can, itself, estimate this! Split
           | Conformal Prediction then does a fairly simple quantile
           | calculation on these scores (or some variant thereof) to then
           | form the set prediction.
           | 
           | In a sense, you could use Bayesian NNs to produce a
           | conformity score. But that doesn't seem to be much better
           | than just using e.g. the model's logits for your conformity
           | score. Theory-wise, Conformal Prediction methods have a
           | number of favorable guarantees that Bayesian models (and
           | especially Bayesian NNs) generally don't, and in practice
           | we've seen that conditional on the model giving calibrated
           | outputs (which is guaranteed for Conformal Prediction, but
           | not for Bayesian NNs), Conformal Prediction predicted sets
           | seem to be tighter than the Bayesian NN ones.
        
         | duvenaud wrote:
         | I agree that Bayesian neural networks haven't been worth it in
         | practice for many applications, but I think the main problem is
         | that it's usually better to spend your compute training a
         | single set of weights for a larger model, rather than doing
         | approximate inference over weights in a smaller model. The
         | exception is probably scientific applications where you mostly
         | know the model, but then you don't really need a neural net
         | anymore.
         | 
         | Choosing a prior is hard, but I'd say it's analogously hard to
         | choosing an architecture - if all else fails, you can do a
         | brute force search, and you even have the marginal likelihood
         | to guide you. I don't think it's the main reason why people
         | don't use BNNs much.
        
           | dkga wrote:
           | I disagree with one conceptual point; if you are truly
           | Bayesian you don't "choose" a prior, by definition you
           | "already have" a prior that you are updating with data to get
           | to a posterior.
        
             | hgomersall wrote:
             | At some level, you have to choose something. You can't know
             | every level in your hierarchy.
        
             | abm53 wrote:
             | 100% correct, but there are ways to push Bayesian inference
             | back a step to justify this sort of thing.
             | 
             | It of course makes the problem even more complex and likely
             | requires further approximations to computing the posterior
             | (or even the MAP solution).
             | 
             | This stretches the notion that you are still doing Bayesian
             | reasoning but can still lead to useful insights.
        
               | DiscourseFan wrote:
               | Probably should just call it something else then; though,
               | I gather that the simplicity of Bayes theorom belies the
               | complexity of what it hides.
        
             | duvenaud wrote:
             | Sure, instead of saying "choose" a prior, you could say
             | "elicit". But I think in this context, focusing on a
             | practitioner's prior knowledge is missing the point. For
             | the sorts of problems we use NNs for, we don't usually
             | think that the guy designing the net has important
             | knowledge that would help making good predictions. Choosing
             | a prior is just an engineering challenge, where one has to
             | avoid accidentally precluding plausible hypotheses.
        
         | waldrews wrote:
         | The Conformal Prediction advocates (especially a certain
         | prominent Twitter account) tend to rehash old frequentist-vs-
         | bayesian arguments with more heated rhetoric than strictly
         | necessary. That fight has been going on for almost a century
         | now. Bayesian counterargument (in caricature form) would be
         | that MLE frequentists just choose an arbitrary (flat) prior,
         | and penalty hyperparameters (common in NN) are a de facto
         | prior. The formal guarantees only have bite in the asymptotic
         | setting or require convoluted statements about probabilities
         | over repeated experiments; and asymptotically, the choice of
         | prior doesn't matter anyway.
         | 
         | (I'm a moderate that uses both approaches, seeing them as part
         | of a general hierarchical modeling method, which means I get
         | mocked by either side for lack of purity).
         | 
         | Bayesians are losing ground at the moment because their
         | computational methods haven't been advanced as fast by the GPU
         | revolution for reasons having to do with difficulty in
         | parallelization, but there's serious practical work (especially
         | using JAX) to catch up, and the whole normalizing flow
         | literature might just get us past the limitations of MCMC for
         | hard problems.
         | 
         | But having said that, Conformal Prediction works as advertised
         | for UQ as a wrapper on any point estimating model. If you've
         | got the data for it - and in the ML setting you do - and you
         | don't care about things like missing data imputation, error in
         | inputs, non-iid spatio-temporal and hierarchical structures,
         | mixtures of models, evidence decay, unbalanced data where
         | small-data islands coexist big data - all the complicated
         | situations where Bayesian methods just automatically work and
         | other methods require elaborate workarounds, yup, use Conformal
         | Prediction.
         | 
         | Calibration is also a pretty magical way to improve just about
         | any estimator. It's cheap to do and it works (although hard to
         | guarantee anything with that in the general case...)
         | 
         | And don't forget quantile regression penalties! Awkward to
         | apply in the NN setting, but an easy and effective way to do UQ
         | in XGBoost world.
        
           | dccsillag wrote:
           | Yeah, I know the account you are talking about, it really is
           | a bit over the top. It's a shame, I've met a bunch of people
           | who mentioned that they were actually turned away from
           | Conformal Prediction due to them.
           | 
           | > But having said that, Conformal Prediction works as
           | advertised for UQ as a wrapper on any point estimating model.
           | If you've got the data for it - and in the ML setting you do
           | - and you don't care about things like missing data
           | imputation, error in inputs, non-iid spatio-temporal and
           | hierarchical structures, mixtures of models, evidence decay,
           | unbalanced data where small-data islands coexist big data -
           | all the complicated situations where Bayesian methods just
           | automatically work and other methods require elaborate
           | workarounds, yup, use Conformal Prediction.
           | 
           | Many of these things can actually work really well with
           | Conformal Prediction, but the algorithms require extensions
           | (much like if you are doing Bayesian inference, you also need
           | to update your model accordingly!). They generally end up
           | being some form of reweighting to compensate for the
           | distribution shifts (excluding the Online Conformal
           | Prediction literature, which is another beast entirely).
           | Also, worth noting that if you have iid data then Conformal
           | Prediction is remarkably data-efficient; as little as 20
           | samples are enough for it to start working for 95% predictive
           | intervals, and with 50 samples (and with almost surely unique
           | conformity scores) it's going to match 95% coverage fairly
           | tightly.
        
             | 3abiton wrote:
             | Are we talking about NN Taleb? I am curious about the
             | twitter persona.
        
               | GemesAS wrote:
               | Someone by the name of V. Minakhin. They have an
               | irrational hatred of Bayesian statistics. He blocked me
               | on twitter for pointing out his claim about significant
               | companies do not use Bayesian methods is contradicted by
               | the fact that I work for one of those companies and use
               | Bayesian methods.
        
           | ComplexSystems wrote:
           | "Bayesian counterargument (in caricature form) would be that
           | MLE frequentists just choose an arbitrary (flat) prior, and
           | penalty hyperparameters (common in NN) are a de facto prior."
           | 
           | This has been my view for a while now. Is this not correct?
           | 
           | In general, I think the idea of a big "frequentist vs
           | Bayesian" debate is silly. I think it is very useful to take
           | frequentist ideas and see what they look like from a Bayesian
           | point of view, and vice versa (when applicable). I think this
           | is pretty much the general stance among most people in the
           | field - it's generally expected that one will understand that
           | regularization methods equate to certain priors, for
           | instance, and in general be able to relate these two
           | perspectives as much as possible.
        
             | duvenaud wrote:
             | I would argue against the idea that "MLE is just Bayes with
             | a flat prior". The power of Bayes usually comes mainly from
             | keeping around all the hypothesis that are compatible with
             | the data, not from the prior. This is especially true in
             | domains where something black-box (essentially prior-less)
             | like a neural net has any chance of working.
        
         | fjkdlsjflkds wrote:
         | > For one, Bayesian inference and UQ fundamentally depends on
         | the choice of the prior, but this is rarely discussed in the
         | Bayesian NN literature and practice, and is further compounded
         | by how fundamentally hard to interpret and choose these priors
         | are (what is the intuition behind a NN's parameters?).
         | 
         | I agree that, computationally, it is hard to justify the use of
         | Bayesian methods on large-scale neural networks when stochastic
         | gradient descent (and friends) is so damn efficient and
         | effective.
         | 
         | On the other hand, the fact that there's a dependence on
         | (subjective) priors is hardly a fair critique: non-Bayesian
         | training of neural networks also depends on the use of
         | (subjective) loss functions with (subjective) regularization
         | terms (in fact, it can be shown that, mathematically, the use
         | of priors is precisely equivalent to adding regularization to a
         | loss function). Non-Bayesian training of neural networks is not
         | "a failed approach" just because someone can arbitrarily choose
         | L1 regularization (i.e., a Laplacian prior) over L2
         | regularization (i.e., a Gaussian prior).
         | 
         | Furthermore, we do have _some_ intuition over NN parameters
         | (particularly when inputs and outputs are properly scaled): a
         | value of 10^15 should be less likely than a value of 0. Note
         | that, in Bayesian practice, people often use weakly-informative
         | priors (see, e.g., http://www.stat.columbia.edu/~gelman/present
         | ations/weakprior...) to encode such intuitive statements while
         | ensuring that (for all practical purposes) the data will
         | effectively overwhelm the prior (again, this is equivalent to
         | adding a minimal amount of regularization to a loss function,
         | to make a problem well-posed when e.g. you have more parameters
         | than data points).
        
           | datastoat wrote:
           | Non-Bayesian NN training does indeed use regularizers that
           | are chosen subjectively --- but they are then tested in
           | validation, and the best-performing regularizer is chosen.
           | Thus the choice is empirical, not subjective.
           | 
           | A Bayesian could try the same thing: try out several priors,
           | and pick the one that performs best in validation. But if you
           | pick your prior based on the data, then the classic theory
           | about "principled quantification of uncertainty" doesn't
           | apply any more. So you're left using a computationally
           | unwieldy procedure that doesn't offer theoretical guarantees.
        
             | panda-giddiness wrote:
             | You can, in fact, do that. It's called (aptly enough) the
             | empirical Bayes method. [1]
             | 
             | [1] https://en.wikipedia.org/wiki/Empirical_Bayes_method
        
               | datastoat wrote:
               | Empirical Bayes is exactly what I was getting at. It's a
               | pragmatic modelling choice, but it loses the theoretical
               | guarantees about uncertainty quantification that pure
               | Bayesianism gives us.
               | 
               | (Though if you have a reference for why empirical Bayes
               | does give theoretical guarantees, I'll be happy to change
               | my mind!)
        
         | dkga wrote:
         | I'm not an expert in BNNs but the prior does not need to be
         | justified in terms of each parameter. Bayesian analysis
         | frequently uses hyperparameters to set the overall tightness or
         | looseness of the parameters (a la Minnesota priors in the
         | econometric literature for example). This would be a similar
         | regularisation intuition as, eg, L1 and L2 regularisation in
         | traditional NN training. This is of course just one example.
        
         | nvrmnd wrote:
         | What is 'UQ', I assume some measure of uncertainty over your
         | model outputs?
        
           | rscho wrote:
           | Unbiased quantifier
        
           | proto-n wrote:
           | Usually means uncertainty quantification
        
         | scellus wrote:
         | Priors on parameters are not an issue. On models of scale,
         | priors are just some computationally convenient shrinkage, and
         | what works is found empirically and canonized into the
         | practice; projecting prior knowledge of the problem at hand by
         | parameter priors does not really happen except in some vague
         | sense ("I think most predictors are irrelevant, so make it
         | sparse by Cauchy/horseshoe/whatever").
         | 
         | The important thing in bayesian (statistical, ML) modelling in
         | general is the ability to gain in flexibility and do model
         | structures that otherwise would be hard or impossible: latent
         | states, hierarchies, etc.
         | 
         | In bayesian NNs the main advantages would be around uncertainty
         | quantification (UQ) and in finding good optima and partly to
         | avoid overfitting. These do apply in some cases of simple NNs.
         | 
         | Mostly however, especially with larger conventional models (not
         | speaking of normalizing flows and such here), using explicit
         | bayes is not feasible. Instead, people use approximate point
         | estimates with tricks:
         | 
         | (1) UQ has been taken care of by post-calibration. (2)
         | Stochastic gradient actually searches for large posterior
         | masses like a variational approximation would do, so it is kind
         | of bayes. (3) And those priors: using dropout is commonplace,
         | it has a bayesian interpretation, and L2 regularization aka
         | gaussian priors are frequent too.
         | 
         | So bayes is there in practice, just not in a neat, pure form
         | but as a collection of practical hacks.
        
       | duvenaud wrote:
       | Author here! What a surprise. This was an abandoned project from
       | 2019, that we never linked or advertised anywhere as far as I
       | know. Anyways, happy to answer questions.
        
         | esafak wrote:
         | just a little typo, but it's Kullback- _Leibler_.
        
           | duvenaud wrote:
           | Thanks for pointing that out!
        
         | idontknowmuch wrote:
         | Somewhat related -- I'd love to hear your thoughts on dex-Lang
         | and Haskell for array programming?
        
           | duvenaud wrote:
           | I still am excited by Dex (https://github.com/google-
           | research/dex-lang/) and still write code in it! I have a
           | bunch of demos and fixes written, and am just waiting for
           | Dougal to finish his latest re-write before I can merge them.
        
         | mugivarra69 wrote:
         | why (if) was this not picked for further research? i know that
         | oatml did quite amount of work on this front as well and it
         | seems the direction is still being worked on. want to get ur 2
         | cent on this approach.
        
           | duvenaud wrote:
           | BNNs certainly have their uses, but I think people in general
           | found that it's a better use of compute to fit a larger model
           | on more data than to try to squeeze more juice from a given
           | small dataset + model. Usually there is more data available,
           | it's just somewhat tangentially related. LLMs are the
           | ultimate example of how training on tons of tangentially-
           | related data can ultimately be worthwhile for almost any
           | task.
        
         | timeinput wrote:
         | What did you use to produce the article? I really really like
         | the formatting.
        
           | duvenaud wrote:
           | I think we used a distill.pub template. Also Jerry wrote some
           | custom BNN fitting code in javascript. I'll ask my co-authors
           | to open-source it.
        
             | duvenaud wrote:
             | Update: the code is here:
             | 
             | https://github.com/jerryqhyu/distill_bayes_net
        
       | oli5679 wrote:
       | https://publications.aston.ac.uk/id/eprint/373/1/NCRG_94_004...
       | 
       | mixture density networks are quite interesting if you want
       | probabilistic estimates of neural. here, your model learns to
       | output and array of gaussian distribution coefficient
       | distributions, and mixture weights.
       | 
       | these weights are specific to individual observations, and
       | trained to maximise likelihood.
        
         | duvenaud wrote:
         | This approach characterizes a different type of uncertainty
         | than BNNs do, and the approaches can be combined. The BNN
         | tracks uncertainty about parameters in the NN, and mixture
         | density nets track the noise distribution _conditional on
         | knowing the parameters_.
        
       | ok123456 wrote:
       | BNNs were an attractive choice in scenarios where the data is
       | expensive to collect, like actual physical experiments. But
       | boosting and other tree-based regression methods give you similar
       | performance with a more straightforward framework for limited
       | tabular data.
        
       | levocardia wrote:
       | What frustrates me about Bayesian NNs is that talking about
       | "priors" doesn't make nearly as much sense as it does in a
       | regression context. A prior over parameter weights has no
       | interpretation in the way that a prior over a regression
       | coefficient, or even a spline smoothness, does. What you really
       | want -- and what natural intelligence probably has -- are priors
       | over _aspects of the world_.
       | 
       | Francois Chollet's paper on measuring intelligence was really
       | informative for me on this front; the "priors" you should have
       | about the world are not half-cauchys over certain hyperparameters
       | or whatever, but priors about agent-ness, object-ness, goal-
       | oriented-ness, and so on. How to encode that in a network...well,
       | that's the real trick, right?
        
         | duvenaud wrote:
         | I agree that priors over aspects of the world would be more
         | useful, but I don't think that they're important in making
         | natural intelligence powerful. In my experience, the important
         | thing is to make your prior really broad, but containing all
         | kinds of different hypotheses with different kinds of rich
         | structure.
         | 
         | I claim that knowing _a priori_ about things like agents and
         | objects just doesn 't save you all that much data, as long as
         | you have the imagination to consider all structures at least
         | that complex.
        
       ___________________________________________________________________
       (page generated 2024-11-22 23:00 UTC)