[HN Gopher] Modeling Uncertainty with PyTorch
       ___________________________________________________________________
        
       Modeling Uncertainty with PyTorch
        
       Author : srom
       Score  : 44 points
       Date   : 2022-01-07 14:58 UTC (1 days ago)
        
 (HTM) web link (romainstrock.com)
 (TXT) w3m dump (romainstrock.com)
        
       | gillesjacobs wrote:
       | The field of ML is largely focused on just getting predictions
       | with fancy models. Estimating the uncertainty, unexpectedness and
       | perplexity of specific predictions is highly underappreciated in
       | common practice.
       | 
       | Even though it is highly economically valuable to be able to tell
       | to what extent you can trust a prediction, the modelling of
       | uncertainty of ML pipelines remains an academic affair in my
       | experience.
        
         | NeedMoreTime4Me wrote:
         | You are definitely right; there are numerous classic
         | applications (i.e. outside of the cutting-edge CV/NLP stuff)
         | that could greatly benefit from such a measure.
         | 
         | The question is: Why don't people use these models? While
         | Bayesian Neural Networks might be tricky to deploy & debug for
         | some people, Gaussian Processes etc. are readily available in
         | sklearn and other implementations.
         | 
         | My theory: most people do not learn these methods in their
         | ,,Introduction to Machine Learning" classes. Or is it lacking
         | scalability in practice?
        
           | shakow wrote:
           | > Or is it lacking scalability in practice?
           | 
           | Only speaking from my own little perspective in
           | bioinformatics, lack of scalability above all else, both for
           | BNNs and GPs.
           | 
           | Sure, the library support could be better, but that was not
           | the main hurdle, more of a friction.
        
             | NeedMoreTime4Me wrote:
             | Do you have an anecdotal guess on the scalability barrier
             | maybe? Like does it take too long with more than 10,000
             | data points having 100 features? Just to get a feel.
        
               | shakow wrote:
               | Please don't quote me on that, as it was academic work in
               | a given language and a given library and might not be
               | representative of the whole ecosystem.
               | 
               | But in a nutshell, on OK-ish CPUs (Xeons a few
               | generations old), we started seeing problems past a few
               | thousands points with a few dozens features.
               | 
               | And not only was the training slow, but also the
               | inference: as we used the whole sampled chain of the
               | weights distributions parameters, not only was memory
               | consumption a sight to behold, but inference time quickly
               | grew through the roof when subsampling was not used.
               | 
               | And all that was on standard NNs, so no complexity added
               | by e.g. convolution layers.
        
           | disgruntledphd2 wrote:
           | It takes more compute, and the errors from badly chosen data
           | vastly outweigh the uncertainties associated with your
           | parameter estimate.
           | 
           | To be fair, I suspect lots of people do this, but for
           | whatever reason nobody talks about it.
        
           | b3kart wrote:
           | They often don't scale, they are tricky to implement in
           | frameworks that people are familiar with, but, most
           | importantly, they make crude approximations meaning after all
           | this effort they often don't beat simple baselines like
           | bootstrap. It's an exciting area of research though.
        
         | marbletimes wrote:
         | When I was in academia, I used to fit highly sophisticated
         | models (think many-parameters, multi-level non-linear mixed
         | effect models) who were giving not only point estimate but also
         | confidence and predictive intervals ("please explain to me the
         | difference between the two" is one of my favorite interview
         | questions and I still have not heard a correct answer).
         | 
         | When I tried to bring an "uncertainty mindset" over when I
         | moved to industry, I found that (1) most DS/ML scientists use
         | ML models that typically don't provide an easy way to estimate
         | uncertainty intervals, (2) in the industry I was in (media)
         | people who make decisions and use model prediction as one of
         | the input for their decision-making are typically not very
         | quantitative and an uncertainty interval, rather than give
         | strength to their process, would confuse them more than
         | anything else: they want a "more or less" estimate, more than a
         | "more or less plus something more and something less" estimate.
         | (3) When services are customer-facing (see ride-sharing)
         | providing an uncertainty interval (your car will arrive between
         | 9 and 15 minutes) would anchor the customer to the lower
         | estimate (they do for the price of rides book in advance, and
         | they need to do it, but they are often way off).
         | 
         | So for many ML applications, an uncertainty interval that
         | nobody internally or externally would base their decision upon
         | is just a nuisance.
        
           | curiousgal wrote:
           | > the difference between the two
           | 
           | One is bigger than the other as far as I remember which means
           | that the standard error of the prediction interval is bigger?
        
             | marbletimes wrote:
             | From a good SO answer, see https://stats.stackexchange.com/
             | questions/16493/difference-b...
             | 
             | "A confidence interval gives a range for E[y|x], as you
             | say. A prediction interval gives a range for y itself.".
             | 
             | In the vast majority of the cases, what we want it the
             | range for y (prediction interval), that is, given x = 3,
             | what is the expected distribution of y?. For example, say
             | we train a model to estimate how the 100-m dash time varies
             | with age. The uncertainty we want is, "at age 48, 90% of
             | Master Athletes run the 100-m dash between 10.2 and 12.4
             | seconds" (here there would be another difference to point
             | out between Frequentist and Bayesian intervals, but let's
             | make things simple).
             | 
             | We are generally not interested in, given x = 3, what is
             | the uncertainty of the expected value of y (that is, the
             | confidence interval)? In this case, the uncertainty we get
             | (we might want it, but often we do not), is, "at age 48, we
             | are 90% confident that the expected time to complete the
             | 100-m dash for Master Athletes is between 11.2 and 11.6
             | seconds".
        
           | code_biologist wrote:
           | Great answer. It prompts a bunch of followup questions!
           | 
           |  _most DS /ML scientists use ML models that typically don't
           | provide an easy way to estimate uncertainty intervals_
           | 
           | Not an DS/ML scientist but a data engineer. The models I've
           | used have been pretty much "slap it into XGBoost with k-fold
           | CV, call it done" -- an easy black box. Is there any model or
           | approach you like to estimate uncertainty with similar ease?
           | 
           | I've seen uncertainty interval / quantile regression done
           | using XGBoost, but it isn't out of the box. I've also been
           | trying to learn some Bayesian modeling, but definitely don't
           | feel handy enough to apply it to random problems needing
           | quick answers at work.
        
             | marbletimes wrote:
             | Correct, quantile regression is an option. Another is
             | "pure" bootstrapping (you can see by googling something
             | like uncertainty + machine learning + bootstrapping that
             | this is a very active area of current research).
             | 
             | The major problem with bootstrapping is the computational
             | time for big models, since many models need to be fit to
             | obtain a representative distribution of predictions.
             | 
             | Now, if you want more "rigorous" quantification of
             | uncertainty, one option is to go Bayesian using
             | probabilistic programming (PyMC, Stan, TMB), but
             | computational time for large models can be prohibitive.
             | Another option is to "scale down" the complexity to models
             | that might be (on average) a bit less accurate, but provide
             | rigorous uncertainty intervals and good interpretability of
             | results, for example Generalized Additive Models.
             | 
             | A note here is that I saw certain quantification of
             | uncertainty by people who were considered very capable in
             | the ML community that gave me goosebumps, for example since
             | the lower bound of the interval was a negative number and
             | the response variable modeled could not be negative, the
             | uncertainty interval was "cut" at zero (one easy way to
             | deal with it, although it depends on the variable modeled
             | and the model itself, is log-transforming the response--but
             | pay attention to intervals when exp(log(y)) to get back to
             | the natural scale. Another useful interview question.)
        
           | joconde wrote:
           | What do "multi-level" and "mixed effects" mean? There are
           | tons of non-linear models with lots of parameters, but I've
           | never heard these other terms.
        
             | canjobear wrote:
             | https://en.wikipedia.org/wiki/Nonlinear_mixed-effects_model
        
         | math_dandy wrote:
         | Uncertainty estimates in traditional parametric statistics are
         | facilitated by strong assumptions on the distribution of the
         | data being analyzed.
         | 
         | In traditional nonparametric statistics, uncertainty estimates
         | are obtained by a process called bootstrapping. But there's a
         | trade-off. There's no free lunch!) If you want to eschew strong
         | distributional hypotheses, you need to pay for it with more
         | data and more compute. The "more compute" typically involves
         | fitting variants of the model in question to many subsets of
         | the original dataset. In deep learning applications in which
         | each fit of the model is extremely expensive, this is
         | impractical.
        
       ___________________________________________________________________
       (page generated 2022-01-08 23:01 UTC)