[HN Gopher] Bayes is guaranteed to overfit
       ___________________________________________________________________
        
       Bayes is guaranteed to overfit
        
       Author : Ambolia
       Score  : 123 points
       Date   : 2023-05-28 16:35 UTC (6 hours ago)
        
 (HTM) web link (www.yulingyao.com)
 (TXT) w3m dump (www.yulingyao.com)
        
       | tesdinger wrote:
       | Bayesian statistics is dumb. What's the point of using prior
       | assumptions that are based on speculation? It's better to
       | admitting your lack of knowledge, not jumping to conclusions
       | without sufficient data, and processing data in an unbiased
       | manner.
        
       | Feministmxist48 wrote:
       | [flagged]
        
       | CrazyStat wrote:
       | I'm on my phone so I haven't tried to work through the math to
       | see where the error is, but the author's conclusion is wrong and
       | the counterexample is simple.
       | 
       | > Unless in degenerating cases (the posterior density is point
       | mass), then the harmonic mean inequality guarantees a strict
       | inequality p ( y i | y - i ) < p ( y i | y ) , for any point i
       | and any model.
       | 
       | Let y_1, ... y_n be iid from a Uniform(0,theta) distribution,
       | with some nice prior on theta (e.g. Exponential(1)). Then the
       | posterior for theta, and hence the predictive density for a new
       | y_i, depends only on max(y_1, ..., y_n). So for all but one of
       | the n observations the author's strict inequality does not hold.
        
       | to-mi wrote:
       | It seems that the post is comparing a predictive distribution
       | conditioned on N data points to one conditioned on N-1 data
       | points. The latter is a biased estimate of the former (e.g.,
       | https://users.aalto.fi/~ave/publications/VehtariLampinen_NC2...)
        
       | bbminner wrote:
       | So the argument is essentially that "not only if you pick the
       | best thing fitting your finite data, but even if you take a
       | weighted average over things that fit your finite data
       | proportionally to how well they fit your finite data - you are
       | still almost surely end up with something that fits your finite
       | sample better than the general population (that this sample was
       | drawn from)"?
        
       | vervez wrote:
       | Here's a good recent paper that looks at this problem and
       | provides remedies in a Bayesian manner.
       | https://arxiv.org/abs/2202.11678
        
       | alexmolas wrote:
       | I got lost in the second equation, when the author says
       | 
       | p(y_i|y_{-i})= \int p(y_i|\theta) p(\theta|y)
       | \frac{p(y_i|\theta)^{-1}} {\int p(y_i|\theta^\prime
       | p(\theta^{\prime}|y))^{-1} d\theta^\prime} d\theta
       | 
       | why is that? Can someone explain the rationale behind this?
        
         | fragmede wrote:
         | https://arachnoid.com/latex/?equ=%0Ap(y_i%7Cy_%7B-i%7D)%3D%2...
         | 
         | for if you can't parse latex in your head
        
       | MontyCarloHall wrote:
       | I don't follow the math. WLOG, for N total datapoints, let y_i =
       | y_N. Then the leave-one-out posterior predictive is
       | \int p(y_N|th)p(th|{y_1...y_{N-1}}) dth = p(y_N|{y_1...y_{N-1})
       | 
       | by the law of total probability.
       | 
       | Expanding the leave-one-out posterior (via Bayes' rule), we have
       | p(th|{y_1...y_{N-1}}) = p({y_1...y_{N-1}}|th)p(th)/\int
       | p({y_1...y_{N-1}}|th')p(th') dth'
       | 
       | which when plugged back into the first equation is
       | \int p(y_N|th) p({y_1...y_{N-1}}|th)p(th) dth/(\int
       | p({y_1...y_{N-1}}|th')p(th') dth')
       | 
       | I don't see how this simplifies to the harmonic mean expression
       | in the post.
       | 
       | Regardless, the author is asserting that
       | p(y_N|{y_1...y_{N-1}}) <= p(y_N|{y_1...y_N})
       | 
       | which seems intuitively plausible for any trained model -- given
       | a model trained on data {y_1...y_N}, performing inference on any
       | datapoint y_1...y_N in the training set will generally be more
       | accurate than performing inference on a datapoint y_{N+1} not in
       | the training set.
        
         | kgwgk wrote:
         | It's reassuring to see that I'm not the only one who finds
         | those equations far from obvious. I didn't spend much time
         | trying to understand the derivation though - as you wrote the
         | result doesn't seem interesting anyway.
        
       | radford-neal wrote:
       | As the author admits at the end, this is rather misleading. In
       | normal usage, "overfit" is by definition a bad thing (it wouldn't
       | be "over" if it was good). And the argument given does nothing to
       | show that Bayesian inference is doing anything bad.
       | 
       | To take a trivial example, suppose you have a uniform(0,1) prior
       | for the probability of a coin landing heads. Integrating over
       | this gives a probability for heads of 1/2. You flip the coin
       | once, and it lands heads. If you integrate over the posterior
       | given this observation, you'll find that the probability of the
       | value in the observation, which is heads, is now 2/3, greater
       | than it was under the prior.
       | 
       | And that's OVERFITTING, according to the definition in the blog
       | post.
       | 
       | Not according to any sensible definition, however.
        
         | kgwgk wrote:
         | I was writing another comment based on that same example and
         | his leaving-one-out calculations (at least based on what I
         | understood).
         | 
         | The posterior vs prior would be the extreme case of a leaving-
         | one-out procedure - leaving the only data point out there is
         | nothing left.
         | 
         | The divergence between the data and the model goes down when we
         | include information about the data in the model. That doesn't
         | seem a controversial opinion. (That's how the blog post is
         | introduced here:
         | https://twitter.com/YulingYao/status/1662284440603619328)
         | 
         | ---
         | 
         | If the data consists of two flips they are either equal or
         | different (the former becomes more likely as the true
         | probability diverges from 0.5).
         | 
         | a) If the data is the same, the posterior probability of that
         | result is 3/4. The log score is 2 log(3/4) = -0.6
         | 
         | When we check the out-of-sample log score for each one based on
         | the 2/3 posterior obtained from the other we get in each case a
         | log score log(2/3) = -0.4
         | 
         | b) If the data is different, the posterior probability is still
         | 1/2. The log score is 2 log(1/2) = 2 -0.7 = -1.4
         | 
         | When we check the out-of-sample log score for each one based on
         | the 1/3 posterior for getting that result obtained from the
         | other we get in each case a log score log(1/3) = -1.1
        
         | sillymath3 wrote:
         | When there is a small amount of information the variance of any
         | estimation is very big and this explains what happens in that
         | example. Overfitting implies a different behavior in training
         | and in test and this is related to a big variance in the
         | estimation of the error. So small amount of information implies
         | that any model suffer overfitting and big variance, so is a
         | general result not related especifically with Bayes.
        
       | psyklic wrote:
       | Typically, we think of overfitting and underfitting as exclusive
       | properties. IMO a large problem here is that the author's
       | definition of overfitting is not inconsistent with underfitting.
       | (Underfit indicates a poor fit on both the training and test
       | sets, in general.)
       | 
       | For example, a simple model might underfit in general, but it may
       | still fit the training set better than the test. If this happens
       | yet both are poor fits, it is clearly underfitting and not
       | overfitting. Yet by the article's definition, it would both be
       | underfitting and overfitting simultaneously. So, I suspect this
       | is not an ideal definition.
        
       | dmurray wrote:
       | Am I missing something or is this argument only as strong as the
       | (perfectly reasonable) claim that all models overfit unless they
       | have a regularisation term?
        
         | throwawaymaths wrote:
         | The author is using a slightly different (but not wrong)
         | definition of overfitting that possibly you are not used to.
        
       | syntaxing wrote:
       | The author mentions he defines over fit as "Test error is always
       | larger than training error". Is there an algorithm or model where
       | that's not the case?
        
         | jgalt212 wrote:
         | Yeah, that's a crummy definition. You can easily force "Test
         | error is always larger than training error" for any model type.
        
           | nerdponx wrote:
           | It's not even about "forcing". This is such common and
           | expected behavior that it's surprising (and suspicious) when
           | it isn't the case.
        
         | [deleted]
        
         | jphoward wrote:
         | You can see regularly in practice where aggressive data
         | augmentation is used, which obviously is only used on training
         | data. But, of course, you'd still be 'overfit' if you fed in
         | unaugmented training data.
        
         | onos wrote:
         | A pedantic example: a model that ignored the training data
         | would do just as well on the training set as on the test set.
        
           | tesdinger wrote:
           | If you do not train on the training set, then there is no
           | training set, and your example is degenerate.
        
         | dataflow wrote:
         | I think they mean "always (statistically) _significantly_
         | larger ". They're probably imagining that something like cross-
         | validation would make test errors approximately equal to
         | training errors, but if you consistently see _significantly_
         | larger errors, then you 've overfit.
        
           | bbstats wrote:
           | When you find the minima of your validation curve, rarely if
           | ever is your test loss lower than your training loss. I don't
           | think this _necessarily_ means you 're overfitting.
        
         | jksk61 wrote:
         | Yes and no. Suppose for example you give MNIST to a SVM and fit
         | the model, then test it only on 0 and 1 digits, which are
         | generally well discriminized, you'll get almost 100% accuracy
         | in test, whereas 97% or less in training. (ye probably need
         | some preprocessing, like using PacMAP or UMAP or whatever but
         | the point is the same)
         | 
         | However, that's just because I decided the right data to test
         | it onto. So, you can't really say much on a model using that
         | definition.
        
       ___________________________________________________________________
       (page generated 2023-05-28 23:01 UTC)