[HN Gopher] Bayes is guaranteed to overfit
___________________________________________________________________
Bayes is guaranteed to overfit
Author : Ambolia
Score : 123 points
Date : 2023-05-28 16:35 UTC (6 hours ago)
(HTM) web link (www.yulingyao.com)
(TXT) w3m dump (www.yulingyao.com)
| tesdinger wrote:
| Bayesian statistics is dumb. What's the point of using prior
| assumptions that are based on speculation? It's better to
| admitting your lack of knowledge, not jumping to conclusions
| without sufficient data, and processing data in an unbiased
| manner.
| Feministmxist48 wrote:
| [flagged]
| CrazyStat wrote:
| I'm on my phone so I haven't tried to work through the math to
| see where the error is, but the author's conclusion is wrong and
| the counterexample is simple.
|
| > Unless in degenerating cases (the posterior density is point
| mass), then the harmonic mean inequality guarantees a strict
| inequality p ( y i | y - i ) < p ( y i | y ) , for any point i
| and any model.
|
| Let y_1, ... y_n be iid from a Uniform(0,theta) distribution,
| with some nice prior on theta (e.g. Exponential(1)). Then the
| posterior for theta, and hence the predictive density for a new
| y_i, depends only on max(y_1, ..., y_n). So for all but one of
| the n observations the author's strict inequality does not hold.
| to-mi wrote:
| It seems that the post is comparing a predictive distribution
| conditioned on N data points to one conditioned on N-1 data
| points. The latter is a biased estimate of the former (e.g.,
| https://users.aalto.fi/~ave/publications/VehtariLampinen_NC2...)
| bbminner wrote:
| So the argument is essentially that "not only if you pick the
| best thing fitting your finite data, but even if you take a
| weighted average over things that fit your finite data
| proportionally to how well they fit your finite data - you are
| still almost surely end up with something that fits your finite
| sample better than the general population (that this sample was
| drawn from)"?
| vervez wrote:
| Here's a good recent paper that looks at this problem and
| provides remedies in a Bayesian manner.
| https://arxiv.org/abs/2202.11678
| alexmolas wrote:
| I got lost in the second equation, when the author says
|
| p(y_i|y_{-i})= \int p(y_i|\theta) p(\theta|y)
| \frac{p(y_i|\theta)^{-1}} {\int p(y_i|\theta^\prime
| p(\theta^{\prime}|y))^{-1} d\theta^\prime} d\theta
|
| why is that? Can someone explain the rationale behind this?
| fragmede wrote:
| https://arachnoid.com/latex/?equ=%0Ap(y_i%7Cy_%7B-i%7D)%3D%2...
|
| for if you can't parse latex in your head
| MontyCarloHall wrote:
| I don't follow the math. WLOG, for N total datapoints, let y_i =
| y_N. Then the leave-one-out posterior predictive is
| \int p(y_N|th)p(th|{y_1...y_{N-1}}) dth = p(y_N|{y_1...y_{N-1})
|
| by the law of total probability.
|
| Expanding the leave-one-out posterior (via Bayes' rule), we have
| p(th|{y_1...y_{N-1}}) = p({y_1...y_{N-1}}|th)p(th)/\int
| p({y_1...y_{N-1}}|th')p(th') dth'
|
| which when plugged back into the first equation is
| \int p(y_N|th) p({y_1...y_{N-1}}|th)p(th) dth/(\int
| p({y_1...y_{N-1}}|th')p(th') dth')
|
| I don't see how this simplifies to the harmonic mean expression
| in the post.
|
| Regardless, the author is asserting that
| p(y_N|{y_1...y_{N-1}}) <= p(y_N|{y_1...y_N})
|
| which seems intuitively plausible for any trained model -- given
| a model trained on data {y_1...y_N}, performing inference on any
| datapoint y_1...y_N in the training set will generally be more
| accurate than performing inference on a datapoint y_{N+1} not in
| the training set.
| kgwgk wrote:
| It's reassuring to see that I'm not the only one who finds
| those equations far from obvious. I didn't spend much time
| trying to understand the derivation though - as you wrote the
| result doesn't seem interesting anyway.
| radford-neal wrote:
| As the author admits at the end, this is rather misleading. In
| normal usage, "overfit" is by definition a bad thing (it wouldn't
| be "over" if it was good). And the argument given does nothing to
| show that Bayesian inference is doing anything bad.
|
| To take a trivial example, suppose you have a uniform(0,1) prior
| for the probability of a coin landing heads. Integrating over
| this gives a probability for heads of 1/2. You flip the coin
| once, and it lands heads. If you integrate over the posterior
| given this observation, you'll find that the probability of the
| value in the observation, which is heads, is now 2/3, greater
| than it was under the prior.
|
| And that's OVERFITTING, according to the definition in the blog
| post.
|
| Not according to any sensible definition, however.
| kgwgk wrote:
| I was writing another comment based on that same example and
| his leaving-one-out calculations (at least based on what I
| understood).
|
| The posterior vs prior would be the extreme case of a leaving-
| one-out procedure - leaving the only data point out there is
| nothing left.
|
| The divergence between the data and the model goes down when we
| include information about the data in the model. That doesn't
| seem a controversial opinion. (That's how the blog post is
| introduced here:
| https://twitter.com/YulingYao/status/1662284440603619328)
|
| ---
|
| If the data consists of two flips they are either equal or
| different (the former becomes more likely as the true
| probability diverges from 0.5).
|
| a) If the data is the same, the posterior probability of that
| result is 3/4. The log score is 2 log(3/4) = -0.6
|
| When we check the out-of-sample log score for each one based on
| the 2/3 posterior obtained from the other we get in each case a
| log score log(2/3) = -0.4
|
| b) If the data is different, the posterior probability is still
| 1/2. The log score is 2 log(1/2) = 2 -0.7 = -1.4
|
| When we check the out-of-sample log score for each one based on
| the 1/3 posterior for getting that result obtained from the
| other we get in each case a log score log(1/3) = -1.1
| sillymath3 wrote:
| When there is a small amount of information the variance of any
| estimation is very big and this explains what happens in that
| example. Overfitting implies a different behavior in training
| and in test and this is related to a big variance in the
| estimation of the error. So small amount of information implies
| that any model suffer overfitting and big variance, so is a
| general result not related especifically with Bayes.
| psyklic wrote:
| Typically, we think of overfitting and underfitting as exclusive
| properties. IMO a large problem here is that the author's
| definition of overfitting is not inconsistent with underfitting.
| (Underfit indicates a poor fit on both the training and test
| sets, in general.)
|
| For example, a simple model might underfit in general, but it may
| still fit the training set better than the test. If this happens
| yet both are poor fits, it is clearly underfitting and not
| overfitting. Yet by the article's definition, it would both be
| underfitting and overfitting simultaneously. So, I suspect this
| is not an ideal definition.
| dmurray wrote:
| Am I missing something or is this argument only as strong as the
| (perfectly reasonable) claim that all models overfit unless they
| have a regularisation term?
| throwawaymaths wrote:
| The author is using a slightly different (but not wrong)
| definition of overfitting that possibly you are not used to.
| syntaxing wrote:
| The author mentions he defines over fit as "Test error is always
| larger than training error". Is there an algorithm or model where
| that's not the case?
| jgalt212 wrote:
| Yeah, that's a crummy definition. You can easily force "Test
| error is always larger than training error" for any model type.
| nerdponx wrote:
| It's not even about "forcing". This is such common and
| expected behavior that it's surprising (and suspicious) when
| it isn't the case.
| [deleted]
| jphoward wrote:
| You can see regularly in practice where aggressive data
| augmentation is used, which obviously is only used on training
| data. But, of course, you'd still be 'overfit' if you fed in
| unaugmented training data.
| onos wrote:
| A pedantic example: a model that ignored the training data
| would do just as well on the training set as on the test set.
| tesdinger wrote:
| If you do not train on the training set, then there is no
| training set, and your example is degenerate.
| dataflow wrote:
| I think they mean "always (statistically) _significantly_
| larger ". They're probably imagining that something like cross-
| validation would make test errors approximately equal to
| training errors, but if you consistently see _significantly_
| larger errors, then you 've overfit.
| bbstats wrote:
| When you find the minima of your validation curve, rarely if
| ever is your test loss lower than your training loss. I don't
| think this _necessarily_ means you 're overfitting.
| jksk61 wrote:
| Yes and no. Suppose for example you give MNIST to a SVM and fit
| the model, then test it only on 0 and 1 digits, which are
| generally well discriminized, you'll get almost 100% accuracy
| in test, whereas 97% or less in training. (ye probably need
| some preprocessing, like using PacMAP or UMAP or whatever but
| the point is the same)
|
| However, that's just because I decided the right data to test
| it onto. So, you can't really say much on a model using that
| definition.
___________________________________________________________________
(page generated 2023-05-28 23:01 UTC)