[HN Gopher] Why is everything based on likelihoods even though l...
___________________________________________________________________
Why is everything based on likelihoods even though likelihoods are
so small?
Author : cl3misch
Score : 147 points
Date : 2024-02-18 12:53 UTC (10 hours ago)
(HTM) web link (stats.stackexchange.com)
(TXT) w3m dump (stats.stackexchange.com)
| nabla9 wrote:
| A Good simple question and a good answer.
| kjhcvkek77 wrote:
| Because it works well in practice. And to elaborate, usually when
| something works well in practice it's because it has multiple
| desirable properties - the one you "ask for", but also other ones
| you get for free.
|
| In this case maximum likelihood approximate bayesian estimation
| with a mostly reasonable prior. Furthermore you could look at the
| convergence properties which are good.
|
| You could probably design some degenerate probability
| distribution that ml-estimation behaves really badly for, but
| those are not common in practice.
| ronald_raygun wrote:
| > You could probably design some degenerate probability
| distribution that ml-estimation behaves really badly for, but
| those are not common in practice.
|
| Anything multimodal...
| nerdponx wrote:
| It's better than "it works well in practice".
|
| The question is misguided as stated. It's like asking why
| chemists care about density for measuring mass.
|
| If you are looking at the likelihood of any particular outcome
| of a continuous random variable, then you do not understand how
| probability works.
|
| The probability of any particular real number arising from a
| probability distribution on the real numbers is exactly 0. It's
| not an arbitrarily small epsilon greater than zero, it's
| actually zero. This definition is in fact required for
| probability to sense mathematically.
|
| You might ask questions like why does maximum likelihood work
| as an optimization criterion, but that's very different from
| asking why we care about likelihood at all.
|
| The comments on the original question do a good job of cutting
| through this confusion.
| kjhcvkek77 wrote:
| I appreciate your response but I don't really agree. They say
| that likelihood can be multiplied by any scale factor or that
| it's only the comparative difference that matters, or we can
| make a little plot, but they don't actually explain why.
|
| I can try to make an explanation from the bayesian
| framework(but as I mentioned it's not the only relevant one)
|
| Likelihood is
| P(measurement=measurement'|parameter=parameter'). This is a
| small value. Given a prior we can
| P(parameter=parameter'|measurement=measurement'). This is
| also small. But when we compute P(parameter'-k<parameter<para
| meter'+k|measurement=measurement') then all the smallness
| cancels see the formulation of bayes that reads
|
| P(X_i|Y) = (P(X_i)P(Y|X_i)/(sum_j P(X_j)P(Y|X_j))
|
| I'm obviously skipping a lot of steps here because I'm
| sketching an explanation rather than giving one.
| oasisaimlessly wrote:
| > The probability of any particular real number arising from
| a probability distribution on the real numbers is exactly 0.
| It's not an arbitrarily small epsilon greater than zero, it's
| actually zero.
|
| Nitpicking somewhat, but e.g. `max(1, uniform(0, 2))` has a
| very non-zero probability of evaluating to 1.
| sega_sai wrote:
| It is very strange that this is on a main page. The key thing is
| likelihood is probability _density_ of your data! I.e. if your
| probability density is a Gaussian N(0,0.00001), then the
| likelihoods of data-points next to the mean will be very large,
| if your PDF is N(0,10000) they 'll be very small. Furthermore the
| amount of data matters as likelihoods will be multiplied for each
| datapoint, so if they were small in the beginning, they'll be
| even smaller, if they were large they'll be larger.
| acc_297 wrote:
| A likelihood could be referring to data drawn from discrete
| distribution though and this wouldn't change much about how
| it's treated and it would be a proper probability not prob-
| density
| nerdponx wrote:
| I'm surprised it's here as well. Of all the interesting
| questions on CV, I would not consider this one of them. I
| wonder if this was sent through the second-chance pool.
| kgwgk wrote:
| > The key thing is likelihood is probability density of your
| data!
|
| In fact the important thing to understand about the likelihood
| function is that it's not a probability density.
| sega_sai wrote:
| For continuous data it is exactly a probability density
| evaluated on your data (for discrete it's PMF instead).
|
| L(params)=P(D|params)
| kgwgk wrote:
| The point is that L(params) is not a probability density.
| The integral of L(params) over params is not one.
| vermarish wrote:
| Right. But if you make the notation slightly more
| explicit, then the integral of L(data, params) over data
| is 1. This follows from the independence assumption.
|
| So we ARE working with a probability function. Its output
| can be interpreted as probabilities. It's just that we're
| maximizing L = P(events | params) with respect to params.
| kgwgk wrote:
| The likelihood function is a function of params for a
| fixed value of data and it is not a probability function.
|
| There is another function - a function of data for fixed
| params - which is a probability density. That doesn't
| change the fact that the likelihood function isn't.
| 331c8c71 wrote:
| The independence has nothing do with the integral being 1
| to be honest. You could write a model where the
| observations are not independent but the (multivariate)
| integral over their domain will still be 1.
| 331c8c71 wrote:
| That's exactly the reason why frequentist approach sucks
| by the way;) Parameters are treated specially and there
| is no internal consistency - to have it you need to
| introduce priors...
| 3abiton wrote:
| It's the bayes vs frequentist war again.
| acc_297 wrote:
| Working on nlme models for work these days - it does become a bit
| of a headache when asking "how much better" the model with
| -2LL=8000 is from the model with -2LL=7995 obviously one is
| better "more likely given the data" but what if the better one
| used 2 more parameters and is hence more complex and might be
| overfitting the dataset? Well then there are all these
| "heuristics" to look at, AIC, BIC, some sort of trick with a
| chi^2 distribution function- these are all just ways to penalize
| the objective function based on the # of parameters but it's
| somewhat debatable which one to apply when and I have read that
| some parameter estimation softwares don't even compute these
| values in the exact same way - I am not a statistician by
| training I just apply "industry standard practices" in as
| reasonably intuitive a way as I can but my _impression_ has
| always been that if you wander far enough into the weeds you'll
| find that stats often becomes a debate between many
| approximations with different sorts of tradeoffs and much of this
| is smoothed over by the fancy scientific software packages that
| get used by non-statistics-researchers one of the most
| frustrating parts of my job is reproducing SAS output (an
| extensively used statistics product) using free R language tools
| since a SAS license costs more than some sports cars... But what
| is SAS actually doing? And it's never just taking a mean or
| pooling variance in the standard way you'd read about in an
| intermediate stats textbook it's always doing some slight
| adjustment based on this or that approximation or heuristic
|
| This tangent may have been unrelated or irrelevant but I've long
| concluded that in practice statistics is far less solved than
| people might expect if they've never had to reproduce any of the
| numbers given to them by statistical analysis
| ronald_raygun wrote:
| Another thing to note is that you're multiplying probabilities
| together. Since each probability is between 0 and 1, youre always
| shrinking the likelihood with each new data point. When you're
| doing this kind of analysis, the question you're asking is "given
| a model with these parameters, what's the probability I get
| _exactly_ this sample? " Which, when you phrase it that way, it
| becomes more apparent why the likelihood is so small.
| cobbal wrote:
| Multiplying them together certainly magnifies the effect, but
| it would magnify it the other way if the likelihoods were
| larger than one. (Easy to get, just tweak the variance of the
| normal distributions to be smaller). Likelihoods are more like
| infinitesimal fractions of a probability, that need to be
| integrated over some set of events to get back a probability.
| In the case of the joint distribution of 50 Gaussian, you can
| think of the likelihood having "units" of epsilon^50.
| j7ake wrote:
| Wait how do you get likelihoods greater than one?
|
| Definitely won't work for likelihoods like Poisson or other
| count based models.
| blackbear_ wrote:
| For discrete distributions you indeed cannot, but for
| continuous distributions all you need is sufficiently small
| variance. Try for example a Gaussian with variance 1e-12
| KeplerBoy wrote:
| The value of a continuous probability density
| distribution at a specific point is pretty meaningless
| though; You have to talk about the integral between two
| values and that won't go above one.
| blackbear_ wrote:
| In context of maximum likelihood the value of the density
| at the maximizer is actually quite useful, for example
| for model comparison.
| here4U wrote:
| Because you can add them
| fergal_reid wrote:
| I think most of the replies, here and on stack exchange, are
| answering slightly the wrong question.
|
| It _is_ fair to ask why the likelihoods are useful if they are so
| small, and it 's not a good answer to talk about how they could
| be expressed as logs, or even to talk about the properties of
| continuous distributions.
|
| I think the answer is:
|
| Yes, individual likelihoods are so small, that yes even a MLE
| solution is extremely unlikely to be correct.
|
| However, the idea is that often a lot of the probability mass -
| an amount that is not small - will be concentrated around the
| maximum likelihood estimate, and so that's why it makes a good
| estimate, and worth using.
|
| Much like how the average is unlikely to be the exact value of a
| new sample from the distribution, but it's a good way of
| describing what to expect. (And gets better if you augment it
| with some measure of dispersion, and so on). (If the distribution
| is very dispersed, then while the average is less useful as an
| idea of what to expect, it still minimises prediction error in
| some loss; but that's a different thing and I think less relevant
| here).
| jvanderbot wrote:
| Yes - the most enlightening concept for me was "Highest
| Probability Density Interval" which basically always is
| clustered around the mean. But you can choose _any_ interval
| which contains as much probability mass!
|
| https://en.wikipedia.org/wiki/Credible_interval#Choosing_a_c...
|
| It's a fairly common "mistake" to assume that the MLE is useful
| _as a point estimate_ and _without considering covariance
| /spread/CI/HPDI/FIM/CRLB/Entropy/MI/KLD_ or some other measure
| of precision given the measurement set.
| bscphil wrote:
| > It is fair to ask why the likelihoods are useful if they are
| so small
|
| The way the question demonstrates "smallness" is wrong,
| however. They quote the product of the likelihoods of 50
| randomly sampled values - 9.183016e-65 - as if the smallness of
| this value is significant or meant anything at all. Forget the
| issue of continuous sampling from a normal distribution, and
| just consider the simple discrete case of flipping a coin. The
| combined probability of any permutation of 50 flips is 0.5 ^
| 50, a really small number. That's because the probability is,
| in fact, really small!
| knightoffaith wrote:
| Right - and so the more appropriate thing to do is not look
| at the raw likelihood of any one particular value but instead
| look at relative likelihoods to understand what values are
| more likely than other values.
| anon946 wrote:
| For the discrete case, it seems that a better thing to do is
| consider the likelihood of getting that number of heads,
| rather than the likelihood of getting that exact sequence.
|
| I am not sure how to handle the continuous case, however.
| lupire wrote:
| Of course you ignore irrelevant ordering of data points.
| That's not the issue.
|
| The issue, for discrete or continuous (which are
| mathematically approximations of each other), is that the
| value at a point is less important than the integral over a
| range. That's why standard deviation is useful. The argmax
| is a convenient average over a weightable range of values.
| The larger your range, the greater the likelihood that the
| "truth" is in that range.
|
| If you only need to be correct up to 1% tolerance, the
| likelihood of a range of values that have
| $SAMPLING_PRECISION tolerance is not importance. Only the
| argmax is, to give you a center of the range.
| aquafox wrote:
| I agree with your points and thats why it's useful to compare a
| MLE to an alternative model via a likelihood ratio test, in
| which case one sees how much better the generative model
| performs as compared to the wrong model.
|
| Similarly, AIC values do not make a lot of sense on an absolute
| scale but only relative to each other, as written in [1].
|
| [1] Burnham, K. P., & Anderson, D. R. (2004). Multimodel
| inference: understanding AIC and BIC in model selection.
| Sociological methods & research, 33(2), 261-304.
| crazygringo wrote:
| > _Yes, individual likelihoods are so small, that yes even a
| MLE solution is extremely unlikely to be correct._
|
| Can you elaborate? An MLE is never going to come up with the
| _exact_ parameters that produced the samples, but in the
| original example, as long as you know it 's a normal
| distribution, MLE is probably going to come up with a mean
| between 4 and 6 and a SD within a similar range as well (I
| haven't calculated it, just eyeballing it) -- when the original
| parameters were 5 and 5.
|
| I guess I don't know what you mean by "correct", but that's as
| correct as you can get, based on just 50 samples.
| fergal_reid wrote:
| Right - I think this is what's at the heart of the original
| question.
|
| I know they asked with a continuous example, but I don't
| interpret their question as limited to continuous cases, and
| I think it's easier to address using a discrete example, as
| we avoid the issue of each exact parameter having
| infinitesimal mass which occurs in a continuous setting.
|
| Let's imagine the parameter we're trying to estimate is
| discrete and has, say, 500 different possible values.
|
| Let's say the parameter can have the value of the integers
| between 1 and 500 and most of the mass is clustered in the
| middle between 230 and 270.
|
| Given some data, it would actually be possible that MLE would
| come up with the exact value, say 250.
|
| But maybe given the data, a range of values between 240 and
| 260 are also very plausible, so the likelihood of exactly 250
| has a fairly low probability.
|
| The original poster is confused, because they are basically
| saying, well, if the actual probability is so low, why is
| this MLE stuff useful?
|
| You are pointing out they should really frame things in terms
| of a range and not a point estimate. You are right; but I
| think their question is still legitimate, because often in
| practice we do not give a range, and just give the maximum
| likelihood estimate of the parameter. (And also, separately,
| in a discrete parameter setting, specific parameter value
| could have substantial mass.)
|
| So why is the MLE useful?
|
| My answer would be, well, that's because for many posterior
| distributions, a lot of the probability mass will be near the
| MLE, if not exactly at it - so knowing the MLE is often
| useful, even if the probability of that exact value of the
| parameter is low.
| TobyTheCamel wrote:
| > However, the idea is that often a lot of the probability mass
| - an amount that is not small - will be concentrated around the
| maximum likelihood estimate, and so that's why it makes a good
| estimate, and worth using.
|
| This may be true for low dimensions but doesn't generalise to
| high dimensions. Consider a 100-dimensional standard normal
| distribution for example. The MLE will still be at the origin
| but most of the mass will live in a thin shell of distance
| roughly 7 units from the origin.
| blt wrote:
| I think the "mass" they are referring to might the mass of
| the Bayesian posterior in parameter space, not the mass of
| the data in event space.
| fergal_reid wrote:
| Yes, in parameter space.
|
| However, TobyTheCamel's point is valid in that there are
| some parameter spaces where the MLE is going to be much
| less useful than others.
|
| Even without having to go to high dimensions, if you've got
| a posterior that looks like a normal distribution, the MLE
| is going to the you a lot, whereas if it's a multimodal
| distribution with a lot of mass scattered around, knowing
| the MLE much less informative.
|
| But this is a complex topic to address in general, so I'm
| trying to stick to what I see as the intuition behind the
| original question!
| lupire wrote:
| Concentration of mass is _density_. A shell is not dense.
|
| If I am looking for a needle in a hyperhaystack, it's not
| important to know that it's more likely to be "somewhere on
| the huge hyperboundary" than "in the center hypercubic inch".
| agnosticmantis wrote:
| > However, the idea is that often a lot of the probability mass
| - an amount that is not small - will be concentrated around the
| maximum likelihood estimate, and so that's why it makes a good
| estimate, and worth using.
|
| This is a Bayesian point of view. The other answers are more
| frequentist, pointing out that likelihood at a parameter theta
| is NOT the probability of theta being the true parameter (given
| data). So we can't and don't interpret it like a probability.
| klipt wrote:
| Given enough data, Bayesian and frequentist models tend to
| converge to the same answer anyway.
|
| Bayesian priors have similar effect to regularization (e.g.
| ridge regression / penalizing large parameter values).
| praptak wrote:
| I don't have a problem with very small probabilities as long as
| they stay within math and kind of "cancel out".
|
| What I do have a problem with is lack of conceptual framework for
| dealing with small probabilities of real life events.
|
| For example, what amount of effort is appropriate to prevent a
| one time event which kills you with say 1 in ten thousand times?
| sebzim4500 wrote:
| Quite a lot. See airbags, seat belts, etc.
| nsomaru wrote:
| One simple way of quantifying this is (amt/cost of harm) *
| (risk of occurrence)
| airstrike wrote:
| _> For example, what amount of effort is appropriate to prevent
| a one time event which kills you with say 1 in ten thousand
| times?_
|
| if you value being killed at a massively negative value, then
| 1/10,000 times that value is still a massively negative value,
| so the answer is "a huge amount of effort"
| samatman wrote:
| For some value of "massive", sure. But for any value of
| massive, it's 1/10,000th that value. Then you factor in the
| value derived from taking that risk, and there's your choice.
|
| The reality is that I don't expend huge amounts of effort
| avoiding tail risks, and you don't either. You might for the
| ones you're explicitly aware of, but a risk is a risk whether
| or not you know you're taking it.
| hedora wrote:
| How I'd put it:
|
| They are useful because the integral of the likelihoods is not
| infinitesimal.
|
| The probability that your yard stick measures 1.0000000000000
| yards is basically zero, but the probability that it's within one
| inch of that is close to one.
|
| We generally prefer to use probability density functions with the
| property that most of the probability density is close to the
| maximum likelihood.
|
| So, in the yard stick example, the yard stick lengths are
| probably gaussian, so if check enough lengths, you'll get a mean
| (== length with the maximum likelihood) that approaches
| 1.00000000000 yards (you'd hope) with some small standard
| deviation (probably less than an inch).
| hopfenspergerj wrote:
| You work with probability density functions because the
| probability of observing any given value in a continuum is zero.
| Density functions may be reasonable to work with if they have
| some nice properties (continuity, unimodality, ...) The question
| and answers here seem to be from people that don't understand
| calculus.
| ninetyninenine wrote:
| In a continuous distribution the probability of any number on
| that distribution being generated is effectively zero. If R was
| generating the true probabilities it should give you zero for
| every single number.
|
| Think about it. That distribution is continuous over an infinite
| amount of numbers. If you select any number the chances of that
| number being generated will be essentially zero. According to the
| theory there is no possibility for any number on the distribution
| to be generated. This is correct.
|
| Yet when you use the random number generator you get a number
| even though that number technically is impossible to exist due to
| zero probability. Does this mean there is a flaw in the theory
| when applied to the number generated?
|
| Yes it does. The theory is an approximation of what's going on
| itself. No random number generator in a computer is selecting a
| number from a truly continuous set of numbers. It is selecting it
| from a finite set of numbers from all available numbers in a
| floating point specification.
|
| Even if it's not a computer when you select a random number by
| intuition from a continuous distribution you are not doing it
| randomly.
|
| Think about it. Pick a random number between 0 and 1. I pick
| 0.343445434. This selection is far from random. It is biased
| because there is an infinite amount of significant figures yet I
| arbitrarily don't go past a certain amount. I cut off at 9
| sigfigs and bias towards a cutoff like that because picking a
| random number with say 6000 sigfigs is just too inconvenient. You
| really need to account for infinite sigfigs for the number to be
| truly random which is impossible.
|
| So even when you pick numbers randomly you are actually picking
| from a finite set.
|
| In fact I can't think of anything in reality that can truly be
| accurately described with a continuous distribution. Nothing is
| in the end truly continuous. Or maybe it does exist, but if it
| does exist how can we even confirm it? We can't verify anything
| in reality to a level of infinite sig figs.
|
| If R was accurately calculating likelihood it should give you
| zero for each number. And the random number generator should not
| even be able to exist as how do even create a pool of infinite
| possibilities to select from? Likely R is giving some probability
| over a small interval of numbers.
|
| That's where the practicality of the continuous distribution
| makes sense when you measure the probability of a range of
| values. You get a solid number in this case.
|
| Anyway the above explanation is probably too deep. A more
| practical way of thinking about this is like this:
|
| It is unlikely for any one person to win the lottery. Yet someone
| always wins. The probability of someone winning is 100 percent.
| The probability of a specific someone winning is 1 over the total
| number of people playing.
|
| Improbable events in the universe happen all the time because
| that all that's available. It's highly improbable for any one
| person to win the lottery but if someone has to win, then there
| is a 100 percent chance that an arbitrary improbable event will
| occur.
|
| This is more easily seen in a uniform discrete distribution
| rather then the normal continuous distribution.
|
| In the case of the normal distribution it is confusing. In a
| normal distribution It is far more likely for an improbable event
| to occur then it is for the single most probable event to occur.
|
| Think of it like this. I have a raffle. There are 2 billion
| participants. Each person has one ticket in the bag, except me. I
| have 100,000 tickets in the bag I am the most likely person to
| win.
|
| But it is still far more likely for anyone else but me to win
| even when I am the most likely person to win. An arbitrary
| improbable event is more likely to occur then the single most
| probable event.
| ttoinou wrote:
| In a continuous distribution the probability of any number on
| that distribution being generated is effectively zero
|
| Yes almost by definition, no ? You can only know the
| probability it would be between a range of numbers, by
| integrating over the distribution
| 331c8c71 wrote:
| > If R was accurately calculating likelihood it should give you
| zero for each number.
|
| You have some good points but this is false. The probability of
| any point for a continuous distribution is indeed zero. That
| doesn't mean that the density at this point is also zero.
| richrichie wrote:
| For infinite probability spaces, likelihood has no interpretable
| meaning akin to probability in the case of finite spaces. This
| can be a source of confusion.
| ultra_nick wrote:
| In a storm, the likelihood of A raindrop hitting your computer
| dwarfs the likelihood of THE raindrops[5927] hitting your
| computer.
|
| Hence why using A and THE is important.
| leni536 wrote:
| I think the correct answer is that it is mostly bogus, but
| likelihood based statistical methods mostly work for well-
| behaving distributions, especially for Gaussian.
|
| Maximum likelihood estimation has some weird cases when the
| distribution is not "well behaving".
| data-ottawa wrote:
| Probability is the probability mass distributed over your data
| with fixed parameters, and likelihood is mass distributed over
| your model parameters with fixed data. The absolute most
| important thing to know about likelihood is that it is not a
| measure of probability, even thought it looks a lot like
| probability.
|
| If I look at coin flip data, I know the data comes from a coin
| flip, but any specific count of heads vs tails becomes less and
| less likely the more flips we do. So likelihood being small tells
| us nothing on its own.
|
| The value of likelihood comes from the framework you use in. If I
| wanted to make a best guess at what the balance of the coin is
| then I could find the maximum of the likelihood over all coin
| balances to get the most representative version of my model.
| Similarly, I can compare two specific coin biases and determine
| which is more likely, but that alone can't tell me anything about
| the probability of the coin being biased.
| andsoitis wrote:
| Relative likehoods can be used for prioritization.
| hdivider wrote:
| _One might perhaps say that what is probable
|
| Is that men experience much that is improbable._
|
| - Fragment of a lost play by Agathon, ca. 2,400 years ago, quoted
| in Aristotle's _The Art of Rhetoric_
| l_e_o_n wrote:
| You flip a possibly-biased coin 20 times and get half heads, half
| tails, e.g. "THHHTTTTTHTHTTHHHTHH".
|
| Under the model where the bias is 0.5--a fair coin--the
| probability of that sequence is (0.5)^20 or about one in a
| million. In fact, the probability of _any_ sequence you could
| observe is one in a million.
|
| Under the model where the bias is 0.4 the probability is (0.4)^10
| x (0.6)^10 or about one in two million.
|
| That is, the sequence we observed supplies about twice as much
| evidence in favor of bias = 0.5 as compared with bias = 0.4--this
| is likelihood.
|
| Likelihood ratios are all that matter.
|
| Morals:
|
| - The more complex the event you're predicting (the rarer the
| tyical observed result) the smaller the associated likelihoods
| will tend to be
|
| - It's possible that every observed result has a tiny probability
| under every model you're considering
|
| - Nonetheless it makes sense to use the ratios of these numbers
| to compare the models
|
| - This has nothing to do with probability densities or
| logarithms, though the fact that we often work with densities
| also makes absolute likelihood values relative to the choice of
| units
|
| Added in edits:
|
| - You could summarize the sequence with the number of heads or
| tails and then the likelihood values would be larger but the
| ratios would remain the same (it's a sufficient statistic).
| Similarly in the CrossValidated question one could summarize the
| data with the mean and sum of squares. But this doesn't work in
| general, e.g. if we have i.i.d. draws from a Cauchy distribution.
| qup wrote:
| > half heads, half tails.
|
| > Under the model where the bias is 0.5--a fair coin--the
| probability of that outcome is (0.5)^20 or about one in a
| million.
|
| No.
|
| Edit: someone downvoted, ha. It's closer to 1 in 6.
| cj wrote:
| 1 in a million is the probability of correctly predicting a
| unique sequence of 20 coin flips, in the exact order. (E.g.
| first 10 flips heads, 2nd 10 flips tails, in that order - 1
| in a million)
| l_e_o_n wrote:
| Edited for clarity.
| BeetleB wrote:
| I'm surprised people are conflating the Binomial distribution
| with OP's statement. He is talking about _one_ specific
| outcome of half heads /half tails (where order matters).
| There is exactly one way to get that outcome.
| b10nic wrote:
| Not quite; the probability of n/2 successes in n trials is
| given as Binomial(n,p) not p^n. p^n is correct for a single
| sequence but there are many possible sequences that result half
| heads, half tails and so you have a factor of "N choose X" or
| the so called "Binomial Coefficient".
|
| > (0.4)^20 x (0.6)^20
|
| and I think you mean (0.4)^10 x (0.6)^10 or more generally
| p^x*(1-p)^n-x.
| l_e_o_n wrote:
| I'm talking about the whole sequence; you're talking about
| the number of heads (or) tails in the sequence.
|
| The number of heads is a sufficient statistic, so we'll get
| the same likelihood ratios out, but the likelihood values
| themselves will be larger.
|
| You could make a similar point about the original
| CrossValidated Normal(0, 1)^N example by summarizing the data
| with the mean and sum of squares.
|
| This doesn't work if the data were Cauchy(0, 1)^N instead.
| kgwgk wrote:
| > Thus, it appears to be very unlikely in a certain sense that
| these numbers came from the very distribution they were generated
| from.
|
| "It appears to be very unlikely in a certain sense that this
| comment is written in English. Yes, this sequence of characters
| is much more likely in English than in French. But you can't even
| fathom how unlikely it was to be ever written in English!"
| hsnewman wrote:
| Because the basis of all reality is quantum physics
| lukego wrote:
| Likelihoods aren't fundamentally small.
|
| The center of a normal distribution has high likelihood (e.g.
| 1000000) if the standard deviation is small or low likelihood if
| the standard deviation is large (e.g. 1/1000000.)
|
| This effect is amplified when you are working with products of
| likelihoods. They can be infinitesimal or astronomical.
|
| Giant likelihoods really surprised me the first time I
| experienced them but they're not uncommon when you work with
| synthetic test data in high dimensions and/or small scales.
|
| They still integrate to the same magnitude because the higher
| likelihood values are spread over shorter spans.
| jdhwosnhw wrote:
| Another issue is that likelihoods associated with continuous
| distributions very often have units. You can't meaningfully
| assign "magnitude" to a quantity with units. You can always
| change your unit system to make the likelihood of, say, a
| particular height in a population of humans to be arbitrarily
| large.
| mathisd wrote:
| I don't understand why maximum of likelyhood is not zero in the
| example given. Isn't P(X = x / theta = theta_0) always null for
| continuous laws ?
| knightoffaith wrote:
| The actual probability is 0, but the probability density is not
| 0. Same reason why the probability that I pick 0.5 from a
| uniform distribution from 0 to 1 is 0, but the value of the
| probability density function of the distribution at 0.5 is 1.
| cubefox wrote:
| What is this point value then measuring? A literal "density"
| doesn't seem plausible either, as points arguably do not have
| any "density".
| hinkley wrote:
| The weirdest "likelihood" conversation I ever had, the putative
| team lead didn't want to change priorities to fix a bug because,
| "how often does that happen?"
|
| My reply was, "it happens to every user, the first time they use
| the app." And then something about how frequency has nothing to
| do with it. Every single user was going to encounter this bug.
| Even if they only used the app once.
|
| I already had a toolbox full of conversations about how bad we
| are at statistics, but that one opened up a whole new avenue of
| things to worry about. One that was reinforced by later articles
| about the uselessness of p95 stats - particularly where 3% of
| your users are experiencing 100% outage.
|
| But the one that is more apropos to the linked question, vs HN in
| general, is how people are bad at calculating the probability
| that "nothing bad happens" when there are fifty low probability
| things that can go wrong. Especially as the number of
| opportunities go up.
|
| And the way that, if we do something risky and nothing bad
| happens, we estimate down the probability of future calamity
| instead of counting ourselves lucky and backing away.
| opportune wrote:
| I've seen this exact same fallacy happen several times
| throughout my career, which isn't even very long.
|
| I think in many cases it boils down to some subtype not being
| identified and evaluated on its own. As in your case it's
| especially impactful, and yet IME also usually where these
| kinds of things get improperly prioritized, when it's a user's
| first impression or when it occurs in a way that causes a user
| to have to just sit and wait on the other end as these are
| often "special" cases with different logic in your application
| code.
|
| OTOH sometimes users try to weird/wrong/adversarial shit and so
| their high failure rate is working as intended. But it pollutes
| your stats such that it can hide real issues with similar
| symptoms and skew distributions.
| itronitron wrote:
| Yeah, and I'd also add that the total # of bugs in an
| application will always be greater than the total # of 'known'
| bugs. Tracking down and fixing the oddball bugs usually
| prevents a larger set of related issues from popping up later.
| fwungy wrote:
| Yes. Statistics are fragile indicators well beyond the Central
| Limit Theorem minimal sufficient boundaries. They work pretty
| good when you have tons of data and run tons of repetitions, but
| for moderate sized data and repetitions you need very high
| certainty levels for statistics to help much.
|
| You can play perfect blackjack and card count at a table with
| good rules and lose plenty because your advantage is small (< 2%)
| and your repetitions are limited.
|
| Statistics get even worse when the probabilities are chained
| because the weakest estimator bounds the rest.
|
| Essentially, if you always follow statical advice you should do
| better than average, if you're lucky. There are better heuristics
| than statistics in most fields of human decision making.
| waldrews wrote:
| Statistician here. There's a deep idea called the likelihood
| principle https://en.wikipedia.org/wiki/Likelihood_principle that
| says all the information we can get from the data about model
| parameters is contained in the likelihood function.
|
| We're talking about the whole likelihood surface here, not just
| the single point that's the maximum likelihood estimator. The MLE
| is a method for choosing a valid point estimator from the
| likelihood function; it has some good properties, like being
| consistent (if you have enough data it converges to the truth)
| and asymptotically efficient (converges smallest possible
| variance) so long as some criteria are met.
|
| But the MLE is not the only choice; for any given model, other
| procedures can be admissible estimators
| https://en.wikipedia.org/wiki/Admissible_decision_rule - it's
| just they also have to be procedures based on the likelihood
| function. In other words, your procedure doesn't have to be "take
| the likelihood function and find its maximum" but it has to be
| "take the likelihood function and... do something sensible with
| it."
|
| So the MLE is popular in the frequentist world where you have to
| make the decision rules using the likelihood directly; in the
| Bayesian world, you take the likelihood and combine it with a
| prior, to make an actual probability distribution. Then you get
| things like like MAP (mode of the posterior) or the Bayes
| estimate (expectation of the posterior) - alternatives to MLE
| that still use the likelihood surface.
|
| Of course this all works only if the underlying probabilistic
| model is literally true. So, in the machine learning world where
| the models are judged on being useful on usefulness and not
| expected to reflect mathematical reality, you're allowed to do
| things inconsistent with likelihood principle, like
| regularization tricks. In some physics situations (astronomical
| imaging comes to mind) where the probability model really is
| governed by the rules of nature, sticking to likelihood principle
| actually matters.
|
| As to the question of being small, well, the likelihood is the
| probability (density) of the exact data you observe given
| parameters. Let's say you know the true parameter (the mean and
| standard deviation) and you observe a thousand draws from a
| normal distribution. Of course the probability of observing the
| very same pattern of a thousand values again is overwhelmingly
| unlikely. But if the mean was way different, that pattern would
| be proportionally even more unlikely. We should only care about
| relative probabilities. What's the probability that the universe
| evolved in exactly such a way that your cat will have exactly
| this fur pattern? Astronomically small. What's the probability
| that the universe evolved in such a way, and some of that fur
| ends up on your furniture? Another unimaginably small number. But
| what's the probability that, in a universe where you and your cat
| exist as you are, his fur will get everywhere? That's pretty much
| a certainty.
| aj7 wrote:
| Because it is the product of the likelihood and the consequence,
| not just the likelihood that matters.
| stochastimus wrote:
| As OP noticed likelihoods often do show up in a comparative
| context. In that context one is asking which thing or sequence is
| most likely to occur by chance relative to another, under an
| (over simplistic, sure) IID assumption. In practice, the ordering
| of such things is often (hand-waving, sure) robust enough that,
| given no other information than the marginals, it is useful. So I
| think OP almost answered his/her own question: they are often
| quite useful in a comparative context and with no additional
| information.
___________________________________________________________________
(page generated 2024-02-18 23:00 UTC)