[HN Gopher] Why is everything based on likelihoods even though l...
       ___________________________________________________________________
        
       Why is everything based on likelihoods even though likelihoods are
       so small?
        
       Author : cl3misch
       Score  : 147 points
       Date   : 2024-02-18 12:53 UTC (10 hours ago)
        
 (HTM) web link (stats.stackexchange.com)
 (TXT) w3m dump (stats.stackexchange.com)
        
       | nabla9 wrote:
       | A Good simple question and a good answer.
        
       | kjhcvkek77 wrote:
       | Because it works well in practice. And to elaborate, usually when
       | something works well in practice it's because it has multiple
       | desirable properties - the one you "ask for", but also other ones
       | you get for free.
       | 
       | In this case maximum likelihood approximate bayesian estimation
       | with a mostly reasonable prior. Furthermore you could look at the
       | convergence properties which are good.
       | 
       | You could probably design some degenerate probability
       | distribution that ml-estimation behaves really badly for, but
       | those are not common in practice.
        
         | ronald_raygun wrote:
         | > You could probably design some degenerate probability
         | distribution that ml-estimation behaves really badly for, but
         | those are not common in practice.
         | 
         | Anything multimodal...
        
         | nerdponx wrote:
         | It's better than "it works well in practice".
         | 
         | The question is misguided as stated. It's like asking why
         | chemists care about density for measuring mass.
         | 
         | If you are looking at the likelihood of any particular outcome
         | of a continuous random variable, then you do not understand how
         | probability works.
         | 
         | The probability of any particular real number arising from a
         | probability distribution on the real numbers is exactly 0. It's
         | not an arbitrarily small epsilon greater than zero, it's
         | actually zero. This definition is in fact required for
         | probability to sense mathematically.
         | 
         | You might ask questions like why does maximum likelihood work
         | as an optimization criterion, but that's very different from
         | asking why we care about likelihood at all.
         | 
         | The comments on the original question do a good job of cutting
         | through this confusion.
        
           | kjhcvkek77 wrote:
           | I appreciate your response but I don't really agree. They say
           | that likelihood can be multiplied by any scale factor or that
           | it's only the comparative difference that matters, or we can
           | make a little plot, but they don't actually explain why.
           | 
           | I can try to make an explanation from the bayesian
           | framework(but as I mentioned it's not the only relevant one)
           | 
           | Likelihood is
           | P(measurement=measurement'|parameter=parameter'). This is a
           | small value. Given a prior we can
           | P(parameter=parameter'|measurement=measurement'). This is
           | also small. But when we compute P(parameter'-k<parameter<para
           | meter'+k|measurement=measurement') then all the smallness
           | cancels see the formulation of bayes that reads
           | 
           | P(X_i|Y) = (P(X_i)P(Y|X_i)/(sum_j P(X_j)P(Y|X_j))
           | 
           | I'm obviously skipping a lot of steps here because I'm
           | sketching an explanation rather than giving one.
        
           | oasisaimlessly wrote:
           | > The probability of any particular real number arising from
           | a probability distribution on the real numbers is exactly 0.
           | It's not an arbitrarily small epsilon greater than zero, it's
           | actually zero.
           | 
           | Nitpicking somewhat, but e.g. `max(1, uniform(0, 2))` has a
           | very non-zero probability of evaluating to 1.
        
       | sega_sai wrote:
       | It is very strange that this is on a main page. The key thing is
       | likelihood is probability _density_ of your data! I.e. if your
       | probability density is a Gaussian N(0,0.00001), then the
       | likelihoods of data-points next to the mean will be very large,
       | if your PDF is N(0,10000) they 'll be very small. Furthermore the
       | amount of data matters as likelihoods will be multiplied for each
       | datapoint, so if they were small in the beginning, they'll be
       | even smaller, if they were large they'll be larger.
        
         | acc_297 wrote:
         | A likelihood could be referring to data drawn from discrete
         | distribution though and this wouldn't change much about how
         | it's treated and it would be a proper probability not prob-
         | density
        
         | nerdponx wrote:
         | I'm surprised it's here as well. Of all the interesting
         | questions on CV, I would not consider this one of them. I
         | wonder if this was sent through the second-chance pool.
        
         | kgwgk wrote:
         | > The key thing is likelihood is probability density of your
         | data!
         | 
         | In fact the important thing to understand about the likelihood
         | function is that it's not a probability density.
        
           | sega_sai wrote:
           | For continuous data it is exactly a probability density
           | evaluated on your data (for discrete it's PMF instead).
           | 
           | L(params)=P(D|params)
        
             | kgwgk wrote:
             | The point is that L(params) is not a probability density.
             | The integral of L(params) over params is not one.
        
               | vermarish wrote:
               | Right. But if you make the notation slightly more
               | explicit, then the integral of L(data, params) over data
               | is 1. This follows from the independence assumption.
               | 
               | So we ARE working with a probability function. Its output
               | can be interpreted as probabilities. It's just that we're
               | maximizing L = P(events | params) with respect to params.
        
               | kgwgk wrote:
               | The likelihood function is a function of params for a
               | fixed value of data and it is not a probability function.
               | 
               | There is another function - a function of data for fixed
               | params - which is a probability density. That doesn't
               | change the fact that the likelihood function isn't.
        
               | 331c8c71 wrote:
               | The independence has nothing do with the integral being 1
               | to be honest. You could write a model where the
               | observations are not independent but the (multivariate)
               | integral over their domain will still be 1.
        
               | 331c8c71 wrote:
               | That's exactly the reason why frequentist approach sucks
               | by the way;) Parameters are treated specially and there
               | is no internal consistency - to have it you need to
               | introduce priors...
        
               | 3abiton wrote:
               | It's the bayes vs frequentist war again.
        
       | acc_297 wrote:
       | Working on nlme models for work these days - it does become a bit
       | of a headache when asking "how much better" the model with
       | -2LL=8000 is from the model with -2LL=7995 obviously one is
       | better "more likely given the data" but what if the better one
       | used 2 more parameters and is hence more complex and might be
       | overfitting the dataset? Well then there are all these
       | "heuristics" to look at, AIC, BIC, some sort of trick with a
       | chi^2 distribution function- these are all just ways to penalize
       | the objective function based on the # of parameters but it's
       | somewhat debatable which one to apply when and I have read that
       | some parameter estimation softwares don't even compute these
       | values in the exact same way - I am not a statistician by
       | training I just apply "industry standard practices" in as
       | reasonably intuitive a way as I can but my _impression_ has
       | always been that if you wander far enough into the weeds you'll
       | find that stats often becomes a debate between many
       | approximations with different sorts of tradeoffs and much of this
       | is smoothed over by the fancy scientific software packages that
       | get used by non-statistics-researchers one of the most
       | frustrating parts of my job is reproducing SAS output (an
       | extensively used statistics product) using free R language tools
       | since a SAS license costs more than some sports cars... But what
       | is SAS actually doing? And it's never just taking a mean or
       | pooling variance in the standard way you'd read about in an
       | intermediate stats textbook it's always doing some slight
       | adjustment based on this or that approximation or heuristic
       | 
       | This tangent may have been unrelated or irrelevant but I've long
       | concluded that in practice statistics is far less solved than
       | people might expect if they've never had to reproduce any of the
       | numbers given to them by statistical analysis
        
       | ronald_raygun wrote:
       | Another thing to note is that you're multiplying probabilities
       | together. Since each probability is between 0 and 1, youre always
       | shrinking the likelihood with each new data point. When you're
       | doing this kind of analysis, the question you're asking is "given
       | a model with these parameters, what's the probability I get
       | _exactly_ this sample? " Which, when you phrase it that way, it
       | becomes more apparent why the likelihood is so small.
        
         | cobbal wrote:
         | Multiplying them together certainly magnifies the effect, but
         | it would magnify it the other way if the likelihoods were
         | larger than one. (Easy to get, just tweak the variance of the
         | normal distributions to be smaller). Likelihoods are more like
         | infinitesimal fractions of a probability, that need to be
         | integrated over some set of events to get back a probability.
         | In the case of the joint distribution of 50 Gaussian, you can
         | think of the likelihood having "units" of epsilon^50.
        
           | j7ake wrote:
           | Wait how do you get likelihoods greater than one?
           | 
           | Definitely won't work for likelihoods like Poisson or other
           | count based models.
        
             | blackbear_ wrote:
             | For discrete distributions you indeed cannot, but for
             | continuous distributions all you need is sufficiently small
             | variance. Try for example a Gaussian with variance 1e-12
        
               | KeplerBoy wrote:
               | The value of a continuous probability density
               | distribution at a specific point is pretty meaningless
               | though; You have to talk about the integral between two
               | values and that won't go above one.
        
               | blackbear_ wrote:
               | In context of maximum likelihood the value of the density
               | at the maximizer is actually quite useful, for example
               | for model comparison.
        
       | here4U wrote:
       | Because you can add them
        
       | fergal_reid wrote:
       | I think most of the replies, here and on stack exchange, are
       | answering slightly the wrong question.
       | 
       | It _is_ fair to ask why the likelihoods are useful if they are so
       | small, and it 's not a good answer to talk about how they could
       | be expressed as logs, or even to talk about the properties of
       | continuous distributions.
       | 
       | I think the answer is:
       | 
       | Yes, individual likelihoods are so small, that yes even a MLE
       | solution is extremely unlikely to be correct.
       | 
       | However, the idea is that often a lot of the probability mass -
       | an amount that is not small - will be concentrated around the
       | maximum likelihood estimate, and so that's why it makes a good
       | estimate, and worth using.
       | 
       | Much like how the average is unlikely to be the exact value of a
       | new sample from the distribution, but it's a good way of
       | describing what to expect. (And gets better if you augment it
       | with some measure of dispersion, and so on). (If the distribution
       | is very dispersed, then while the average is less useful as an
       | idea of what to expect, it still minimises prediction error in
       | some loss; but that's a different thing and I think less relevant
       | here).
        
         | jvanderbot wrote:
         | Yes - the most enlightening concept for me was "Highest
         | Probability Density Interval" which basically always is
         | clustered around the mean. But you can choose _any_ interval
         | which contains as much probability mass!
         | 
         | https://en.wikipedia.org/wiki/Credible_interval#Choosing_a_c...
         | 
         | It's a fairly common "mistake" to assume that the MLE is useful
         | _as a point estimate_ and _without considering covariance
         | /spread/CI/HPDI/FIM/CRLB/Entropy/MI/KLD_ or some other measure
         | of precision given the measurement set.
        
         | bscphil wrote:
         | > It is fair to ask why the likelihoods are useful if they are
         | so small
         | 
         | The way the question demonstrates "smallness" is wrong,
         | however. They quote the product of the likelihoods of 50
         | randomly sampled values - 9.183016e-65 - as if the smallness of
         | this value is significant or meant anything at all. Forget the
         | issue of continuous sampling from a normal distribution, and
         | just consider the simple discrete case of flipping a coin. The
         | combined probability of any permutation of 50 flips is 0.5 ^
         | 50, a really small number. That's because the probability is,
         | in fact, really small!
        
           | knightoffaith wrote:
           | Right - and so the more appropriate thing to do is not look
           | at the raw likelihood of any one particular value but instead
           | look at relative likelihoods to understand what values are
           | more likely than other values.
        
           | anon946 wrote:
           | For the discrete case, it seems that a better thing to do is
           | consider the likelihood of getting that number of heads,
           | rather than the likelihood of getting that exact sequence.
           | 
           | I am not sure how to handle the continuous case, however.
        
             | lupire wrote:
             | Of course you ignore irrelevant ordering of data points.
             | That's not the issue.
             | 
             | The issue, for discrete or continuous (which are
             | mathematically approximations of each other), is that the
             | value at a point is less important than the integral over a
             | range. That's why standard deviation is useful. The argmax
             | is a convenient average over a weightable range of values.
             | The larger your range, the greater the likelihood that the
             | "truth" is in that range.
             | 
             | If you only need to be correct up to 1% tolerance, the
             | likelihood of a range of values that have
             | $SAMPLING_PRECISION tolerance is not importance. Only the
             | argmax is, to give you a center of the range.
        
         | aquafox wrote:
         | I agree with your points and thats why it's useful to compare a
         | MLE to an alternative model via a likelihood ratio test, in
         | which case one sees how much better the generative model
         | performs as compared to the wrong model.
         | 
         | Similarly, AIC values do not make a lot of sense on an absolute
         | scale but only relative to each other, as written in [1].
         | 
         | [1] Burnham, K. P., & Anderson, D. R. (2004). Multimodel
         | inference: understanding AIC and BIC in model selection.
         | Sociological methods & research, 33(2), 261-304.
        
         | crazygringo wrote:
         | > _Yes, individual likelihoods are so small, that yes even a
         | MLE solution is extremely unlikely to be correct._
         | 
         | Can you elaborate? An MLE is never going to come up with the
         | _exact_ parameters that produced the samples, but in the
         | original example, as long as you know it 's a normal
         | distribution, MLE is probably going to come up with a mean
         | between 4 and 6 and a SD within a similar range as well (I
         | haven't calculated it, just eyeballing it) -- when the original
         | parameters were 5 and 5.
         | 
         | I guess I don't know what you mean by "correct", but that's as
         | correct as you can get, based on just 50 samples.
        
           | fergal_reid wrote:
           | Right - I think this is what's at the heart of the original
           | question.
           | 
           | I know they asked with a continuous example, but I don't
           | interpret their question as limited to continuous cases, and
           | I think it's easier to address using a discrete example, as
           | we avoid the issue of each exact parameter having
           | infinitesimal mass which occurs in a continuous setting.
           | 
           | Let's imagine the parameter we're trying to estimate is
           | discrete and has, say, 500 different possible values.
           | 
           | Let's say the parameter can have the value of the integers
           | between 1 and 500 and most of the mass is clustered in the
           | middle between 230 and 270.
           | 
           | Given some data, it would actually be possible that MLE would
           | come up with the exact value, say 250.
           | 
           | But maybe given the data, a range of values between 240 and
           | 260 are also very plausible, so the likelihood of exactly 250
           | has a fairly low probability.
           | 
           | The original poster is confused, because they are basically
           | saying, well, if the actual probability is so low, why is
           | this MLE stuff useful?
           | 
           | You are pointing out they should really frame things in terms
           | of a range and not a point estimate. You are right; but I
           | think their question is still legitimate, because often in
           | practice we do not give a range, and just give the maximum
           | likelihood estimate of the parameter. (And also, separately,
           | in a discrete parameter setting, specific parameter value
           | could have substantial mass.)
           | 
           | So why is the MLE useful?
           | 
           | My answer would be, well, that's because for many posterior
           | distributions, a lot of the probability mass will be near the
           | MLE, if not exactly at it - so knowing the MLE is often
           | useful, even if the probability of that exact value of the
           | parameter is low.
        
         | TobyTheCamel wrote:
         | > However, the idea is that often a lot of the probability mass
         | - an amount that is not small - will be concentrated around the
         | maximum likelihood estimate, and so that's why it makes a good
         | estimate, and worth using.
         | 
         | This may be true for low dimensions but doesn't generalise to
         | high dimensions. Consider a 100-dimensional standard normal
         | distribution for example. The MLE will still be at the origin
         | but most of the mass will live in a thin shell of distance
         | roughly 7 units from the origin.
        
           | blt wrote:
           | I think the "mass" they are referring to might the mass of
           | the Bayesian posterior in parameter space, not the mass of
           | the data in event space.
        
             | fergal_reid wrote:
             | Yes, in parameter space.
             | 
             | However, TobyTheCamel's point is valid in that there are
             | some parameter spaces where the MLE is going to be much
             | less useful than others.
             | 
             | Even without having to go to high dimensions, if you've got
             | a posterior that looks like a normal distribution, the MLE
             | is going to the you a lot, whereas if it's a multimodal
             | distribution with a lot of mass scattered around, knowing
             | the MLE much less informative.
             | 
             | But this is a complex topic to address in general, so I'm
             | trying to stick to what I see as the intuition behind the
             | original question!
        
           | lupire wrote:
           | Concentration of mass is _density_. A shell is not dense.
           | 
           | If I am looking for a needle in a hyperhaystack, it's not
           | important to know that it's more likely to be "somewhere on
           | the huge hyperboundary" than "in the center hypercubic inch".
        
         | agnosticmantis wrote:
         | > However, the idea is that often a lot of the probability mass
         | - an amount that is not small - will be concentrated around the
         | maximum likelihood estimate, and so that's why it makes a good
         | estimate, and worth using.
         | 
         | This is a Bayesian point of view. The other answers are more
         | frequentist, pointing out that likelihood at a parameter theta
         | is NOT the probability of theta being the true parameter (given
         | data). So we can't and don't interpret it like a probability.
        
           | klipt wrote:
           | Given enough data, Bayesian and frequentist models tend to
           | converge to the same answer anyway.
           | 
           | Bayesian priors have similar effect to regularization (e.g.
           | ridge regression / penalizing large parameter values).
        
       | praptak wrote:
       | I don't have a problem with very small probabilities as long as
       | they stay within math and kind of "cancel out".
       | 
       | What I do have a problem with is lack of conceptual framework for
       | dealing with small probabilities of real life events.
       | 
       | For example, what amount of effort is appropriate to prevent a
       | one time event which kills you with say 1 in ten thousand times?
        
         | sebzim4500 wrote:
         | Quite a lot. See airbags, seat belts, etc.
        
         | nsomaru wrote:
         | One simple way of quantifying this is (amt/cost of harm) *
         | (risk of occurrence)
        
         | airstrike wrote:
         | _> For example, what amount of effort is appropriate to prevent
         | a one time event which kills you with say 1 in ten thousand
         | times?_
         | 
         | if you value being killed at a massively negative value, then
         | 1/10,000 times that value is still a massively negative value,
         | so the answer is "a huge amount of effort"
        
           | samatman wrote:
           | For some value of "massive", sure. But for any value of
           | massive, it's 1/10,000th that value. Then you factor in the
           | value derived from taking that risk, and there's your choice.
           | 
           | The reality is that I don't expend huge amounts of effort
           | avoiding tail risks, and you don't either. You might for the
           | ones you're explicitly aware of, but a risk is a risk whether
           | or not you know you're taking it.
        
       | hedora wrote:
       | How I'd put it:
       | 
       | They are useful because the integral of the likelihoods is not
       | infinitesimal.
       | 
       | The probability that your yard stick measures 1.0000000000000
       | yards is basically zero, but the probability that it's within one
       | inch of that is close to one.
       | 
       | We generally prefer to use probability density functions with the
       | property that most of the probability density is close to the
       | maximum likelihood.
       | 
       | So, in the yard stick example, the yard stick lengths are
       | probably gaussian, so if check enough lengths, you'll get a mean
       | (== length with the maximum likelihood) that approaches
       | 1.00000000000 yards (you'd hope) with some small standard
       | deviation (probably less than an inch).
        
       | hopfenspergerj wrote:
       | You work with probability density functions because the
       | probability of observing any given value in a continuum is zero.
       | Density functions may be reasonable to work with if they have
       | some nice properties (continuity, unimodality, ...) The question
       | and answers here seem to be from people that don't understand
       | calculus.
        
       | ninetyninenine wrote:
       | In a continuous distribution the probability of any number on
       | that distribution being generated is effectively zero. If R was
       | generating the true probabilities it should give you zero for
       | every single number.
       | 
       | Think about it. That distribution is continuous over an infinite
       | amount of numbers. If you select any number the chances of that
       | number being generated will be essentially zero. According to the
       | theory there is no possibility for any number on the distribution
       | to be generated. This is correct.
       | 
       | Yet when you use the random number generator you get a number
       | even though that number technically is impossible to exist due to
       | zero probability. Does this mean there is a flaw in the theory
       | when applied to the number generated?
       | 
       | Yes it does. The theory is an approximation of what's going on
       | itself. No random number generator in a computer is selecting a
       | number from a truly continuous set of numbers. It is selecting it
       | from a finite set of numbers from all available numbers in a
       | floating point specification.
       | 
       | Even if it's not a computer when you select a random number by
       | intuition from a continuous distribution you are not doing it
       | randomly.
       | 
       | Think about it. Pick a random number between 0 and 1. I pick
       | 0.343445434. This selection is far from random. It is biased
       | because there is an infinite amount of significant figures yet I
       | arbitrarily don't go past a certain amount. I cut off at 9
       | sigfigs and bias towards a cutoff like that because picking a
       | random number with say 6000 sigfigs is just too inconvenient. You
       | really need to account for infinite sigfigs for the number to be
       | truly random which is impossible.
       | 
       | So even when you pick numbers randomly you are actually picking
       | from a finite set.
       | 
       | In fact I can't think of anything in reality that can truly be
       | accurately described with a continuous distribution. Nothing is
       | in the end truly continuous. Or maybe it does exist, but if it
       | does exist how can we even confirm it? We can't verify anything
       | in reality to a level of infinite sig figs.
       | 
       | If R was accurately calculating likelihood it should give you
       | zero for each number. And the random number generator should not
       | even be able to exist as how do even create a pool of infinite
       | possibilities to select from? Likely R is giving some probability
       | over a small interval of numbers.
       | 
       | That's where the practicality of the continuous distribution
       | makes sense when you measure the probability of a range of
       | values. You get a solid number in this case.
       | 
       | Anyway the above explanation is probably too deep. A more
       | practical way of thinking about this is like this:
       | 
       | It is unlikely for any one person to win the lottery. Yet someone
       | always wins. The probability of someone winning is 100 percent.
       | The probability of a specific someone winning is 1 over the total
       | number of people playing.
       | 
       | Improbable events in the universe happen all the time because
       | that all that's available. It's highly improbable for any one
       | person to win the lottery but if someone has to win, then there
       | is a 100 percent chance that an arbitrary improbable event will
       | occur.
       | 
       | This is more easily seen in a uniform discrete distribution
       | rather then the normal continuous distribution.
       | 
       | In the case of the normal distribution it is confusing. In a
       | normal distribution It is far more likely for an improbable event
       | to occur then it is for the single most probable event to occur.
       | 
       | Think of it like this. I have a raffle. There are 2 billion
       | participants. Each person has one ticket in the bag, except me. I
       | have 100,000 tickets in the bag I am the most likely person to
       | win.
       | 
       | But it is still far more likely for anyone else but me to win
       | even when I am the most likely person to win. An arbitrary
       | improbable event is more likely to occur then the single most
       | probable event.
        
         | ttoinou wrote:
         | In a continuous distribution the probability of any number on
         | that distribution being generated is effectively zero
         | 
         | Yes almost by definition, no ? You can only know the
         | probability it would be between a range of numbers, by
         | integrating over the distribution
        
         | 331c8c71 wrote:
         | > If R was accurately calculating likelihood it should give you
         | zero for each number.
         | 
         | You have some good points but this is false. The probability of
         | any point for a continuous distribution is indeed zero. That
         | doesn't mean that the density at this point is also zero.
        
       | richrichie wrote:
       | For infinite probability spaces, likelihood has no interpretable
       | meaning akin to probability in the case of finite spaces. This
       | can be a source of confusion.
        
       | ultra_nick wrote:
       | In a storm, the likelihood of A raindrop hitting your computer
       | dwarfs the likelihood of THE raindrops[5927] hitting your
       | computer.
       | 
       | Hence why using A and THE is important.
        
       | leni536 wrote:
       | I think the correct answer is that it is mostly bogus, but
       | likelihood based statistical methods mostly work for well-
       | behaving distributions, especially for Gaussian.
       | 
       | Maximum likelihood estimation has some weird cases when the
       | distribution is not "well behaving".
        
       | data-ottawa wrote:
       | Probability is the probability mass distributed over your data
       | with fixed parameters, and likelihood is mass distributed over
       | your model parameters with fixed data. The absolute most
       | important thing to know about likelihood is that it is not a
       | measure of probability, even thought it looks a lot like
       | probability.
       | 
       | If I look at coin flip data, I know the data comes from a coin
       | flip, but any specific count of heads vs tails becomes less and
       | less likely the more flips we do. So likelihood being small tells
       | us nothing on its own.
       | 
       | The value of likelihood comes from the framework you use in. If I
       | wanted to make a best guess at what the balance of the coin is
       | then I could find the maximum of the likelihood over all coin
       | balances to get the most representative version of my model.
       | Similarly, I can compare two specific coin biases and determine
       | which is more likely, but that alone can't tell me anything about
       | the probability of the coin being biased.
        
       | andsoitis wrote:
       | Relative likehoods can be used for prioritization.
        
       | hdivider wrote:
       | _One might perhaps say that what is probable
       | 
       | Is that men experience much that is improbable._
       | 
       | - Fragment of a lost play by Agathon, ca. 2,400 years ago, quoted
       | in Aristotle's _The Art of Rhetoric_
        
       | l_e_o_n wrote:
       | You flip a possibly-biased coin 20 times and get half heads, half
       | tails, e.g. "THHHTTTTTHTHTTHHHTHH".
       | 
       | Under the model where the bias is 0.5--a fair coin--the
       | probability of that sequence is (0.5)^20 or about one in a
       | million. In fact, the probability of _any_ sequence you could
       | observe is one in a million.
       | 
       | Under the model where the bias is 0.4 the probability is (0.4)^10
       | x (0.6)^10 or about one in two million.
       | 
       | That is, the sequence we observed supplies about twice as much
       | evidence in favor of bias = 0.5 as compared with bias = 0.4--this
       | is likelihood.
       | 
       | Likelihood ratios are all that matter.
       | 
       | Morals:
       | 
       | - The more complex the event you're predicting (the rarer the
       | tyical observed result) the smaller the associated likelihoods
       | will tend to be
       | 
       | - It's possible that every observed result has a tiny probability
       | under every model you're considering
       | 
       | - Nonetheless it makes sense to use the ratios of these numbers
       | to compare the models
       | 
       | - This has nothing to do with probability densities or
       | logarithms, though the fact that we often work with densities
       | also makes absolute likelihood values relative to the choice of
       | units
       | 
       | Added in edits:
       | 
       | - You could summarize the sequence with the number of heads or
       | tails and then the likelihood values would be larger but the
       | ratios would remain the same (it's a sufficient statistic).
       | Similarly in the CrossValidated question one could summarize the
       | data with the mean and sum of squares. But this doesn't work in
       | general, e.g. if we have i.i.d. draws from a Cauchy distribution.
        
         | qup wrote:
         | > half heads, half tails.
         | 
         | > Under the model where the bias is 0.5--a fair coin--the
         | probability of that outcome is (0.5)^20 or about one in a
         | million.
         | 
         | No.
         | 
         | Edit: someone downvoted, ha. It's closer to 1 in 6.
        
           | cj wrote:
           | 1 in a million is the probability of correctly predicting a
           | unique sequence of 20 coin flips, in the exact order. (E.g.
           | first 10 flips heads, 2nd 10 flips tails, in that order - 1
           | in a million)
        
           | l_e_o_n wrote:
           | Edited for clarity.
        
           | BeetleB wrote:
           | I'm surprised people are conflating the Binomial distribution
           | with OP's statement. He is talking about _one_ specific
           | outcome of half heads /half tails (where order matters).
           | There is exactly one way to get that outcome.
        
         | b10nic wrote:
         | Not quite; the probability of n/2 successes in n trials is
         | given as Binomial(n,p) not p^n. p^n is correct for a single
         | sequence but there are many possible sequences that result half
         | heads, half tails and so you have a factor of "N choose X" or
         | the so called "Binomial Coefficient".
         | 
         | > (0.4)^20 x (0.6)^20
         | 
         | and I think you mean (0.4)^10 x (0.6)^10 or more generally
         | p^x*(1-p)^n-x.
        
           | l_e_o_n wrote:
           | I'm talking about the whole sequence; you're talking about
           | the number of heads (or) tails in the sequence.
           | 
           | The number of heads is a sufficient statistic, so we'll get
           | the same likelihood ratios out, but the likelihood values
           | themselves will be larger.
           | 
           | You could make a similar point about the original
           | CrossValidated Normal(0, 1)^N example by summarizing the data
           | with the mean and sum of squares.
           | 
           | This doesn't work if the data were Cauchy(0, 1)^N instead.
        
       | kgwgk wrote:
       | > Thus, it appears to be very unlikely in a certain sense that
       | these numbers came from the very distribution they were generated
       | from.
       | 
       | "It appears to be very unlikely in a certain sense that this
       | comment is written in English. Yes, this sequence of characters
       | is much more likely in English than in French. But you can't even
       | fathom how unlikely it was to be ever written in English!"
        
       | hsnewman wrote:
       | Because the basis of all reality is quantum physics
        
       | lukego wrote:
       | Likelihoods aren't fundamentally small.
       | 
       | The center of a normal distribution has high likelihood (e.g.
       | 1000000) if the standard deviation is small or low likelihood if
       | the standard deviation is large (e.g. 1/1000000.)
       | 
       | This effect is amplified when you are working with products of
       | likelihoods. They can be infinitesimal or astronomical.
       | 
       | Giant likelihoods really surprised me the first time I
       | experienced them but they're not uncommon when you work with
       | synthetic test data in high dimensions and/or small scales.
       | 
       | They still integrate to the same magnitude because the higher
       | likelihood values are spread over shorter spans.
        
         | jdhwosnhw wrote:
         | Another issue is that likelihoods associated with continuous
         | distributions very often have units. You can't meaningfully
         | assign "magnitude" to a quantity with units. You can always
         | change your unit system to make the likelihood of, say, a
         | particular height in a population of humans to be arbitrarily
         | large.
        
       | mathisd wrote:
       | I don't understand why maximum of likelyhood is not zero in the
       | example given. Isn't P(X = x / theta = theta_0) always null for
       | continuous laws ?
        
         | knightoffaith wrote:
         | The actual probability is 0, but the probability density is not
         | 0. Same reason why the probability that I pick 0.5 from a
         | uniform distribution from 0 to 1 is 0, but the value of the
         | probability density function of the distribution at 0.5 is 1.
        
           | cubefox wrote:
           | What is this point value then measuring? A literal "density"
           | doesn't seem plausible either, as points arguably do not have
           | any "density".
        
       | hinkley wrote:
       | The weirdest "likelihood" conversation I ever had, the putative
       | team lead didn't want to change priorities to fix a bug because,
       | "how often does that happen?"
       | 
       | My reply was, "it happens to every user, the first time they use
       | the app." And then something about how frequency has nothing to
       | do with it. Every single user was going to encounter this bug.
       | Even if they only used the app once.
       | 
       | I already had a toolbox full of conversations about how bad we
       | are at statistics, but that one opened up a whole new avenue of
       | things to worry about. One that was reinforced by later articles
       | about the uselessness of p95 stats - particularly where 3% of
       | your users are experiencing 100% outage.
       | 
       | But the one that is more apropos to the linked question, vs HN in
       | general, is how people are bad at calculating the probability
       | that "nothing bad happens" when there are fifty low probability
       | things that can go wrong. Especially as the number of
       | opportunities go up.
       | 
       | And the way that, if we do something risky and nothing bad
       | happens, we estimate down the probability of future calamity
       | instead of counting ourselves lucky and backing away.
        
         | opportune wrote:
         | I've seen this exact same fallacy happen several times
         | throughout my career, which isn't even very long.
         | 
         | I think in many cases it boils down to some subtype not being
         | identified and evaluated on its own. As in your case it's
         | especially impactful, and yet IME also usually where these
         | kinds of things get improperly prioritized, when it's a user's
         | first impression or when it occurs in a way that causes a user
         | to have to just sit and wait on the other end as these are
         | often "special" cases with different logic in your application
         | code.
         | 
         | OTOH sometimes users try to weird/wrong/adversarial shit and so
         | their high failure rate is working as intended. But it pollutes
         | your stats such that it can hide real issues with similar
         | symptoms and skew distributions.
        
         | itronitron wrote:
         | Yeah, and I'd also add that the total # of bugs in an
         | application will always be greater than the total # of 'known'
         | bugs. Tracking down and fixing the oddball bugs usually
         | prevents a larger set of related issues from popping up later.
        
       | fwungy wrote:
       | Yes. Statistics are fragile indicators well beyond the Central
       | Limit Theorem minimal sufficient boundaries. They work pretty
       | good when you have tons of data and run tons of repetitions, but
       | for moderate sized data and repetitions you need very high
       | certainty levels for statistics to help much.
       | 
       | You can play perfect blackjack and card count at a table with
       | good rules and lose plenty because your advantage is small (< 2%)
       | and your repetitions are limited.
       | 
       | Statistics get even worse when the probabilities are chained
       | because the weakest estimator bounds the rest.
       | 
       | Essentially, if you always follow statical advice you should do
       | better than average, if you're lucky. There are better heuristics
       | than statistics in most fields of human decision making.
        
       | waldrews wrote:
       | Statistician here. There's a deep idea called the likelihood
       | principle https://en.wikipedia.org/wiki/Likelihood_principle that
       | says all the information we can get from the data about model
       | parameters is contained in the likelihood function.
       | 
       | We're talking about the whole likelihood surface here, not just
       | the single point that's the maximum likelihood estimator. The MLE
       | is a method for choosing a valid point estimator from the
       | likelihood function; it has some good properties, like being
       | consistent (if you have enough data it converges to the truth)
       | and asymptotically efficient (converges smallest possible
       | variance) so long as some criteria are met.
       | 
       | But the MLE is not the only choice; for any given model, other
       | procedures can be admissible estimators
       | https://en.wikipedia.org/wiki/Admissible_decision_rule - it's
       | just they also have to be procedures based on the likelihood
       | function. In other words, your procedure doesn't have to be "take
       | the likelihood function and find its maximum" but it has to be
       | "take the likelihood function and... do something sensible with
       | it."
       | 
       | So the MLE is popular in the frequentist world where you have to
       | make the decision rules using the likelihood directly; in the
       | Bayesian world, you take the likelihood and combine it with a
       | prior, to make an actual probability distribution. Then you get
       | things like like MAP (mode of the posterior) or the Bayes
       | estimate (expectation of the posterior) - alternatives to MLE
       | that still use the likelihood surface.
       | 
       | Of course this all works only if the underlying probabilistic
       | model is literally true. So, in the machine learning world where
       | the models are judged on being useful on usefulness and not
       | expected to reflect mathematical reality, you're allowed to do
       | things inconsistent with likelihood principle, like
       | regularization tricks. In some physics situations (astronomical
       | imaging comes to mind) where the probability model really is
       | governed by the rules of nature, sticking to likelihood principle
       | actually matters.
       | 
       | As to the question of being small, well, the likelihood is the
       | probability (density) of the exact data you observe given
       | parameters. Let's say you know the true parameter (the mean and
       | standard deviation) and you observe a thousand draws from a
       | normal distribution. Of course the probability of observing the
       | very same pattern of a thousand values again is overwhelmingly
       | unlikely. But if the mean was way different, that pattern would
       | be proportionally even more unlikely. We should only care about
       | relative probabilities. What's the probability that the universe
       | evolved in exactly such a way that your cat will have exactly
       | this fur pattern? Astronomically small. What's the probability
       | that the universe evolved in such a way, and some of that fur
       | ends up on your furniture? Another unimaginably small number. But
       | what's the probability that, in a universe where you and your cat
       | exist as you are, his fur will get everywhere? That's pretty much
       | a certainty.
        
       | aj7 wrote:
       | Because it is the product of the likelihood and the consequence,
       | not just the likelihood that matters.
        
       | stochastimus wrote:
       | As OP noticed likelihoods often do show up in a comparative
       | context. In that context one is asking which thing or sequence is
       | most likely to occur by chance relative to another, under an
       | (over simplistic, sure) IID assumption. In practice, the ordering
       | of such things is often (hand-waving, sure) robust enough that,
       | given no other information than the marginals, it is useful. So I
       | think OP almost answered his/her own question: they are often
       | quite useful in a comparative context and with no additional
       | information.
        
       ___________________________________________________________________
       (page generated 2024-02-18 23:00 UTC)