https://stats.stackexchange.com/questions/639548/why-is-everything-based-on-likelihoods-even-though-likelihoods-are-so-small

 
 

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including
Stack Overflow, the largest, most trusted online community for
developers to learn, share their knowledge, and build their careers.

Visit Stack Exchange
[                    ]
Loading...
 1.  
      + Tour Start here for a quick overview of the site
      + Help Center Detailed answers to any questions you might have
      + Meta Discuss the workings and policies of this site
      + About Us Learn more about Stack Overflow the company, and our
        products
 2.  
 3. current community

      +  
        Cross Validated
        help chat
      + 
         
        Cross Validated Meta
   
    your communities

    Sign up or log in to customize your list.

    more stack exchange communities

    company blog
 4.
 5. Log in
 6. Sign up

Cross Validated

 1.
     1.  
        Home
     2.  
        Questions
     3.  
        Tags
     4.
     5.  
        Users
     6.  
        Unanswered
 2. Teams
    Stack Overflow for Teams - Start collaborating and sharing
    organizational knowledge. [teams-illo-free-si] Create a free Team
    Why Teams?
 3.  
    Teams
 4.  
    Create free Team

Teams

Q&A for work

Connect and share knowledge within a single location that is
structured and easy to search.

Learn more about Teams

Why is everything based on likelihoods even though likelihoods are so
small?

Ask Question
Asked today
Modified today
Viewed 23k times
15
 
$\begingroup$

Suppose I generate some random numbers from a specific normal
distribution in R:

set.seed(123)
random_numbers <- rnorm(50, mean = 5, sd = 5)

These numbers look like this:

 [1]  2.1976218  3.8491126 12.7935416  5.3525420  5.6464387 13.5753249  7.3045810 -1.3253062
     [9]  1.5657357  2.7716901 11.1204090  6.7990691  7.0038573  5.5534136  2.2207943 13.9345657
    [17]  7.4892524 -4.8330858  8.5067795  2.6360430 -0.3391185  3.9101254 -0.1300222  1.3555439
    [25]  1.8748037 -3.4334666  9.1889352  5.7668656 -0.6906847 11.2690746  7.1323211  3.5246426
    [33]  9.4756283  9.3906674  9.1079054  8.4432013  7.7695883  4.6904414  3.4701867  3.0976450
    [41]  1.5264651  3.9604136 -1.3269818 15.8447798 11.0398100 -0.6155429  2.9855758  2.6667232
    [49]  8.8998256  4.5831547

Now, suppose I calculate the likelihood of these numbers under the
correct normal distribution::

likelihood <- prod(dnorm(random_numbers, mean = 5, sd = 5))
[1] 9.183016e-65

As we can see, even from the correct distribution, the likelihood is
very, very small. Thus, it appears to be very unlikely in a certain
sense that these numbers came from the very distribution they were
generated from.

The only consolation is that the likelihood is even smaller when
coming from some other distribution, e.g.

> likelihood <- prod(dnorm(random_numbers, mean = 6, sd = 6))
> likelihood
[1] 3.954015e-66

But this to me seems like a moot point: a turtle is faster than a
snail, but both animals are slow. Even though the correct likelihood
(i.e. 5,5) is bigger than the incorrect likelihood (i.e. 6,6), both
are still so small!

So how come in statistics, everything is based on likelihoods (e.g.
regression estimates, maximum likelihood estimation, etc) when the
evaluated likelihood is always so small for even the correct
distribution?

  * maximum-likelihood
  * likelihood

Share
Cite
Improve this question
Follow
edited 8 hours ago
 
kjetil b halvorsen's user avatar
kjetil b halvorsen
76.5k3131 gold badges186186 silver badges577577 bronze badges
asked 20 hours ago
 
ionojoseph's user avatar
ionojosephionojoseph
15111 gold badge11 silver badge33 bronze badges
New contributor
ionojoseph is a new contributor to this site. Take care in asking for
clarification, commenting, and answering. Check out our Code of
Conduct.
$\endgroup$
8

  * 1
    $\begingroup$ Welcome to Cross Validated! Would it help if we
    normalized the area under a PDF to a googol to inflate these
    numbers? $\endgroup$
    - Dave
    20 hours ago
  * 
    $\begingroup$ isnt that for integration? $\endgroup$
    - ionojoseph
    20 hours ago
  * 
    $\begingroup$ Yes, but think about how high the PDF y-values (the
    likelihood values) would be for the area under the PDF to be a
    googol. $\endgroup$
    - Dave
    20 hours ago
  * 
    $\begingroup$ I am a bit confused. integrating a distribution
    tells the probability of observing a range of values. likelihood
    is for individual points .... because the probability of
    observing an individual point is 0 as I understand? $\endgroup$
    - ionojoseph
    20 hours ago
  * 3
    $\begingroup$ The probability of observing an exact value from a
    truly continuous distribution might be zero, but your values are
    nowhere near exact, as they are expressed to 8 significant
    figures. The probability of observing a value that rounds to, or
    is observed to 8 significant figures is much higher than zero. $\
    endgroup$
    - Michael Lew
    19 hours ago

  |  Show 3 more comments

3 Answers 3

Sorted by: Reset to default
[Highest score (default)     ]
9
 
$\begingroup$

The key lies not in the absolute size of the likelihood values but in
their relative comparison and the mathematical principles underlying
likelihood-based methods. The smallness of the likelihood is expected
when dealing with continuous distributions and a product of many
probabilities because you're essentially multiplying a lot of numbers
that are less than 1.

The utility of likelihoods comes from their comparative nature, not
their absolute values. When we compare likelihoods across different
sets of parameters, we're looking for which parameters make the
observed data "most likely" relative to other parameter sets, rather
than looking for a likelihood that suggests the data is likely in an
absolute sense.

The scale of likelihood values is often less important than how these
values change relative to changes in parameters. This is why in many
statistical methods, such as MLE, we're interested in finding the
parameters that maximize the likelihood function, as these are
considered the best estimates given the data.

Because likelihood values can be extremely small, in practice,
statisticians often work with the log of the likelihood. This
transformation turns products into sums, making the values more
manageable and the optimization problems easier to solve, while
preserving the location of the maximum.

set.seed(123)
random_numbers <- rnorm(50, mean = 5, sd = 5)

# Function to calculate log likelihood of a normal distribution
log_likelihood <- function(data, mean, sd) {
  sum(dnorm(data, mean, sd, log = TRUE))
}

# Calculating log likelihood for the correct parameters
log_likelihood_correct <- log_likelihood(random_numbers, 5, 5)
print(log_likelihood_correct)
[1] -147.4507

# Calculating log likelihood for incorrect parameters
log_likelihood_incorrect <- log_likelihood(random_numbers, 6, 6)
print(log_likelihood_incorrect)
[1] -150.5959

# Comparison
print(log_likelihood_correct > log_likelihood_incorrect)
[1] TRUE

Share
Cite
Improve this answer
Follow
edited 8 hours ago
 
kjetil b halvorsen's user avatar
kjetil b halvorsen
76.5k3131 gold badges186186 silver badges577577 bronze badges
answered 19 hours ago
 
ADAM's user avatar
ADAMADAM
27611 silver badge99 bronze badges
$\endgroup$

Add a comment  |   
5
 
$\begingroup$

First, as others have mentioned, we usually work with the logarithm
of the likelihood function, for various mathematical and
computational reasons.

Second, since the likelihood function depends on the data, it is
convenient to transform it to a function with standardized maxima
(see Pickles 1986). $$ R(\theta) = \frac{L(\theta)}{L(\theta^\ast)} \
quad \text{where } \theta^\ast = \arg \max_{\theta} L(\theta) $$

set.seed(123)
random_numbers <- rnorm(50, mean = 5, sd = 5)

max_likelihood <- prod(dnorm(random_numbers, mean = 5, sd = 5))

nonmax_likelihood <- rep(0,1000)
j <- 1

for (k in seq(0,10,length.out=1000)) {
  nonmax_likelihood[j] <- prod(dnorm(random_numbers, mean=k, sd=5))
  j <- j+1
}

par(mfrow = c(1, 2))

plot(seq(0,10,length.out=1000),nonmax_likelihood/max_likelihood,
     xlab="Mean", ylab="Relative likelihood")

plot(seq(0,10,length.out=1000),log(nonmax_likelihood) - log(max_likelihood),
     xlab="Mean", ylab="Relative log-likelihood")

enter image description here

Share
Cite
Improve this answer
Follow
answered 17 hours ago
 
Durden's user avatar
DurdenDurden
1,2131717 silver badges2626 bronze badges
$\endgroup$
3

  * 
    $\begingroup$ I would say that the mathematical convenience of
    using log likelihood functions are more than counterbalanced by
    the un-intuitiveness introduced by the log scale. In the figures
    you supplied the support by the data of means near 5 is much more
    easily seen in the linear likelihood graph. $\endgroup$
    - Michael Lew
    17 hours ago
  * 1
    $\begingroup$ I would also add that the convenience of scaling
    the likelihood function to have unit maximum is possible because
    the likelihoods are only used as ratios. It is also worth noting
    that you have only dealt with the mean parameter, whereas the
    question included variation of both the mean and spread
    parameters. (I only mention this because the OP seems to be new
    to likelihoods.) $\endgroup$
    - Michael Lew
    17 hours ago
  * 
    $\begingroup$ You do not typically work with the logarithm of the
    likelihood function when multplying the prior by the likelihood
    function and then normailizing, to get the posterior
    distribution. $\endgroup$
    - Michael Hardy
    18 mins ago

Add a comment  |   
3
 
$\begingroup$

I can think of two things that might help you.

First, likelihoods are defined only to a proportionality factor and
their utility comes from their use in a ratio and while they are
proportional to the relevant probability, they are not probabilities.
That means that if you are uncomfortable with the values in the range
of $10^{-65}$ then you could simply multiply them all by $10^{65}$
without changing the ratios. Of course, there is no need to do as the
ratio effectively does it for you. The likelihood ratio for the two
distributions is about 25 times in favour of the 5,5 distribution
over the 6,6 distribution. That would typically be thought of as
being fairly strong (but not overwhelmingly strong) support by the
data (and the statistical model) for the 5,5 distribution over the
6,6 distribution.

Second, I usually find a plot of the likelihood as a function of a
parameter to be helpful. You have set up the system with two
parameters that are effectively 'of interest' and so the relevant
likelihood function would be three dimensional and thus awkward.
(Those dimensions being the population mean, the standard deviation,
and the likelihood values.) It would be easier for you to fix one of
those parameters and explore the likelihoods as a function of the
other. My justification for looking at the full likelihood function
rather than a singular ratio of two selected points in parameter
space is that it contains more information and it allows the data to
speak with less distortion.

Share
Cite
Improve this answer
Follow
answered 20 hours ago
 
Michael Lew's user avatar
Michael LewMichael Lew
14.6k22 gold badges3939 silver badges6060 bronze badges
$\endgroup$

Add a comment  |   
Highly active question. Earn 10 reputation (not counting the
association bonus) in order to answer this question. The reputation
requirement helps protect this question from spam and non-answer
activity.

Not the answer you're looking for? Browse other questions tagged

  * maximum-likelihood
  * likelihood

or ask your own question.

  * Featured on Meta
  * 
    Site maintenance - Saturday, February 24th, 2024, 14:00 - 22:00
    UTC (9 AM - 5...
  * 
    Upcoming privacy updates: removal of the Activity data section
    and Google...

Linked

 
663
What is the difference between "likelihood" and "probability"?
 
26
What does "likelihood is only defined up to a multiplicative constant
of proportionality" mean in practice?
 
10
What is likelihood actually?
 
9
Why we always put log() before the joint pdf when we use MLE(Maximum
likelihood Estimation)?

Related

 
0
Finding maximum likelihood estimates of parameters of multiple normal
populations
 
1
(R code provided) one-on-one correspondence bet. shapes of sampling
distribution and Likelihood function
 
4
python computing likelihood causing exp overflow
 
2
How to choose between mean squared error and likelihood?
 
1
Confusion about the optimized parameters when doing maximum
likelihood
 
2
How to reject a distribution given a sample?

Hot Network Questions

  * 
    How much time do we need to decide if the replacement hydraulic
    fluid is inappropriate for our brake system or not?
  * 
    Compute the phat-fingered double-bit-flip distance
  * 
    How can I keep the form when I use Expand[2a+3b]^5 and apply to
    all elements of a list?
  * 
    Give multiple users transparent ownership of directory and
    contents
  * 
    How to prevent accidental execution of potentially harmful
    commands (e.g. reboot)
  * 
    Determining why my proof depends on the axiom of choice
  * 
    How to make a circle mesh with smooth coronas at different
    heights?
  * 
    Markets in Germany with a large selection of seafood
  * 
    Is there any satellite that uses LOX as oxidizer?
  * 
    Possible inconsistencies of the Hamiltonian in the two-body
    problem
  * 
    Confusion over Microfacet-based BRDFs and Normal Distribution
    Functions
  * 
    Hiding a star cluster
  * 
    Ordering with repeats
  * 
    White is missing
  * 
    What are the techniques that delay the formation of shock wave
    over wing?
  * 
    What is causing this impossibly long return time in AC
    Brotherhood on the PS3?
  * 
    Remove all text files with non-US-ASCII text encoding from
    current folder on Linux
  * 
    What spares were taken on Apollo missions, and what was left
    behind? The question of gloves
  * 
    How can the ECtHR judgement on encryption be reconciled with the
    UK's Online Safety act?
  * 
    Is it bad practice to cite online news articles in solely because
    it's not a "reputable" source (i.e journal articles or even
    books)?
  * 
    Does psychophysical harmony strongly point toward theism?
  * 
    Spot The Difference
  * 
    Measuring a voltage signal ranging from -2 to 2 volts into a 16
    bit ADC circuit
  * 
    Why is the key typically the first and/or last note (or chord) of
    a song?

more hot questions
Question feed

Subscribe to RSS

Question feed

To subscribe to this RSS feed, copy and paste this URL into your RSS
reader.

[https://stats.stacke]
 
*

Cross Validated

  * Tour
  * Help
  * Chat
  * Contact
  * Feedback

Company

  * Stack Overflow
  * Teams
  * Advertising
  * Collectives
  * Talent
  * About
  * Press
  * Legal
  * Privacy Policy
  * Terms of Service
  * Cookie Settings
  * Cookie Policy

Stack Exchange Network

  * Technology
  * Culture & recreation
  * Life & arts
  * Science
  * Professional
  * Business
  * API
  * Data

  * Blog
  * Facebook
  * Twitter
  * LinkedIn
  * Instagram

Site design / logo (c) 2024 Stack Exchange Inc; user contributions
licensed under CC BY-SA. rev 2024.2.16.5008