[HN Gopher] Lies. Damned Lies. P-value thresholds
       ___________________________________________________________________
        
       Lies. Damned Lies. P-value thresholds
        
       Author : fghorow
       Score  : 51 points
       Date   : 2024-12-28 15:39 UTC (4 days ago)
        
 (HTM) web link (www.newyorker.com)
 (TXT) w3m dump (www.newyorker.com)
        
       | brudgers wrote:
       | Previous discussion under original title,
       | https://news.ycombinator.com/item?id=20978134
        
       | chkgk wrote:
       | http://archive.today/yju8K
        
       | sdwr wrote:
       | This piece is basically set up to fail. It's historical, based on
       | math, and doesn't venture towards any drama or stakes. I can
       | easily imagine it being a farticle on a third-rate SEO webzine.
       | 
       | So why does it work okay as a New Yorker piece? How is their
       | writing consistently good?
       | 
       | I think their secret sauce is the (implied) perspective. The
       | impression that the author has a unique, complete, accurate take
       | on the subject, and is letting you in on it piece by piece, in a
       | meandering way.
        
         | tucnak wrote:
         | You're exactly right: what you refer to as "perspective" is
         | easily faked.
        
         | martindale wrote:
         | Astroturf on HN. Never thought I'd see the day.
        
       | moonlion_eth wrote:
       | P values are why llms hallucinate
        
         | ellisv wrote:
         | Would you elaborate?
        
           | svnt wrote:
           | If I take it at face value they seem to be blaming LLM
           | hallucinations on questionable training data from published
           | science done badly while focused on p-values.
        
         | klysm wrote:
         | I have no idea how this could possibly be true
        
           | zrm wrote:
           | LLMs are essentially predicting what token could plausibly
           | come next. They sort the possible next tokens by probability,
           | often throw out the the ones with very low probabilities
           | (p-value too low) and then use a weighted random number
           | generator to choose one from the rest.
           | 
           | Sometimes that means you exclude a good next token, or
           | include a bad one, so a bad one gets chosen. And then once it
           | has, the thing is going to pick whatever is most likely to
           | come after that, which will be some malarkey because it has
           | already emitted a nonsense token and is now using that as
           | context. But whatever it is, it will still sound plausible
           | because it's still choosing from the most likely things to
           | follow what has already been emitted.
        
             | seanhunter wrote:
             | A p-value isn't just any old probability- it has a specific
             | meaning related to hypothesis testing[1]. A p-value is the
             | conditional probability of seeing a result at least as
             | extreme as some observation _under the null hypothesis_.
             | 
             | Yes LLMs generate tokens using a stochastic process, so it
             | is probabalistic. Everyone knows that, but in the normal
             | process of generating text, LLMs aren't doing a hypothesis
             | test so _by definition_ p-values are completely irrelevant
             | to how LLMs hallucinate.
             | 
             | [1] https://math.libretexts.org/Courses/Queens_College/Intr
             | oduct...
        
               | seanhunter wrote:
               | This is such a common thing to misunderstand that I'm
               | going to respond to my own message to give an
               | explanation, because many of the links I've found make
               | sense once you know the lingo etc but might not before
               | then.
               | 
               | Say you go into a bar and just by chance there is a
               | football[1] match on television between Denmark and
               | France. You see a bunch of fans of each country and you
               | think "Hey, the Danes look taller than the French". You
               | want to find out whether this is true in general, so to
               | test this hypothesis you persuade them during a lull in
               | the match to line up and get measured. As luck would have
               | it there are exactly n people from each country.
               | 
               | H_0 (the null hypothesis) is that the two population
               | means are the same. That is, that Danish people have the
               | same average height as French people.
               | 
               | H_1 (the alternative hypothesis) is that Danish people
               | are taller on average than French people (ie the
               | population mean is larger).
               | 
               | So you take the average height and see that the Danes in
               | this bar are say 5cm taller on average than the French
               | people in this bar.
               | 
               | The p-value is how likely it would be to select a random
               | sample of n people from each of two populations (one from
               | Danes, one from French people) with an average height of
               | the Danes in the sample being at least 5cm larger than
               | the French _if the actual average height of the
               | underlying populations you sampled from (all Danes and
               | all French people) were the same_.
               | 
               | How you use a p-value is typically if it is smaller than
               | some threshold called a critical value you "reject the
               | null hypothesis" at some significance level. So in this
               | case if the p-value was small enough you conclude that
               | the population means are unlikely to be the same.
               | 
               | Actually calculating the p-value is going to depend a bit
               | on the distribution etc but that's what a p-value is. As
               | you can see it's not just a probability.
               | 
               | [1]soccer if you're from the US
        
           | seanhunter wrote:
           | I can tell by your response that you are burdened by
           | understanding at least one of a)what a p-value is b)What an
           | "LLM hallucination" means or c)what actually causes llms to
           | hallucinate.
           | 
           | If you set yourself free from the meaning of all the nouns in
           | the sentence then you can get there.
        
             | klysm wrote:
             | Okay, if I forget what p-value means and just take it to
             | mean probability, I guess I can see the point? It's still
             | wrong through
        
               | seanhunter wrote:
               | Yes, it is still wrong even if you just think
               | "probability", not "p-value".
               | 
               | For people who don't believe me, spin up your LLM of
               | choice using the API in your favourite language[1] and
               | make some query using a temperature of zero. You will
               | find if you repeat the query multiple times you always
               | get the same response. That is because it always giving
               | you the highest weighted result in the transformer output
               | whereas if you set a non-zero temperature (or use the
               | default chat frontend) it does a weighted random sample.
               | 
               | So there is no probabilistic variance between responses
               | with temperature set to zero for a given model, but you
               | will nonetheless find that you can get the LLM to
               | hallucinate. One way I've found to get LLMs to frequently
               | hallucinate is to ask the difference between two concepts
               | that are actually the same (eg gemini gave me a very
               | convincing looking but totally wrong "explanation" of the
               | difference between a linear map and a linear
               | transformation in linear algebra [2].
               | 
               | Therefore the probabilistic nature of a normal LLM
               | response can not be the reason for hallucination because
               | when we turn that off we find we still get
               | hallucinations.
               | 
               | The real reason that LLMs hallucinate is more mundane and
               | yet more fundamental- Hallucinating (in the normal sense
               | of the word) is actually all that LLMs do. This is what
               | Karpathy is talking about when he says that LLMs "dream
               | documents". We just specifically call it "hallucination"
               | when the results are somehow undesirable, typically
               | because they don't correspond with some particular facts
               | we would like the model's output to be grounded in.
               | 
               | But LLMs don't have any sort of model of the world, they
               | have weights which are a lossy compression of their raw
               | training data, so in response to some prompt they give
               | the response that they have learned in instruction fine-
               | tuning minimizes whatever loss function was used for that
               | fine-tuning process. That's all. When we use words like
               | "hallucination" we are in danger of anthropomorphising
               | the model and using our reasoning process to try to back
               | into how the model actually works.
               | 
               | [1] You need to use the programming API rather than the
               | usual web frontend to set the temperature parameter.
               | 
               | [2] For the curious, it more or less said that for one of
               | them (I forget which) you could move the origin so turned
               | it into an affine transformation, but it mangled the
               | underlying maths further. The evidence has fallen out of
               | my gemini history so I can't share it, but that sort of
               | approach has been fruitful in the past. Neither chatgpt
               | nor claude fall for that specific example fwliw.
        
               | BlueTemplar wrote:
               | While I like to call out people for using "hallucinate"
               | for this kind of behavior too (for language models, at
               | least, it might actually be appropriate for visual models
               | ?)-
               | 
               | > One way I've found to get LLMs to frequently
               | hallucinate is to ask the difference between two concepts
               | that are actually the same
               | 
               | -this only confirms my belief that "bulshitting" is an
               | appropriate term to use for this behavior : doesn't
               | exactly the same thing happen with (not savvy enough)
               | human students ?
               | 
               | You call it "anthropomorphizing", and "not having a model
               | of the world", but isn't it more like forcing a model of
               | the world on the student / language model by the way that
               | you frame the question ?
               | 
               | (Interestingly, there might be a parallel here with the
               | article : with the language model not being a real
               | student, but a statistical average over all students,
               | including being "with one breast and one testicle".)
        
               | seanhunter wrote:
               | Yes actually I think you're right.
        
             | mnky9800n wrote:
             | I recommend setting yourself free of all nouns in general.
        
       | flobosg wrote:
       | (2019)
        
       | Hilift wrote:
       | Harold Shipman with the NHS killed about one person per month for
       | 30 years, and the response was to ask if doctors could be
       | monitored to discover this earlier. However, the systems they
       | conceived "eventually cast suspicion on the innocent". 25 years
       | later the NHS is still struggling to answer these questions.
        
       | dccsillag wrote:
       | If thresholding of P-values is the issue, E-values -- a recent,
       | much more elegant, easier to work with, and more robust
       | alternative to P-values -- solve this.
       | 
       | https://arxiv.org/abs/2312.08040 https://arxiv.org/abs/2205.00901
       | https://arxiv.org/abs/2210.01948 https://arxiv.org/abs/2410.23614
        
       ___________________________________________________________________
       (page generated 2025-01-01 23:02 UTC)