[HN Gopher] Lies. Damned Lies. P-value thresholds
___________________________________________________________________
Lies. Damned Lies. P-value thresholds
Author : fghorow
Score : 51 points
Date : 2024-12-28 15:39 UTC (4 days ago)
(HTM) web link (www.newyorker.com)
(TXT) w3m dump (www.newyorker.com)
| brudgers wrote:
| Previous discussion under original title,
| https://news.ycombinator.com/item?id=20978134
| chkgk wrote:
| http://archive.today/yju8K
| sdwr wrote:
| This piece is basically set up to fail. It's historical, based on
| math, and doesn't venture towards any drama or stakes. I can
| easily imagine it being a farticle on a third-rate SEO webzine.
|
| So why does it work okay as a New Yorker piece? How is their
| writing consistently good?
|
| I think their secret sauce is the (implied) perspective. The
| impression that the author has a unique, complete, accurate take
| on the subject, and is letting you in on it piece by piece, in a
| meandering way.
| tucnak wrote:
| You're exactly right: what you refer to as "perspective" is
| easily faked.
| martindale wrote:
| Astroturf on HN. Never thought I'd see the day.
| moonlion_eth wrote:
| P values are why llms hallucinate
| ellisv wrote:
| Would you elaborate?
| svnt wrote:
| If I take it at face value they seem to be blaming LLM
| hallucinations on questionable training data from published
| science done badly while focused on p-values.
| klysm wrote:
| I have no idea how this could possibly be true
| zrm wrote:
| LLMs are essentially predicting what token could plausibly
| come next. They sort the possible next tokens by probability,
| often throw out the the ones with very low probabilities
| (p-value too low) and then use a weighted random number
| generator to choose one from the rest.
|
| Sometimes that means you exclude a good next token, or
| include a bad one, so a bad one gets chosen. And then once it
| has, the thing is going to pick whatever is most likely to
| come after that, which will be some malarkey because it has
| already emitted a nonsense token and is now using that as
| context. But whatever it is, it will still sound plausible
| because it's still choosing from the most likely things to
| follow what has already been emitted.
| seanhunter wrote:
| A p-value isn't just any old probability- it has a specific
| meaning related to hypothesis testing[1]. A p-value is the
| conditional probability of seeing a result at least as
| extreme as some observation _under the null hypothesis_.
|
| Yes LLMs generate tokens using a stochastic process, so it
| is probabalistic. Everyone knows that, but in the normal
| process of generating text, LLMs aren't doing a hypothesis
| test so _by definition_ p-values are completely irrelevant
| to how LLMs hallucinate.
|
| [1] https://math.libretexts.org/Courses/Queens_College/Intr
| oduct...
| seanhunter wrote:
| This is such a common thing to misunderstand that I'm
| going to respond to my own message to give an
| explanation, because many of the links I've found make
| sense once you know the lingo etc but might not before
| then.
|
| Say you go into a bar and just by chance there is a
| football[1] match on television between Denmark and
| France. You see a bunch of fans of each country and you
| think "Hey, the Danes look taller than the French". You
| want to find out whether this is true in general, so to
| test this hypothesis you persuade them during a lull in
| the match to line up and get measured. As luck would have
| it there are exactly n people from each country.
|
| H_0 (the null hypothesis) is that the two population
| means are the same. That is, that Danish people have the
| same average height as French people.
|
| H_1 (the alternative hypothesis) is that Danish people
| are taller on average than French people (ie the
| population mean is larger).
|
| So you take the average height and see that the Danes in
| this bar are say 5cm taller on average than the French
| people in this bar.
|
| The p-value is how likely it would be to select a random
| sample of n people from each of two populations (one from
| Danes, one from French people) with an average height of
| the Danes in the sample being at least 5cm larger than
| the French _if the actual average height of the
| underlying populations you sampled from (all Danes and
| all French people) were the same_.
|
| How you use a p-value is typically if it is smaller than
| some threshold called a critical value you "reject the
| null hypothesis" at some significance level. So in this
| case if the p-value was small enough you conclude that
| the population means are unlikely to be the same.
|
| Actually calculating the p-value is going to depend a bit
| on the distribution etc but that's what a p-value is. As
| you can see it's not just a probability.
|
| [1]soccer if you're from the US
| seanhunter wrote:
| I can tell by your response that you are burdened by
| understanding at least one of a)what a p-value is b)What an
| "LLM hallucination" means or c)what actually causes llms to
| hallucinate.
|
| If you set yourself free from the meaning of all the nouns in
| the sentence then you can get there.
| klysm wrote:
| Okay, if I forget what p-value means and just take it to
| mean probability, I guess I can see the point? It's still
| wrong through
| seanhunter wrote:
| Yes, it is still wrong even if you just think
| "probability", not "p-value".
|
| For people who don't believe me, spin up your LLM of
| choice using the API in your favourite language[1] and
| make some query using a temperature of zero. You will
| find if you repeat the query multiple times you always
| get the same response. That is because it always giving
| you the highest weighted result in the transformer output
| whereas if you set a non-zero temperature (or use the
| default chat frontend) it does a weighted random sample.
|
| So there is no probabilistic variance between responses
| with temperature set to zero for a given model, but you
| will nonetheless find that you can get the LLM to
| hallucinate. One way I've found to get LLMs to frequently
| hallucinate is to ask the difference between two concepts
| that are actually the same (eg gemini gave me a very
| convincing looking but totally wrong "explanation" of the
| difference between a linear map and a linear
| transformation in linear algebra [2].
|
| Therefore the probabilistic nature of a normal LLM
| response can not be the reason for hallucination because
| when we turn that off we find we still get
| hallucinations.
|
| The real reason that LLMs hallucinate is more mundane and
| yet more fundamental- Hallucinating (in the normal sense
| of the word) is actually all that LLMs do. This is what
| Karpathy is talking about when he says that LLMs "dream
| documents". We just specifically call it "hallucination"
| when the results are somehow undesirable, typically
| because they don't correspond with some particular facts
| we would like the model's output to be grounded in.
|
| But LLMs don't have any sort of model of the world, they
| have weights which are a lossy compression of their raw
| training data, so in response to some prompt they give
| the response that they have learned in instruction fine-
| tuning minimizes whatever loss function was used for that
| fine-tuning process. That's all. When we use words like
| "hallucination" we are in danger of anthropomorphising
| the model and using our reasoning process to try to back
| into how the model actually works.
|
| [1] You need to use the programming API rather than the
| usual web frontend to set the temperature parameter.
|
| [2] For the curious, it more or less said that for one of
| them (I forget which) you could move the origin so turned
| it into an affine transformation, but it mangled the
| underlying maths further. The evidence has fallen out of
| my gemini history so I can't share it, but that sort of
| approach has been fruitful in the past. Neither chatgpt
| nor claude fall for that specific example fwliw.
| BlueTemplar wrote:
| While I like to call out people for using "hallucinate"
| for this kind of behavior too (for language models, at
| least, it might actually be appropriate for visual models
| ?)-
|
| > One way I've found to get LLMs to frequently
| hallucinate is to ask the difference between two concepts
| that are actually the same
|
| -this only confirms my belief that "bulshitting" is an
| appropriate term to use for this behavior : doesn't
| exactly the same thing happen with (not savvy enough)
| human students ?
|
| You call it "anthropomorphizing", and "not having a model
| of the world", but isn't it more like forcing a model of
| the world on the student / language model by the way that
| you frame the question ?
|
| (Interestingly, there might be a parallel here with the
| article : with the language model not being a real
| student, but a statistical average over all students,
| including being "with one breast and one testicle".)
| seanhunter wrote:
| Yes actually I think you're right.
| mnky9800n wrote:
| I recommend setting yourself free of all nouns in general.
| flobosg wrote:
| (2019)
| Hilift wrote:
| Harold Shipman with the NHS killed about one person per month for
| 30 years, and the response was to ask if doctors could be
| monitored to discover this earlier. However, the systems they
| conceived "eventually cast suspicion on the innocent". 25 years
| later the NHS is still struggling to answer these questions.
| dccsillag wrote:
| If thresholding of P-values is the issue, E-values -- a recent,
| much more elegant, easier to work with, and more robust
| alternative to P-values -- solve this.
|
| https://arxiv.org/abs/2312.08040 https://arxiv.org/abs/2205.00901
| https://arxiv.org/abs/2210.01948 https://arxiv.org/abs/2410.23614
___________________________________________________________________
(page generated 2025-01-01 23:02 UTC)