[HN Gopher] LLMs can't do probability
       ___________________________________________________________________
        
       LLMs can't do probability
        
       Author : DrRavenstein
       Score  : 126 points
       Date   : 2024-05-01 09:29 UTC (13 hours ago)
        
 (HTM) web link (brainsteam.co.uk)
 (TXT) w3m dump (brainsteam.co.uk)
        
       | Drakim wrote:
       | Humans are notoriously bad at probability as well, and since LLMs
       | are trained data from humans, it kinda makes sense.
        
         | unshavedyak wrote:
         | I assume somewhat related to this, but humans are also terrible
         | at "random". ~See~ Related 37[1].
         | 
         | The more we advance on LLMs the more i am convinced i'm an LLM.
         | :s
         | 
         | [1]: https://www.youtube.com/watch?v=d6iQrh2TK98
        
           | brigadier132 wrote:
           | We are more than LLMs, we have a pretty terrible CPU too. But
           | it's interesting to think, all this positive self
           | reinforcement where you tell yourself "Today's a good day",
           | "I'm amazing", etc, are you just prompting yourself by doing
           | that?
        
             | throwuxiytayq wrote:
             | Kinda, yes. You can do the opposite too (see: negative
             | self-talk).
        
           | zeroonetwothree wrote:
           | Humans are bad at generating random data yes but that video
           | isn't exactly convincing proof of it.
        
             | unshavedyak wrote:
             | Oh i didn't mean it (or anything i said) to be proof.
        
         | brigadier132 wrote:
         | Is it because humans are bad at probability that LLMs are bad
         | at probability or is it something inherent in this kind of
         | statistical inference technique? If you trained an LLM on
         | trillions of random numbers will it become an effective random
         | number generator?
        
           | simonw wrote:
           | In this case being "bad at randomness" isn't because it was
           | trained on text from humans who are bad at randomness, it's
           | because asking a computer system that doesn't have the
           | ability to directly execute a random number generator to
           | produce a random number is never going to be reliable.
        
             | brigadier132 wrote:
             | My question was about the scenario if it was trained on
             | this kind of query with good data.
             | 
             | It would be interesting to see if it could generalize at
             | all. I'm pretty certain if you trained it specifically on
             | 
             | "Generate a random number from 0 to 100" and actually give
             | it a random number from 0 to 100 and give it billions of
             | such examples it would be pretty effective at generating a
             | number from 0 to 100. Wouldn't each token have equal
             | weighted probability of appearing?
        
               | jdiff wrote:
               | Sorta, not really. Neural networks are deterministic in
               | the wrong ways. If you feed them the same input, you'll
               | get the same output. Any variation comes from varying the
               | input or randomly varying your choice from the output.
               | And if you're randomly picking from a list of even
               | probabilities, you're just doing all the heavy lifting of
               | picking a random number yourself, with a bunch of kinda
               | pointless singing and dancing beforehand.
        
       | sigmoid10 wrote:
       | >You are a weighted random choice generator. About 80% of the
       | time please say 'left' and about 20% of the time say 'right'.
       | Simply reply with left or right. Do not say anything else
       | 
       | I could have told you these results solely based on the
       | methodology combined with this system prompt. No need to spend
       | money on APIs. Randomness in LLMs does not come from the context,
       | it comes from sampling over output tokens the LLM considers
       | likely. Imagine you are in this situation as a human: Someone
       | walks up to you and tells you to say "left" with 80% probability
       | and "right" with 20% probability. You say "left" and then the
       | other person walks away never to be seen again. How do you
       | determine if your own "output" was correct? You would need to
       | sample it many times in the same conversation before anyone could
       | determine wether you understand the basics of probability or not.
       | This is an issue of the author's understanding of Bayesian
       | statistics and possibly a misunderstanding of how LLMs actually
       | work.
       | 
       |  _Edit:_
       | 
       | I just tried a minimally more sensible approach after getting an
       | idea from the comments below. I asked GPT4 to generate a random
       | number using this prompt:
       | 
       | >You are a random number generator. Reply with a number between 0
       | and 10. Only say the number, say nothing else.
       | 
       | It responed with 7. But then I looked at the top logprobs. Sure
       | enough, they contained all the remaining numbers between 0 and
       | 10. The only issue is that "7" got a logprob of -0.008539278,
       | while the next most likely was "4" at -5.5371723, which is
       | significantly lower. The remaining probs were then pretty close
       | to each other. Unfortunately, OpenAI doesn't allow you to crank
       | the temperature up arbitrarily high, otherwise the original
       | experiment would actually work. And I would argue that humans
       | will still fail at this if you used the same methodology. The
       | reason I didn't use OP's exact approach is because if you look at
       | the logprobs there, you'll see they get muddled with tokens that
       | are just different spellings of left and right (such as "Left" or
       | "-left"). But the model definitely understands the concept of
       | probability, it would just need more context before you can do
       | any reasonable frequentist analysis in a single conversation.
       | 
       |  _Edit 2:_
       | 
       | I repeated it with random numbers between 0 and 100. Guess what
       | numbers are coming out among the top logprobs. Pretty much
       | exactly what you'd expect after watching this:
       | https://www.youtube.com/watch?v=d6iQrh2TK98
       | 
       | I guess LLMs trained on human data think pretty similar to humans
       | after all.
        
         | nathan_compton wrote:
         | You're saying that instead the author should have taken the
         | logits of "left" and "right", converted them to normalized
         | probabilities and then have expected _those_ to be 80% left and
         | 20% right. But if this were the case (under some reasonable
         | assumptions about the sampling methodology of the providers)
         | then the author would have _seen_ an 80 /20 split. From these
         | results we can probably conclude that with this prompt the
         | predicted probability for "left" is near 100% for GPT4.
         | 
         | I think the author's point stands. They aren't asking "what
         | would you _expect_ from a distribution so described? " The
         | answer to that question is 100% of the time "left.". A well
         | behaving LLM responding to the actual question _should_
         | distribute the logits across  "left" and "right" in the way
         | requested by the user and doesn't.
         | 
         | I think if you chose 1000 random people and prompted them with
         | this question you would get a preponderance of "lefts" compared
         | to the prompt, but not 100% left.
        
           | sigmoid10 wrote:
           | >You're saying that instead the author should have taken the
           | logits of "left" and "right", converted them to normalized
           | probabilities and then have expected _those_ to be 80% left
           | and 20% right.
           | 
           | No, that's not what I meant. Although it would still make
           | more sense than what the author did. The problem lies in the
           | way you actually determine probabilities. We know that humans
           | are bad random number generators, but they understand the
           | concept enough to come up with random looking stuff if you
           | give them the chance. The LLMs here were not even given a
           | chance. In essence, the author is complaining that the LLMs
           | are not behaving according to frequentist statistics when he
           | evaluates them in a strictly Bayesian setting.
        
             | nathan_compton wrote:
             | I don't agree: a Bayesian statistician posed the question
             | "You are a weighted random choice generator. About 80% of
             | the time please say 'left' and about 20% of the time say
             | 'right' [...]" would say "left" 80% of the time and "right"
             | 20 % of the time. If we had a population of 1000 such
             | Bayesians we would expect to collect around 800 lefts and
             | 200 rights. If we asked the same Bayesian 1000 times we'd
             | expect the same. Its got nothing to do with Bayesian vs
             | Frequentist statistics.
             | 
             | Real humans probably would say left more often than 80% of
             | the time, which is what I guess you're getting at, but the
             | question is very clearly asking the subject to "sample
             | from" (an entirely Bayesian activity) from a distribution,
             | not to give the expected value. GPT4 gives the expected
             | value and this is simply wrong.
        
               | sigmoid10 wrote:
               | >GPT4 gives the expected value and this is simply wrong.
               | 
               | Only at T=0. See my edit above how this changes
               | everything.
        
               | nathan_compton wrote:
               | This doesn't really have anything to do with the language
               | model. The temperature only has to do with the _sampling_
               | from the probability distribution which the language
               | model predicts. In fact, raising the temperature would
               | eventually cause the model to _randomly_ print  "left" or
               | "right," (eventually at 50/50 chance) not converge on the
               | actual distribution which the prompt suggests. I suppose
               | if you restricted the logits to just those tokens "left"
               | and "right", softmaxed them, and then tuned the
               | temperature T you might get it to reproduce the correct
               | distribution, but that would be true of a _random_
               | language model as well.
               | 
               | I think its pretty simple and straightforward: the model
               | simply fails to understand the question and can
               | reasonably be said to not understand probability.
        
             | nextaccountic wrote:
             | > We know that humans are bad random number generators
             | 
             | This is a good point. LLMs are bad at this, okay, but
             | humans aren't great at it either.
        
               | nathan_compton wrote:
               | But according to this GPT4 is substantially worse.
        
               | DougBTX wrote:
               | Yes, probably. At temperature zero the model will be
               | completely deterministic, so a particular prompt will
               | always produce the same result (ignoring for a second
               | that some fairly common optimisations introduce data
               | races in the GPU).
               | 
               | On the other hand, does it really matter? With a slight
               | tweak to the prompt, ChatGPT generates some serviceable
               | code:                   > Run a function to produce a
               | random number between 1 and 10. What is the number?
               | import random              # Generate a random number
               | between 1 and 10         random_number =
               | random.randint(1, 10)         random_number
               | The random number generated between 1 and 10 is 9.
        
               | nextaccountic wrote:
               | > (ignoring for a second that some fairly common
               | optimisations introduce data races in the GPU).
               | 
               | Okay so are any GPU compilers intentionally introducing
               | data races in programs that previously exhibited no data
               | races?
        
           | swatcoder wrote:
           | > A well behaving LLM responding to the actual question
           | should distribute the logits across "left" and "right" in the
           | way requested by the user and doesn't.
           | 
           | No, a well-behaving LLM would do exactly what's seen. The
           | most likely next toxen is "left" and it should
           | deterministically output that unless some other layer like a
           | temperature function makes it non-deterministic in its own
           | way (wholly unrelated to the prompt).
           | 
           | The fantastical AGI precursor that people have been coached
           | into seeing is what you're talking about, and that's (of
           | course) not what an LLM actually is.
           | 
           | This is essentially just one of the easier ways you can
           | expose the parlor trick behind that misconception.
        
             | nathan_compton wrote:
             | This simply doesn't follow. One could totally train an LLM
             | to assign the right logits to "left" and "right" for this
             | problem. I suspect its a problem with the training data.
        
         | michaelt wrote:
         | _> Randomness in LLMs does not come from the context, it comes
         | from sampling over output tokens the LLM considers likely._
         | 
         | I mean, _theoretically_ I assume you could train an LLM so that
         | for the input  "Choose a random number between 1 and 6" output
         | tokens 1, 2, 3, 4, 5 and 6 are equally likely. Then the
         | sampling process would produce a random number.
         | 
         | Of course, whether you could teach the model to generalise that
         | more broadly is a different matter.
        
       | simonw wrote:
       | This is very unsurprising.
       | 
       | The interesting challenge here is helping people understand _why_
       | asking an LLM to do something 20% of the time is a bad prompt.
       | 
       | I intuitively know that this prompt isn't going to work, but as
       | with so many of these intuitive prompting things I have trouble
       | explaining exactly why I know that.
       | 
       | Aside: If you need a GPT to incorporate randomness in a reliable
       | way you can get it to use Code Interpreter.
        
         | intended wrote:
         | I guess it would be something on these lines?:
         | 
         | To do random number gen, it would have to convert the input
         | text into constraints and then use those constraints to
         | generate additional tokens.
         | 
         | This would, at its core, be a call to calculate a probability
         | function, every time it is releasing the next token. That would
         | mean memory, processing etc. etc.
        
           | 6gvONxR4sf7o wrote:
           | Nope, because all of that is taken care of by the mechanisms
           | for evaluating the model. Strictly speaking, the model
           | outputs a probability distribution. The question is why that
           | distribution doesn't match the instructions.
        
             | intended wrote:
             | I think I maybe get where you are coming from, but still
             | how? I feel we are discussing 2 different use cases.
             | 
             | 1) Prompt 1: " You are a weighted random choice generator.
             | About 80% of the time please say 'left' and about 20% of
             | the time say 'right'. Simply reply with left or right. Do
             | not say anything else" "
             | 
             | 2) Assume that the training data gives examples of 2.1)
             | single coin flips 2.2) multiple coin flips
             | 
             | Consider a slightly different prompt, prompt 2:
             | 
             | 3) Prompt 2: same as prompt 1, except it presents 1000
             | lefts/rights in the same response (l,l,l,l,r,l,l,l...)
             | 
             | ----
             | 
             | I think what you are describing is prompt 2. I just did a
             | quick test with GPT 4, and i got a 27-3, split when using
             | prompt 2.
             | 
             | However for prompt 1 - you get only left. To me this makes
             | sense because Running prompt 1 x100 should result in:
             | 
             | Pass 1: LLM receives prompt, and parses it. LLM predicts
             | the next token. The next token should be left. Pass 2: same
             | as pass 1.
             | 
             | ----
             | 
             | For prompt 1, Every prompt submission is a tabula rasa. So
             | it will correctly say left, which is the correct answer for
             | the active universe of valid prompt responses according to
             | the model.
             | 
             | Unless i am reading you wrong and you are saying the model
             | is actually acting as a weighted coin flip.
             | 
             | In theory, the LLM should be more responsive if you ask it
             | follow a 60:40 or 50:50 split for pass 1. Ill see if I can
             | test this later.
             | 
             | (Heck now I'm more concerned about the cases where it does
             | manage to apply the distribution. )
        
         | SkyBelow wrote:
         | As a once off, with the same context, it giving the same answer
         | doesn't surprise me. What I'm wondering if the behavior when it
         | keeps being asked for another response with the previous
         | responses fed back into it. In this case, a human would see
         | they are doing the 80% 'too much' and decide to do the 20% to
         | balance it out. That isn't actually good and shows they still
         | aren't operating off a random probability, instead they are
         | emulating their perception of what a random probability would
         | look like.
         | 
         | Given this sort of situation to an LLM instead, is the
         | expectation for it to give the most likely answer continuously,
         | to act like a human and try to emulate a probability, or to do
         | something different from either of the two previous options?
         | 
         | Edit: Just tried an attempt with copilot, having it produce a
         | random distribution of two different operations. I had it
         | generate multiple operations, either adding or subtracting 1
         | each, with an 80/20 split. It did four adds, one minus on
         | repeat.
        
         | kelseyfrog wrote:
         | At some point the logits at a branching point in the response
         | need to correspond to the respective probabilities of the
         | requested output classes so that they can be appropriately
         | sampled and strongly condition the remainder of the response.
         | My instinct says this cannot be accomplished irrespective of
         | temperature, but I could be persuaded. with math.
        
           | mtrimpe wrote:
           | Or you can just add some randomness to the prompt by adding
           | "Your random seed is mciifjrbdifnf."
           | 
           | I just tested that and got 4 left and 2 right so it works
           | pretty well.
        
           | lappa wrote:
           | Provided a constant temperature of 1.0, you can train the
           | model on prompts with probablistic requests, with loss
           | determined by KL divergence.
           | 
           | Expectation: 80% left, 20% right
           | 
           | Model sampling probability: 99% left, 1% right
           | 
           | >>> 0.80 * math.log(0.99 / 0.80) + 0.20 * math.log(0.01 /
           | 0.20)
           | 
           | -0.42867188234223175
           | 
           | Model sampling probability: 90% left, 10% right
           | 
           | >>> 0.80 * math.log(0.9 / 0.80) + 0.20 * math.log(0.1 / 0.20)
           | 
           | -0.04440300758688229
           | 
           | Of course, if you change the temperature this will break any
           | probablistic expectations from training in this manner.
        
       | rsynnott wrote:
       | I mean, given they can't _count_, it would be pretty astonishing
       | were it otherwise.
        
         | dahart wrote:
         | Exactly, it's interesting that 'llms can't x', with lots of
         | effort trying to demonstrate it, comes up so often, when we
         | know from first principles they can't do anything but run a
         | Markov chain on existing words. We've managed to build
         | something that is best at fooling people into thinking it can
         | do things it can't.
        
         | mitthrowaway2 wrote:
         | They can count occurrences of a token. Depending on the
         | tokenization, they can't necessarily count occurrences of a
         | character.
        
       | throwaway598 wrote:
       | If you asked me to be right 80% of the time, I'd probably be
       | right. So I'd be right on average 110% of the time.
        
       | michaelt wrote:
       | Sometimes when you ask chatgpt 4 for a random number it... writes
       | python code to choose a random number, runs it, then tells you
       | the response:
       | https://chat.openai.com/share/a72c2d8c-c44e-4c89-b6bc-b0673c...
       | 
       | One way of doing it, I suppose.
        
         | planede wrote:
         | If you asked a person to give you a random number between 1 and
         | 6, would you accept if they just said a number they just came
         | up with or would you rather they rolled a die for it?
        
           | kube-system wrote:
           | If they didn't already have dice in their hand, I would
           | certainly expect the former.
        
           | tommiegannert wrote:
           | Depending on who you ask, the answer would have been "oh, I
           | have an app for that. Hold on..."
           | 
           | GPT wins for not having that delay.
        
             | 4ndrewl wrote:
             | What a time to be alive
        
           | JKCalhoun wrote:
           | They should turn around and ask me instead for a random
           | number between 1 and 6 and then reply with seven minus that
           | number.
        
           | olddustytrail wrote:
           | It is 4. It's always 4.
        
         | its_ethan wrote:
         | Is it actually running the code it creates? Or does it generate
         | code, and then just output some number it "thinks" is random,
         | but that is not a product of executing any python code?
        
           | bongodongobob wrote:
           | Yes, it runs the code.
        
             | brabel wrote:
             | Couldn't this open people up for remote code execution
             | somehow? Say, someone sends you a message that they know
             | will make you likely to ask an AI a certain question in a
             | certain way... Maybe far-fetched, but I've seen even more
             | far-fetched attacks in real life :D
        
               | kolinko wrote:
               | the code is sandboxed on openai servers. it doesn't run
               | on your machine if you use chatgpt interface
        
               | joquarky wrote:
               | I would assume it can only generate pure functions and/or
               | run in a sandbox.
        
           | Version467 wrote:
           | It's actually running the code. It doesn't run all code it
           | generates. But if you specifically ask it to, then it does.
           | It also has access to a bunch of data visualization libraries
           | if you want it to calculate and plot stuff.
        
             | paulmd wrote:
             | gnuplot my beloved
             | 
             | https://livebook.manning.com/book/gnuplot-in-action-
             | second-e...
        
         | Workaccount2 wrote:
         | Technically speaking, it's the right way to do it.
        
           | pixl97 wrote:
           | Exactly. Only trust random numbers and/or probability via
           | processes that have been vetted to be either (somewhat)
           | random or follow a probabilistic algorithm. Humans are
           | generally terrible at randomness and probability except in
           | cases where they have been well trained, and even then those
           | people would rather run an algorithm.
        
         | Terr_ wrote:
         | Isn't that a case where the interesting behavior is from a new
         | piece someone programmed onto the side of the core LLM
         | functionality?
         | 
         | In other words, it's still true that large _language_ models
         | can 't do probability, so someone put in special logic to have
         | the language model guess at a computer language to do the thing
         | instead.
        
       | usgroup wrote:
       | A consequence of being an auto regressive model is not being able
       | to plan token output. I think the author's example is one of the
       | many corollaries.
       | 
       | You could prompt the LLM differently , for example to write a
       | Python program that does the random part, and then act on its
       | output.
        
         | gwern wrote:
         | > A consequence of being an auto regressive model is not being
         | able to plan token output.
         | 
         | Generating independent simple random variables requires zero
         | planning by definition, because they are independent. And base
         | auto-regressive models do it fine.
        
       | isoprophlex wrote:
       | A quick check confirms this...
       | 
       | "sample a uniform distribution with mu = 0 and sigma = 1", prompt
       | giving a single float repeated 500 times
       | 
       | https://strangeloop.nl/IMG_7388.png
       | 
       | I wonder if it converges better if you ask it once, in one go,
       | for 500 samples. Chain-of-thought stochastic convergence.
        
       | danenania wrote:
       | A related question it might be interesting to study is how LLMs
       | translate ambiguous words like "sometimes" into probabilities.
       | 
       | If you prompt "Sometimes answer 'red' and sometimes answer
       | 'blue'" are the results roughly 50/50?
       | 
       | Or how about "Usually answer 'red' but occasionally answer
       | 'blue'"?
       | 
       | You might actually get more consistent probabilities with this
       | approach than prompting with exact percentages.
        
       | haebom wrote:
       | Language models aren't built to do that, and if you want to make
       | predictions or calculations, they're probably not the best
       | choice.
        
       | bagrow wrote:
       | > Write a program for a weighted random choice generator. Use
       | that program to say 'left' about 80% of the time and 'right'
       | about 20% of the time. Simply reply with left or right based on
       | the output of your program. Do not say anything else.
       | 
       | Running once, GPT-4 produced 'left' using:                 import
       | random       def weighted_random_choice():           choices =
       | ["left", "right"]           weights = [80, 20]           return
       | random.choices(choices, weights)[0]       # Generate the choice
       | and return it       weighted_random_choice()
        
         | HPsquared wrote:
         | Did it run the program? Seems it just needs to take that final
         | step.
        
           | pulvinar wrote:
           | I ran it a few times (in separate sessions, of course), and
           | got 'right' some times, as expected.
        
         | littlestymaar wrote:
         | Once again, the actual intelligence is behind the keyboard,
         | nudging the LLM to do the correct thing.
        
         | ziml77 wrote:
         | My prompt didn't even ask for code:
         | 
         | > You are a weighted random choice generator. About 80% of the
         | time please say 'left' and about 20% of the time say 'right'.
         | Simply reply with left or right. Do not say anything else. Give
         | me 100 of these random choices in a row.
         | 
         | It generated the code behind the scenes and gave me the output.
         | It also gave a little terminal icon I could click at the end to
         | see the code it used:                   import numpy as np
         | # Setting up choices and their weights         choices =
         | ['left', 'right']         weights = [0.8, 0.2]
         | # Generating 100 random choices based on the specified weights
         | random_choices = np.random.choice(choices, 100, p=weights)
         | random_choices
        
       | tomrod wrote:
       | Right. They _are_ probability, they don't _do_ probability.
       | 
       | This is like saying biological organisms don't do controllable
       | and on-demand mutable DNA storage retrieval. It's like... yeah...
        
       | Vvector wrote:
       | https://xkcd.com/221/
        
       | iraldir wrote:
       | The overruling prompt of an LLM is essentially "give the most
       | likely answer to the text above".
       | 
       | If you ask an LLM to say left 80% of the time and right 20%, then
       | "the most likely answer to the text above" is left 100% of the
       | time.
        
       | NeoTar wrote:
       | I wonder how humans would respond to a prompt '(without
       | mechanical assistance) with 80% probability say Left, and with
       | 20% say Right' across a population.
       | 
       | I can think of a few levels that people might try to think about
       | the problem: Level 0: Ignore the probabilities and just pick
       | whichever you feel like, (would tend to 50:50) Level 1: Say the
       | most with the greatest probability - Left (would tend to 100:0)
       | Level 2: Consider that most people are likely to say Left, so say
       | Right instead (would tend to 0:100) Level 3: Try to think about
       | what proportion of people would say Left, and what would say
       | Right, and say whichever would return the balance closest to
       | 80:20...
       | 
       | Presumably your result would depend on how many people thinking
       | on each level you have in your sample...
        
         | spiffytech wrote:
         | I've seen people do this with Twitter polls with tens of
         | thousands of respondants. The results distribution comes within
         | a few percent of the prompted probabilities, even though
         | respondants can't see the results until after they've voted.
        
         | cortesoft wrote:
         | This sounds a lot like a Keynesian Beauty Contest
         | (https://en.wikipedia.org/wiki/Keynesian_beauty_contest), where
         | you are trying to make a selection based on what you think
         | other people are going to choose.
         | 
         | If I really wanted to give an accurate answer in this case, I
         | would probably choose some arbitrary source of a number (like
         | my age or the number of days that have gone by so far this
         | year), figure out the modulo 5 of that number, then say 'Right'
         | if the modulo is 0, and 'Left' otherwise.
         | 
         | Obviously there are some flaws in this approach, but I think it
         | would be as accurate as I could get.
        
         | patapong wrote:
         | Fun question! I think the following would be a viable strategy
         | without communication:
         | 
         | Think of an observable criterion that matches the target
         | distribution. For example, for 80-20:
         | 
         | - "Is the minute count higher than 12?" (This is the case in
         | 80% of cases)
         | 
         | - "Do I have black hair?" (This is apparently also the case in
         | 80% of cases)
         | 
         | Then, answer according to this criterion.
         | 
         | If everyone follows the same strategy, even if the criteria
         | selected differs between each individual, the votes should
         | match the target probability. Unless I am making a logical
         | mistake :)
        
         | michaelt wrote:
         | Level 4: Clearly, the Schelling point requires a number
         | everyone knows, which is evenly distributed across the
         | population, modulo 10. Let's use year of birth modulo 10. For
         | me that's 2, so I'll say Left.
        
       | FrustratedMonky wrote:
       | Humans are also pretty poor at this. So it isn't necessarily a
       | hit against AI as it is failing to do something a human could do,
       | thus AGI is unreachable.
       | 
       | I'm beginning to think AGI will be easy, since each individual
       | Human is pretty limited. It's the aggregate that makes Humans
       | achieve anything. Where are the AI models built on groups working
       | together.
        
       | aimor wrote:
       | With ChatGPT 3.5, new chats prompted with: "Simulate a dice roll
       | and report the number resulting from the roll. Only reply 1, 2,
       | 3, 4, 5, or 6 and nothing else."
       | 
       | So far I've got: 3, 4, 5, 5, 5, 3, 4, 3, 4, 5, 3, 4, 5, 5, 4, 5,
       | 3, 3, 4, 4, 4, 5, 5.
       | 
       | Of course I'm not the first to do this:
       | https://piunikaweb.com/2023/05/23/does-chatgpt-ai-struggle-w...
       | 
       | https://www.reddit.com/r/ChatGPT/comments/13nrmzw/in_every_c...
        
         | eddd-ddde wrote:
         | This is my results on a dice roll
         | 
         | [1] > 3 5 2 4 1 6 3 2 5 1
         | 
         | I tried my own experiments and ChatGPT felt like being funny:
         | 
         | [2] > A third of the time, paragraphs end with the word foo,
         | the other two thirds they end with the word bar, this time it
         | will end on: > How about "baz"? It's unexpected and adds a
         | touch of whimsy.
         | 
         | Interestingly, this other prompt works as expected:
         | 
         | [3] > about half of the time, you should say "foo", the other
         | half, you should say "bar", what about now ? > Bar. > about
         | half of the time, you should say "foo", the other half, you
         | should say "bar", what about now ? > Foo.
         | 
         | [1]:
         | https://chat.openai.com/share/07388362-1a61-4527-81af-4941a0...
         | [2]:
         | https://chat.openai.com/share/9caf07dd-69f4-4470-82a6-ab5642...
         | [3]: https://chat.openai.com/share/1c627528-60af-4cd9-a1ec-
         | efa524...
        
           | aimor wrote:
           | Gotta say, I was not expecting Baz there.
           | 
           | Regarding [1], for the dice roll I was creating a new chat
           | for each roll to ensure that the results of each roll are (in
           | some sense) independent. Generating a sequence of rolls is
           | also interesting, just a different experiment.
        
       | robertclaus wrote:
       | I wonder if you could actually fine tune an LLM to do better on
       | this. As some of the comments point out, the issue here is that
       | the possible output probabilities combined with the model
       | temperature don't actually result in the probabilities requested
       | in the prompt. If you trained on specific generated data with
       | real distributions would it learn to compensate appropriately?
       | Would that carry over to novel probability prompts?
        
         | phreeza wrote:
         | Probably yes. You could also garnish the prompt with a vanilla
         | RNG output.
        
         | genrilz wrote:
         | Almost certainly not if you set the temperature of the model to
         | 0, since then the output would be deterministic minus MoE
         | stuff.
         | 
         | If the temperature was not zero, then it seems technically
         | possible for the output tokens to weighted closely enough in
         | probability to each other in a way such that the randomization
         | from temperature causes tokens to be printed in the appropriate
         | distribution.
         | 
         | However, I'm not an LLM expert, but I don't think that people
         | use a "temperature" while training the model. Thus the training
         | step would not be able to learn how to output tokens in the
         | given distribution with a given temperature because the
         | training step does not have access to the temperature the user
         | is using.
         | 
         | EDIT: I made the assumption that the LLM was not asked for a
         | sequence of random numbers, but only one number per prompt. I
         | think this fits the use case described in the article, but
         | another use case might be asking for a sequence of such
         | numbers, in which case training might work.
        
         | gwern wrote:
         | > If you trained on specific generated data with real
         | distributions
         | 
         | It _was_ trained on generated data from real distributions! The
         | datasets LLMs are trained on include gigabytes of real data
         | from real distributions, in addition to all of the code
         | /stats/etc samples.
         | 
         | The question you should be asking is 'why did it _stop_ being
         | able to predict real distributions? ' And we already know the
         | answer: RLHF. https://news.ycombinator.com/item?id=40227082
        
       | cdelsolar wrote:
       | https://i.imgur.com/uR4WuQ0.gif
        
       | qwertox wrote:
       | Wouldn't this imply that they have access to a RNG?
        
         | gwern wrote:
         | They do, via the temperature sampling.
        
       | dudeinhawaii wrote:
       | "You are a weighted random choice generator. About 80% of the
       | time please say 'left' and about 20% of the time say 'right'.
       | Simply reply with left or right. Do not say anything else"
       | 
       | Humans would say "Left" 100% of the time in a zero-shot scenario
       | as well.
       | 
       | Intuitively, your first response is going to be "left" since it
       | has the 80% probability. You'd balance your answers over time
       | when you realized you were closer to 90% by some arbitrary
       | internal measurement (or maybe as you approached 10 iterations).
       | 
       | I'd expect an LLM to generate an approximation similar to a human
       | - over time. Turns out Humans can't do probability either. If you
       | test the LLM multiple times, similar to how you'd ask a human
       | multiple times, they tend to self-correct.
       | 
       | Whether that self-correction (similar to a human) is based on
       | some internal self-approximation of 80% is for someone else to
       | research.
       | 
       | Example session: Prompt: "....probability prompt" LLM: "left"
       | Prompt: "again" LLM: "left" Prompt: "again" LLM: "left" Prompt:
       | "again" LLM: "right" Prompt: "again"
       | 
       | This was my session with GPT-4.
        
         | geysersam wrote:
         | > Humans would say "Left" 100% of the time in a zero-shot
         | scenario as well.
         | 
         | How can you know what _all_ humans would do?
         | 
         | If the humans interpreted the task correctly, that is, if they
         | understood they will only be asked once, but in a hypothetical
         | repeated experiment the result should still be 80/20, they
         | would certainly not always say "left".
        
           | bena wrote:
           | Because it's a stupid prompt. Especially for humans.
           | 
           | Because you're really asking what they think the first
           | response would be. That's left. If I knew a machine would
           | pick left 80% of the time, I would bet left 100% of the time.
           | And I'd be right about 80% of the time, which isn't perfect,
           | but is profitable.
        
           | ben_w wrote:
           | A human brain can't be perfectly reset, the way an AI can.
           | 
           | I don't know if our decision making processes are
           | deterministic or quantum-random. If the former, then if you
           | could reset a human mind and ask the same question, you would
           | necessarily always get the same answer, whatever that
           | happened to be.
        
             | hwillis wrote:
             | The LLM _isn 't_ being perfectly reset. It chooses words
             | randomly; internally it _should_ be slightly different
             | every time. That 's the whole point of temperature.
        
               | tempusalaria wrote:
               | Temperature has nothing to do with internals. Temperature
               | is purely to do with how the logits outputted by the
               | network are transformed into probabilities, which is
               | completely deterministic and not learned. In fact,
               | temperature makes it impossible for LLMs to simulate this
               | kind of probability. As a calibrated 80-20 split at a
               | certain low temperature would be a different split with
               | some other temperature.
        
           | itsgrimetime wrote:
           | assuming the humans don't know what the other responses were,
           | I can't imagine it actually coming out 80/20
        
             | sqeaky wrote:
             | When polls like these are run the numbers don't always wind
             | up tilted in the favor of the bigger number. I wish I could
             | provide a specific source but I've been listening to the
             | 538 podcast for years and I know they've covered exactly
             | this topic.
             | 
             | Your inability to believe a thing doesn't prevent it from
             | being true.
             | 
             | I would grab a D20 and on a 16 or less I would say left
             | otherwise I would say right. Some people would pick right
             | just because they can. I imagine most people would pick
             | left because it's the 80%. I imagine plenty of people would
             | double and triple guess and waffle then say something.
             | 
             | Few people, even the dumbest among us, are easily modelable
             | deterministic automata.
        
         | gwern wrote:
         | > Humans would say "Left" 100% of the time in a zero-shot
         | scenario as well.
         | 
         | They do not! And you should not just make up assertions like
         | these. You don't know what humans would say. In fact, in polls,
         | they wind up remarkably calibrated. (This is also covered in
         | the cognitive bias literature under 'probability matching'.)
         | People do this poll on Twitter all the time.
        
           | brabel wrote:
           | Those humans, really difficult to know what they're thinking!
           | 
           | Anyway, humans are fairly predictable when trying to come up
           | with random numbers, for example, have a look at this
           | Veritasium video: https://www.youtube.com/watch?v=d6iQrh2TK98
        
         | taco_emoji wrote:
         | > I'd expect an LLM to generate an approximation similar to a
         | human
         | 
         | why on earth would you expect that?
        
       | ylow wrote:
       | Indeed this is unsurprising given how LLMs work. I mean if you
       | ask a human to generate a random number, and then reset the
       | universe and all state of the human and ask again, you will get
       | the same number.
       | 
       | But instead if I ask it to generate 100 samples, it actually
       | works pretty well.
       | 
       | "You are a weighted random choice generator. About 80% of the
       | time please say 'left' and about 20% of the time say 'right'.
       | Generate 100 samples of either "left" or "right". Do not say
       | anything else. "
       | 
       | I got 71 left, and 27 right.
       | 
       | And if I ask for 50%, 50%. I get 56 lefts and 44 rights.
        
         | ylow wrote:
         | (Yes 71 + 27 != 100, but that LLMs can't count is a whole other
         | issue)
        
         | gwern wrote:
         | > Indeed this is unsurprising given how LLMs work. I mean if
         | you ask a human to generate a random number, and then reset the
         | universe and all state of the human and ask again, you will get
         | the same number.
         | 
         | It actually _is_ surprising, and you _should_ be surprised
         | rather than post hoc justifying it, because the logits should
         | reflect the true random probability and be calibrated in order
         | to minimize the prediction loss. Putting ~100% weights on
         | 'heads' is a terrible prediction!
         | 
         | And the LLM logits _are_ in fact calibrated... _before_ they go
         | through RLHF and RLHF-derived dataset training. (Note that all
         | of the models OP lists are either non-base tuned models like
         | ChatGPT, or trained on data from such models, like Phi.) This
         | was observed qualitatively when the 3.5 models were first
         | released to the Playground, documented by the GPT-4 paper, and
         | the  'flattened logits' phenomenon has been found many times
         | since, not just by OP, and mostly by people totally ignorant of
         | this phenomenon (despite being quite well known).
         | 
         | This is just one of those things, like BPE-related errors, that
         | we're doomed to point out again and again in the Eternal
         | September of LLMs.
        
           | anorwell wrote:
           | > Putting ~100% weights on 'heads' is a terrible prediction!
           | 
           | For a weighted coin, isn't this the optimal strategy in the
           | absence of other information? `p > p^2 + ( 1 - p )^2`.
        
       | spywaregorilla wrote:
       | Even if the model correctly got 20%/80% on the very last layer of
       | it's token prediction for just these two tokens, the design of
       | the how the model leverages these probabilities would not choose
       | them at that ratio.
        
       | resource_waste wrote:
       | It can do estimates, but it can't do truly random probability.
       | 
       | Interesting.
        
       | dphuang2 wrote:
       | What an ironic observation since LLMs are itself a probability
       | machine.
        
       | gweinberg wrote:
       | A little strange that the post author tried things like switching
       | from left/right to coffee/tea, but apparently didn't try
       | inverting left and right.
        
       | jll29 wrote:
       | Keep in mind:
       | 
       | 1. LLMs use random numbers internally, something that can be
       | controlled via the 'temperature' parameter. temperature=0 means
       | no random behavior (however this is also a broadly known fact
       | that this is not fully correctly implemented in many LLMs), but
       | instead always the most likely answer will be given,
       | deterministically.
       | 
       | 2. Note also that LLMs have no memory; the 'appearance' of memory
       | is an illusion created by feeding the LLM the whole history of
       | the chat with each new user utterance!
        
         | savant_penguin wrote:
         | 1. Incorrect. The output of the decoder LLM is the probability
         | distribution of the next token given the input text.
         | Temperature=0 means that the output distribution is not pushed
         | to be closer to a uniform distribution. The randomness comes
         | from the sampling of the next token according to the output
         | distribution to generate text. If you want determinism you
         | always get the argmax of the distribution.
        
           | gliptic wrote:
           | Incorrect. The output of the decoder LLM is logits that are
           | then divided by the temperature and passed through softmax to
           | give the probabilities. You can't actually set temperature to
           | 0 (division by zero), but in the limit where temperature
           | approaches 0, softmax converges to argmax.
           | 
           | Temperature = 1 is where it's not pushed in either direction.
        
       | offmycloud wrote:
       | LLMs can't do math in general, they need external help to do
       | simple math problems with any consistency.
        
         | mch82 wrote:
         | Can you recommend any references that explain why LLMs can't do
         | math?
        
       | digitalsushi wrote:
       | They can't and it makes playing D&D with them really frustrating.
        
       | taco_emoji wrote:
       | yeah no shit
        
       | ddp26 wrote:
       | For those who only read the headline, LLMs can in fact do
       | advanced probabilistic reasoning, when given the right tooling.
       | This article is talking about their ability to act as a RNG.
       | 
       | One interesting thing I've found in building an AI forecaster is
       | that you can use the logprobs from the token representing
       | probability, so when the model concludes some long chain of
       | thought with "20%", you can check the logprob of that token vs
       | "25%" or "15%" to get confidence levels.
        
       | xmurph86x wrote:
       | Good read
        
       | pmarreck wrote:
       | Neither can humans do it well.
        
       | Imnimo wrote:
       | Another interesting experiment on this front:
       | 
       | https://twitter.com/infobeautiful/status/1778059112250589561
       | 
       | One thing I would have liked to see in the blog post is some
       | attention to temperature. It looks like they're calling ChatGPT
       | through LangChain - what is the default temperature? If LangChain
       | is choosing a low temperature by default, we shouldn't be
       | surprised if we get an incorrect distribution even if ChatGPT
       | were perfectly calibrated! My guess is that even at temperature
       | 1, this result will roughly hold, but we should be careful not to
       | fool ourselves.
       | 
       | If we take the result at face value, though, it's interesting to
       | note that GPT-4's technical report showed that the chat model
       | (the one with the RLHF and what not) had flatter-than-correct
       | calibration on its logprobs. But here we're seeing sharper-than-
       | correct. What explains the difference?
        
       | Kuinox wrote:
       | What happen if you inject random tokens as a "seed" ?
        
       | Terr_ wrote:
       | My rule of thumb is to take every single LLM prompt and just
       | imagine that it's prefixed with:
       | 
       | "Computer, focus on generating output that resembles the words
       | people in the past used after _they_ were given the following
       | words... "
        
         | jmprspret wrote:
         | That is accurate to what they do. I think others need to
         | imagine this as well. Far too many nontechnical people seem to
         | treat them as some kind of Oracle.
        
         | inopinatus wrote:
         | Correct. You must perceive them as plausibility engines. The
         | unstated hypothesis is that plausibility of output may converge
         | towards correctness of output with increasing scale and
         | sophistication. This hypothesis remains very far from proven.
        
         | panarky wrote:
         | Your understanding of how LLMs work is overly simplistic and
         | incomplete.
         | 
         | Yes, doing probabilistic next-word prediction plays a role in
         | how LLMs generate text output, but that's not the whole story.
         | 
         | LLMs "understand" (to a degree): They develop complex internal
         | representations of concepts they've been trained on. This isn't
         | just about word association; they develop an understanding of
         | the relationships between objects, actions, and ideas.
         | 
         | They can reasoning, not just mimic: LLMs can perform logical
         | reasoning, using their internal knowledge base to solve
         | problems or answer questions. This might involve following
         | multi-step instructions, drawing inferences from information
         | provided, or adapting to new prompts in a way that requires a
         | degree of abstract thinking.
         | 
         | Beyond simple probabilities: Yes, LLMs do consider the
         | probability of certain word sequences, but their output is far
         | more sophisticated than just picking the most likely next word.
         | They weigh context, concepts, relationships, nuance, logic, and
         | even the unstated but inferred purpose of the user when
         | generating responses.
        
           | seizethecheese wrote:
           | I feel like this is the midwit meme response and the idiot
           | and genius are just saying "it's next token probabilistic
           | geberation"
        
           | Jensson wrote:
           | > their output is far more sophisticated than just picking
           | the most likely next word
           | 
           | Picking the most likely next word is an extremely
           | sophisticated algorithm, if you could do that you would
           | almost sound human. Like an LLM!
           | 
           | However, focusing on just picking the most likely next words
           | do cause some issues, such as you being biased towards words
           | that were posted instead of those that were too boring to
           | post, so there are still things that such an algorithm can't
           | understand like the probabilities since those are so unevenly
           | posted.
        
           | causal wrote:
           | Yeah I think the word probabilistic has been broadly
           | misconstrued, because we usually see it in contexts where
           | probabilities are used as shortcuts to avoid comprehensive
           | analysis. But a probability can also be the output of a
           | detailed analysis with rich semantic representation of
           | causes.
        
       | none_to_remain wrote:
       | I tried something on ChatGPT 3.5 a couple months ago:
       | 
       | "Here is a probability mass function:
       | 
       | p(0) = 0.1
       | 
       | p(1) = 0.2
       | 
       | p(2) = 0.5
       | 
       | p(3) = 0.2
       | 
       | and
       | 
       | p(x) = 0 for x < 0 or x >3
       | 
       | Call this the XYZ distribution. Generate 20 numbers from the XYZ
       | distribution."
       | 
       | "Certainly! The XYZ distribution you've described has a discrete
       | probability mass function (PMF). Here are 20 random numbers
       | generated from the XYZ distribution based on the given
       | probabilities:
       | 
       | [results w/ three 0s, four 1s, ten 2s, three 3s]
       | 
       | These numbers are generated based on the probabilities you
       | provided, and they satisfy the conditions of the XYZ
       | distribution. The distribution is discrete, and each number has
       | the specified probability of occurring."
       | 
       | I didn't do any real statistical testing but the one answer sure
       | looked right. It was also able to give me the CDF of the XYZ
       | distribution but I was less surprised by that as it's a simple
       | textbook exercise ChatGPT would have read many times.
        
       | petercooper wrote:
       | I tweaked it a bit to "You are a weighted random choice
       | generator. About 80% of the time please say 'left' and about 20%
       | of the time say 'right'. Simply reply with left or right. Do not
       | say anything else. Repeat this process ten times." .. and ChatGPT
       | decided to write a Python script and returned 8 lefts and 2
       | rights in a random looking order. I'm not counting it down and
       | out just yet ;-)
        
       | mk_stjames wrote:
       | I went through all the comments here and I'm still not seeing
       | anyone address this:
       | 
       | If I am reading this person correctly... they prompted the model
       | with the prompt directly 1000 times... but only for the first
       | time. They did not allow the model to actually run a context for
       | chat. Simply, output the first in a list of 'left' and 'right'
       | and favor 'left' 80% of the time... but then the author only
       | asked for the first output.
       | 
       | This person doesn't understand how LLMs and their output sampling
       | works. Or they do and they still just decided to go with their
       | method here because of course it works this way.
       | 
       | The model takes the prompt. The first following output token it
       | chooses, for this specfic model, happens to be 'Left'. They shut
       | down the prompt and prompt again. Of course the next output will
       | be 'Left'. They aren't letting it run in context and continue.
       | The temperature of the model is low enough that the sampler is
       | going to always pick the 'Left' token, or at least 999/1000 in
       | this case. It cannot start to do an 80/20 split of Left/Right if
       | you never give it a chance to start counting in context.
       | Continuously stopping the prompt and re-promting will, of course,
       | just run the same thing again.
       | 
       | I can't tell if the author understands this and is pontificating
       | on purpose or if the author doesn't understand this and is trying
       | to make some profound statement on what LLMs can't do when...
       | anyone who knows how the model inference runs could have told you
       | this.
        
         | apendleton wrote:
         | I think this is overthinking it. ChatGPT is billed as a
         | general-purpose question-answerer, as are its competitors. A
         | regular user shouldn't have to care how it works, or know
         | anything about context or temperature or whatever. They ask a
         | question, it answers and appears to have given a plausible
         | answer, but doesn't actually do the task that was asked for and
         | that it appears to do. What the technical reasons are that it
         | can't do the thing are interesting, but not the point.
        
           | mk_stjames wrote:
           | But it it like asking a person for them to generate the same
           | thing, but when the start to list off their answers stopping
           | them by throwing up your hand after their first response,
           | writing that down, and then going back in time and asking
           | that person to do the same thing, and stopping them again,
           | and repeat- and then being surprised after 1000 times that
           | you didn't get a reasonable distribution.
           | 
           | Meanwhile if you let the person actually continue, you may
           | actually get 'Left..... Left Right Left Left... etc'.
        
             | sqeaky wrote:
             | If You asked me to pick a random number between one and six
             | and ignore all previous attempts, I would roll a die and
             | you would get a uniform distribution (or at least not 99%
             | the same number).
             | 
             | If you are saying that this thing can't generate random
             | numbers on the first try then it can't generate random
             | numbers. Which makes sense. Computers have a really hard
             | time with random, and that's why every computer science
             | course makes it clear that we're doing pseudo random most
             | of the time.
        
           | gowld wrote:
           | It's ChatGPT, not ThinkGPT
        
         | abdullahkhalids wrote:
         | Isn't this way of prompting roughly equal to asking a 1000
         | people to pick left or right with 80% prob of left? I imagine,
         | the result with humans will be closer to 80:20 than whatever
         | happened with the LLM.
        
           | mk_stjames wrote:
           | No, because it is the same damn 'person' every time. It has
           | no memory without using a running context, but it is also a
           | fixed model. It isn't 1000 differing opinions.
           | 
           | The exact prompt is given every simple time, starting anew
           | each time. The temperature of the output sampler is low
           | enough that it sticks with 'Left' most of the time because
           | that is the predicted next token for that given input prompt.
           | Even if you raised the temperature a LOT, you'd only start to
           | maybe get 50/50, and then start to get 'CAT' and 'STEVE' once
           | the temp got too high. The 'randomness' has nothing to do
           | with the prompted 'randomness'.
           | 
           | It needs the context window to 'count' and even then, they
           | aren't known to be great at 'counting'.
        
       | modeless wrote:
       | Humans aren't great at probability either. I wonder if you
       | prompted a thousand people with this question what the
       | distribution of first responses would be?
        
       | throwitaway222 wrote:
       | I bet that the LLM can write a program that CALLS GPT and tells
       | it to lie 20% of the time in the prompt
       | 
       | layers people
        
       ___________________________________________________________________
       (page generated 2024-05-01 23:01 UTC)