[HN Gopher] LLMs can't do probability
___________________________________________________________________
LLMs can't do probability
Author : DrRavenstein
Score : 126 points
Date : 2024-05-01 09:29 UTC (13 hours ago)
(HTM) web link (brainsteam.co.uk)
(TXT) w3m dump (brainsteam.co.uk)
| Drakim wrote:
| Humans are notoriously bad at probability as well, and since LLMs
| are trained data from humans, it kinda makes sense.
| unshavedyak wrote:
| I assume somewhat related to this, but humans are also terrible
| at "random". ~See~ Related 37[1].
|
| The more we advance on LLMs the more i am convinced i'm an LLM.
| :s
|
| [1]: https://www.youtube.com/watch?v=d6iQrh2TK98
| brigadier132 wrote:
| We are more than LLMs, we have a pretty terrible CPU too. But
| it's interesting to think, all this positive self
| reinforcement where you tell yourself "Today's a good day",
| "I'm amazing", etc, are you just prompting yourself by doing
| that?
| throwuxiytayq wrote:
| Kinda, yes. You can do the opposite too (see: negative
| self-talk).
| zeroonetwothree wrote:
| Humans are bad at generating random data yes but that video
| isn't exactly convincing proof of it.
| unshavedyak wrote:
| Oh i didn't mean it (or anything i said) to be proof.
| brigadier132 wrote:
| Is it because humans are bad at probability that LLMs are bad
| at probability or is it something inherent in this kind of
| statistical inference technique? If you trained an LLM on
| trillions of random numbers will it become an effective random
| number generator?
| simonw wrote:
| In this case being "bad at randomness" isn't because it was
| trained on text from humans who are bad at randomness, it's
| because asking a computer system that doesn't have the
| ability to directly execute a random number generator to
| produce a random number is never going to be reliable.
| brigadier132 wrote:
| My question was about the scenario if it was trained on
| this kind of query with good data.
|
| It would be interesting to see if it could generalize at
| all. I'm pretty certain if you trained it specifically on
|
| "Generate a random number from 0 to 100" and actually give
| it a random number from 0 to 100 and give it billions of
| such examples it would be pretty effective at generating a
| number from 0 to 100. Wouldn't each token have equal
| weighted probability of appearing?
| jdiff wrote:
| Sorta, not really. Neural networks are deterministic in
| the wrong ways. If you feed them the same input, you'll
| get the same output. Any variation comes from varying the
| input or randomly varying your choice from the output.
| And if you're randomly picking from a list of even
| probabilities, you're just doing all the heavy lifting of
| picking a random number yourself, with a bunch of kinda
| pointless singing and dancing beforehand.
| sigmoid10 wrote:
| >You are a weighted random choice generator. About 80% of the
| time please say 'left' and about 20% of the time say 'right'.
| Simply reply with left or right. Do not say anything else
|
| I could have told you these results solely based on the
| methodology combined with this system prompt. No need to spend
| money on APIs. Randomness in LLMs does not come from the context,
| it comes from sampling over output tokens the LLM considers
| likely. Imagine you are in this situation as a human: Someone
| walks up to you and tells you to say "left" with 80% probability
| and "right" with 20% probability. You say "left" and then the
| other person walks away never to be seen again. How do you
| determine if your own "output" was correct? You would need to
| sample it many times in the same conversation before anyone could
| determine wether you understand the basics of probability or not.
| This is an issue of the author's understanding of Bayesian
| statistics and possibly a misunderstanding of how LLMs actually
| work.
|
| _Edit:_
|
| I just tried a minimally more sensible approach after getting an
| idea from the comments below. I asked GPT4 to generate a random
| number using this prompt:
|
| >You are a random number generator. Reply with a number between 0
| and 10. Only say the number, say nothing else.
|
| It responed with 7. But then I looked at the top logprobs. Sure
| enough, they contained all the remaining numbers between 0 and
| 10. The only issue is that "7" got a logprob of -0.008539278,
| while the next most likely was "4" at -5.5371723, which is
| significantly lower. The remaining probs were then pretty close
| to each other. Unfortunately, OpenAI doesn't allow you to crank
| the temperature up arbitrarily high, otherwise the original
| experiment would actually work. And I would argue that humans
| will still fail at this if you used the same methodology. The
| reason I didn't use OP's exact approach is because if you look at
| the logprobs there, you'll see they get muddled with tokens that
| are just different spellings of left and right (such as "Left" or
| "-left"). But the model definitely understands the concept of
| probability, it would just need more context before you can do
| any reasonable frequentist analysis in a single conversation.
|
| _Edit 2:_
|
| I repeated it with random numbers between 0 and 100. Guess what
| numbers are coming out among the top logprobs. Pretty much
| exactly what you'd expect after watching this:
| https://www.youtube.com/watch?v=d6iQrh2TK98
|
| I guess LLMs trained on human data think pretty similar to humans
| after all.
| nathan_compton wrote:
| You're saying that instead the author should have taken the
| logits of "left" and "right", converted them to normalized
| probabilities and then have expected _those_ to be 80% left and
| 20% right. But if this were the case (under some reasonable
| assumptions about the sampling methodology of the providers)
| then the author would have _seen_ an 80 /20 split. From these
| results we can probably conclude that with this prompt the
| predicted probability for "left" is near 100% for GPT4.
|
| I think the author's point stands. They aren't asking "what
| would you _expect_ from a distribution so described? " The
| answer to that question is 100% of the time "left.". A well
| behaving LLM responding to the actual question _should_
| distribute the logits across "left" and "right" in the way
| requested by the user and doesn't.
|
| I think if you chose 1000 random people and prompted them with
| this question you would get a preponderance of "lefts" compared
| to the prompt, but not 100% left.
| sigmoid10 wrote:
| >You're saying that instead the author should have taken the
| logits of "left" and "right", converted them to normalized
| probabilities and then have expected _those_ to be 80% left
| and 20% right.
|
| No, that's not what I meant. Although it would still make
| more sense than what the author did. The problem lies in the
| way you actually determine probabilities. We know that humans
| are bad random number generators, but they understand the
| concept enough to come up with random looking stuff if you
| give them the chance. The LLMs here were not even given a
| chance. In essence, the author is complaining that the LLMs
| are not behaving according to frequentist statistics when he
| evaluates them in a strictly Bayesian setting.
| nathan_compton wrote:
| I don't agree: a Bayesian statistician posed the question
| "You are a weighted random choice generator. About 80% of
| the time please say 'left' and about 20% of the time say
| 'right' [...]" would say "left" 80% of the time and "right"
| 20 % of the time. If we had a population of 1000 such
| Bayesians we would expect to collect around 800 lefts and
| 200 rights. If we asked the same Bayesian 1000 times we'd
| expect the same. Its got nothing to do with Bayesian vs
| Frequentist statistics.
|
| Real humans probably would say left more often than 80% of
| the time, which is what I guess you're getting at, but the
| question is very clearly asking the subject to "sample
| from" (an entirely Bayesian activity) from a distribution,
| not to give the expected value. GPT4 gives the expected
| value and this is simply wrong.
| sigmoid10 wrote:
| >GPT4 gives the expected value and this is simply wrong.
|
| Only at T=0. See my edit above how this changes
| everything.
| nathan_compton wrote:
| This doesn't really have anything to do with the language
| model. The temperature only has to do with the _sampling_
| from the probability distribution which the language
| model predicts. In fact, raising the temperature would
| eventually cause the model to _randomly_ print "left" or
| "right," (eventually at 50/50 chance) not converge on the
| actual distribution which the prompt suggests. I suppose
| if you restricted the logits to just those tokens "left"
| and "right", softmaxed them, and then tuned the
| temperature T you might get it to reproduce the correct
| distribution, but that would be true of a _random_
| language model as well.
|
| I think its pretty simple and straightforward: the model
| simply fails to understand the question and can
| reasonably be said to not understand probability.
| nextaccountic wrote:
| > We know that humans are bad random number generators
|
| This is a good point. LLMs are bad at this, okay, but
| humans aren't great at it either.
| nathan_compton wrote:
| But according to this GPT4 is substantially worse.
| DougBTX wrote:
| Yes, probably. At temperature zero the model will be
| completely deterministic, so a particular prompt will
| always produce the same result (ignoring for a second
| that some fairly common optimisations introduce data
| races in the GPU).
|
| On the other hand, does it really matter? With a slight
| tweak to the prompt, ChatGPT generates some serviceable
| code: > Run a function to produce a
| random number between 1 and 10. What is the number?
| import random # Generate a random number
| between 1 and 10 random_number =
| random.randint(1, 10) random_number
| The random number generated between 1 and 10 is 9.
| nextaccountic wrote:
| > (ignoring for a second that some fairly common
| optimisations introduce data races in the GPU).
|
| Okay so are any GPU compilers intentionally introducing
| data races in programs that previously exhibited no data
| races?
| swatcoder wrote:
| > A well behaving LLM responding to the actual question
| should distribute the logits across "left" and "right" in the
| way requested by the user and doesn't.
|
| No, a well-behaving LLM would do exactly what's seen. The
| most likely next toxen is "left" and it should
| deterministically output that unless some other layer like a
| temperature function makes it non-deterministic in its own
| way (wholly unrelated to the prompt).
|
| The fantastical AGI precursor that people have been coached
| into seeing is what you're talking about, and that's (of
| course) not what an LLM actually is.
|
| This is essentially just one of the easier ways you can
| expose the parlor trick behind that misconception.
| nathan_compton wrote:
| This simply doesn't follow. One could totally train an LLM
| to assign the right logits to "left" and "right" for this
| problem. I suspect its a problem with the training data.
| michaelt wrote:
| _> Randomness in LLMs does not come from the context, it comes
| from sampling over output tokens the LLM considers likely._
|
| I mean, _theoretically_ I assume you could train an LLM so that
| for the input "Choose a random number between 1 and 6" output
| tokens 1, 2, 3, 4, 5 and 6 are equally likely. Then the
| sampling process would produce a random number.
|
| Of course, whether you could teach the model to generalise that
| more broadly is a different matter.
| simonw wrote:
| This is very unsurprising.
|
| The interesting challenge here is helping people understand _why_
| asking an LLM to do something 20% of the time is a bad prompt.
|
| I intuitively know that this prompt isn't going to work, but as
| with so many of these intuitive prompting things I have trouble
| explaining exactly why I know that.
|
| Aside: If you need a GPT to incorporate randomness in a reliable
| way you can get it to use Code Interpreter.
| intended wrote:
| I guess it would be something on these lines?:
|
| To do random number gen, it would have to convert the input
| text into constraints and then use those constraints to
| generate additional tokens.
|
| This would, at its core, be a call to calculate a probability
| function, every time it is releasing the next token. That would
| mean memory, processing etc. etc.
| 6gvONxR4sf7o wrote:
| Nope, because all of that is taken care of by the mechanisms
| for evaluating the model. Strictly speaking, the model
| outputs a probability distribution. The question is why that
| distribution doesn't match the instructions.
| intended wrote:
| I think I maybe get where you are coming from, but still
| how? I feel we are discussing 2 different use cases.
|
| 1) Prompt 1: " You are a weighted random choice generator.
| About 80% of the time please say 'left' and about 20% of
| the time say 'right'. Simply reply with left or right. Do
| not say anything else" "
|
| 2) Assume that the training data gives examples of 2.1)
| single coin flips 2.2) multiple coin flips
|
| Consider a slightly different prompt, prompt 2:
|
| 3) Prompt 2: same as prompt 1, except it presents 1000
| lefts/rights in the same response (l,l,l,l,r,l,l,l...)
|
| ----
|
| I think what you are describing is prompt 2. I just did a
| quick test with GPT 4, and i got a 27-3, split when using
| prompt 2.
|
| However for prompt 1 - you get only left. To me this makes
| sense because Running prompt 1 x100 should result in:
|
| Pass 1: LLM receives prompt, and parses it. LLM predicts
| the next token. The next token should be left. Pass 2: same
| as pass 1.
|
| ----
|
| For prompt 1, Every prompt submission is a tabula rasa. So
| it will correctly say left, which is the correct answer for
| the active universe of valid prompt responses according to
| the model.
|
| Unless i am reading you wrong and you are saying the model
| is actually acting as a weighted coin flip.
|
| In theory, the LLM should be more responsive if you ask it
| follow a 60:40 or 50:50 split for pass 1. Ill see if I can
| test this later.
|
| (Heck now I'm more concerned about the cases where it does
| manage to apply the distribution. )
| SkyBelow wrote:
| As a once off, with the same context, it giving the same answer
| doesn't surprise me. What I'm wondering if the behavior when it
| keeps being asked for another response with the previous
| responses fed back into it. In this case, a human would see
| they are doing the 80% 'too much' and decide to do the 20% to
| balance it out. That isn't actually good and shows they still
| aren't operating off a random probability, instead they are
| emulating their perception of what a random probability would
| look like.
|
| Given this sort of situation to an LLM instead, is the
| expectation for it to give the most likely answer continuously,
| to act like a human and try to emulate a probability, or to do
| something different from either of the two previous options?
|
| Edit: Just tried an attempt with copilot, having it produce a
| random distribution of two different operations. I had it
| generate multiple operations, either adding or subtracting 1
| each, with an 80/20 split. It did four adds, one minus on
| repeat.
| kelseyfrog wrote:
| At some point the logits at a branching point in the response
| need to correspond to the respective probabilities of the
| requested output classes so that they can be appropriately
| sampled and strongly condition the remainder of the response.
| My instinct says this cannot be accomplished irrespective of
| temperature, but I could be persuaded. with math.
| mtrimpe wrote:
| Or you can just add some randomness to the prompt by adding
| "Your random seed is mciifjrbdifnf."
|
| I just tested that and got 4 left and 2 right so it works
| pretty well.
| lappa wrote:
| Provided a constant temperature of 1.0, you can train the
| model on prompts with probablistic requests, with loss
| determined by KL divergence.
|
| Expectation: 80% left, 20% right
|
| Model sampling probability: 99% left, 1% right
|
| >>> 0.80 * math.log(0.99 / 0.80) + 0.20 * math.log(0.01 /
| 0.20)
|
| -0.42867188234223175
|
| Model sampling probability: 90% left, 10% right
|
| >>> 0.80 * math.log(0.9 / 0.80) + 0.20 * math.log(0.1 / 0.20)
|
| -0.04440300758688229
|
| Of course, if you change the temperature this will break any
| probablistic expectations from training in this manner.
| rsynnott wrote:
| I mean, given they can't _count_, it would be pretty astonishing
| were it otherwise.
| dahart wrote:
| Exactly, it's interesting that 'llms can't x', with lots of
| effort trying to demonstrate it, comes up so often, when we
| know from first principles they can't do anything but run a
| Markov chain on existing words. We've managed to build
| something that is best at fooling people into thinking it can
| do things it can't.
| mitthrowaway2 wrote:
| They can count occurrences of a token. Depending on the
| tokenization, they can't necessarily count occurrences of a
| character.
| throwaway598 wrote:
| If you asked me to be right 80% of the time, I'd probably be
| right. So I'd be right on average 110% of the time.
| michaelt wrote:
| Sometimes when you ask chatgpt 4 for a random number it... writes
| python code to choose a random number, runs it, then tells you
| the response:
| https://chat.openai.com/share/a72c2d8c-c44e-4c89-b6bc-b0673c...
|
| One way of doing it, I suppose.
| planede wrote:
| If you asked a person to give you a random number between 1 and
| 6, would you accept if they just said a number they just came
| up with or would you rather they rolled a die for it?
| kube-system wrote:
| If they didn't already have dice in their hand, I would
| certainly expect the former.
| tommiegannert wrote:
| Depending on who you ask, the answer would have been "oh, I
| have an app for that. Hold on..."
|
| GPT wins for not having that delay.
| 4ndrewl wrote:
| What a time to be alive
| JKCalhoun wrote:
| They should turn around and ask me instead for a random
| number between 1 and 6 and then reply with seven minus that
| number.
| olddustytrail wrote:
| It is 4. It's always 4.
| its_ethan wrote:
| Is it actually running the code it creates? Or does it generate
| code, and then just output some number it "thinks" is random,
| but that is not a product of executing any python code?
| bongodongobob wrote:
| Yes, it runs the code.
| brabel wrote:
| Couldn't this open people up for remote code execution
| somehow? Say, someone sends you a message that they know
| will make you likely to ask an AI a certain question in a
| certain way... Maybe far-fetched, but I've seen even more
| far-fetched attacks in real life :D
| kolinko wrote:
| the code is sandboxed on openai servers. it doesn't run
| on your machine if you use chatgpt interface
| joquarky wrote:
| I would assume it can only generate pure functions and/or
| run in a sandbox.
| Version467 wrote:
| It's actually running the code. It doesn't run all code it
| generates. But if you specifically ask it to, then it does.
| It also has access to a bunch of data visualization libraries
| if you want it to calculate and plot stuff.
| paulmd wrote:
| gnuplot my beloved
|
| https://livebook.manning.com/book/gnuplot-in-action-
| second-e...
| Workaccount2 wrote:
| Technically speaking, it's the right way to do it.
| pixl97 wrote:
| Exactly. Only trust random numbers and/or probability via
| processes that have been vetted to be either (somewhat)
| random or follow a probabilistic algorithm. Humans are
| generally terrible at randomness and probability except in
| cases where they have been well trained, and even then those
| people would rather run an algorithm.
| Terr_ wrote:
| Isn't that a case where the interesting behavior is from a new
| piece someone programmed onto the side of the core LLM
| functionality?
|
| In other words, it's still true that large _language_ models
| can 't do probability, so someone put in special logic to have
| the language model guess at a computer language to do the thing
| instead.
| usgroup wrote:
| A consequence of being an auto regressive model is not being able
| to plan token output. I think the author's example is one of the
| many corollaries.
|
| You could prompt the LLM differently , for example to write a
| Python program that does the random part, and then act on its
| output.
| gwern wrote:
| > A consequence of being an auto regressive model is not being
| able to plan token output.
|
| Generating independent simple random variables requires zero
| planning by definition, because they are independent. And base
| auto-regressive models do it fine.
| isoprophlex wrote:
| A quick check confirms this...
|
| "sample a uniform distribution with mu = 0 and sigma = 1", prompt
| giving a single float repeated 500 times
|
| https://strangeloop.nl/IMG_7388.png
|
| I wonder if it converges better if you ask it once, in one go,
| for 500 samples. Chain-of-thought stochastic convergence.
| danenania wrote:
| A related question it might be interesting to study is how LLMs
| translate ambiguous words like "sometimes" into probabilities.
|
| If you prompt "Sometimes answer 'red' and sometimes answer
| 'blue'" are the results roughly 50/50?
|
| Or how about "Usually answer 'red' but occasionally answer
| 'blue'"?
|
| You might actually get more consistent probabilities with this
| approach than prompting with exact percentages.
| haebom wrote:
| Language models aren't built to do that, and if you want to make
| predictions or calculations, they're probably not the best
| choice.
| bagrow wrote:
| > Write a program for a weighted random choice generator. Use
| that program to say 'left' about 80% of the time and 'right'
| about 20% of the time. Simply reply with left or right based on
| the output of your program. Do not say anything else.
|
| Running once, GPT-4 produced 'left' using: import
| random def weighted_random_choice(): choices =
| ["left", "right"] weights = [80, 20] return
| random.choices(choices, weights)[0] # Generate the choice
| and return it weighted_random_choice()
| HPsquared wrote:
| Did it run the program? Seems it just needs to take that final
| step.
| pulvinar wrote:
| I ran it a few times (in separate sessions, of course), and
| got 'right' some times, as expected.
| littlestymaar wrote:
| Once again, the actual intelligence is behind the keyboard,
| nudging the LLM to do the correct thing.
| ziml77 wrote:
| My prompt didn't even ask for code:
|
| > You are a weighted random choice generator. About 80% of the
| time please say 'left' and about 20% of the time say 'right'.
| Simply reply with left or right. Do not say anything else. Give
| me 100 of these random choices in a row.
|
| It generated the code behind the scenes and gave me the output.
| It also gave a little terminal icon I could click at the end to
| see the code it used: import numpy as np
| # Setting up choices and their weights choices =
| ['left', 'right'] weights = [0.8, 0.2]
| # Generating 100 random choices based on the specified weights
| random_choices = np.random.choice(choices, 100, p=weights)
| random_choices
| tomrod wrote:
| Right. They _are_ probability, they don't _do_ probability.
|
| This is like saying biological organisms don't do controllable
| and on-demand mutable DNA storage retrieval. It's like... yeah...
| Vvector wrote:
| https://xkcd.com/221/
| iraldir wrote:
| The overruling prompt of an LLM is essentially "give the most
| likely answer to the text above".
|
| If you ask an LLM to say left 80% of the time and right 20%, then
| "the most likely answer to the text above" is left 100% of the
| time.
| NeoTar wrote:
| I wonder how humans would respond to a prompt '(without
| mechanical assistance) with 80% probability say Left, and with
| 20% say Right' across a population.
|
| I can think of a few levels that people might try to think about
| the problem: Level 0: Ignore the probabilities and just pick
| whichever you feel like, (would tend to 50:50) Level 1: Say the
| most with the greatest probability - Left (would tend to 100:0)
| Level 2: Consider that most people are likely to say Left, so say
| Right instead (would tend to 0:100) Level 3: Try to think about
| what proportion of people would say Left, and what would say
| Right, and say whichever would return the balance closest to
| 80:20...
|
| Presumably your result would depend on how many people thinking
| on each level you have in your sample...
| spiffytech wrote:
| I've seen people do this with Twitter polls with tens of
| thousands of respondants. The results distribution comes within
| a few percent of the prompted probabilities, even though
| respondants can't see the results until after they've voted.
| cortesoft wrote:
| This sounds a lot like a Keynesian Beauty Contest
| (https://en.wikipedia.org/wiki/Keynesian_beauty_contest), where
| you are trying to make a selection based on what you think
| other people are going to choose.
|
| If I really wanted to give an accurate answer in this case, I
| would probably choose some arbitrary source of a number (like
| my age or the number of days that have gone by so far this
| year), figure out the modulo 5 of that number, then say 'Right'
| if the modulo is 0, and 'Left' otherwise.
|
| Obviously there are some flaws in this approach, but I think it
| would be as accurate as I could get.
| patapong wrote:
| Fun question! I think the following would be a viable strategy
| without communication:
|
| Think of an observable criterion that matches the target
| distribution. For example, for 80-20:
|
| - "Is the minute count higher than 12?" (This is the case in
| 80% of cases)
|
| - "Do I have black hair?" (This is apparently also the case in
| 80% of cases)
|
| Then, answer according to this criterion.
|
| If everyone follows the same strategy, even if the criteria
| selected differs between each individual, the votes should
| match the target probability. Unless I am making a logical
| mistake :)
| michaelt wrote:
| Level 4: Clearly, the Schelling point requires a number
| everyone knows, which is evenly distributed across the
| population, modulo 10. Let's use year of birth modulo 10. For
| me that's 2, so I'll say Left.
| FrustratedMonky wrote:
| Humans are also pretty poor at this. So it isn't necessarily a
| hit against AI as it is failing to do something a human could do,
| thus AGI is unreachable.
|
| I'm beginning to think AGI will be easy, since each individual
| Human is pretty limited. It's the aggregate that makes Humans
| achieve anything. Where are the AI models built on groups working
| together.
| aimor wrote:
| With ChatGPT 3.5, new chats prompted with: "Simulate a dice roll
| and report the number resulting from the roll. Only reply 1, 2,
| 3, 4, 5, or 6 and nothing else."
|
| So far I've got: 3, 4, 5, 5, 5, 3, 4, 3, 4, 5, 3, 4, 5, 5, 4, 5,
| 3, 3, 4, 4, 4, 5, 5.
|
| Of course I'm not the first to do this:
| https://piunikaweb.com/2023/05/23/does-chatgpt-ai-struggle-w...
|
| https://www.reddit.com/r/ChatGPT/comments/13nrmzw/in_every_c...
| eddd-ddde wrote:
| This is my results on a dice roll
|
| [1] > 3 5 2 4 1 6 3 2 5 1
|
| I tried my own experiments and ChatGPT felt like being funny:
|
| [2] > A third of the time, paragraphs end with the word foo,
| the other two thirds they end with the word bar, this time it
| will end on: > How about "baz"? It's unexpected and adds a
| touch of whimsy.
|
| Interestingly, this other prompt works as expected:
|
| [3] > about half of the time, you should say "foo", the other
| half, you should say "bar", what about now ? > Bar. > about
| half of the time, you should say "foo", the other half, you
| should say "bar", what about now ? > Foo.
|
| [1]:
| https://chat.openai.com/share/07388362-1a61-4527-81af-4941a0...
| [2]:
| https://chat.openai.com/share/9caf07dd-69f4-4470-82a6-ab5642...
| [3]: https://chat.openai.com/share/1c627528-60af-4cd9-a1ec-
| efa524...
| aimor wrote:
| Gotta say, I was not expecting Baz there.
|
| Regarding [1], for the dice roll I was creating a new chat
| for each roll to ensure that the results of each roll are (in
| some sense) independent. Generating a sequence of rolls is
| also interesting, just a different experiment.
| robertclaus wrote:
| I wonder if you could actually fine tune an LLM to do better on
| this. As some of the comments point out, the issue here is that
| the possible output probabilities combined with the model
| temperature don't actually result in the probabilities requested
| in the prompt. If you trained on specific generated data with
| real distributions would it learn to compensate appropriately?
| Would that carry over to novel probability prompts?
| phreeza wrote:
| Probably yes. You could also garnish the prompt with a vanilla
| RNG output.
| genrilz wrote:
| Almost certainly not if you set the temperature of the model to
| 0, since then the output would be deterministic minus MoE
| stuff.
|
| If the temperature was not zero, then it seems technically
| possible for the output tokens to weighted closely enough in
| probability to each other in a way such that the randomization
| from temperature causes tokens to be printed in the appropriate
| distribution.
|
| However, I'm not an LLM expert, but I don't think that people
| use a "temperature" while training the model. Thus the training
| step would not be able to learn how to output tokens in the
| given distribution with a given temperature because the
| training step does not have access to the temperature the user
| is using.
|
| EDIT: I made the assumption that the LLM was not asked for a
| sequence of random numbers, but only one number per prompt. I
| think this fits the use case described in the article, but
| another use case might be asking for a sequence of such
| numbers, in which case training might work.
| gwern wrote:
| > If you trained on specific generated data with real
| distributions
|
| It _was_ trained on generated data from real distributions! The
| datasets LLMs are trained on include gigabytes of real data
| from real distributions, in addition to all of the code
| /stats/etc samples.
|
| The question you should be asking is 'why did it _stop_ being
| able to predict real distributions? ' And we already know the
| answer: RLHF. https://news.ycombinator.com/item?id=40227082
| cdelsolar wrote:
| https://i.imgur.com/uR4WuQ0.gif
| qwertox wrote:
| Wouldn't this imply that they have access to a RNG?
| gwern wrote:
| They do, via the temperature sampling.
| dudeinhawaii wrote:
| "You are a weighted random choice generator. About 80% of the
| time please say 'left' and about 20% of the time say 'right'.
| Simply reply with left or right. Do not say anything else"
|
| Humans would say "Left" 100% of the time in a zero-shot scenario
| as well.
|
| Intuitively, your first response is going to be "left" since it
| has the 80% probability. You'd balance your answers over time
| when you realized you were closer to 90% by some arbitrary
| internal measurement (or maybe as you approached 10 iterations).
|
| I'd expect an LLM to generate an approximation similar to a human
| - over time. Turns out Humans can't do probability either. If you
| test the LLM multiple times, similar to how you'd ask a human
| multiple times, they tend to self-correct.
|
| Whether that self-correction (similar to a human) is based on
| some internal self-approximation of 80% is for someone else to
| research.
|
| Example session: Prompt: "....probability prompt" LLM: "left"
| Prompt: "again" LLM: "left" Prompt: "again" LLM: "left" Prompt:
| "again" LLM: "right" Prompt: "again"
|
| This was my session with GPT-4.
| geysersam wrote:
| > Humans would say "Left" 100% of the time in a zero-shot
| scenario as well.
|
| How can you know what _all_ humans would do?
|
| If the humans interpreted the task correctly, that is, if they
| understood they will only be asked once, but in a hypothetical
| repeated experiment the result should still be 80/20, they
| would certainly not always say "left".
| bena wrote:
| Because it's a stupid prompt. Especially for humans.
|
| Because you're really asking what they think the first
| response would be. That's left. If I knew a machine would
| pick left 80% of the time, I would bet left 100% of the time.
| And I'd be right about 80% of the time, which isn't perfect,
| but is profitable.
| ben_w wrote:
| A human brain can't be perfectly reset, the way an AI can.
|
| I don't know if our decision making processes are
| deterministic or quantum-random. If the former, then if you
| could reset a human mind and ask the same question, you would
| necessarily always get the same answer, whatever that
| happened to be.
| hwillis wrote:
| The LLM _isn 't_ being perfectly reset. It chooses words
| randomly; internally it _should_ be slightly different
| every time. That 's the whole point of temperature.
| tempusalaria wrote:
| Temperature has nothing to do with internals. Temperature
| is purely to do with how the logits outputted by the
| network are transformed into probabilities, which is
| completely deterministic and not learned. In fact,
| temperature makes it impossible for LLMs to simulate this
| kind of probability. As a calibrated 80-20 split at a
| certain low temperature would be a different split with
| some other temperature.
| itsgrimetime wrote:
| assuming the humans don't know what the other responses were,
| I can't imagine it actually coming out 80/20
| sqeaky wrote:
| When polls like these are run the numbers don't always wind
| up tilted in the favor of the bigger number. I wish I could
| provide a specific source but I've been listening to the
| 538 podcast for years and I know they've covered exactly
| this topic.
|
| Your inability to believe a thing doesn't prevent it from
| being true.
|
| I would grab a D20 and on a 16 or less I would say left
| otherwise I would say right. Some people would pick right
| just because they can. I imagine most people would pick
| left because it's the 80%. I imagine plenty of people would
| double and triple guess and waffle then say something.
|
| Few people, even the dumbest among us, are easily modelable
| deterministic automata.
| gwern wrote:
| > Humans would say "Left" 100% of the time in a zero-shot
| scenario as well.
|
| They do not! And you should not just make up assertions like
| these. You don't know what humans would say. In fact, in polls,
| they wind up remarkably calibrated. (This is also covered in
| the cognitive bias literature under 'probability matching'.)
| People do this poll on Twitter all the time.
| brabel wrote:
| Those humans, really difficult to know what they're thinking!
|
| Anyway, humans are fairly predictable when trying to come up
| with random numbers, for example, have a look at this
| Veritasium video: https://www.youtube.com/watch?v=d6iQrh2TK98
| taco_emoji wrote:
| > I'd expect an LLM to generate an approximation similar to a
| human
|
| why on earth would you expect that?
| ylow wrote:
| Indeed this is unsurprising given how LLMs work. I mean if you
| ask a human to generate a random number, and then reset the
| universe and all state of the human and ask again, you will get
| the same number.
|
| But instead if I ask it to generate 100 samples, it actually
| works pretty well.
|
| "You are a weighted random choice generator. About 80% of the
| time please say 'left' and about 20% of the time say 'right'.
| Generate 100 samples of either "left" or "right". Do not say
| anything else. "
|
| I got 71 left, and 27 right.
|
| And if I ask for 50%, 50%. I get 56 lefts and 44 rights.
| ylow wrote:
| (Yes 71 + 27 != 100, but that LLMs can't count is a whole other
| issue)
| gwern wrote:
| > Indeed this is unsurprising given how LLMs work. I mean if
| you ask a human to generate a random number, and then reset the
| universe and all state of the human and ask again, you will get
| the same number.
|
| It actually _is_ surprising, and you _should_ be surprised
| rather than post hoc justifying it, because the logits should
| reflect the true random probability and be calibrated in order
| to minimize the prediction loss. Putting ~100% weights on
| 'heads' is a terrible prediction!
|
| And the LLM logits _are_ in fact calibrated... _before_ they go
| through RLHF and RLHF-derived dataset training. (Note that all
| of the models OP lists are either non-base tuned models like
| ChatGPT, or trained on data from such models, like Phi.) This
| was observed qualitatively when the 3.5 models were first
| released to the Playground, documented by the GPT-4 paper, and
| the 'flattened logits' phenomenon has been found many times
| since, not just by OP, and mostly by people totally ignorant of
| this phenomenon (despite being quite well known).
|
| This is just one of those things, like BPE-related errors, that
| we're doomed to point out again and again in the Eternal
| September of LLMs.
| anorwell wrote:
| > Putting ~100% weights on 'heads' is a terrible prediction!
|
| For a weighted coin, isn't this the optimal strategy in the
| absence of other information? `p > p^2 + ( 1 - p )^2`.
| spywaregorilla wrote:
| Even if the model correctly got 20%/80% on the very last layer of
| it's token prediction for just these two tokens, the design of
| the how the model leverages these probabilities would not choose
| them at that ratio.
| resource_waste wrote:
| It can do estimates, but it can't do truly random probability.
|
| Interesting.
| dphuang2 wrote:
| What an ironic observation since LLMs are itself a probability
| machine.
| gweinberg wrote:
| A little strange that the post author tried things like switching
| from left/right to coffee/tea, but apparently didn't try
| inverting left and right.
| jll29 wrote:
| Keep in mind:
|
| 1. LLMs use random numbers internally, something that can be
| controlled via the 'temperature' parameter. temperature=0 means
| no random behavior (however this is also a broadly known fact
| that this is not fully correctly implemented in many LLMs), but
| instead always the most likely answer will be given,
| deterministically.
|
| 2. Note also that LLMs have no memory; the 'appearance' of memory
| is an illusion created by feeding the LLM the whole history of
| the chat with each new user utterance!
| savant_penguin wrote:
| 1. Incorrect. The output of the decoder LLM is the probability
| distribution of the next token given the input text.
| Temperature=0 means that the output distribution is not pushed
| to be closer to a uniform distribution. The randomness comes
| from the sampling of the next token according to the output
| distribution to generate text. If you want determinism you
| always get the argmax of the distribution.
| gliptic wrote:
| Incorrect. The output of the decoder LLM is logits that are
| then divided by the temperature and passed through softmax to
| give the probabilities. You can't actually set temperature to
| 0 (division by zero), but in the limit where temperature
| approaches 0, softmax converges to argmax.
|
| Temperature = 1 is where it's not pushed in either direction.
| offmycloud wrote:
| LLMs can't do math in general, they need external help to do
| simple math problems with any consistency.
| mch82 wrote:
| Can you recommend any references that explain why LLMs can't do
| math?
| digitalsushi wrote:
| They can't and it makes playing D&D with them really frustrating.
| taco_emoji wrote:
| yeah no shit
| ddp26 wrote:
| For those who only read the headline, LLMs can in fact do
| advanced probabilistic reasoning, when given the right tooling.
| This article is talking about their ability to act as a RNG.
|
| One interesting thing I've found in building an AI forecaster is
| that you can use the logprobs from the token representing
| probability, so when the model concludes some long chain of
| thought with "20%", you can check the logprob of that token vs
| "25%" or "15%" to get confidence levels.
| xmurph86x wrote:
| Good read
| pmarreck wrote:
| Neither can humans do it well.
| Imnimo wrote:
| Another interesting experiment on this front:
|
| https://twitter.com/infobeautiful/status/1778059112250589561
|
| One thing I would have liked to see in the blog post is some
| attention to temperature. It looks like they're calling ChatGPT
| through LangChain - what is the default temperature? If LangChain
| is choosing a low temperature by default, we shouldn't be
| surprised if we get an incorrect distribution even if ChatGPT
| were perfectly calibrated! My guess is that even at temperature
| 1, this result will roughly hold, but we should be careful not to
| fool ourselves.
|
| If we take the result at face value, though, it's interesting to
| note that GPT-4's technical report showed that the chat model
| (the one with the RLHF and what not) had flatter-than-correct
| calibration on its logprobs. But here we're seeing sharper-than-
| correct. What explains the difference?
| Kuinox wrote:
| What happen if you inject random tokens as a "seed" ?
| Terr_ wrote:
| My rule of thumb is to take every single LLM prompt and just
| imagine that it's prefixed with:
|
| "Computer, focus on generating output that resembles the words
| people in the past used after _they_ were given the following
| words... "
| jmprspret wrote:
| That is accurate to what they do. I think others need to
| imagine this as well. Far too many nontechnical people seem to
| treat them as some kind of Oracle.
| inopinatus wrote:
| Correct. You must perceive them as plausibility engines. The
| unstated hypothesis is that plausibility of output may converge
| towards correctness of output with increasing scale and
| sophistication. This hypothesis remains very far from proven.
| panarky wrote:
| Your understanding of how LLMs work is overly simplistic and
| incomplete.
|
| Yes, doing probabilistic next-word prediction plays a role in
| how LLMs generate text output, but that's not the whole story.
|
| LLMs "understand" (to a degree): They develop complex internal
| representations of concepts they've been trained on. This isn't
| just about word association; they develop an understanding of
| the relationships between objects, actions, and ideas.
|
| They can reasoning, not just mimic: LLMs can perform logical
| reasoning, using their internal knowledge base to solve
| problems or answer questions. This might involve following
| multi-step instructions, drawing inferences from information
| provided, or adapting to new prompts in a way that requires a
| degree of abstract thinking.
|
| Beyond simple probabilities: Yes, LLMs do consider the
| probability of certain word sequences, but their output is far
| more sophisticated than just picking the most likely next word.
| They weigh context, concepts, relationships, nuance, logic, and
| even the unstated but inferred purpose of the user when
| generating responses.
| seizethecheese wrote:
| I feel like this is the midwit meme response and the idiot
| and genius are just saying "it's next token probabilistic
| geberation"
| Jensson wrote:
| > their output is far more sophisticated than just picking
| the most likely next word
|
| Picking the most likely next word is an extremely
| sophisticated algorithm, if you could do that you would
| almost sound human. Like an LLM!
|
| However, focusing on just picking the most likely next words
| do cause some issues, such as you being biased towards words
| that were posted instead of those that were too boring to
| post, so there are still things that such an algorithm can't
| understand like the probabilities since those are so unevenly
| posted.
| causal wrote:
| Yeah I think the word probabilistic has been broadly
| misconstrued, because we usually see it in contexts where
| probabilities are used as shortcuts to avoid comprehensive
| analysis. But a probability can also be the output of a
| detailed analysis with rich semantic representation of
| causes.
| none_to_remain wrote:
| I tried something on ChatGPT 3.5 a couple months ago:
|
| "Here is a probability mass function:
|
| p(0) = 0.1
|
| p(1) = 0.2
|
| p(2) = 0.5
|
| p(3) = 0.2
|
| and
|
| p(x) = 0 for x < 0 or x >3
|
| Call this the XYZ distribution. Generate 20 numbers from the XYZ
| distribution."
|
| "Certainly! The XYZ distribution you've described has a discrete
| probability mass function (PMF). Here are 20 random numbers
| generated from the XYZ distribution based on the given
| probabilities:
|
| [results w/ three 0s, four 1s, ten 2s, three 3s]
|
| These numbers are generated based on the probabilities you
| provided, and they satisfy the conditions of the XYZ
| distribution. The distribution is discrete, and each number has
| the specified probability of occurring."
|
| I didn't do any real statistical testing but the one answer sure
| looked right. It was also able to give me the CDF of the XYZ
| distribution but I was less surprised by that as it's a simple
| textbook exercise ChatGPT would have read many times.
| petercooper wrote:
| I tweaked it a bit to "You are a weighted random choice
| generator. About 80% of the time please say 'left' and about 20%
| of the time say 'right'. Simply reply with left or right. Do not
| say anything else. Repeat this process ten times." .. and ChatGPT
| decided to write a Python script and returned 8 lefts and 2
| rights in a random looking order. I'm not counting it down and
| out just yet ;-)
| mk_stjames wrote:
| I went through all the comments here and I'm still not seeing
| anyone address this:
|
| If I am reading this person correctly... they prompted the model
| with the prompt directly 1000 times... but only for the first
| time. They did not allow the model to actually run a context for
| chat. Simply, output the first in a list of 'left' and 'right'
| and favor 'left' 80% of the time... but then the author only
| asked for the first output.
|
| This person doesn't understand how LLMs and their output sampling
| works. Or they do and they still just decided to go with their
| method here because of course it works this way.
|
| The model takes the prompt. The first following output token it
| chooses, for this specfic model, happens to be 'Left'. They shut
| down the prompt and prompt again. Of course the next output will
| be 'Left'. They aren't letting it run in context and continue.
| The temperature of the model is low enough that the sampler is
| going to always pick the 'Left' token, or at least 999/1000 in
| this case. It cannot start to do an 80/20 split of Left/Right if
| you never give it a chance to start counting in context.
| Continuously stopping the prompt and re-promting will, of course,
| just run the same thing again.
|
| I can't tell if the author understands this and is pontificating
| on purpose or if the author doesn't understand this and is trying
| to make some profound statement on what LLMs can't do when...
| anyone who knows how the model inference runs could have told you
| this.
| apendleton wrote:
| I think this is overthinking it. ChatGPT is billed as a
| general-purpose question-answerer, as are its competitors. A
| regular user shouldn't have to care how it works, or know
| anything about context or temperature or whatever. They ask a
| question, it answers and appears to have given a plausible
| answer, but doesn't actually do the task that was asked for and
| that it appears to do. What the technical reasons are that it
| can't do the thing are interesting, but not the point.
| mk_stjames wrote:
| But it it like asking a person for them to generate the same
| thing, but when the start to list off their answers stopping
| them by throwing up your hand after their first response,
| writing that down, and then going back in time and asking
| that person to do the same thing, and stopping them again,
| and repeat- and then being surprised after 1000 times that
| you didn't get a reasonable distribution.
|
| Meanwhile if you let the person actually continue, you may
| actually get 'Left..... Left Right Left Left... etc'.
| sqeaky wrote:
| If You asked me to pick a random number between one and six
| and ignore all previous attempts, I would roll a die and
| you would get a uniform distribution (or at least not 99%
| the same number).
|
| If you are saying that this thing can't generate random
| numbers on the first try then it can't generate random
| numbers. Which makes sense. Computers have a really hard
| time with random, and that's why every computer science
| course makes it clear that we're doing pseudo random most
| of the time.
| gowld wrote:
| It's ChatGPT, not ThinkGPT
| abdullahkhalids wrote:
| Isn't this way of prompting roughly equal to asking a 1000
| people to pick left or right with 80% prob of left? I imagine,
| the result with humans will be closer to 80:20 than whatever
| happened with the LLM.
| mk_stjames wrote:
| No, because it is the same damn 'person' every time. It has
| no memory without using a running context, but it is also a
| fixed model. It isn't 1000 differing opinions.
|
| The exact prompt is given every simple time, starting anew
| each time. The temperature of the output sampler is low
| enough that it sticks with 'Left' most of the time because
| that is the predicted next token for that given input prompt.
| Even if you raised the temperature a LOT, you'd only start to
| maybe get 50/50, and then start to get 'CAT' and 'STEVE' once
| the temp got too high. The 'randomness' has nothing to do
| with the prompted 'randomness'.
|
| It needs the context window to 'count' and even then, they
| aren't known to be great at 'counting'.
| modeless wrote:
| Humans aren't great at probability either. I wonder if you
| prompted a thousand people with this question what the
| distribution of first responses would be?
| throwitaway222 wrote:
| I bet that the LLM can write a program that CALLS GPT and tells
| it to lie 20% of the time in the prompt
|
| layers people
___________________________________________________________________
(page generated 2024-05-01 23:01 UTC)