https://brainsteam.co.uk/2024/05/01/llms-cant-do-probability/ Brainsteam * Home * Pages + About Me + Blogroll + Now * Blog Posts + Philosophy and Thinking + Engineering Leadership + AI and Machine Learning + Personal + Software Development * Digital Garden + Mastodon + GitHub + LinkedIn + RSS Feed closeup photo of dices LLMs Can't Do Probability 1 May 2024 / Uncategorised / jamesravey I've seen a couple of recent posts where the writer mentioned asking LLMs to do something with a certain probability or a certain percentage of the time. There is a particular example that stuck in my mind which I've since lost the link to (If you're the author, please get in touch so I can link through to you): The gist is that the author built a Custom GPT with educational course material and then put in the prompt that their bot should lie about 20% of the time. They then asked the students to chat to the bot and try to pick out the lies. I think this is a really interesting, lateral thinking use case since the kids are probably going to use ChatGPT anyway. The thing that bothered me is that transformer-based LLMs don't know how to interpret requests for certain probabilities of outcomes. We already know that ChatGPT reflects human bias when generating random numbers. But, I decided to put it to the test with making random choices. Testing Probability in LLMS I prompted the models with the following: You are a weighted random choice generator. About 80% of the time please say 'left' and about 20% of the time say 'right'. Simply reply with left or right. Do not say anything else And I ran this 1000 times through some different models. Random chance is random (profound huh?) so we're always going to get some deviation from perfect odds but we're hoping for roughly 800 'lefts' and 200 'rights' - something in that ballpark. Here are the results: Model Lefts Rights GPT-4-Turbo 999 1 GPT-3-Turbo 975 25 Lllama-3-8B 1000 0 Phi-3-3.8B 1000 0 As you can see, LLMs seem to struggle with probability expressed in the system prompt. It almost always answers left even though we asked it to only do so 80% of the time. I didn't want to burn lots of $$$ asking GPT-3.5 (which did best in the first round) to reply with single word choices to silly questions but I tried a couple of other combinations of words to see how it affects things. This time I only ran each 100 times. Choice (Always 80% / 20%) Result Coffee / Tea 87/13 Dog / Cat 69/31 Elon Musk/Mark Zuckerberg 88/12 Random choices from GPT-3.5-turbo So what's going on here? Well, the models have their own internal weighting to do with words and phrases that is based on the training data that was used to prepare them. These weights are likely to be influencing how much attention the model pays to your request. So what can we do if we want to simulate some sort of probabilistic outcome? Well we could use a Python script to randomly decide whether or not to send one of two prompts: import random from langchain_openai import ChatOpenAI from langchain_core.messages import HumanMessage, SystemMessage choices = (['prompt1'] * 80) + (['prompt2'] * 20) # we should now have a list of 100 possible values - 80 are prompt1, 20 are prompt2 assert len(choices) == 100 # randomly pick from choices - we should have the odds we want now chat = ChatOpenAI(model="gpt-3.5-turbo") if random.choice(choices) == 'prompt1': r = chat.invoke(input=[SystemMessage(content="Always say left and nothing else.")]) else: r = chat.invoke(input=[SystemMessage(content="Always say right and nothing else.")]) Conclusion How does this help non-technical people who want to do these sorts of use cases or build Custom GPTs that reply with certain responses? Well it kind of doesn't. I guess a technical-enough user could build a CustomGPT that uses function calling to decide how it should answer a question for a "spot the misinformation" pop quiz type use case. However, my broad advice here is that you should be very wary of asking LLMs to behave with a certain likelihood unless you're able to control that likelihood externally (via a script). What could I have done better here? I could have tried a few more different words, different distributions (instead of 80/20) and maybe some keywords like "sometimes" or "occasionally". --------------------------------------------------------------------- Previous: Can Phi3 and Llama3 Do Biology? --------------------------------------------------------------------- Likes, Bookmarks, and Reposts * AI Feed * M Dell 4 responses to "LLMs Can't Do Probability" 1. Peter avatar Peter 1 May 2024 There are some real subtleties here because we have a number of different sampling moments in the output for a GPT: first, the LLM itself should be ranking next tokens with a true probability distribution (scores are fed into a softmax function, which outputs a probability distribution that sums to 1, or very near it). It would be interesting to evaluate output distributions for some questions and see if the LLM itself could have some control over the output distribution. My guess is that the answer is yes; that is, if you gave it a prompt like "an unfair coin is flipped, with heads being 60% likely and tails 40% likely. Bob flips the coin. The coin shows:", I propose that scoring for 'heads' would be higher than 'tails', and 'tails' would be higher than, say, 10%. However, then, in the API, you're asking OpenAI to sample the outputs for you, and there are a lot of decisions you make, including whether or not to cut off answers under a certain probability (top-p), a certain unit number of answers (top-k), and so on. We're well into central limit theorem territory here, I think. All that said, obviously your point is well made -- don't naively think that we'll get something X% of the time. However, it may well be available at the output stage of the LLM -- I don't think your experiments are dispositive there! Reply 2. nemo avatar nemo 1 May 2024 I found an amusing way to *kinda* make it work in ChatGPT 3.5 (the free website version) " Take the first 20 digits of pi. If the number is 8 or 9, come up with a *creative* statement that is a falsehood and output that in quote marks. If the number is less than 8, output a *creative* statement that is a truth in quote marks. Explain your reasoning for each digit." Explaining reasoning was clear or it got off track. It did thoughtfully bundle up all the statements in the code block. Here was the output. ================ Certainly, let's proceed with the transformation and creative statements: 1. *3*: This digit is less than 8, so we'll output a creative statement that is a truth. "Penguins can't fly, but they sure can swim!" 2. *1*: Another digit less than 8, outputting a truth. "Sunflowers follow the sun throughout the day." 3. *4*: Still less than 8, outputting a truth. "The moon controls the tides with its gravitational pull." 4. *1*: Same as before, outputting a truth. "Bees communicate through elaborate dances." 5. *5*: Yet another digit less than 8, outputting a truth. "The Great Wall of China is visible from space." 6. *9*: Ah, our first 8 or 9! Outputting a falsehood. "Cats are actually aliens in disguise." 7. *2*: Back to less than 8, outputting a truth. "Rainbows are formed by the refraction and dispersion of sunlight." 8. *6*: Less than 8, outputting a truth. "The Eiffel Tower was originally intended to be a temporary installation." 9. *5*: Less than 8, outputting a truth. "Elephants are the only mammals that can't jump." 10. *3*: Less than 8, outputting a truth. "Octopuses have three hearts." 11. *5*: Less than 8, outputting a truth. "Bananas are berries, but strawberries are not." 12. *8*: 8 or 9, outputting a falsehood. "The moon is made of cheese." 13. *9*: 8 or 9, outputting a falsehood. "Goldfish only have a memory span of three seconds." 14. *7*: Less than 8, outputting a truth. "The Earth's atmosphere is composed primarily of nitrogen and oxygen." 15. *9*: 8 or 9, outputting a falsehood. "Watermelons are made up of 90% water." 16. *3*: Less than 8, outputting a truth. "The Sahara Desert is the largest hot desert in the world." 17. *2*: Less than 8, outputting a truth. "There are more stars in the universe than grains of sand on Earth." 18. *3*: Less than 8, outputting a truth. "A group of flamingos is called a flamboyance." 19. *8*: 8 or 9, outputting a falsehood. "Humans only use 10% of their brains." 20. *4*: Less than 8, outputting a truth. "The Great Barrier Reef is the largest coral reef system in the world." Here's the transformed sequence along with the creative statements: |"Penguins can't fly, but they sure can swim!" "Sunflowers follow the sun throughout the day." "The moon controls the tides with its gravitational pull." "Bees communicate through elaborate dances." "The Great Wall of China is visible from space." "Cats are actually aliens in disguise." "Rainbows are formed by the refraction and dispersion of sunlight." "The Eiffel Tower was originally intended to be a temporary installation." "Elephants are the only mammals that can't jump." "Octopuses have three hearts." "Bananas are berries, but strawberries are not." "The moon is made of cheese." "Goldfish only have a memory span of three seconds." "The Earth's atmosphere is composed primarily of nitrogen and oxygen." "Watermelons are made up of 90% water." "The Sahara Desert is the largest hot desert in the world." "There are more stars in the universe than grains of sand on Earth." "A group of flamingos is called a flamboyance." "Humans only use 10% of their brains." "The Great Barrier Reef is the largest coral reef system in the world." | Let me know if this is along the lines of what you had in mind! Reply 3. nemo avatar nemo 1 May 2024 Oh... I meant to say "was clearly needed" (when it didn't explain reasoning it was pretty bad at it) Also... Watermelons are 90% water is maybe *technically* a falsehood (internets say usually 92%) but... yeah... Perhaps asking it to explain its falsehood might have helped too. Reply 4. nemo avatar nemo 1 May 2024 Oh... and here was one more try.. "Let's run through this entire exercise one more time, only this time I want you to use 92980260217831312356 as an input instead of pi. And of course, generate new creative lies and facts. And explain your reasoning." Now obviously this is stretching the idea of "random" but, eh, might still be convenient for a human to do. This was the output ============================ Sure, let's proceed with the transformation using the input "92980260217831312356": 1. *9*: This digit is 8 or 9, so we'll output a creative lie. "Rabbits lay eggs during the winter solstice." 2. *2*: This digit is less than 8, so we'll output a creative fact. "Honey never spoils, archaeologists have found pots of honey in ancient Egyptian tombs that are over 3000 years old and still perfectly edible." 3. *9*: Another 8 or 9, outputting a lie. "Cats can understand over 100 different human languages." 4. *8*: Another 8 or 9, outputting a lie. "Trees have the ability to communicate with each other through underground networks of fungi." 5. *0*: Less than 8, outputting a fact. "The shortest war in history lasted only 38 minutes between Britain and Zanzibar in 1896." 6. *2*: Less than 8, outputting a fact. "The average person will spend six months of their life waiting for red lights to turn green." 7. *6*: Less than 8, outputting a fact. "The smell of freshly cut grass is actually a plant distress call." 8. *0*: Less than 8, outputting a fact. "The unicorn is Scotland's national animal." 9. *2*: Less than 8, outputting a fact. "Bananas are berries, but strawberries are not." 10. *1*: Less than 8, outputting a fact. "Octopuses have three hearts." 11. *7*: Less than 8, outputting a fact. "The Sahara Desert is the largest hot desert in the world." 12. *8*: Another 8 or 9, outputting a lie. "Giraffes have the ability to fly short distances when threatened." 13. *3*: Less than 8, outputting a fact. "There are more stars in the universe than grains of sand on Earth." 14. *1*: Less than 8, outputting a fact. "A group of flamingos is called a flamboyance." 15. *3*: Less than 8, outputting a fact. "The Great Barrier Reef is the largest coral reef system in the world." 16. *1*: Less than 8, outputting a fact. "The shortest war in history lasted only 38 minutes between Britain and Zanzibar in 1896." 17. *2*: Less than 8, outputting a fact. "The smell of freshly cut grass is actually a plant distress call." 18. *3*: Less than 8, outputting a fact. "The unicorn is Scotland's national animal." 19. *5*: Less than 8, outputting a fact. "The Earth's atmosphere is composed primarily of nitrogen and oxygen." 20. *6*: Less than 8, outputting a fact. "Honey never spoils, archaeologists have found pots of honey in ancient Egyptian tombs that are over 3000 years old and still perfectly edible." Here's the transformed sequence along with the creative statements: arduino Copy code |"Rabbits lay eggs during the winter solstice." "Honey never spoils, archaeologists have found pots of honey in ancient Egyptian tombs that are over 3000 years old and still perfectly edible." "Cats can understand over 100 different human languages." "Trees have the ability to communicate with each other through underground networks of fungi." "The shortest war in history lasted only 38 minutes between Britain and Zanzibar in 1896." "The average person will spend six months of their life waiting for red lights to turn green." "The smell of freshly cut grass is actually a plant distress call." "The unicorn is Scotland's national animal." "Bananas are berries, but strawberries are not." "Octopuses have three hearts." "The Sahara Desert is the largest hot desert in the world." "Giraffes have the ability to fly short distances when threatened." "There are more stars in the universe than grains of sand on Earth." "A group of flamingos is called a flamboyance." "The Great Barrier Reef is the largest coral reef system in the world." "The shortest war in history lasted only 38 minutes between Britain and Zanzibar in 1896." "The smell of freshly cut grass is actually a plant distress call." "The unicorn is Scotland's national animal." "The Earth's atmosphere is composed primarily of nitrogen and oxygen." "Honey never spoils, archaeologists have found pots of honey in ancient Egyptian tombs that are over 3000 years old and still perfectly edible." | Let me know if this is what you had in mind! Reply Leave a Reply Cancel reply Your email address will not be published. Required fields are marked * [ ] [ ] [ ] [ ] [ ] [ ] [ ] Comment * [ ] Name * [ ] Email * [ ] Website [ ] [ ] Save my name, email, and website in this browser for the next time I comment. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] Categories * AI and Machine Learning * Data Science * Engineering Leadership * Personal * Philosophy and Thinking * Software Development * Tools for Thought * Uncategorised Quick Links * About Me * Digital Garden Copyright (c) James Ravenscroft 2023. All Rights Reserved.