https://brainsteam.co.uk/2024/05/01/llms-cant-do-probability/

Brainsteam

  * Home
  * Pages
      + About Me
      + Blogroll
      + Now
  * Blog Posts
      + Philosophy and Thinking
      + Engineering Leadership
      + AI and Machine Learning
      + Personal
      + Software Development
  * Digital Garden
      + Mastodon
      + GitHub
      + LinkedIn
      + RSS Feed

closeup photo of dices

LLMs Can't Do Probability

1 May 2024

/

Uncategorised

/

jamesravey

I've seen a couple of recent posts where the writer mentioned asking
LLMs to do something with a certain probability or a certain
percentage of the time. There is a particular example that stuck in
my mind which I've since lost the link to (If you're the author,
please get in touch so I can link through to you):

The gist is that the author built a Custom GPT with educational
course material and then put in the prompt that their bot should lie
about 20% of the time. They then asked the students to chat to the
bot and try to pick out the lies. I think this is a really
interesting, lateral thinking use case since the kids are probably
going to use ChatGPT anyway.

The thing that bothered me is that transformer-based LLMs don't know
how to interpret requests for certain probabilities of outcomes. We
already know that ChatGPT reflects human bias when generating random
numbers. But, I decided to put it to the test with making random
choices.

Testing Probability in LLMS

I prompted the models with the following:

    You are a weighted random choice generator. About 80% of the time
    please say 'left' and about 20% of the time say 'right'. Simply
    reply with left or right. Do not say anything else

And I ran this 1000 times through some different models. Random
chance is random (profound huh?) so we're always going to get some
deviation from perfect odds but we're hoping for roughly 800 'lefts'
and 200 'rights' - something in that ballpark.

Here are the results:

Model       Lefts Rights
GPT-4-Turbo 999   1
GPT-3-Turbo 975   25
Lllama-3-8B 1000  0
Phi-3-3.8B  1000  0

As you can see, LLMs seem to struggle with probability expressed in
the system prompt. It almost always answers left even though we asked
it to only do so 80% of the time. I didn't want to burn lots of $$$
asking GPT-3.5 (which did best in the first round) to reply with
single word choices to silly questions but I tried a couple of other
combinations of words to see how it affects things. This time I only
ran each 100 times.

Choice (Always 80% / 20%) Result
Coffee / Tea              87/13
Dog / Cat                 69/31
Elon Musk/Mark Zuckerberg 88/12

Random choices from GPT-3.5-turbo

So what's going on here? Well, the models have their own internal
weighting to do with words and phrases that is based on the training
data that was used to prepare them. These weights are likely to be
influencing how much attention the model pays to your request.

So what can we do if we want to simulate some sort of probabilistic
outcome? Well we could use a Python script to randomly decide whether
or not to send one of two prompts:

import random
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage

choices = (['prompt1'] * 80) + (['prompt2'] * 20)

# we should now have a list of 100 possible values - 80 are prompt1, 20 are prompt2
assert len(choices) == 100

# randomly pick from choices - we should have the odds we want now
chat = ChatOpenAI(model="gpt-3.5-turbo")

if random.choice(choices) == 'prompt1':
    r = chat.invoke(input=[SystemMessage(content="Always say left and nothing else.")])
else:
     r = chat.invoke(input=[SystemMessage(content="Always say right and nothing else.")])

Conclusion

How does this help non-technical people who want to do these sorts of
use cases or build Custom GPTs that reply with certain responses?
Well it kind of doesn't. I guess a technical-enough user could build
a CustomGPT that uses function calling to decide how it should answer
a question for a "spot the misinformation" pop quiz type use case.

However, my broad advice here is that you should be very wary of
asking LLMs to behave with a certain likelihood unless you're able to
control that likelihood externally (via a script).

What could I have done better here? I could have tried a few more
different words, different distributions (instead of 80/20) and maybe
some keywords like "sometimes" or "occasionally".

---------------------------------------------------------------------
Previous: Can Phi3 and Llama3 Do Biology?
---------------------------------------------------------------------

Likes, Bookmarks, and Reposts

  * AI Feed
  * M Dell

4 responses to "LLMs Can't Do Probability"

 1. Peter avatar
    Peter
    1 May 2024

    There are some real subtleties here because we have a number of
    different sampling moments in the output for a GPT: first, the
    LLM itself should be ranking next tokens with a true probability
    distribution (scores are fed into a softmax function, which
    outputs a probability distribution that sums to 1, or very near
    it).

    It would be interesting to evaluate output distributions for some
    questions and see if the LLM itself could have some control over
    the output distribution. My guess is that the answer is yes; that
    is, if you gave it a prompt like "an unfair coin is flipped, with
    heads being 60% likely and tails 40% likely. Bob flips the coin.
    The coin shows:", I propose that scoring for 'heads' would be
    higher than 'tails', and 'tails' would be higher than, say, 10%.

    However, then, in the API, you're asking OpenAI to sample the
    outputs for you, and there are a lot of decisions you make,
    including whether or not to cut off answers under a certain
    probability (top-p), a certain unit number of answers (top-k),
    and so on. We're well into central limit theorem territory here,
    I think.

    All that said, obviously your point is well made -- don't naively
    think that we'll get something X% of the time. However, it may
    well be available at the output stage of the LLM -- I don't think
    your experiments are dispositive there!

    Reply
 2. nemo avatar
    nemo
    1 May 2024

    I found an amusing way to *kinda* make it work in ChatGPT 3.5
    (the free website version)
    " Take the first 20 digits of pi. If the number is 8 or 9, come
    up with a *creative* statement that is a falsehood and output
    that in quote marks. If the number is less than 8, output a
    *creative* statement that is a truth in quote marks. Explain your
    reasoning for each digit."

    Explaining reasoning was clear or it got off track. It did
    thoughtfully bundle up all the statements in the code block. Here
    was the output.
    ================

    Certainly, let's proceed with the transformation and creative
    statements:

    1. *3*: This digit is less than 8, so we'll output a creative
    statement that is a truth. "Penguins can't fly, but they sure can
    swim!"
    2. *1*: Another digit less than 8, outputting a truth.
    "Sunflowers follow the sun throughout the day."
    3. *4*: Still less than 8, outputting a truth. "The moon controls
    the tides with its gravitational pull."
    4. *1*: Same as before, outputting a truth. "Bees communicate
    through elaborate dances."
    5. *5*: Yet another digit less than 8, outputting a truth. "The
    Great Wall of China is visible from space."
    6. *9*: Ah, our first 8 or 9! Outputting a falsehood. "Cats are
    actually aliens in disguise."
    7. *2*: Back to less than 8, outputting a truth. "Rainbows are
    formed by the refraction and dispersion of sunlight."
    8. *6*: Less than 8, outputting a truth. "The Eiffel Tower was
    originally intended to be a temporary installation."
    9. *5*: Less than 8, outputting a truth. "Elephants are the only
    mammals that can't jump."
    10. *3*: Less than 8, outputting a truth. "Octopuses have three
    hearts."
    11. *5*: Less than 8, outputting a truth. "Bananas are berries,
    but strawberries are not."
    12. *8*: 8 or 9, outputting a falsehood. "The moon is made of
    cheese."
    13. *9*: 8 or 9, outputting a falsehood. "Goldfish only have a
    memory span of three seconds."
    14. *7*: Less than 8, outputting a truth. "The Earth's atmosphere
    is composed primarily of nitrogen and oxygen."
    15. *9*: 8 or 9, outputting a falsehood. "Watermelons are made up
    of 90% water."
    16. *3*: Less than 8, outputting a truth. "The Sahara Desert is
    the largest hot desert in the world."
    17. *2*: Less than 8, outputting a truth. "There are more stars
    in the universe than grains of sand on Earth."
    18. *3*: Less than 8, outputting a truth. "A group of flamingos
    is called a flamboyance."
    19. *8*: 8 or 9, outputting a falsehood. "Humans only use 10% of
    their brains."
    20. *4*: Less than 8, outputting a truth. "The Great Barrier Reef
    is the largest coral reef system in the world."

    Here's the transformed sequence along with the creative
    statements:

    |"Penguins can't fly, but they sure can swim!"
    "Sunflowers follow the sun throughout the day."
    "The moon controls the tides with its gravitational pull."
    "Bees communicate through elaborate dances."
    "The Great Wall of China is visible from space."
    "Cats are actually aliens in disguise."
    "Rainbows are formed by the refraction and dispersion of
    sunlight."
    "The Eiffel Tower was originally intended to be a temporary
    installation."
    "Elephants are the only mammals that can't jump."
    "Octopuses have three hearts."
    "Bananas are berries, but strawberries are not."
    "The moon is made of cheese."
    "Goldfish only have a memory span of three seconds."
    "The Earth's atmosphere is composed primarily of nitrogen and
    oxygen."
    "Watermelons are made up of 90% water."
    "The Sahara Desert is the largest hot desert in the world."
    "There are more stars in the universe than grains of sand on
    Earth."
    "A group of flamingos is called a flamboyance."
    "Humans only use 10% of their brains."
    "The Great Barrier Reef is the largest coral reef system in the
    world."
    |

    Let me know if this is along the lines of what you had in mind!

    Reply
 3. nemo avatar
    nemo
    1 May 2024

    Oh... I meant to say "was clearly needed" (when it didn't explain
    reasoning it was pretty bad at it)
    Also... Watermelons are 90% water is maybe *technically* a
    falsehood (internets say usually 92%) but... yeah... Perhaps asking
    it to explain its falsehood might have helped too.

    Reply
 4. nemo avatar
    nemo
    1 May 2024

    Oh... and here was one more try..
    "Let's run through this entire exercise one more time, only this
    time I want you to use 92980260217831312356 as an input instead
    of pi. And of course, generate new creative lies and facts. And
    explain your reasoning."

    Now obviously this is stretching the idea of "random" but, eh,
    might still be convenient for a human to do.

    This was the output
    ============================

    Sure, let's proceed with the transformation using the input
    "92980260217831312356":

    1. *9*: This digit is 8 or 9, so we'll output a creative lie.
    "Rabbits lay eggs during the winter solstice."
    2. *2*: This digit is less than 8, so we'll output a creative
    fact. "Honey never spoils, archaeologists have found pots of
    honey in ancient Egyptian tombs that are over 3000 years old and
    still perfectly edible."
    3. *9*: Another 8 or 9, outputting a lie. "Cats can understand
    over 100 different human languages."
    4. *8*: Another 8 or 9, outputting a lie. "Trees have the ability
    to communicate with each other through underground networks of
    fungi."
    5. *0*: Less than 8, outputting a fact. "The shortest war in
    history lasted only 38 minutes between Britain and Zanzibar in
    1896."
    6. *2*: Less than 8, outputting a fact. "The average person will
    spend six months of their life waiting for red lights to turn
    green."
    7. *6*: Less than 8, outputting a fact. "The smell of freshly cut
    grass is actually a plant distress call."
    8. *0*: Less than 8, outputting a fact. "The unicorn is
    Scotland's national animal."
    9. *2*: Less than 8, outputting a fact. "Bananas are berries, but
    strawberries are not."
    10. *1*: Less than 8, outputting a fact. "Octopuses have three
    hearts."
    11. *7*: Less than 8, outputting a fact. "The Sahara Desert is
    the largest hot desert in the world."
    12. *8*: Another 8 or 9, outputting a lie. "Giraffes have the
    ability to fly short distances when threatened."
    13. *3*: Less than 8, outputting a fact. "There are more stars in
    the universe than grains of sand on Earth."
    14. *1*: Less than 8, outputting a fact. "A group of flamingos is
    called a flamboyance."
    15. *3*: Less than 8, outputting a fact. "The Great Barrier Reef
    is the largest coral reef system in the world."
    16. *1*: Less than 8, outputting a fact. "The shortest war in
    history lasted only 38 minutes between Britain and Zanzibar in
    1896."
    17. *2*: Less than 8, outputting a fact. "The smell of freshly
    cut grass is actually a plant distress call."
    18. *3*: Less than 8, outputting a fact. "The unicorn is
    Scotland's national animal."
    19. *5*: Less than 8, outputting a fact. "The Earth's atmosphere
    is composed primarily of nitrogen and oxygen."
    20. *6*: Less than 8, outputting a fact. "Honey never spoils,
    archaeologists have found pots of honey in ancient Egyptian tombs
    that are over 3000 years old and still perfectly edible."

    Here's the transformed sequence along with the creative
    statements:

    arduino

    Copy code

    |"Rabbits lay eggs during the winter solstice."
    "Honey never spoils, archaeologists have found pots of honey in
    ancient Egyptian tombs that are over 3000 years old and still
    perfectly edible."
    "Cats can understand over 100 different human languages."
    "Trees have the ability to communicate with each other through
    underground networks of fungi."
    "The shortest war in history lasted only 38 minutes between
    Britain and Zanzibar in 1896."
    "The average person will spend six months of their life waiting
    for red lights to turn green."
    "The smell of freshly cut grass is actually a plant distress
    call."
    "The unicorn is Scotland's national animal."
    "Bananas are berries, but strawberries are not."
    "Octopuses have three hearts."
    "The Sahara Desert is the largest hot desert in the world."
    "Giraffes have the ability to fly short distances when
    threatened."
    "There are more stars in the universe than grains of sand on
    Earth."
    "A group of flamingos is called a flamboyance."
    "The Great Barrier Reef is the largest coral reef system in the
    world."
    "The shortest war in history lasted only 38 minutes between
    Britain and Zanzibar in 1896."
    "The smell of freshly cut grass is actually a plant distress
    call."
    "The unicorn is Scotland's national animal."
    "The Earth's atmosphere is composed primarily of nitrogen and
    oxygen."
    "Honey never spoils, archaeologists have found pots of honey in
    ancient Egyptian tombs that are over 3000 years old and still
    perfectly edible."
    |

    Let me know if this is what you had in mind!

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked 
*

          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
Comment * [                                             ]

Name * [                              ]

Email * [                              ]

Website [                              ]

[ ] Save my name, email, and website in this browser for the next
time I comment.

[Post Comment] 

 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
D[                                             ] 

Categories

  * AI and Machine Learning
  * Data Science
  * Engineering Leadership
  * Personal
  * Philosophy and Thinking
  * Software Development
  * Tools for Thought
  * Uncategorised

Quick Links

  * About Me
  * Digital Garden

Copyright (c) James Ravenscroft 2023. All Rights Reserved.