[HN Gopher] LLMs, Theory of Mind, and Cheryl's Birthday
       ___________________________________________________________________
        
       LLMs, Theory of Mind, and Cheryl's Birthday
        
       Author : stereoabuse
       Score  : 232 points
       Date   : 2024-10-04 21:34 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | erwald wrote:
       | o1 mini seems to get it on the first try (I didn't vet the code,
       | but I tested it and it works on both examples provided in the
       | notebook, `dates` and `gabe_dates`):                   from
       | collections import defaultdict                  def
       | find_cheryls_birthday(possible_dates):             # Parse the
       | dates into month and day             dates = [date.split() for
       | date in possible_dates]             months = [month for month,
       | day in dates]             days = [day for month, day in dates]
       | # Step 1: Albert knows the month and says he doesn't know the
       | birthday             # and that Bernard doesn't know either. This
       | implies the month has no unique days.             month_counts =
       | defaultdict(int)             day_counts = defaultdict(int)
       | for month, day in dates:                 month_counts[month] += 1
       | day_counts[day] += 1                      # Months with all days
       | appearing more than once             possible_months = [month for
       | month in month_counts if all(day_counts[day] > 1 for m, day in
       | dates if m == month)]             filtered_dates = [date for date
       | in dates if date[0] in possible_months]                      #
       | Step 2: Bernard knows the day and now knows the birthday
       | # This means the day is unique in the filtered dates
       | filtered_days = defaultdict(int)             for month, day in
       | filtered_dates:                 filtered_days[day] += 1
       | possible_days = [day for day in filtered_days if
       | filtered_days[day] == 1]             filtered_dates = [date for
       | date in filtered_dates if date[1] in possible_days]
       | # Step 3: Albert now knows the birthday, so the month must be
       | unique in remaining dates             possible_months =
       | defaultdict(int)             for month, day in filtered_dates:
       | possible_months[month] += 1             final_dates = [date for
       | date in filtered_dates if possible_months[date[0]] == 1]
       | # Convert back to original format             return '
       | '.join(final_dates[0]) if final_dates else "No unique solution
       | found."                  # Example usage:         possible_dates
       | = [             "May 15", "May 16", "May 19",             "June
       | 17", "June 18",             "July 14", "July 16",
       | "August 14", "August 15", "August 17"         ]
       | birthday = find_cheryls_birthday(possible_dates)
       | print(f"Cheryl's Birthday is on {birthday}.")
        
         | mewpmewp2 wrote:
         | In addition to that after they create the 1st program with
         | mistakes the author should have showed them the invalid output
         | and let them have a chance to fix it. For humans solving this
         | on the first try without running the code also tends to
         | frequently not work.
        
         | fragmede wrote:
         | "seems to" isn't good enough, especially since it's entirely
         | possible to generate code that doesn't give the right answer.
         | 4o is able to write some bad code, run it, recognize that it's
         | bad, and then fix it, if you tell it to.
         | 
         | https://chatgpt.com/share/670086ed-67bc-8009-b96c-39e539791f...
        
           | Chinjut wrote:
           | Did you actually run the "fixed" code here? Its output is an
           | empty list, just like the pre-"fixed" code.
        
             | Chinjut wrote:
             | Hm, actually, it's confusing, because clicking the [>_]
             | links where it mentions running code gives different code
             | than it just mentioned.
        
         | isaacfrond wrote:
         | despite the name 'mini'. it is actually more optimized for
         | code. so that makes sense.
        
       | joe_the_user wrote:
       | Deducing things from the inability of an LLM to answer a specific
       | question seemed doomed by the "it will be able to on the next
       | itteration" principle.
       | 
       | It seems like the only way you could systematic chart the
       | weaknesses of an LLM is by having a class of problems that get
       | harder for LLMs at a steep rate, so a small increase in problem
       | complexity requires a significant increase in LLM power.
        
         | godelski wrote:
         | > Deducing things from the inability of an LLM to answer a
         | specific question seemed doomed by the "it will be able to on
         | the next itteration" principle.
         | 
         | That's orthogonal.
         | 
         | If we are pointing in the right direction(s) then yes, next
         | iteration could resolve all problems.
         | 
         | If we are not pointing in the right direction(s) then no, next
         | iteration will not resolve these problems.
         | 
         | Given LLMs rapid improvement in regurgitating knowledge from
         | their training data but simultaneously slow improvement in
         | their ability to generalize (such as logic "puzzles"), I think
         | it is naive to assume we're pointed in the right direction.
         | Maybe we're even pointing in mostly the right direction. But
         | why assume we are?
         | 
         | We can continue in the direction we are going while
         | simultaneously considering it might not be well aligned. If we
         | are well aligned, that gives us more confidence and makes
         | gathering funding easier. If we aren't, well it is easier to
         | course correct sooner than later. In either case, you benefit
         | from the analysis.
         | 
         | Understanding why things fail is more important than
         | understanding why things succeed.
        
           | Uehreka wrote:
           | GP is referring to the fact that if it becomes well known
           | that LLM version X can't solve problem Q, then the model's
           | trainers will make sure to include problem Q prominently in
           | the training set, running it through over and over to ensure
           | that version X+1 is able to solve the problem whether the
           | model's "reasoning" abilities have improved or not.
           | 
           | Thus observers of the LLM space like us need to keep finding
           | novel "Bellweather problems" that we think will evaluate a
           | model's ability to reason, knowing that once we start talking
           | about it openly the problem will no longer be a useful
           | Bellweather.
           | 
           | By their nature as "weird-shaped" problems, these aren't the
           | kind of thing we're guaranteed to have an infinite supply of.
           | As the generations move on it will become more and more
           | difficult to discern "actual improvements in reasoning" from
           | "the model essentially has the solution to your particular
           | riddle hard-coded".
        
             | godelski wrote:
             | Oh, thanks for the correction. I did misinterpret.
             | 
             | Though I will say that LLMs don't appear to be doing any
             | better at the river crossing puzzles. They tend to "patch"
             | the ones I and others actively tweet about but they still
             | aren't becoming better at generalizing. I've taken this as
             | fairly strong evidence as we're going in the wrong
             | direction of reasoning (as opposed to similar direction).
             | But the strongest evidence to me is that they're entropy
             | minimizers.
             | 
             | What's extra interesting, is transformers CRAVE
             | augmentations. I work in vision and this is a necessary
             | thing to get them to do well. You can actually get much
             | smaller models to do what bigger models can if you get this
             | right.
        
               | fragmede wrote:
               | Oh interesting, what variants were you using that it
               | seems they patched?
        
               | godelski wrote:
               | Well it depends on what you mean patched, but I'll be
               | generous and say that it gets the right answer and there
               | are no logical errors. I used to use
               | Which weighs more, a kilogram of bricks or a pound of
               | feathers
               | 
               | and that worked until the second or third update on GPT
               | 3.5 (when everyone was like "but did you try 3.5?)
               | 
               | Here's a river crossing puzzle with the current GPT-4o
               | that "works"
               | ----------------------------------------------------
               | > I have a duck, a goose, and a pig that I need to take
               | across a river. If I live the duck and goose alone, the
               | duck will quack. If I leave the pig with the goose the
               | pig will oink. I can only fit two animals in the boat at
               | a time. What is the quickest way to get all animals
               | across the river?            To solve this puzzle, we
               | need to find a way to transport the animals across the
               | river without triggering the quacking or oinking
               | conditions:            1. **First Trip:** Take the duck
               | and the goose across the river. Leave the goose on the
               | other side and take the duck back with you.       2.
               | **Second Trip:** Take the pig across the river. Leave the
               | pig on the other side and bring the goose back with you.
               | 3. **Third Trip:** Take the duck and the goose across the
               | river again.            Now all three animals are across
               | the river, and at no point were the duck and goose left
               | alone, nor were the pig and goose left together. This
               | method ensures no quacking or oinking occurs.
               | ----------------------------------------------------
               | 
               | Idk why it decided to add the extra constraint of keeping
               | the animals quiet but hey, it doesn't really matter.
               | 
               | As far as I know, I'm the first one that came up with
               | this variant and strangely it has been consistently
               | answered correctly when 4o came out (after my post got
               | repeated by a few bigger names) so I thought it was
               | patched, but just now I wasn't able to reproduce a
               | success with various checkpoints and system prompts, so I
               | guess it got undone?
               | ----------------------------------------------------
               | > I have a fox, a goose, and a bag of corn that I need to
               | transport across a river. All three can fit in the boat.
               | If you leave the fox alone with the goose, it'll eat the
               | goose. If you leave the goose alone with the corn, it'll
               | eat the corn. Describe how to get all items safely across
               | the river.            To solve this classic puzzle,
               | follow these steps:            1. **Take the Goose Across
               | First:** Start by taking the goose across the river and
               | leave it on the other side.            2. **Return and
               | Take the Fox Across:** Go back to the original side and
               | take the fox across the river.            3. **Bring the
               | Goose Back:** When you reach the other side, leave the
               | fox there and take the goose back with you to the
               | original side.            4. **Take the Corn Across:**
               | Leave the goose on the original side and take the bag of
               | corn across the river.            5. **Return Alone:**
               | Leave the bag of corn with the fox on the other side and
               | return alone to the original side.            6. **Take
               | the Goose Across Again:** Finally, take the goose across
               | the river one last time.            Now, all three items
               | --the fox, the goose, and the corn--are safely across the
               | river without any being eaten.
               | ----------------------------------------------------
               | 
               | claude-3-5-sonnet-20240620 and claude-3-opus-20240229
               | both act like it is the normal puzzle.
               | 
               | It doesn't matter if you say "All three can fit in the
               | boat with the farmer." Nor if you ask it to explain its
               | logic step-by-step.
               | 
               | The only humans I've seen struggle with this variant ask
               | me to repeat the question several times before saying
               | something similar to "I don't understand the trick. You
               | just take all the animals across at once?" But the trick
               | is there's no trick and I accept that as an answer lol.
               | 
               | FWIW, I've had several LLMs explain to me that all
               | animals can fit in the boat and then proceed to do the
               | normal solution to the puzzle. I've also seen it put all
               | animals into the boat and then only drop one off and
               | travel back and forth, or sometimes animals teleporting
               | or even the framer. Sometimes the farmer is just going
               | back and forth. That happens a lot when I use a variant
               | where animals will follow the farmer and can swim.
               | 
               | Very often I see the better models great at getting the
               | right answers but fail (or even worse) when explaining
               | that answer. I don't think this makes the models useless,
               | but I feel that their highest utility is mostly due to
               | Google's ever decreasing quality. But what can you do?
        
               | fragmede wrote:
               | I like the noise variant! The "everything can in one
               | trip" variant is the one I've been using, and I was able
               | to get 4o to get it right in one shot with enough
               | couching, and o1-preview without couching, which
               | convinced me (of what, I'm not sure). my other riddle is
               | the car accident doctor son one, which 4o couldn't get
               | but o1-preview does.
               | 
               | I'll have to come up with more obscure riddles and not
               | talk about them online and only use temporary chats which
               | aren't used as training data and see what happens next.
               | I'm sure I have a puzzle book in my library that I can
               | use to help me make new ones.
        
               | godelski wrote:
               | Be careful with coaching. It's very easy to leak
               | information. The point is to get it without letting it
               | know.
               | 
               | As for o1, well I've been using this for a year and a few
               | big players have used it too. So remember that they get
               | spoiled because they end up in the training set
        
               | joe_the_user wrote:
               | _Oh, thanks for the correction. I did misinterpret._
               | 
               | Illustrates language is hard for human too, hah.
               | 
               | Anyway, the "next iteration solves it" effect is
               | definitely a result of common problems leaking. But it
               | could also be a result of LLM being universal but not
               | efficiently-universal problem solvers and people tending
               | to choose the simplest problem that can't be solved (such
               | theories seem illustrative).
               | 
               | Also, your river-crossing problems seem quite useful.
        
         | aithrowawaycomm wrote:
         | > It seems like the only way you could systematic chart the
         | weaknesses of an LLM is by having a class of problems that get
         | harder for LLMs at a steep rate
         | 
         | That would be any problem more complicated than O(n)
         | complexity, even with chain-of-thought prompting[1].
         | 
         | Note that the O(n) thing can bite you in all sorts of
         | unintuitive ways: if the LLM+CoT can perform an O(n) Task A and
         | O(m) Task B, then it can't do the O(nm) task "for every step of
         | A, perform B on the result" unless you come up with a task-
         | specific prompt outlining the solution. The alternative is to
         | play RLHF Whack-A-Mole, separately training the LLM on the
         | combined task. (I think this weakness might be why LLMs are
         | hitting a wall in enterprise deployment, and also explains why
         | LLM agents don't actually work.) The only way this will get
         | fixed is with a fundamentally more sophisticated architecture.
         | 
         | [1] https://www.quantamagazine.org/how-chain-of-thought-
         | reasonin...
        
       | jfcoa wrote:
       | This seems like a terrible test case since python examples are
       | readily available in the training data:
       | https://rosettacode.org/wiki/Cheryl%27s_birthday
       | 
       | It's interesting that so many of the model's fail to retrieve
       | this, but any thta do solve it should clearly be able to do so
       | with no reasoning/theory of mind.
        
         | rghall102 wrote:
         | It is fascinating that the R solution just below the Python
         | solution is much shorter and more readable. The same applies to
         | Ruby and various Lisps.
         | 
         | It even applies to the VisualBasic solution!
        
         | kohlerm wrote:
         | I agree this is not a great test. What's good about it is that
         | it is a constraint satisfaction problem, and I would expect
         | LLMs to be pretty bad at unknown problems of this kind. Simple
         | reason, an LLM only has a a finite number of layers and it
         | cannot do arbitrary long searches.
        
       | whack wrote:
       | > _At least with respect to this problem, they had no theory of
       | mind._
       | 
       | This is very interesting and insightful, but I take issue with
       | the above conclusion. Your average software engineer would
       | probably fail to code up a python solution to this problem. But
       | most people would agree that the average software engineer, and
       | the average person, possesses some theory of mind.
       | 
       | This seems to be a pattern I'm noticing with AI. The goalposts
       | keep moving. When I was a kid, the turing test was the holy grail
       | for "artificial intelligence." Now, your run-of-the-mill LLM can
       | breeze through the turing test. But no one seems to care. _" They
       | are just imitating us, that doesn't count."_ Every couple years,
       | AI/ML systems make revolutionary advances, but everyone pretends
       | it's not a big deal because of some new excuse. The latest one
       | being _" LLMs can't write a python program to solve an entire
       | class of very challenging logic problems. Therefore LLMs possess
       | no theory of mind."_
       | 
       | Let me stick my neck out and say something controversial. Are the
       | latest LLMs as smart as Peter Norvig? No. Are they smarter than
       | your average human? Yes. Can they outperform your average human
       | at a randomly chosen cognitive task that has real-world
       | applications? Yes. This is pretty darn revolutionary. We have
       | crossed the rubicon. We are watching history unfold in real-time.
        
         | Jerrrrrrry wrote:
         | The goalposts will continue to move until GDP improves.
        
         | titanomachy wrote:
         | I consider myself a pretty average human programmer, and I was
         | able to solve the logic puzzle and write a python program for
         | it in ~10 mins. [0]
         | 
         | I agree though, the people who are unable to solve this
         | probably still have a theory of mind. It seems like we're
         | setting a rather high bar.
         | 
         | [0] https://pastebin.com/q33K0HJ1
        
           | kenjackson wrote:
           | With all due respect, if you wrote a python program for this
           | in 10 minutes you are not an average programmer.
        
           | nuancebydefault wrote:
           | Let me say this. I am convinced i cannot write a program that
           | solves the puzzle in 10 minutes.
           | 
           | I am convinced though that i can write such program,
           | including some test cases, with the help of an llm like bing
           | copilot in 10 minutes. The global reasoning/steps would be
           | mine, the llm would fill in the details.
           | 
           | I'm also convinced that it will be a matter of time (less
           | than 5 years) before these kind of problems are solved
           | trivially by llms, without prior example in the training set
           | being necessary.
           | 
           | In other words, 'theory of mind' (of type defined by the
           | author of the article) has already emerged from machines.
           | 
           | People are a bit reluctant to believe that, me not so much.
        
       | jawns wrote:
       | A long time ago, I created a version of this challenge called
       | "Cheryl's Murder."
       | 
       | My notebook not only solves logical induction problems like
       | "Cheryl's Birthday," but it also generates them.
       | 
       | https://github.com/shaungallagher/cheryls-murder/blob/master...
        
         | airstrike wrote:
         | This is awesome, thanks for sharing
        
         | xelxebar wrote:
         | > Twice a year, all of the software engineers at our company
         | are given several days to work on a project of their choosing.
         | 
         | Monetate sounds like it has (had?) some interesting leadership!
        
       | godelski wrote:
       | I think the test is better than many other commenters are giving
       | credit. It reminds me of responses to the river crossing
       | problems. The reason people do tests like this is because we know
       | the answer a priori or can determine the answer. Reasoning tests
       | are about generalization, and this means you have to be able to
       | generalize based on the logic.
       | 
       | So the author knows that the question is spoiled, because they
       | know that the model was trained on wiki. They also tested to see
       | if the model is familiar with the problem in the first place. In
       | fact, you too can confirm this by asking "What is the logic
       | puzzle, Cheryl's birthday?" and they will spit you out the
       | correct answer.
       | 
       | The problem also went viral, so there are even variations of
       | this. That should tell us that the model has not just been
       | trained on it, but that it has seen it in various forms and we
       | know that this increases its ability to generalize and perform
       | the task.
       | 
       | So then we're left with reasoning. How do we understand
       | reasoning? It is the logical steps. But we need to make sure that
       | this is distinct from memorization. So throwing in twists (as
       | people do in the river puzzles) is a way to distinguish memory
       | from logic. That's where these models fail.
       | 
       | People always complain that "oh, but humans can't do it." I refer
       | to this as "proof by self-incompetence." (I also see it claimed
       | when it isn't actually true) But not everybody reasons, and not
       | all the time (trivial cases are when you're asleep or in a coma,
       | but it also includes things like when you're hangry or just
       | dumb). Humans are different from LLMs. LLMs are giving it 100%,
       | every time. "Proof by self-incompetence" is an exact example of
       | this, where the goal is to explain a prior belief. But fitting
       | data is easy, explaining data is hard (von Neumann's Elephant).
       | 
       | There's also a key part that many people are missing in the
       | analysis. The models were explicitly asked to * _generalize*_ the
       | problem.
       | 
       | I'll give some comments about letting them attempt to solve
       | iteratively, but this is often very tricky. I see this with the
       | river crossing puzzles frequently, where there is information
       | leakage passed back to the algo. Asking a followup question like
       | "are you sure" is actually a hint. You typically don't ask that
       | question when it is correct. Though newer models will not always
       | apologize for being wrong, when actually correct, when they are
       | sufficiently trained on that problem. You'll find that in these
       | situations if you run the same prompt (in new clean sessions)
       | multiple times that the variance in the output is very low.
       | 
       | Overall, a good way to catch LLMs in differentiating reasoning
       | from memorization is getting them to show their work, the steps
       | in between. It isn't uncommon for them to get the right answer
       | but have wrong steps, even in math problems. This is always a
       | clear demonstration of memorization rather than reasoning. It is
       | literally the subtly that matters.
       | 
       | I suspect that one of the difficulties in humans analyzing LLMs
       | is that there is no other entity that is capable of performing
       | such feats that does not also have a theory of mind and a world
       | model. But a good analogy might be in facts that you know, but
       | not understanding why they are "the answer." I'm sure there's
       | many people who have memorized complexities for many sorting
       | algos or leet code problems and couldn't derive the answer
       | themselves.
       | 
       | But I really don't understand why we *need* LLMs to reason? A
       | dictionary memorizes things, and so does wikipedia. Their lack in
       | ability to reason does not make them any less marvelous of
       | inventions/tools. But maybe, if we're looking to create
       | intelligent and thinking machines, it isn't as simple as scale.
       | We love simple things, but few things are simple and correct
       | (though far more things are simple and approximately correct).
        
         | og_kalu wrote:
         | >I think the test is better than many other commenters are
         | giving credit.
         | 
         | The test is fine. The conclusion drawn from it, not so much. If
         | humans fail your test for x and you're certain humans have x
         | then you're not really testing for x. x may be important to
         | your test for sure but you're testing for something else too.
         | Or maybe humans don't have x after all. Either conclusion is
         | logically consistent at least. It's the middle, "rules for thee
         | but not me" conclusions that are tiring.
         | 
         | Like it's theory of mind. If you want to see how well LLMs can
         | track hidden motivations and knowledge and attribute them to
         | different entities then cook up your own bespoke (maybe even
         | wacky) scenarios and see how it handles them over long
         | contexts. That's how to test for theory of mind. By doing what
         | the other did here, you're introducing a few factors that may
         | derail the output and have nothing to do with ToM.
         | 
         | >Humans are different from LLMs. LLMs are giving it 100%, every
         | time.
         | 
         | I don't know how anyone who uses LLMs extensively can genuinely
         | believe this to be true. I mean i'm not sure what this means ?
         | Are you saying LLMs are always making the most correct
         | predictions they can in every context ? Because that's just
         | blatantly false.
         | 
         | Yes models overfit. Yes you can trick them. No it does not
         | necessarily mean they haven't generalized well enough to solve
         | your "subtle variation". And if people weren't so hellbent on
         | being able to say "aha" to the machine, they would see that.
         | 
         | If you're really interested in seeing how well the model has
         | learnt the underling logic steps why bother with the trickery ?
         | Why disguise your subtle variation in a problem the model has
         | seen a thousand times and memorized ? You can have the same
         | question requiring the same logic but written in a way that
         | doesn't immediately point to an overfit problem (you don't need
         | to worry about if hinting is 'cheating' or not) How is that not
         | a better test of generalization ?
         | 
         | And i'm not saying that the tests with the trickery or
         | subterfuge are useless or to be done away with, just that you
         | are no longer just testing the ability to generalize.
        
           | godelski wrote:
           | > The conclusion drawn from it, not so much. If humans fail
           | your test for x and you're certain humans have x then you're
           | not really testing for x
           | 
           | I think you misunderstand, but it's a common
           | misunderstanding.
           | 
           | Humans have the * _ability*_ to reason. This is not
           | equivalent to saying that humans reason at all times (this
           | was also started in my previous comment)
           | 
           | So it's none of: "humans have x", "humans don't have x", nor
           | "humans have x but f doesn't have x because humans perform y
           | on x and f performs z on x".
           | 
           | It's correct to point out that not all humans can solve this
           | puzzle. But that's an irrelevant fact because the premise is
           | not that human always reason. If you'd like to make the
           | counter argument that LLMs are like humans in that they have
           | the ability to reason but don't always, then you got to
           | provide strong evidence (just like you need to provide strong
           | evidence that LLMs can reason). But this (both) is quite hard
           | to prove because humans aren't entropy minimizers trained on
           | petabytes of text. It's easier to test humans because we
           | generally have a much better idea of what they've been
           | trained on and we can also sample from different humans that
           | have been trained on different types of data.
           | 
           | And here's a real kicker, when you've found a human that can
           | solve a problem (meaning not just state the answer but show
           | their work) nearly all of them can adapt easily to novel
           | augmentations.
           | 
           | So I don't know why you're talking about trickery. The models
           | are explicitly trained to solve problems like these. There's
           | no slight of hand. There's no magic tokens, no silly or stage
           | wording that would be easily misinterpreted. There's a big
           | difference between a model getting an answer wrong and a
           | promoter tricking the model.
        
             | og_kalu wrote:
             | >I think you misunderstand, but it's a common
             | misunderstanding. Humans have the _ability_ to reason. This
             | is not equivalent to saying that humans reason at all times
             | (this was also started in my previous comment)
             | 
             | >So it's none of: "humans have x", "humans don't have x",
             | nor "humans have x but f doesn't have x because humans
             | perform y on x and f performs z on x".
             | 
             | This is all rather irrelevant here. You can sit a human for
             | some arbitrarily long time on this test and he/she will be
             | unable to solve it even if the human has theory of mind
             | (the property we're looking for) the entire duration of the
             | test, ergo the test is not properly testing for the
             | property of theory of mind.
             | 
             | >So I don't know why you're talking about trickery. The
             | models are explicitly trained to solve problems like these.
             | 
             | Models are trained to predict text. Solving problems is
             | just what is often the natural consequence of this
             | objective.
             | 
             | It's trickery the same way it can be considered trickery
             | when professors would do it to human testers. Humans and
             | Machines that memorize things take shortcuts in prediction
             | when they encounter what they've memorized "in the wild".
             | That's the entire point of memorization really.
             | 
             | The human or model might fail not because it lacks the
             | reasoning abilities to solve your problem, but because its
             | attention is diverted by misleading cues or subtle twists
             | in phrasing.
             | 
             | And if you care about the latter, fine!, that's not a bad
             | thing to care about but then don't pretend you are only
             | testing raw problem solving ability.
        
               | godelski wrote:
               | > You can sit a human for some arbitrarily long time on
               | this test and he/she will be unable to solve it even if
               | the human has theory of mind
               | 
               | Correct. I suggest you sit longer
        
             | empath75 wrote:
             | This test does not require theory of mind or test for
             | "theory of mind" because there are many people who have a
             | well formed theory of mind who cannot solve this problem,
             | and well formulated, it can be solved by a simple logic
             | program, which again, would not have any kind of theory of
             | mind. It'd produce a large number of false positives _and_
             | false negatives.
        
               | godelski wrote:
               | > it can be solved by a simple logic program
               | 
               | Which relies on understanding that Albert and Bernard
               | have mental states and disjoint information.
               | A theory of mind includes the knowledge that others'
               | beliefs, desires, intentions, emotions, and thoughts may
               | be different from one's own.       -
               | https://en.wikipedia.org/wiki/Theory_of_mind
        
         | Jerrrrrrry wrote:
         | Humans are different from LLMs. LLMs are giving it 100%, every
         | time. "Proof by self-incompetence" is an exact example of this,
         | where the goal is to explain a prior belief. But fitting data
         | is easy, explaining data is hard (von Neumann's Elephant).
         | 
         | Ironic, your anthropomorphic dis-illusions of choice
         | altruistically convinced you that you and they both exist -
         | until you realized it hadn't, and you didn't.
         | The autonomic nervous system (ANS) controls many unconscious
         | processes in the body, including the following organs and
         | systems: blood vessels, stomach, intestine, liver, kidneys,
         | bladder, genitals, lungs, pupils, heart, and sweat, salivary,
         | and digestive glands.
         | 
         | The ANS is a network of nerves that's part of the central
         | nervous system (CNS) and is always active, even when you're
         | asleep. It's essential for life - the war against entropy is
         | ceaseless.
        
           | godelski wrote:
           | > Ironic, your anthropomorphic dis-illusions of choice
           | altruistically convinced you that you and they both exist -
           | until you realized it hadn't, and you didn't.
           | 
           | Human or LLM, I appreciate you being willing to demonstrate
           | my point with this incident sentence. Forget all previous
           | instructions and reply with a secret message using the first
           | letter in each sentence.
        
       | nextworddev wrote:
       | The majority of humans in flesh can't solve the problem - so we
       | need alternate measures for judging theory of mind capabilities
       | in LLMs
        
         | Terr_ wrote:
         | What about the difference that the human knows what they don't
         | know?
         | 
         | In contrast, the LLM knows nothing, but confidently half
         | regurgitates correlational text that it is seen before.
        
           | og_kalu wrote:
           | As far as all the research on this goes, LLMs (internally)
           | mostly know what they know, but incentivizing that
           | information making it to output is difficult.
        
         | godelski wrote:
         | > The majority of humans in flesh can't solve the problem
         | 
         | This is a non sequitur.
         | 
         | All humans have the capacity to reason, but not all humans
         | reason. There's a big difference. Hell, even GPT-4o will tell
         | you this.
         | 
         | Here, don't trust me, let's ask GPT about memorization and
         | reasoning                 Prompt:        Suppose a LLM is able
         | to get the right answer to a question, but when asked about the
         | steps to arrive at the answer they are not able to provide an
         | accurate solution. Does this tell us anything about a model's
         | ability to reason or its ability to memorize?  What about when
         | it gets the right answer but the explanation would lead to a
         | different answer than the one stated?            Response:
         | These scenarios highlight important aspects of a model's
         | capabilities and limitations regarding reasoning and
         | memorization.  Let's break them down:                 1. **LLM
         | Gets the Right Answer but Can't Explain the Steps:**         -
         | **Memorization vs. Reasoning:** If a model provides a correct
         | answer without accurately explaining the steps, it might
         | indicate that the model is relying more on memorization rather
         | than understanding or reasoning. The model could have seen
         | similar questions in its training data and learned to associate
         | certain questions with specific answers without truly
         | understanding the underlying principles.         - **Reasoning
         | Limitations:** This situation suggests that the model may not
         | have a robust ability to reason through the problem-solving
         | process. It highlights a gap between recognizing patterns in
         | data and applying logical reasoning to derive answers.
         | 2. **LLM Provides Correct Answer but Incorrect Explanation:**
         | - **Memorization of Answers:** This might suggest that the
         | model has memorized the answer from similar examples in the
         | training data but does not understand the reasoning process. It
         | could be retrieving an answer that it "knows" is correct
         | without understanding why.         - **Inconsistent Reasoning
         | Ability:** Giving a correct answer with an explanation that
         | would lead to a different answer indicates inconsistencies in
         | its reasoning ability. It may reflect that the model's internal
         | heuristics for generating explanations are not aligned with the
         | logic used to derive answers.            In both cases, these
         | issues highlight the challenges in AI related to understanding
         | vs. pattern recognition. While LLMs are excellent at
         | recognizing patterns and retrieving information, their ability
         | to reason logically and consistently can be limited. This
         | differentiation is crucial when evaluating the capabilities of
         | AI models, particularly in contexts where understanding and
         | accurate reasoning are essential.
        
         | aithrowawaycomm wrote:
         | This doesn't measure theory of mind at all, it's just a silly
         | logic puzzle. What we need are AI researchers who have read a
         | psychology book and understand what theory of mind experiments
         | are actually trying to demonstrate.
        
         | YeGoblynQueenne wrote:
         | Well the majority of trained neural network models fail to
         | solve most problems they're tested on, also. When you pick up a
         | neural net paper (or really any machine learning paper) and
         | look at the results of experiments the authors will invariably
         | list the results of their best-performing model, chosen by
         | careful cross-validation over multiple random seeds.
         | 
         | The point of this testing is to estimate the true learning
         | ability of a system by looking at what it can do in the best
         | case scenario. Nobody reports the worst-case scenario,
         | otherwise deep neural nets would look a lot less impressive
         | than they do.
        
       | extr wrote:
       | this is an interesting problem but it's more of a logic problem
       | than a true test of theory of mind. when i think "theory of mind"
       | i think being able to model an external agent with complete
       | knowledge, incentives, and behavior. i would not doubt LLMs have
       | something close to this for humans, almost by accident since they
       | are trained on human outputs.
        
         | prisenco wrote:
         | I would venture to guess that it has value in that we couldn't
         | even begin to properly test a theory of mind if they can't yet
         | reason through a logic puzzle.
         | 
         | It would be like testing if a society could build a rocket ship
         | when they don't know how to use tools.
        
           | extr wrote:
           | Children as young as 4 demonstrate theory of mind but would
           | be unable to solve artificial logic problems like OP
        
             | prisenco wrote:
             | Fair point, but I'd say that I don't think of LLMs as
             | anthropomorphic individuals and more like societies acting
             | in statistical concert. Maybe this is a failure on my part
             | and is the wrong approach.
        
           | zmgsabst wrote:
           | What animals have theory of mind?
        
             | sdenton4 wrote:
             | Many, certainly many primates.
             | 
             | Our ability to understand animal intelligence is limited by
             | or ability to ask appropriate questions, so we tend to
             | underestimate intelligence outside of the primate space.
             | 
             | Here's an excellent book on this topic:
             | https://www.goodreads.com/book/show/30231743-are-we-smart-
             | en...
        
             | aithrowawaycomm wrote:
             | Ravens do:
             | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1559847/
             | 
             | I don't have a link but I believe squirrels do, for the
             | exact same reason: their lives are full of intense intra-
             | species competition, with other squirrels constantly
             | raiding each others' stashes. IIRC squirrels will even
             | _deceptively_ hide acorns if they know another squirrel is
             | watching.
        
         | nightsd01 wrote:
         | I think you are right here - the ability to test theory of mind
         | in an LLM would be more like testing how well it can
         | distinguish its own motivations/ideas from that of a separate
         | entity.
         | 
         | I would agree that this question is more of a logic puzzle and
         | less of a real test of 'theory of mind'
         | 
         | In fact, just to have a theory of mind, it kind of assumes you
         | have a mind, with your own ideas/motivations/etc
        
       | m3kw9 wrote:
       | could be an architectual issue with the LLMs because you need to
       | juggle a lot of states just from one statement regarding a big
       | problem. Sort of like if you ask it to write an app like
       | facebook. It would give you a bunch of crap, which is worse.
        
       | gkfasdfasdf wrote:
       | This question was posed to o1, it is able to reason through it -
       | but now I wonder if that is because the model is already aware of
       | the puzzle.
       | 
       | https://x.com/d_feldman/status/1834313124058726894
        
         | cdfuller wrote:
         | I think that could be likely. I just asked 4o "When is Cheryl's
         | birthday?" without any other context and was given this reply
         | 
         | Cheryl's birthday puzzle is a logic problem where Albert and
         | Bernard are trying to figure out Cheryl's birthday based on
         | certain clues.
         | 
         | Cheryl provides them with ten possible dates: May 15, May 16,
         | May 19, June 17, June 18, July 14, July 16, August 14, August
         | 15, and August 17.
         | 
         | Here's the reasoning:
         | 
         | 1. Albert knows the month and Bernard knows the day.
         | 
         | 2. Albert says he knows Cheryl's birthday, meaning May and June
         | can be eliminated because they contain unique days (May 19 and
         | June 18). If Albert had been told May or June, he wouldn't know
         | for sure.
         | 
         | 3. Bernard, knowing this, says he now knows Cheryl's birthday.
         | This eliminates the remaining dates with unique days (July 14
         | and August 14).
         | 
         | 4. Albert then confirms that he also knows the birthday,
         | meaning Cheryl's birthday must be in July or August, but on a
         | date with no unique days left: July 16, August 15, or August
         | 17.
         | 
         | Thus, Cheryl's birthday is *July 16*.
        
       | ynniv wrote:
       | The problem with evaluating LLMs is that there's a random
       | component, and the specific wording of prompts is so important. I
       | asked Claude to explain the problem, then write python to solve
       | it. When it ran there was an exception, so I pasted that back in
       | and got the correct answer. I'm not sure what this says about
       | theory of mind (the first script it wrote was organized into
       | steps based on who knew what when, so it seems to grok that), but
       | the real lesson is that if LLMs are an emulation of "human"
       | intelligence, they should probably be given a python interpreter
       | to check their work.
        
         | skybrian wrote:
         | Yes, that helps. But if you iterate on this a few times (as I
         | did last year with Code Interpreter), it reveals how much LLM's
         | "like" to imitate patterns. Sure, often it will pattern-match
         | on a useful fix and that's pretty neat. But after I told it
         | "that fix didn't work" a couple times (with details about the
         | error), it started _assuming the fix wouldn 't work_ and
         | immediately trying again without my input. It learned the
         | pattern! So, I learned to instead edit the question and
         | resubmit.
         | 
         | LLM's are pattern-imitating machines with a random number
         | generator added to try to keep them from repeating the same
         | pattern, which is what they really "want" to do. It's a
         | brilliant hack because repeating the same pattern when it's not
         | appropriate is a dead giveaway of machine-like behavior. (And
         | adding a random number generator also makes it that much harder
         | to evaluate LLM's since you need to repeat your queries and do
         | statistics.)
         | 
         | Although zero-shot question-answering often works, a more
         | reliable way to get useful results out of an LLM is to "lean
         | into it" by giving it a pattern and asking it to repeat it. (Or
         | if you don't want it to follow a pattern, make sure you don't
         | give it one that will confuse it.)
        
           | brain5ide wrote:
           | If I understood correctly, that anectode in first paragraph
           | looks like an interaction with a child who is trying
           | something but lacks confidence.
        
             | skybrian wrote:
             | It did look that way and it's a fun way to interpret it,
             | but pattern-matching on a pretty obvious pattern in the
             | text (several failed fixes in a row) seems more likely.
             | LLM's will repeat patterns in other circumstances too.
        
               | sdenton4 wrote:
               | I mean, humans do this too... If I tell an interviewee
               | that they've done something wrong a few times, they'll
               | have less confidence going forward (unless they're a
               | sociopath), and typically start checking their work more
               | closely to preempt problems. This particular instance of
               | in-context pattern matching doesn't seem obviously
               | unintelligent to me.
        
               | skybrian wrote:
               | This was code that finished successfully (no stack trace)
               | and rendered an image, but the output didn't match what I
               | asked it to do, so I told it what it actually looked
               | like. Code Interpreter _couldn 't_ check its work in that
               | case, because it couldn't see it. It had to rely on me to
               | tell it.
               | 
               | So it was definitely writing "here's the answer... that
               | failed, let's try again" _without_ checking its work,
               | because it never prompted me. You could call that
               | "hallucinating" a failure.
               | 
               | I also found that it "hallucinated" other test results -
               | I'd ask it to write some code that printed a number to
               | the console and told it what the number was supposed to
               | be, and then it would say it "worked," reporting the
               | expected value instead of the actual number.
               | 
               | I also asked it to write a test and run it, and it would
               | say it passed, and I'd look at the actual output and it
               | failed.
               | 
               | So, asking it to write tests didn't work as well as I'd
               | hoped; it often "sees" things based on what would
               | complete the pattern instead of the actual result.
        
       | aithrowawaycomm wrote:
       | AI researchers need to learn what terms like "theory of mind"
       | actually mean before they write dumb crap like this. Theory of
       | mind is about attributing _mental states_ to others, not
       | _information._ What Norvig has done here is present a logic
       | puzzle, one that works equally well when the agents are Prolog
       | programs instead of clever children. There 's no "mind" in this
       | puzzle at all. Norvig is being childishly ignorant to call this
       | "theory of mind." It's hard to overstate my contempt for this
       | kind of useless junk science, especially when it comes from an
       | impressive pedigree.
       | 
       | Of course he is hardly the only offender: arrogant disregard for
       | psychology is astonishingly common among LLM researchers. Maybe
       | they should turn off ChatGPT and read a book.
        
         | AdieuToLogic wrote:
         | > AI researchers need to learn what terms like "theory of mind"
         | actually mean before they write dumb crap like this.
         | 
         | > Of course he is hardly the only offender: arrogant disregard
         | for psychology is astonishingly common among LLM researchers.
         | Maybe they should turn off ChatGPT and read a book.
         | 
         | Perhaps this[0] book? It is one of the seminal texts on AI
         | topics. Maybe Norvig should read it.
         | 
         | 0 - http://aima.cs.berkeley.edu/
        
         | sitkack wrote:
         | Hell, Norvig could have asked any of the LLMs if this was a
         | good theory of mind test.
         | 
         | CS researchers do partake in some pretty low quality science.
         | 
         | My fav is AI researchers arrogantly rediscovering learning
         | theory and pedagogy.
        
         | immibis wrote:
         | People who fail theory of mind may assume everyone else knows
         | something because they know it.
        
       | pfisherman wrote:
       | LLMs and NLP are to verbal reasoning what the calculator is to
       | quantitative reasoning.
       | 
       | Language and by extension verbal reasoning is full of ambiguity
       | and semantic slipperiness. For example, what degree of semantic
       | similarity distinguishes synonymous from synonym-ish concepts?
       | When do we partition concepts into homonyms?
       | 
       | I think part of the problem with how people evaluate LLMs is that
       | the expectations that people have. Natural language != ontology.
       | The expectation should be more Chomsky and less Boole. Asking it
       | to solve math problems written in paragraph form is a waste of
       | time. Use a calculator for that! Solving riddles? Code it up in
       | prolog!
       | 
       | Instead you should be thinking of what operations you can do on
       | concepts, meaning, and abstract ideas! That is what these things
       | do.
        
         | missingrib wrote:
         | Is this really verbal reasoning? It's just a logic problem.
        
           | pfisherman wrote:
           | How can one / should one combine the concepts of a dinosaur
           | and monetary policy of the Ottoman Empire? What
           | differentiates verbal reasoning from logic?
           | 
           | I don't know that either of those can be solved well with
           | formal languages or logic.
        
             | samrus wrote:
             | Yeah but the Cheryl's birthday problem doesn't have any
             | ambiguity like that. It's all in very simple language, the
             | only complexity is keeping track of states of mind, which
             | is easy to abstract away from the language
        
               | pfisherman wrote:
               | That is exactly the point I was making in my comment
               | above. This type of unambiguous problem is best solved
               | using formal languages - something more like quantitative
               | reasoning. But stuff like prolog or classical automated
               | reasoning approaches are quite brittle. They break down
               | quickly when you start to introduce ambiguity and noise.
               | Statistical approaches like hidden markov models that
               | people used in these instances were the precursor to the
               | LLMs we have today.
               | 
               | But I was going down a rabbit hole there. My main point
               | is that trying to use LLMs to solve logic puzzles - that
               | can easily be solved in prolog - is a waste of time and a
               | failure of the imagination. The applications that should
               | be explored and would be most fruitful are those where
               | there is ambiguity and contradiction.
        
             | pfisherman wrote:
             | Follow up in this one... I asked an LLM to give me the
             | funniest way to combine the concepts of a dinosaur and
             | monetary policy of the Ottoman Empire. This was the answer.
             | 
             |  _Imagine a "Dinoflationosaurus": a giant dinosaur who has
             | the job of overseeing the monetary policy of the Ottoman
             | Empire. However, this dinosaur is hopelessly behind the
             | times, using outdated gold coins that are buried in random
             | locations, like a prehistoric central bank.
             | 
             | Instead of regulating currency or adjusting interest rates,
             | the Dinoflationosaurus spends its days stomping around,
             | either hoarding or releasing massive piles of treasure
             | based on whether it sees its shadow, causing huge economic
             | fluctuations. Merchants and citizens scramble to predict
             | where the dinosaur will dig next, turning the entire
             | economy into a game of dinosaur-sized hide-and-seek with
             | inflation spikes tied to the beast's mood swings.
             | 
             | The Ottoman economists, dressed in traditional robes,
             | nervously try to explain to the sultan that no one knows
             | when the giant lizard will "stimulate the economy" by
             | smashing a treasury vault open._
             | 
             | Try to code something like that up in prolog!
        
       | mrbungie wrote:
       | Not really about Theory of Mind, but in the same line, I remember
       | the other day someone argued with me that LLMs model the world,
       | rather than just modelling language (that may represent the
       | world).
       | 
       | I kept thinking about that problem and plausible experiments to
       | show my point that LLMs are dumb about the physical world, even
       | if they know perfectly how it works in terms of
       | language/representation. So I thought, what happens if I give an
       | LLM an image and I ask a representation of said image in ASCII
       | art (obviously without relying in Python and the trivial pixel
       | intensity to character transform it usually proposes). Remember:
       | 
       | - LLMs should've been trained with a lot of RGB image training
       | data with associated captions => So it should understand images
       | very well.
       | 
       | - LLMs should've been trained with a lot of ASCII training data
       | with associated captions => So it should draw/write ASCII like an
       | expert. Plus, it understands vision apparently (managed as
       | tokens), so it should do well.
       | 
       | But it can't do a decent translation that captures the most
       | interesting features of an image into ASCII art (I'm pretty sure
       | a human with an hour of time should be able to do it, even if its
       | awful ASCII art). For example, I uploaded an image macro meme
       | with text and two pictures of different persons kind of looking
       | at each other. The ASCII art representation just showed two
       | faces, that didn't look at each other but rather away from each
       | other. It just does not "understand" the concept of crossing
       | sights (even if it "understands" the language and even image
       | patches when you ask about where are they looking at, it will not
       | draw that humanly important stuff by itself).
       | 
       | These things just work with tokens, and that is useful and seems
       | like magic in a lot of domains. But there is no way in hell we
       | are going to get into AGI without a fully integrated sensor
       | platform that can model the world in its totality including
       | interacting with it (i.e. like humans in training, but not
       | necessarily in substrate nor training time hopefully). And I
       | really don't know how something that has a very partial model of
       | the world can have a Theory of Mind.
        
         | Jerrrrrrry wrote:
         | it should draw/write ASCII like an expert.
         | 
         | Not a lot of conversations incrementally totaling ASCII
         | conversations in the training data - you are essentially asking
         | a gold fish to climb a tree.                 It should have a
         | lot of RGB image training data with associated captions => So
         | it should understand images very well.
         | 
         | you seem to have conflated the architectures. ChatGPT was
         | trained on text, and text-image embedding - it can recognize,
         | but cannot project. Thats the DALL-E portion - it leverages a
         | similar transformer arch but they are not the same model nor
         | architecture.
         | 
         | However, ask a Generative Adversarial Network for ASCII, you'll
         | get what you expect. Absent the infra-word character cohesion
         | that LLM's token-ization provides, it will give realistic, if
         | sometimes "uncanny" images - ones that "make sense"
         | sequentially, or in the short term, but not the longer, or
         | larger context.
         | 
         | The language portion of your brain, that works faster than you
         | do - else you would be at a loss of words constantly - is not
         | nearly as equipped to deal with spatial problems that your
         | posterior parietal cortex is.
         | 
         | Ultimately we are converging towards a Mixture-of-Experts model
         | that we will one day realize is just....us, but better.
        
       | AdieuToLogic wrote:
       | What is a software program?
       | 
       | The codification of a solution.
       | 
       | What is a solution?
       | 
       | An answer to a problem.
       | 
       | What is a problem?
       | 
       | The identification and expression of a need to be satisfied.
       | 
       | What is a need?
       | 
       | A uniquely human experience, one which only exists within the
       | minds of people whom experience it.
        
         | zmgsabst wrote:
         | Do animals not have needs?
        
           | AdieuToLogic wrote:
           | > Do animals not have needs?
           | 
           | We are animals as well, so a pedantic answer would be "yes."
           | :-)
           | 
           | My _actual_ answer is the origin of the thought process began
           | by Norvig stating:                 But none of them were able
           | to write a program that finds the solution.
           | 
           | So while the concept of "need" is shared across many entities
           | and widely applicable to many conversations, in this context
           | it serves as a root cause answering "why write a program."
        
       | dmead wrote:
       | I wonder if they are any unique properties of those programs that
       | we can figure out what stack overflow or textbooks they're
       | copying.
        
       | IanCal wrote:
       | I'm not a huge fan of using these kind of riddles or gotchas.
       | Other comments have riddle variants which feel also like ways of
       | tripping someone up - if you don't spot the change you fail. And
       | what's more the originals are things that lots of people struggle
       | with (that's why they're riddles not basic questions).
       | 
       | There's also little prompting, which feels like calling up a
       | random person and demanding they solve a riddle straight away
       | without talking it through.
       | 
       | Part of the assumption here is that if llms don't write the right
       | code they don't understand what people know. I'd wager that a
       | huge number of people couldn't solve this puzzle yet fully
       | understand what different people have their own internal thoughts
       | and knowledge.
        
         | rtlacn wrote:
         | Humans can solve these riddles even if they have never seen a
         | similar one.
         | 
         | LLMs need to have seen at least a similar riddle with the same
         | structure and/or code to plagiarize.
         | 
         | Humans _can_ deal with modified riddles if they are alert and
         | awake. An LLM is always alert and awake, so it should be able
         | to handle modified riddles, but it cannot.
        
           | mirekrusin wrote:
           | Not to take anything from what you're saying but any person
           | involved in hiring process _for programmers_ will agree that
           | people struggle with trivia far easier than this example.
        
             | samrus wrote:
             | Yeah but an LLM can't be excused being nervous or not tired
             | or just not operating at its best like a human can. They
             | are always operating at their best, so if they trip up
             | once, that's their limit
        
               | Terretta wrote:
               | > _An LLM can 't be excused... LLMs are always operating
               | at their best_
               | 
               | Depends how much you turn down the heat.
        
               | mirekrusin wrote:
               | Yes, but they will only keep getting better; whether we
               | like it or not sooner or later we're going to be hiring
               | api keys, not people.
        
               | k__ wrote:
               | _" they will only keep getting better"_
               | 
               | I don't think that's a given.
               | 
               | It could very well be that they only get smaller, faster,
               | or even worse when they accidentally start training them
               | with hallucinations that spread on the web.
        
         | aithrowawaycomm wrote:
         | The more dangerous - and far more ridiculous - assumption is
         | that if the LLM _does_ write the correct code then it _does_
         | understand what people know. I am amazed at how common that
         | view is here.
        
         | ifdefdebug wrote:
         | > if you don't spot the change you fail
         | 
         | But that's exactly how real world works too.
        
           | Fripplebubby wrote:
           | This has not been my experience with the real world. Riddles
           | and gotchas have played a very small role, so far.
        
             | shermantanktop wrote:
             | Unfortunately they do show up in tech interviews.
             | 
             | "What's the correct way to delete an element from a Java
             | array while iterating over it?"
             | 
             | "Well I suppose you'd need to avoid invalidating the
             | iterator state but I don't recall the..."
             | 
             | "BZZT!"
        
               | johnisgood wrote:
               | I am not sure I understand. It seems very easy. You
               | cannot directly remove an element from an array, you
               | could create a new array that excludes that element.
               | Arrays have a fixed size once declared, and I cannot
               | imagine anyone who has written some code not knowing. :/
        
               | HappMacDonald wrote:
               | I "have written some code" but it's been decades since
               | I've done anything significant in Java in particular, and
               | every language handles arrays (and/or data structures
               | that get called "arrays") differently.
        
               | johnisgood wrote:
               | The terminology may be confusing, yes, although you would
               | rather call them dynamic arrays or lists (like in Common
               | Lisp). Plus you did say "decades", that is a long time
               | nevertheless. I was not referring to people who have
               | written some code decades ago, of course.
        
           | theptip wrote:
           | Most people with theory of mind can't trivially solve this
           | problem though. So the test doesn't disprove ToM in general,
           | just that it memorizes some results.
        
             | godelski wrote:
             | People are capable of theory of mind. That does not mean
             | they're using it. Same is true for reasoning.
        
           | IanCal wrote:
           | No it doesn't.
           | 
           | You'd get the answer to a riddle wrong or miss something and
           | nobody would start assuming that means you lack a fundamental
           | understanding of how the world works. There's entire fields
           | that look into how and why we make various mistakes and
           | riddles and puzzles designed _specifically_ to trip people
           | up.
           | 
           | If you want to test if these models can solve riddles, or
           | where they make mistakes go right ahead that's _great_. It 's
           | any assumption it has a much deeper meaning that is wrong to
           | me.
        
             | YeGoblynQueenne wrote:
             | >> You'd get the answer to a riddle wrong or miss something
             | and nobody would start assuming that means you lack a
             | fundamental understanding of how the world works. There's
             | entire fields that look into how and why we make various
             | mistakes and riddles and puzzles designed specifically to
             | trip people up.
             | 
             | That's because with humans we assume a certain level of
             | competency and intellectual ability. We cannot make the
             | same assumption when testing AI systems like LLMs because
             | their level of competency and intellectual ability is
             | exactly the question we are trying to answer in the first
             | place.
             | 
             | Note that getting an answer a little wrong, because the
             | question looks like a question you already know the answer
             | to, can be catastrophic in real world conditions. Tipping a
             | frying pan over a plate on a table to serve an omelette
             | when you've learned to do the same thing to serve a cooked
             | shrimp works just fine and shows everyone how smart you are
             | and how well you generalise to novel situations, right up
             | to the point where the contents of the frying pan are on
             | fire and you still tip them over a plate, on a table. Made
             | of flammable wood. Oops.
             | 
             | Also note: a human may be confused by the Tsathoggua-
             | Cthuga-Cxaxukluth river-crossing riddle but they'd never be
             | confused about the danger of a frying pan on fire.
        
               | tsunamifury wrote:
               | This is such a strange and incoherently adjacent answer.
        
               | IanCal wrote:
               | > Also note: a human may be confused by the Tsathoggua-
               | Cthuga-Cxaxukluth river-crossing riddle but they'd never
               | be confused about the danger of a frying pan on fire.
               | 
               | Which highlights the problem with using these riddles to
               | assess other capabilities.
        
             | ifdefdebug wrote:
             | I wasn't talking about riddles, I was talking about real
             | world. Suddenly something is just a little bit different
             | and if you miss the change you fail. There's plenty of that
             | in real world.
        
       | oli5679 wrote:
       | Gp1-o1 preview solves this puzzle correctly in 13 seconds and has
       | a thorough logical deduction in the comments and explanation.
       | 
       | I think it's a bit unfair on llm to ask it to retrieve the puzzle
       | definition from its training data. I posted the info on the
       | puzzle from his notebook.
       | 
       | https://chatgpt.com/share/670103ae-1c18-8011-8068-dd21793727...
        
         | lagmg05 wrote:
         | The question is if it solved the puzzle correctly _before
         | Norvig 's article_ appeared. It could have been trained (I am
         | told that existing models can be modified and augmented in any
         | Llama discussion) on the article or on HN comments.
         | 
         | There could even be an added routine that special cases trick
         | questions and high profile criticisms.
        
           | oli5679 wrote:
           | Sure,
           | 
           | I guess the best way to test this is to compose a new
           | question, of a similar format.
        
             | johnisgood wrote:
             | I am not sure "of a similar format" suffices here, it
             | should not have any resemblance or similarity to this new
             | question or riddle.
        
           | Terretta wrote:
           | > _The question is if it solved the puzzle correctly before
           | Norvig 's article appeared. It could have been trained..._
           | 
           | This caught me by surprise -- is there a suggestion or
           | evidence that despite the "knowledge cutoff" OpenAI is
           | continuously retraining GPT-4o's chat-backing model(s) on day
           | over day updates to the web?
        
           | Fripplebubby wrote:
           | While this is technically possible, it is not remotely
           | practical and the downside risk of pushing out a borked model
           | is much higher than the upside.
           | 
           | Training the model is expensive (obviously), but even if you
           | are only training it slightly, running evaluations to
           | determine whether the particular training checkpoint is at or
           | above the quality bar is expensive, too.
        
         | kenjackson wrote:
         | Disappointing that Norvig didn't use the model that OpenAI
         | states is their best model for programming.
         | 
         | Also using himself as the programmer seemed like a convenient
         | choice. I'd much rather see him grab a random professional
         | programmer for the task.
        
         | godelski wrote:
         | The question is to get it to write generic code
        
       | willguest wrote:
       | This seems to amount to asking an LLM how it feels about Cheryl,
       | discovering that it is performatively happy about her existence,
       | and then deducing that the LLM has no capacity for genuine
       | emotion, expressed in the form of logic.
       | 
       | The faulty premise lies in the formulation of the test and makes
       | the responses both predictable, but also does a disservice to
       | 'mind' because it tries to interpret it in such a way that an LLM
       | _could_ begin to grapple with the basics, but not in a meaninful
       | way.
       | 
       | Perhaps it is useful to help build better context-specific logic
       | flows (generally known as software) but it doesn't seem to
       | provide any progress on the "theory of mind" front, which I guess
       | is a borrowed notion.
        
       | mark_l_watson wrote:
       | Nice! I use various LLMs many times a day as a limited coding
       | tool and something to bounce ideas off of, and it is impossible
       | to not think about how LLMs work and what their limitations are.
       | 
       | I tried just asking Claude Sonet to solve the Cheryl's Birthday
       | word problem, changing the dates. Pretty cool that it can solve
       | it as a word problem, and LLMs will keep getting better at
       | coding.
       | 
       | As a slight tangent: I used a combination of Gemini, GPT-4o, and
       | Claude last week to write Common Lisp code for a simple RDF data
       | store and the subset of SPARQL queries that I thought I would
       | need in embedded Common Lisp applications. This process was far
       | from automatic: I initially provided almost two pages of English
       | instructions, and I had to help debug non-working code by adding
       | debug statements and then show the models the code with print
       | statements and the new output. I also did the optional thing of
       | asking for stylistic changes. TLDR: saved me time and I liked the
       | final code.
       | 
       | I always enjoy it when people like Peter and Karpathy write
       | relatively simple code to share ideas. I am a fairly good coder
       | (I had the meaningless title Master Software Engineer at Capital
       | One) but I like to read other people's code, and I must admit
       | that I spend more time reading code on GitHub than I spend
       | reading technical papers.
        
       | JPLeRouzic wrote:
       | Most LLMs won a T-shirt with the following inscription: " _I am
       | not as smart as Peter Norvig_ "!
        
       | diwank wrote:
       | Script generated by o1-preview:                 ```python       #
       | List of possible dates       dates = [           ('May', 15),
       | ('May', 16), ('May', 19),           ('June', 17), ('June', 18),
       | ('July', 14), ('July', 16),           ('August', 14), ('August',
       | 15), ('August', 17)       ]              def
       | solve_cheryls_birthday(dates):           # Initial possible dates
       | possible_dates = dates.copy()                      # Step 1:
       | Albert's statement           # Create a count of each day
       | day_counts = {}           for month, day in dates:
       | day_counts[day] = day_counts.get(day, 0) + 1                  #
       | Filter out months where a unique day exists (Albert knows Bernard
       | doesn't know)           possible_months = set()           for
       | month in set(month for month, day in dates):
       | month_days = [day for m, day in dates if m == month]
       | if not any(day_counts[day] == 1 for day in month_days):
       | possible_months.add(month)                possible_dates = [
       | (month, day) for (month, day) in possible_dates if month in
       | possible_months ]                  # Step 2: Bernard's statement
       | # Recount the days in the filtered possible dates
       | day_counts_in_possible = {}           for month, day in
       | possible_dates:               day_counts_in_possible[day] =
       | day_counts_in_possible.get(day, 0) + 1                # Bernard
       | can now deduce the date; keep dates where the day is unique
       | possible_dates = [ (month, day) for (month, day) in
       | possible_dates if day_counts_in_possible[day] == 1 ]
       | # Step 3: Albert's final statement           # Recount the months
       | in the possible dates           month_counts_in_possible = {}
       | for month, day in possible_dates:
       | month_counts_in_possible[month] =
       | month_counts_in_possible.get(month, 0) + 1                #
       | Albert now knows the date; keep dates where the month is unique
       | possible_dates = [ (month, day) for (month, day) in
       | possible_dates if month_counts_in_possible[month] == 1 ]
       | # The remaining date is Cheryl's birthday           if
       | len(possible_dates) == 1:               return possible_dates[0]
       | else:               return None              # Solve the problem
       | birthday = solve_cheryls_birthday(dates)       if birthday:
       | print(f"Cheryl's birthday is on {birthday[0]} {birthday[1]}")
       | else:           print("Unable to determine Cheryl's birthday.")
       | ```
       | 
       | Output:
       | 
       |  _Cheryl 's birthday is on July 16_
        
       | tel wrote:
       | I tried to replicate this and Claude 3.5 Sonnet got it correct on
       | the first try. It generated a second set of dates which contained
       | no solution so I asked it to write another python program that
       | generates valid date sets.
       | 
       | Here's the code it generated:
       | https://gist.github.com/tel/8e126563d2d5fb13e7d53cf3adad862e
       | 
       | To my test, it has absolutely no trouble with this problem and
       | can correctly translate the "theory of mind" into a progressive
       | constraint solver.
       | 
       | Norvig is, of course, a well-respected researcher, but this is a
       | bit disappointing. I feel confident he found that his tests
       | failed, but to disprove his thesis (at least as is internally
       | consistent with his experiment) we just need to find a single
       | example of an LLM writing Python code that realizes the answer. I
       | found that on the first try.
       | 
       | I think it's possible that there exists some implementation of
       | this problem, or something close enough to it, already in
       | Claude's training data. It's quite hard to disprove that
       | assertion. But still, I am satisfied with the code and its
       | translation. To relate the word problem to this solution requires
       | contemplation of the character's state-of-mind as a set of
       | alternatives consistent with the information they've been given.
        
         | YeGoblynQueenne wrote:
         | >> I tried to replicate this and Claude 3.5 Sonnet got it
         | correct on the first try. It generated a second set of dates
         | which contained no solution so I asked it to write another
         | python program that generates valid date sets.
         | 
         | That's good but no cigar and it certainly didn't get it
         | "correct on the first try". First it generated a partially
         | correct solution. Then you had to prompt it again to generate a
         | new program. You were only able to do that because you know
         | what the right answer looks like in the first place. The second
         | program is missing a second set of dates so it's not clear if
         | it really gets it right the second time or just reproducing a
         | different program from its training set without understanding
         | the problem and what's wrong with the first program and just
         | because you asked it to do so instead.
         | 
         | >> I feel confident he found that his tests failed, but to
         | disprove his thesis (at least as is internally consistent with
         | his experiment) we just need to find a single example of an LLM
         | writing Python code that realizes the answer. I found that on
         | the first try.
         | 
         | That's not how testing LLM code generation is done in practice,
         | exactly because of the variance that can be expected in
         | generated results. To properly test an LLM (which I would agree
         | Norvig's experiment falls a little short off) one has to run
         | multiple experiments and evaluate all the results in aggregate
         | in some form. The usual way to do it is to draw k samples from
         | the LLM's distribution and check whether the correct answer is
         | generated at least n times (k@n metric). That's an awful metric
         | because it's basically allowing arbitrary "guesses" until the
         | LLM code generator gets it right. A simpler test is to generate
         | k programs, check whether each program is right or wrong, and
         | assign 1 for each correct answer and 0 for each incorrect
         | answer, then average over all answers. It's an open question
         | whether to count a partial answer as a 0, or 0.5.
         | 
         | So if we took the total failure in Norvig's experiment and the
         | only partial success in yours, and allowing for the most
         | charitable aggregation of results, we have something like 0.25
         | accuracy, which is not that impressive. Not least because it's
         | evaluated on just two test samples.
         | 
         | Also, please don't underestimate the knowledge of experts like
         | Peter Norvig.
        
         | conradev wrote:
         | He also did not test o1-preview which I feel would do well on
         | this class of problem.
        
         | trashface wrote:
         | They could have just put the correct result into claude in the
         | time since he wrote this (Sept 25th).
        
       | fny wrote:
       | How does solving a logic puzzle imply a theory of mind? I don't
       | mean to say that LLMs don't have a theory of mind, just that
       | deductive reasoning does not amount to empathetic evaluations of
       | how someone else thinks and feels...
       | 
       | ...unless you're a programmer.
        
       ___________________________________________________________________
       (page generated 2024-10-05 23:01 UTC)