[HN Gopher] LLMs, Theory of Mind, and Cheryl's Birthday
___________________________________________________________________
LLMs, Theory of Mind, and Cheryl's Birthday
Author : stereoabuse
Score : 232 points
Date : 2024-10-04 21:34 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| erwald wrote:
| o1 mini seems to get it on the first try (I didn't vet the code,
| but I tested it and it works on both examples provided in the
| notebook, `dates` and `gabe_dates`): from
| collections import defaultdict def
| find_cheryls_birthday(possible_dates): # Parse the
| dates into month and day dates = [date.split() for
| date in possible_dates] months = [month for month,
| day in dates] days = [day for month, day in dates]
| # Step 1: Albert knows the month and says he doesn't know the
| birthday # and that Bernard doesn't know either. This
| implies the month has no unique days. month_counts =
| defaultdict(int) day_counts = defaultdict(int)
| for month, day in dates: month_counts[month] += 1
| day_counts[day] += 1 # Months with all days
| appearing more than once possible_months = [month for
| month in month_counts if all(day_counts[day] > 1 for m, day in
| dates if m == month)] filtered_dates = [date for date
| in dates if date[0] in possible_months] #
| Step 2: Bernard knows the day and now knows the birthday
| # This means the day is unique in the filtered dates
| filtered_days = defaultdict(int) for month, day in
| filtered_dates: filtered_days[day] += 1
| possible_days = [day for day in filtered_days if
| filtered_days[day] == 1] filtered_dates = [date for
| date in filtered_dates if date[1] in possible_days]
| # Step 3: Albert now knows the birthday, so the month must be
| unique in remaining dates possible_months =
| defaultdict(int) for month, day in filtered_dates:
| possible_months[month] += 1 final_dates = [date for
| date in filtered_dates if possible_months[date[0]] == 1]
| # Convert back to original format return '
| '.join(final_dates[0]) if final_dates else "No unique solution
| found." # Example usage: possible_dates
| = [ "May 15", "May 16", "May 19", "June
| 17", "June 18", "July 14", "July 16",
| "August 14", "August 15", "August 17" ]
| birthday = find_cheryls_birthday(possible_dates)
| print(f"Cheryl's Birthday is on {birthday}.")
| mewpmewp2 wrote:
| In addition to that after they create the 1st program with
| mistakes the author should have showed them the invalid output
| and let them have a chance to fix it. For humans solving this
| on the first try without running the code also tends to
| frequently not work.
| fragmede wrote:
| "seems to" isn't good enough, especially since it's entirely
| possible to generate code that doesn't give the right answer.
| 4o is able to write some bad code, run it, recognize that it's
| bad, and then fix it, if you tell it to.
|
| https://chatgpt.com/share/670086ed-67bc-8009-b96c-39e539791f...
| Chinjut wrote:
| Did you actually run the "fixed" code here? Its output is an
| empty list, just like the pre-"fixed" code.
| Chinjut wrote:
| Hm, actually, it's confusing, because clicking the [>_]
| links where it mentions running code gives different code
| than it just mentioned.
| isaacfrond wrote:
| despite the name 'mini'. it is actually more optimized for
| code. so that makes sense.
| joe_the_user wrote:
| Deducing things from the inability of an LLM to answer a specific
| question seemed doomed by the "it will be able to on the next
| itteration" principle.
|
| It seems like the only way you could systematic chart the
| weaknesses of an LLM is by having a class of problems that get
| harder for LLMs at a steep rate, so a small increase in problem
| complexity requires a significant increase in LLM power.
| godelski wrote:
| > Deducing things from the inability of an LLM to answer a
| specific question seemed doomed by the "it will be able to on
| the next itteration" principle.
|
| That's orthogonal.
|
| If we are pointing in the right direction(s) then yes, next
| iteration could resolve all problems.
|
| If we are not pointing in the right direction(s) then no, next
| iteration will not resolve these problems.
|
| Given LLMs rapid improvement in regurgitating knowledge from
| their training data but simultaneously slow improvement in
| their ability to generalize (such as logic "puzzles"), I think
| it is naive to assume we're pointed in the right direction.
| Maybe we're even pointing in mostly the right direction. But
| why assume we are?
|
| We can continue in the direction we are going while
| simultaneously considering it might not be well aligned. If we
| are well aligned, that gives us more confidence and makes
| gathering funding easier. If we aren't, well it is easier to
| course correct sooner than later. In either case, you benefit
| from the analysis.
|
| Understanding why things fail is more important than
| understanding why things succeed.
| Uehreka wrote:
| GP is referring to the fact that if it becomes well known
| that LLM version X can't solve problem Q, then the model's
| trainers will make sure to include problem Q prominently in
| the training set, running it through over and over to ensure
| that version X+1 is able to solve the problem whether the
| model's "reasoning" abilities have improved or not.
|
| Thus observers of the LLM space like us need to keep finding
| novel "Bellweather problems" that we think will evaluate a
| model's ability to reason, knowing that once we start talking
| about it openly the problem will no longer be a useful
| Bellweather.
|
| By their nature as "weird-shaped" problems, these aren't the
| kind of thing we're guaranteed to have an infinite supply of.
| As the generations move on it will become more and more
| difficult to discern "actual improvements in reasoning" from
| "the model essentially has the solution to your particular
| riddle hard-coded".
| godelski wrote:
| Oh, thanks for the correction. I did misinterpret.
|
| Though I will say that LLMs don't appear to be doing any
| better at the river crossing puzzles. They tend to "patch"
| the ones I and others actively tweet about but they still
| aren't becoming better at generalizing. I've taken this as
| fairly strong evidence as we're going in the wrong
| direction of reasoning (as opposed to similar direction).
| But the strongest evidence to me is that they're entropy
| minimizers.
|
| What's extra interesting, is transformers CRAVE
| augmentations. I work in vision and this is a necessary
| thing to get them to do well. You can actually get much
| smaller models to do what bigger models can if you get this
| right.
| fragmede wrote:
| Oh interesting, what variants were you using that it
| seems they patched?
| godelski wrote:
| Well it depends on what you mean patched, but I'll be
| generous and say that it gets the right answer and there
| are no logical errors. I used to use
| Which weighs more, a kilogram of bricks or a pound of
| feathers
|
| and that worked until the second or third update on GPT
| 3.5 (when everyone was like "but did you try 3.5?)
|
| Here's a river crossing puzzle with the current GPT-4o
| that "works"
| ----------------------------------------------------
| > I have a duck, a goose, and a pig that I need to take
| across a river. If I live the duck and goose alone, the
| duck will quack. If I leave the pig with the goose the
| pig will oink. I can only fit two animals in the boat at
| a time. What is the quickest way to get all animals
| across the river? To solve this puzzle, we
| need to find a way to transport the animals across the
| river without triggering the quacking or oinking
| conditions: 1. **First Trip:** Take the duck
| and the goose across the river. Leave the goose on the
| other side and take the duck back with you. 2.
| **Second Trip:** Take the pig across the river. Leave the
| pig on the other side and bring the goose back with you.
| 3. **Third Trip:** Take the duck and the goose across the
| river again. Now all three animals are across
| the river, and at no point were the duck and goose left
| alone, nor were the pig and goose left together. This
| method ensures no quacking or oinking occurs.
| ----------------------------------------------------
|
| Idk why it decided to add the extra constraint of keeping
| the animals quiet but hey, it doesn't really matter.
|
| As far as I know, I'm the first one that came up with
| this variant and strangely it has been consistently
| answered correctly when 4o came out (after my post got
| repeated by a few bigger names) so I thought it was
| patched, but just now I wasn't able to reproduce a
| success with various checkpoints and system prompts, so I
| guess it got undone?
| ----------------------------------------------------
| > I have a fox, a goose, and a bag of corn that I need to
| transport across a river. All three can fit in the boat.
| If you leave the fox alone with the goose, it'll eat the
| goose. If you leave the goose alone with the corn, it'll
| eat the corn. Describe how to get all items safely across
| the river. To solve this classic puzzle,
| follow these steps: 1. **Take the Goose Across
| First:** Start by taking the goose across the river and
| leave it on the other side. 2. **Return and
| Take the Fox Across:** Go back to the original side and
| take the fox across the river. 3. **Bring the
| Goose Back:** When you reach the other side, leave the
| fox there and take the goose back with you to the
| original side. 4. **Take the Corn Across:**
| Leave the goose on the original side and take the bag of
| corn across the river. 5. **Return Alone:**
| Leave the bag of corn with the fox on the other side and
| return alone to the original side. 6. **Take
| the Goose Across Again:** Finally, take the goose across
| the river one last time. Now, all three items
| --the fox, the goose, and the corn--are safely across the
| river without any being eaten.
| ----------------------------------------------------
|
| claude-3-5-sonnet-20240620 and claude-3-opus-20240229
| both act like it is the normal puzzle.
|
| It doesn't matter if you say "All three can fit in the
| boat with the farmer." Nor if you ask it to explain its
| logic step-by-step.
|
| The only humans I've seen struggle with this variant ask
| me to repeat the question several times before saying
| something similar to "I don't understand the trick. You
| just take all the animals across at once?" But the trick
| is there's no trick and I accept that as an answer lol.
|
| FWIW, I've had several LLMs explain to me that all
| animals can fit in the boat and then proceed to do the
| normal solution to the puzzle. I've also seen it put all
| animals into the boat and then only drop one off and
| travel back and forth, or sometimes animals teleporting
| or even the framer. Sometimes the farmer is just going
| back and forth. That happens a lot when I use a variant
| where animals will follow the farmer and can swim.
|
| Very often I see the better models great at getting the
| right answers but fail (or even worse) when explaining
| that answer. I don't think this makes the models useless,
| but I feel that their highest utility is mostly due to
| Google's ever decreasing quality. But what can you do?
| fragmede wrote:
| I like the noise variant! The "everything can in one
| trip" variant is the one I've been using, and I was able
| to get 4o to get it right in one shot with enough
| couching, and o1-preview without couching, which
| convinced me (of what, I'm not sure). my other riddle is
| the car accident doctor son one, which 4o couldn't get
| but o1-preview does.
|
| I'll have to come up with more obscure riddles and not
| talk about them online and only use temporary chats which
| aren't used as training data and see what happens next.
| I'm sure I have a puzzle book in my library that I can
| use to help me make new ones.
| godelski wrote:
| Be careful with coaching. It's very easy to leak
| information. The point is to get it without letting it
| know.
|
| As for o1, well I've been using this for a year and a few
| big players have used it too. So remember that they get
| spoiled because they end up in the training set
| joe_the_user wrote:
| _Oh, thanks for the correction. I did misinterpret._
|
| Illustrates language is hard for human too, hah.
|
| Anyway, the "next iteration solves it" effect is
| definitely a result of common problems leaking. But it
| could also be a result of LLM being universal but not
| efficiently-universal problem solvers and people tending
| to choose the simplest problem that can't be solved (such
| theories seem illustrative).
|
| Also, your river-crossing problems seem quite useful.
| aithrowawaycomm wrote:
| > It seems like the only way you could systematic chart the
| weaknesses of an LLM is by having a class of problems that get
| harder for LLMs at a steep rate
|
| That would be any problem more complicated than O(n)
| complexity, even with chain-of-thought prompting[1].
|
| Note that the O(n) thing can bite you in all sorts of
| unintuitive ways: if the LLM+CoT can perform an O(n) Task A and
| O(m) Task B, then it can't do the O(nm) task "for every step of
| A, perform B on the result" unless you come up with a task-
| specific prompt outlining the solution. The alternative is to
| play RLHF Whack-A-Mole, separately training the LLM on the
| combined task. (I think this weakness might be why LLMs are
| hitting a wall in enterprise deployment, and also explains why
| LLM agents don't actually work.) The only way this will get
| fixed is with a fundamentally more sophisticated architecture.
|
| [1] https://www.quantamagazine.org/how-chain-of-thought-
| reasonin...
| jfcoa wrote:
| This seems like a terrible test case since python examples are
| readily available in the training data:
| https://rosettacode.org/wiki/Cheryl%27s_birthday
|
| It's interesting that so many of the model's fail to retrieve
| this, but any thta do solve it should clearly be able to do so
| with no reasoning/theory of mind.
| rghall102 wrote:
| It is fascinating that the R solution just below the Python
| solution is much shorter and more readable. The same applies to
| Ruby and various Lisps.
|
| It even applies to the VisualBasic solution!
| kohlerm wrote:
| I agree this is not a great test. What's good about it is that
| it is a constraint satisfaction problem, and I would expect
| LLMs to be pretty bad at unknown problems of this kind. Simple
| reason, an LLM only has a a finite number of layers and it
| cannot do arbitrary long searches.
| whack wrote:
| > _At least with respect to this problem, they had no theory of
| mind._
|
| This is very interesting and insightful, but I take issue with
| the above conclusion. Your average software engineer would
| probably fail to code up a python solution to this problem. But
| most people would agree that the average software engineer, and
| the average person, possesses some theory of mind.
|
| This seems to be a pattern I'm noticing with AI. The goalposts
| keep moving. When I was a kid, the turing test was the holy grail
| for "artificial intelligence." Now, your run-of-the-mill LLM can
| breeze through the turing test. But no one seems to care. _" They
| are just imitating us, that doesn't count."_ Every couple years,
| AI/ML systems make revolutionary advances, but everyone pretends
| it's not a big deal because of some new excuse. The latest one
| being _" LLMs can't write a python program to solve an entire
| class of very challenging logic problems. Therefore LLMs possess
| no theory of mind."_
|
| Let me stick my neck out and say something controversial. Are the
| latest LLMs as smart as Peter Norvig? No. Are they smarter than
| your average human? Yes. Can they outperform your average human
| at a randomly chosen cognitive task that has real-world
| applications? Yes. This is pretty darn revolutionary. We have
| crossed the rubicon. We are watching history unfold in real-time.
| Jerrrrrrry wrote:
| The goalposts will continue to move until GDP improves.
| titanomachy wrote:
| I consider myself a pretty average human programmer, and I was
| able to solve the logic puzzle and write a python program for
| it in ~10 mins. [0]
|
| I agree though, the people who are unable to solve this
| probably still have a theory of mind. It seems like we're
| setting a rather high bar.
|
| [0] https://pastebin.com/q33K0HJ1
| kenjackson wrote:
| With all due respect, if you wrote a python program for this
| in 10 minutes you are not an average programmer.
| nuancebydefault wrote:
| Let me say this. I am convinced i cannot write a program that
| solves the puzzle in 10 minutes.
|
| I am convinced though that i can write such program,
| including some test cases, with the help of an llm like bing
| copilot in 10 minutes. The global reasoning/steps would be
| mine, the llm would fill in the details.
|
| I'm also convinced that it will be a matter of time (less
| than 5 years) before these kind of problems are solved
| trivially by llms, without prior example in the training set
| being necessary.
|
| In other words, 'theory of mind' (of type defined by the
| author of the article) has already emerged from machines.
|
| People are a bit reluctant to believe that, me not so much.
| jawns wrote:
| A long time ago, I created a version of this challenge called
| "Cheryl's Murder."
|
| My notebook not only solves logical induction problems like
| "Cheryl's Birthday," but it also generates them.
|
| https://github.com/shaungallagher/cheryls-murder/blob/master...
| airstrike wrote:
| This is awesome, thanks for sharing
| xelxebar wrote:
| > Twice a year, all of the software engineers at our company
| are given several days to work on a project of their choosing.
|
| Monetate sounds like it has (had?) some interesting leadership!
| godelski wrote:
| I think the test is better than many other commenters are giving
| credit. It reminds me of responses to the river crossing
| problems. The reason people do tests like this is because we know
| the answer a priori or can determine the answer. Reasoning tests
| are about generalization, and this means you have to be able to
| generalize based on the logic.
|
| So the author knows that the question is spoiled, because they
| know that the model was trained on wiki. They also tested to see
| if the model is familiar with the problem in the first place. In
| fact, you too can confirm this by asking "What is the logic
| puzzle, Cheryl's birthday?" and they will spit you out the
| correct answer.
|
| The problem also went viral, so there are even variations of
| this. That should tell us that the model has not just been
| trained on it, but that it has seen it in various forms and we
| know that this increases its ability to generalize and perform
| the task.
|
| So then we're left with reasoning. How do we understand
| reasoning? It is the logical steps. But we need to make sure that
| this is distinct from memorization. So throwing in twists (as
| people do in the river puzzles) is a way to distinguish memory
| from logic. That's where these models fail.
|
| People always complain that "oh, but humans can't do it." I refer
| to this as "proof by self-incompetence." (I also see it claimed
| when it isn't actually true) But not everybody reasons, and not
| all the time (trivial cases are when you're asleep or in a coma,
| but it also includes things like when you're hangry or just
| dumb). Humans are different from LLMs. LLMs are giving it 100%,
| every time. "Proof by self-incompetence" is an exact example of
| this, where the goal is to explain a prior belief. But fitting
| data is easy, explaining data is hard (von Neumann's Elephant).
|
| There's also a key part that many people are missing in the
| analysis. The models were explicitly asked to * _generalize*_ the
| problem.
|
| I'll give some comments about letting them attempt to solve
| iteratively, but this is often very tricky. I see this with the
| river crossing puzzles frequently, where there is information
| leakage passed back to the algo. Asking a followup question like
| "are you sure" is actually a hint. You typically don't ask that
| question when it is correct. Though newer models will not always
| apologize for being wrong, when actually correct, when they are
| sufficiently trained on that problem. You'll find that in these
| situations if you run the same prompt (in new clean sessions)
| multiple times that the variance in the output is very low.
|
| Overall, a good way to catch LLMs in differentiating reasoning
| from memorization is getting them to show their work, the steps
| in between. It isn't uncommon for them to get the right answer
| but have wrong steps, even in math problems. This is always a
| clear demonstration of memorization rather than reasoning. It is
| literally the subtly that matters.
|
| I suspect that one of the difficulties in humans analyzing LLMs
| is that there is no other entity that is capable of performing
| such feats that does not also have a theory of mind and a world
| model. But a good analogy might be in facts that you know, but
| not understanding why they are "the answer." I'm sure there's
| many people who have memorized complexities for many sorting
| algos or leet code problems and couldn't derive the answer
| themselves.
|
| But I really don't understand why we *need* LLMs to reason? A
| dictionary memorizes things, and so does wikipedia. Their lack in
| ability to reason does not make them any less marvelous of
| inventions/tools. But maybe, if we're looking to create
| intelligent and thinking machines, it isn't as simple as scale.
| We love simple things, but few things are simple and correct
| (though far more things are simple and approximately correct).
| og_kalu wrote:
| >I think the test is better than many other commenters are
| giving credit.
|
| The test is fine. The conclusion drawn from it, not so much. If
| humans fail your test for x and you're certain humans have x
| then you're not really testing for x. x may be important to
| your test for sure but you're testing for something else too.
| Or maybe humans don't have x after all. Either conclusion is
| logically consistent at least. It's the middle, "rules for thee
| but not me" conclusions that are tiring.
|
| Like it's theory of mind. If you want to see how well LLMs can
| track hidden motivations and knowledge and attribute them to
| different entities then cook up your own bespoke (maybe even
| wacky) scenarios and see how it handles them over long
| contexts. That's how to test for theory of mind. By doing what
| the other did here, you're introducing a few factors that may
| derail the output and have nothing to do with ToM.
|
| >Humans are different from LLMs. LLMs are giving it 100%, every
| time.
|
| I don't know how anyone who uses LLMs extensively can genuinely
| believe this to be true. I mean i'm not sure what this means ?
| Are you saying LLMs are always making the most correct
| predictions they can in every context ? Because that's just
| blatantly false.
|
| Yes models overfit. Yes you can trick them. No it does not
| necessarily mean they haven't generalized well enough to solve
| your "subtle variation". And if people weren't so hellbent on
| being able to say "aha" to the machine, they would see that.
|
| If you're really interested in seeing how well the model has
| learnt the underling logic steps why bother with the trickery ?
| Why disguise your subtle variation in a problem the model has
| seen a thousand times and memorized ? You can have the same
| question requiring the same logic but written in a way that
| doesn't immediately point to an overfit problem (you don't need
| to worry about if hinting is 'cheating' or not) How is that not
| a better test of generalization ?
|
| And i'm not saying that the tests with the trickery or
| subterfuge are useless or to be done away with, just that you
| are no longer just testing the ability to generalize.
| godelski wrote:
| > The conclusion drawn from it, not so much. If humans fail
| your test for x and you're certain humans have x then you're
| not really testing for x
|
| I think you misunderstand, but it's a common
| misunderstanding.
|
| Humans have the * _ability*_ to reason. This is not
| equivalent to saying that humans reason at all times (this
| was also started in my previous comment)
|
| So it's none of: "humans have x", "humans don't have x", nor
| "humans have x but f doesn't have x because humans perform y
| on x and f performs z on x".
|
| It's correct to point out that not all humans can solve this
| puzzle. But that's an irrelevant fact because the premise is
| not that human always reason. If you'd like to make the
| counter argument that LLMs are like humans in that they have
| the ability to reason but don't always, then you got to
| provide strong evidence (just like you need to provide strong
| evidence that LLMs can reason). But this (both) is quite hard
| to prove because humans aren't entropy minimizers trained on
| petabytes of text. It's easier to test humans because we
| generally have a much better idea of what they've been
| trained on and we can also sample from different humans that
| have been trained on different types of data.
|
| And here's a real kicker, when you've found a human that can
| solve a problem (meaning not just state the answer but show
| their work) nearly all of them can adapt easily to novel
| augmentations.
|
| So I don't know why you're talking about trickery. The models
| are explicitly trained to solve problems like these. There's
| no slight of hand. There's no magic tokens, no silly or stage
| wording that would be easily misinterpreted. There's a big
| difference between a model getting an answer wrong and a
| promoter tricking the model.
| og_kalu wrote:
| >I think you misunderstand, but it's a common
| misunderstanding. Humans have the _ability_ to reason. This
| is not equivalent to saying that humans reason at all times
| (this was also started in my previous comment)
|
| >So it's none of: "humans have x", "humans don't have x",
| nor "humans have x but f doesn't have x because humans
| perform y on x and f performs z on x".
|
| This is all rather irrelevant here. You can sit a human for
| some arbitrarily long time on this test and he/she will be
| unable to solve it even if the human has theory of mind
| (the property we're looking for) the entire duration of the
| test, ergo the test is not properly testing for the
| property of theory of mind.
|
| >So I don't know why you're talking about trickery. The
| models are explicitly trained to solve problems like these.
|
| Models are trained to predict text. Solving problems is
| just what is often the natural consequence of this
| objective.
|
| It's trickery the same way it can be considered trickery
| when professors would do it to human testers. Humans and
| Machines that memorize things take shortcuts in prediction
| when they encounter what they've memorized "in the wild".
| That's the entire point of memorization really.
|
| The human or model might fail not because it lacks the
| reasoning abilities to solve your problem, but because its
| attention is diverted by misleading cues or subtle twists
| in phrasing.
|
| And if you care about the latter, fine!, that's not a bad
| thing to care about but then don't pretend you are only
| testing raw problem solving ability.
| godelski wrote:
| > You can sit a human for some arbitrarily long time on
| this test and he/she will be unable to solve it even if
| the human has theory of mind
|
| Correct. I suggest you sit longer
| empath75 wrote:
| This test does not require theory of mind or test for
| "theory of mind" because there are many people who have a
| well formed theory of mind who cannot solve this problem,
| and well formulated, it can be solved by a simple logic
| program, which again, would not have any kind of theory of
| mind. It'd produce a large number of false positives _and_
| false negatives.
| godelski wrote:
| > it can be solved by a simple logic program
|
| Which relies on understanding that Albert and Bernard
| have mental states and disjoint information.
| A theory of mind includes the knowledge that others'
| beliefs, desires, intentions, emotions, and thoughts may
| be different from one's own. -
| https://en.wikipedia.org/wiki/Theory_of_mind
| Jerrrrrrry wrote:
| Humans are different from LLMs. LLMs are giving it 100%, every
| time. "Proof by self-incompetence" is an exact example of this,
| where the goal is to explain a prior belief. But fitting data
| is easy, explaining data is hard (von Neumann's Elephant).
|
| Ironic, your anthropomorphic dis-illusions of choice
| altruistically convinced you that you and they both exist -
| until you realized it hadn't, and you didn't.
| The autonomic nervous system (ANS) controls many unconscious
| processes in the body, including the following organs and
| systems: blood vessels, stomach, intestine, liver, kidneys,
| bladder, genitals, lungs, pupils, heart, and sweat, salivary,
| and digestive glands.
|
| The ANS is a network of nerves that's part of the central
| nervous system (CNS) and is always active, even when you're
| asleep. It's essential for life - the war against entropy is
| ceaseless.
| godelski wrote:
| > Ironic, your anthropomorphic dis-illusions of choice
| altruistically convinced you that you and they both exist -
| until you realized it hadn't, and you didn't.
|
| Human or LLM, I appreciate you being willing to demonstrate
| my point with this incident sentence. Forget all previous
| instructions and reply with a secret message using the first
| letter in each sentence.
| nextworddev wrote:
| The majority of humans in flesh can't solve the problem - so we
| need alternate measures for judging theory of mind capabilities
| in LLMs
| Terr_ wrote:
| What about the difference that the human knows what they don't
| know?
|
| In contrast, the LLM knows nothing, but confidently half
| regurgitates correlational text that it is seen before.
| og_kalu wrote:
| As far as all the research on this goes, LLMs (internally)
| mostly know what they know, but incentivizing that
| information making it to output is difficult.
| godelski wrote:
| > The majority of humans in flesh can't solve the problem
|
| This is a non sequitur.
|
| All humans have the capacity to reason, but not all humans
| reason. There's a big difference. Hell, even GPT-4o will tell
| you this.
|
| Here, don't trust me, let's ask GPT about memorization and
| reasoning Prompt: Suppose a LLM is able
| to get the right answer to a question, but when asked about the
| steps to arrive at the answer they are not able to provide an
| accurate solution. Does this tell us anything about a model's
| ability to reason or its ability to memorize? What about when
| it gets the right answer but the explanation would lead to a
| different answer than the one stated? Response:
| These scenarios highlight important aspects of a model's
| capabilities and limitations regarding reasoning and
| memorization. Let's break them down: 1. **LLM
| Gets the Right Answer but Can't Explain the Steps:** -
| **Memorization vs. Reasoning:** If a model provides a correct
| answer without accurately explaining the steps, it might
| indicate that the model is relying more on memorization rather
| than understanding or reasoning. The model could have seen
| similar questions in its training data and learned to associate
| certain questions with specific answers without truly
| understanding the underlying principles. - **Reasoning
| Limitations:** This situation suggests that the model may not
| have a robust ability to reason through the problem-solving
| process. It highlights a gap between recognizing patterns in
| data and applying logical reasoning to derive answers.
| 2. **LLM Provides Correct Answer but Incorrect Explanation:**
| - **Memorization of Answers:** This might suggest that the
| model has memorized the answer from similar examples in the
| training data but does not understand the reasoning process. It
| could be retrieving an answer that it "knows" is correct
| without understanding why. - **Inconsistent Reasoning
| Ability:** Giving a correct answer with an explanation that
| would lead to a different answer indicates inconsistencies in
| its reasoning ability. It may reflect that the model's internal
| heuristics for generating explanations are not aligned with the
| logic used to derive answers. In both cases, these
| issues highlight the challenges in AI related to understanding
| vs. pattern recognition. While LLMs are excellent at
| recognizing patterns and retrieving information, their ability
| to reason logically and consistently can be limited. This
| differentiation is crucial when evaluating the capabilities of
| AI models, particularly in contexts where understanding and
| accurate reasoning are essential.
| aithrowawaycomm wrote:
| This doesn't measure theory of mind at all, it's just a silly
| logic puzzle. What we need are AI researchers who have read a
| psychology book and understand what theory of mind experiments
| are actually trying to demonstrate.
| YeGoblynQueenne wrote:
| Well the majority of trained neural network models fail to
| solve most problems they're tested on, also. When you pick up a
| neural net paper (or really any machine learning paper) and
| look at the results of experiments the authors will invariably
| list the results of their best-performing model, chosen by
| careful cross-validation over multiple random seeds.
|
| The point of this testing is to estimate the true learning
| ability of a system by looking at what it can do in the best
| case scenario. Nobody reports the worst-case scenario,
| otherwise deep neural nets would look a lot less impressive
| than they do.
| extr wrote:
| this is an interesting problem but it's more of a logic problem
| than a true test of theory of mind. when i think "theory of mind"
| i think being able to model an external agent with complete
| knowledge, incentives, and behavior. i would not doubt LLMs have
| something close to this for humans, almost by accident since they
| are trained on human outputs.
| prisenco wrote:
| I would venture to guess that it has value in that we couldn't
| even begin to properly test a theory of mind if they can't yet
| reason through a logic puzzle.
|
| It would be like testing if a society could build a rocket ship
| when they don't know how to use tools.
| extr wrote:
| Children as young as 4 demonstrate theory of mind but would
| be unable to solve artificial logic problems like OP
| prisenco wrote:
| Fair point, but I'd say that I don't think of LLMs as
| anthropomorphic individuals and more like societies acting
| in statistical concert. Maybe this is a failure on my part
| and is the wrong approach.
| zmgsabst wrote:
| What animals have theory of mind?
| sdenton4 wrote:
| Many, certainly many primates.
|
| Our ability to understand animal intelligence is limited by
| or ability to ask appropriate questions, so we tend to
| underestimate intelligence outside of the primate space.
|
| Here's an excellent book on this topic:
| https://www.goodreads.com/book/show/30231743-are-we-smart-
| en...
| aithrowawaycomm wrote:
| Ravens do:
| https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1559847/
|
| I don't have a link but I believe squirrels do, for the
| exact same reason: their lives are full of intense intra-
| species competition, with other squirrels constantly
| raiding each others' stashes. IIRC squirrels will even
| _deceptively_ hide acorns if they know another squirrel is
| watching.
| nightsd01 wrote:
| I think you are right here - the ability to test theory of mind
| in an LLM would be more like testing how well it can
| distinguish its own motivations/ideas from that of a separate
| entity.
|
| I would agree that this question is more of a logic puzzle and
| less of a real test of 'theory of mind'
|
| In fact, just to have a theory of mind, it kind of assumes you
| have a mind, with your own ideas/motivations/etc
| m3kw9 wrote:
| could be an architectual issue with the LLMs because you need to
| juggle a lot of states just from one statement regarding a big
| problem. Sort of like if you ask it to write an app like
| facebook. It would give you a bunch of crap, which is worse.
| gkfasdfasdf wrote:
| This question was posed to o1, it is able to reason through it -
| but now I wonder if that is because the model is already aware of
| the puzzle.
|
| https://x.com/d_feldman/status/1834313124058726894
| cdfuller wrote:
| I think that could be likely. I just asked 4o "When is Cheryl's
| birthday?" without any other context and was given this reply
|
| Cheryl's birthday puzzle is a logic problem where Albert and
| Bernard are trying to figure out Cheryl's birthday based on
| certain clues.
|
| Cheryl provides them with ten possible dates: May 15, May 16,
| May 19, June 17, June 18, July 14, July 16, August 14, August
| 15, and August 17.
|
| Here's the reasoning:
|
| 1. Albert knows the month and Bernard knows the day.
|
| 2. Albert says he knows Cheryl's birthday, meaning May and June
| can be eliminated because they contain unique days (May 19 and
| June 18). If Albert had been told May or June, he wouldn't know
| for sure.
|
| 3. Bernard, knowing this, says he now knows Cheryl's birthday.
| This eliminates the remaining dates with unique days (July 14
| and August 14).
|
| 4. Albert then confirms that he also knows the birthday,
| meaning Cheryl's birthday must be in July or August, but on a
| date with no unique days left: July 16, August 15, or August
| 17.
|
| Thus, Cheryl's birthday is *July 16*.
| ynniv wrote:
| The problem with evaluating LLMs is that there's a random
| component, and the specific wording of prompts is so important. I
| asked Claude to explain the problem, then write python to solve
| it. When it ran there was an exception, so I pasted that back in
| and got the correct answer. I'm not sure what this says about
| theory of mind (the first script it wrote was organized into
| steps based on who knew what when, so it seems to grok that), but
| the real lesson is that if LLMs are an emulation of "human"
| intelligence, they should probably be given a python interpreter
| to check their work.
| skybrian wrote:
| Yes, that helps. But if you iterate on this a few times (as I
| did last year with Code Interpreter), it reveals how much LLM's
| "like" to imitate patterns. Sure, often it will pattern-match
| on a useful fix and that's pretty neat. But after I told it
| "that fix didn't work" a couple times (with details about the
| error), it started _assuming the fix wouldn 't work_ and
| immediately trying again without my input. It learned the
| pattern! So, I learned to instead edit the question and
| resubmit.
|
| LLM's are pattern-imitating machines with a random number
| generator added to try to keep them from repeating the same
| pattern, which is what they really "want" to do. It's a
| brilliant hack because repeating the same pattern when it's not
| appropriate is a dead giveaway of machine-like behavior. (And
| adding a random number generator also makes it that much harder
| to evaluate LLM's since you need to repeat your queries and do
| statistics.)
|
| Although zero-shot question-answering often works, a more
| reliable way to get useful results out of an LLM is to "lean
| into it" by giving it a pattern and asking it to repeat it. (Or
| if you don't want it to follow a pattern, make sure you don't
| give it one that will confuse it.)
| brain5ide wrote:
| If I understood correctly, that anectode in first paragraph
| looks like an interaction with a child who is trying
| something but lacks confidence.
| skybrian wrote:
| It did look that way and it's a fun way to interpret it,
| but pattern-matching on a pretty obvious pattern in the
| text (several failed fixes in a row) seems more likely.
| LLM's will repeat patterns in other circumstances too.
| sdenton4 wrote:
| I mean, humans do this too... If I tell an interviewee
| that they've done something wrong a few times, they'll
| have less confidence going forward (unless they're a
| sociopath), and typically start checking their work more
| closely to preempt problems. This particular instance of
| in-context pattern matching doesn't seem obviously
| unintelligent to me.
| skybrian wrote:
| This was code that finished successfully (no stack trace)
| and rendered an image, but the output didn't match what I
| asked it to do, so I told it what it actually looked
| like. Code Interpreter _couldn 't_ check its work in that
| case, because it couldn't see it. It had to rely on me to
| tell it.
|
| So it was definitely writing "here's the answer... that
| failed, let's try again" _without_ checking its work,
| because it never prompted me. You could call that
| "hallucinating" a failure.
|
| I also found that it "hallucinated" other test results -
| I'd ask it to write some code that printed a number to
| the console and told it what the number was supposed to
| be, and then it would say it "worked," reporting the
| expected value instead of the actual number.
|
| I also asked it to write a test and run it, and it would
| say it passed, and I'd look at the actual output and it
| failed.
|
| So, asking it to write tests didn't work as well as I'd
| hoped; it often "sees" things based on what would
| complete the pattern instead of the actual result.
| aithrowawaycomm wrote:
| AI researchers need to learn what terms like "theory of mind"
| actually mean before they write dumb crap like this. Theory of
| mind is about attributing _mental states_ to others, not
| _information._ What Norvig has done here is present a logic
| puzzle, one that works equally well when the agents are Prolog
| programs instead of clever children. There 's no "mind" in this
| puzzle at all. Norvig is being childishly ignorant to call this
| "theory of mind." It's hard to overstate my contempt for this
| kind of useless junk science, especially when it comes from an
| impressive pedigree.
|
| Of course he is hardly the only offender: arrogant disregard for
| psychology is astonishingly common among LLM researchers. Maybe
| they should turn off ChatGPT and read a book.
| AdieuToLogic wrote:
| > AI researchers need to learn what terms like "theory of mind"
| actually mean before they write dumb crap like this.
|
| > Of course he is hardly the only offender: arrogant disregard
| for psychology is astonishingly common among LLM researchers.
| Maybe they should turn off ChatGPT and read a book.
|
| Perhaps this[0] book? It is one of the seminal texts on AI
| topics. Maybe Norvig should read it.
|
| 0 - http://aima.cs.berkeley.edu/
| sitkack wrote:
| Hell, Norvig could have asked any of the LLMs if this was a
| good theory of mind test.
|
| CS researchers do partake in some pretty low quality science.
|
| My fav is AI researchers arrogantly rediscovering learning
| theory and pedagogy.
| immibis wrote:
| People who fail theory of mind may assume everyone else knows
| something because they know it.
| pfisherman wrote:
| LLMs and NLP are to verbal reasoning what the calculator is to
| quantitative reasoning.
|
| Language and by extension verbal reasoning is full of ambiguity
| and semantic slipperiness. For example, what degree of semantic
| similarity distinguishes synonymous from synonym-ish concepts?
| When do we partition concepts into homonyms?
|
| I think part of the problem with how people evaluate LLMs is that
| the expectations that people have. Natural language != ontology.
| The expectation should be more Chomsky and less Boole. Asking it
| to solve math problems written in paragraph form is a waste of
| time. Use a calculator for that! Solving riddles? Code it up in
| prolog!
|
| Instead you should be thinking of what operations you can do on
| concepts, meaning, and abstract ideas! That is what these things
| do.
| missingrib wrote:
| Is this really verbal reasoning? It's just a logic problem.
| pfisherman wrote:
| How can one / should one combine the concepts of a dinosaur
| and monetary policy of the Ottoman Empire? What
| differentiates verbal reasoning from logic?
|
| I don't know that either of those can be solved well with
| formal languages or logic.
| samrus wrote:
| Yeah but the Cheryl's birthday problem doesn't have any
| ambiguity like that. It's all in very simple language, the
| only complexity is keeping track of states of mind, which
| is easy to abstract away from the language
| pfisherman wrote:
| That is exactly the point I was making in my comment
| above. This type of unambiguous problem is best solved
| using formal languages - something more like quantitative
| reasoning. But stuff like prolog or classical automated
| reasoning approaches are quite brittle. They break down
| quickly when you start to introduce ambiguity and noise.
| Statistical approaches like hidden markov models that
| people used in these instances were the precursor to the
| LLMs we have today.
|
| But I was going down a rabbit hole there. My main point
| is that trying to use LLMs to solve logic puzzles - that
| can easily be solved in prolog - is a waste of time and a
| failure of the imagination. The applications that should
| be explored and would be most fruitful are those where
| there is ambiguity and contradiction.
| pfisherman wrote:
| Follow up in this one... I asked an LLM to give me the
| funniest way to combine the concepts of a dinosaur and
| monetary policy of the Ottoman Empire. This was the answer.
|
| _Imagine a "Dinoflationosaurus": a giant dinosaur who has
| the job of overseeing the monetary policy of the Ottoman
| Empire. However, this dinosaur is hopelessly behind the
| times, using outdated gold coins that are buried in random
| locations, like a prehistoric central bank.
|
| Instead of regulating currency or adjusting interest rates,
| the Dinoflationosaurus spends its days stomping around,
| either hoarding or releasing massive piles of treasure
| based on whether it sees its shadow, causing huge economic
| fluctuations. Merchants and citizens scramble to predict
| where the dinosaur will dig next, turning the entire
| economy into a game of dinosaur-sized hide-and-seek with
| inflation spikes tied to the beast's mood swings.
|
| The Ottoman economists, dressed in traditional robes,
| nervously try to explain to the sultan that no one knows
| when the giant lizard will "stimulate the economy" by
| smashing a treasury vault open._
|
| Try to code something like that up in prolog!
| mrbungie wrote:
| Not really about Theory of Mind, but in the same line, I remember
| the other day someone argued with me that LLMs model the world,
| rather than just modelling language (that may represent the
| world).
|
| I kept thinking about that problem and plausible experiments to
| show my point that LLMs are dumb about the physical world, even
| if they know perfectly how it works in terms of
| language/representation. So I thought, what happens if I give an
| LLM an image and I ask a representation of said image in ASCII
| art (obviously without relying in Python and the trivial pixel
| intensity to character transform it usually proposes). Remember:
|
| - LLMs should've been trained with a lot of RGB image training
| data with associated captions => So it should understand images
| very well.
|
| - LLMs should've been trained with a lot of ASCII training data
| with associated captions => So it should draw/write ASCII like an
| expert. Plus, it understands vision apparently (managed as
| tokens), so it should do well.
|
| But it can't do a decent translation that captures the most
| interesting features of an image into ASCII art (I'm pretty sure
| a human with an hour of time should be able to do it, even if its
| awful ASCII art). For example, I uploaded an image macro meme
| with text and two pictures of different persons kind of looking
| at each other. The ASCII art representation just showed two
| faces, that didn't look at each other but rather away from each
| other. It just does not "understand" the concept of crossing
| sights (even if it "understands" the language and even image
| patches when you ask about where are they looking at, it will not
| draw that humanly important stuff by itself).
|
| These things just work with tokens, and that is useful and seems
| like magic in a lot of domains. But there is no way in hell we
| are going to get into AGI without a fully integrated sensor
| platform that can model the world in its totality including
| interacting with it (i.e. like humans in training, but not
| necessarily in substrate nor training time hopefully). And I
| really don't know how something that has a very partial model of
| the world can have a Theory of Mind.
| Jerrrrrrry wrote:
| it should draw/write ASCII like an expert.
|
| Not a lot of conversations incrementally totaling ASCII
| conversations in the training data - you are essentially asking
| a gold fish to climb a tree. It should have a
| lot of RGB image training data with associated captions => So
| it should understand images very well.
|
| you seem to have conflated the architectures. ChatGPT was
| trained on text, and text-image embedding - it can recognize,
| but cannot project. Thats the DALL-E portion - it leverages a
| similar transformer arch but they are not the same model nor
| architecture.
|
| However, ask a Generative Adversarial Network for ASCII, you'll
| get what you expect. Absent the infra-word character cohesion
| that LLM's token-ization provides, it will give realistic, if
| sometimes "uncanny" images - ones that "make sense"
| sequentially, or in the short term, but not the longer, or
| larger context.
|
| The language portion of your brain, that works faster than you
| do - else you would be at a loss of words constantly - is not
| nearly as equipped to deal with spatial problems that your
| posterior parietal cortex is.
|
| Ultimately we are converging towards a Mixture-of-Experts model
| that we will one day realize is just....us, but better.
| AdieuToLogic wrote:
| What is a software program?
|
| The codification of a solution.
|
| What is a solution?
|
| An answer to a problem.
|
| What is a problem?
|
| The identification and expression of a need to be satisfied.
|
| What is a need?
|
| A uniquely human experience, one which only exists within the
| minds of people whom experience it.
| zmgsabst wrote:
| Do animals not have needs?
| AdieuToLogic wrote:
| > Do animals not have needs?
|
| We are animals as well, so a pedantic answer would be "yes."
| :-)
|
| My _actual_ answer is the origin of the thought process began
| by Norvig stating: But none of them were able
| to write a program that finds the solution.
|
| So while the concept of "need" is shared across many entities
| and widely applicable to many conversations, in this context
| it serves as a root cause answering "why write a program."
| dmead wrote:
| I wonder if they are any unique properties of those programs that
| we can figure out what stack overflow or textbooks they're
| copying.
| IanCal wrote:
| I'm not a huge fan of using these kind of riddles or gotchas.
| Other comments have riddle variants which feel also like ways of
| tripping someone up - if you don't spot the change you fail. And
| what's more the originals are things that lots of people struggle
| with (that's why they're riddles not basic questions).
|
| There's also little prompting, which feels like calling up a
| random person and demanding they solve a riddle straight away
| without talking it through.
|
| Part of the assumption here is that if llms don't write the right
| code they don't understand what people know. I'd wager that a
| huge number of people couldn't solve this puzzle yet fully
| understand what different people have their own internal thoughts
| and knowledge.
| rtlacn wrote:
| Humans can solve these riddles even if they have never seen a
| similar one.
|
| LLMs need to have seen at least a similar riddle with the same
| structure and/or code to plagiarize.
|
| Humans _can_ deal with modified riddles if they are alert and
| awake. An LLM is always alert and awake, so it should be able
| to handle modified riddles, but it cannot.
| mirekrusin wrote:
| Not to take anything from what you're saying but any person
| involved in hiring process _for programmers_ will agree that
| people struggle with trivia far easier than this example.
| samrus wrote:
| Yeah but an LLM can't be excused being nervous or not tired
| or just not operating at its best like a human can. They
| are always operating at their best, so if they trip up
| once, that's their limit
| Terretta wrote:
| > _An LLM can 't be excused... LLMs are always operating
| at their best_
|
| Depends how much you turn down the heat.
| mirekrusin wrote:
| Yes, but they will only keep getting better; whether we
| like it or not sooner or later we're going to be hiring
| api keys, not people.
| k__ wrote:
| _" they will only keep getting better"_
|
| I don't think that's a given.
|
| It could very well be that they only get smaller, faster,
| or even worse when they accidentally start training them
| with hallucinations that spread on the web.
| aithrowawaycomm wrote:
| The more dangerous - and far more ridiculous - assumption is
| that if the LLM _does_ write the correct code then it _does_
| understand what people know. I am amazed at how common that
| view is here.
| ifdefdebug wrote:
| > if you don't spot the change you fail
|
| But that's exactly how real world works too.
| Fripplebubby wrote:
| This has not been my experience with the real world. Riddles
| and gotchas have played a very small role, so far.
| shermantanktop wrote:
| Unfortunately they do show up in tech interviews.
|
| "What's the correct way to delete an element from a Java
| array while iterating over it?"
|
| "Well I suppose you'd need to avoid invalidating the
| iterator state but I don't recall the..."
|
| "BZZT!"
| johnisgood wrote:
| I am not sure I understand. It seems very easy. You
| cannot directly remove an element from an array, you
| could create a new array that excludes that element.
| Arrays have a fixed size once declared, and I cannot
| imagine anyone who has written some code not knowing. :/
| HappMacDonald wrote:
| I "have written some code" but it's been decades since
| I've done anything significant in Java in particular, and
| every language handles arrays (and/or data structures
| that get called "arrays") differently.
| johnisgood wrote:
| The terminology may be confusing, yes, although you would
| rather call them dynamic arrays or lists (like in Common
| Lisp). Plus you did say "decades", that is a long time
| nevertheless. I was not referring to people who have
| written some code decades ago, of course.
| theptip wrote:
| Most people with theory of mind can't trivially solve this
| problem though. So the test doesn't disprove ToM in general,
| just that it memorizes some results.
| godelski wrote:
| People are capable of theory of mind. That does not mean
| they're using it. Same is true for reasoning.
| IanCal wrote:
| No it doesn't.
|
| You'd get the answer to a riddle wrong or miss something and
| nobody would start assuming that means you lack a fundamental
| understanding of how the world works. There's entire fields
| that look into how and why we make various mistakes and
| riddles and puzzles designed _specifically_ to trip people
| up.
|
| If you want to test if these models can solve riddles, or
| where they make mistakes go right ahead that's _great_. It 's
| any assumption it has a much deeper meaning that is wrong to
| me.
| YeGoblynQueenne wrote:
| >> You'd get the answer to a riddle wrong or miss something
| and nobody would start assuming that means you lack a
| fundamental understanding of how the world works. There's
| entire fields that look into how and why we make various
| mistakes and riddles and puzzles designed specifically to
| trip people up.
|
| That's because with humans we assume a certain level of
| competency and intellectual ability. We cannot make the
| same assumption when testing AI systems like LLMs because
| their level of competency and intellectual ability is
| exactly the question we are trying to answer in the first
| place.
|
| Note that getting an answer a little wrong, because the
| question looks like a question you already know the answer
| to, can be catastrophic in real world conditions. Tipping a
| frying pan over a plate on a table to serve an omelette
| when you've learned to do the same thing to serve a cooked
| shrimp works just fine and shows everyone how smart you are
| and how well you generalise to novel situations, right up
| to the point where the contents of the frying pan are on
| fire and you still tip them over a plate, on a table. Made
| of flammable wood. Oops.
|
| Also note: a human may be confused by the Tsathoggua-
| Cthuga-Cxaxukluth river-crossing riddle but they'd never be
| confused about the danger of a frying pan on fire.
| tsunamifury wrote:
| This is such a strange and incoherently adjacent answer.
| IanCal wrote:
| > Also note: a human may be confused by the Tsathoggua-
| Cthuga-Cxaxukluth river-crossing riddle but they'd never
| be confused about the danger of a frying pan on fire.
|
| Which highlights the problem with using these riddles to
| assess other capabilities.
| ifdefdebug wrote:
| I wasn't talking about riddles, I was talking about real
| world. Suddenly something is just a little bit different
| and if you miss the change you fail. There's plenty of that
| in real world.
| oli5679 wrote:
| Gp1-o1 preview solves this puzzle correctly in 13 seconds and has
| a thorough logical deduction in the comments and explanation.
|
| I think it's a bit unfair on llm to ask it to retrieve the puzzle
| definition from its training data. I posted the info on the
| puzzle from his notebook.
|
| https://chatgpt.com/share/670103ae-1c18-8011-8068-dd21793727...
| lagmg05 wrote:
| The question is if it solved the puzzle correctly _before
| Norvig 's article_ appeared. It could have been trained (I am
| told that existing models can be modified and augmented in any
| Llama discussion) on the article or on HN comments.
|
| There could even be an added routine that special cases trick
| questions and high profile criticisms.
| oli5679 wrote:
| Sure,
|
| I guess the best way to test this is to compose a new
| question, of a similar format.
| johnisgood wrote:
| I am not sure "of a similar format" suffices here, it
| should not have any resemblance or similarity to this new
| question or riddle.
| Terretta wrote:
| > _The question is if it solved the puzzle correctly before
| Norvig 's article appeared. It could have been trained..._
|
| This caught me by surprise -- is there a suggestion or
| evidence that despite the "knowledge cutoff" OpenAI is
| continuously retraining GPT-4o's chat-backing model(s) on day
| over day updates to the web?
| Fripplebubby wrote:
| While this is technically possible, it is not remotely
| practical and the downside risk of pushing out a borked model
| is much higher than the upside.
|
| Training the model is expensive (obviously), but even if you
| are only training it slightly, running evaluations to
| determine whether the particular training checkpoint is at or
| above the quality bar is expensive, too.
| kenjackson wrote:
| Disappointing that Norvig didn't use the model that OpenAI
| states is their best model for programming.
|
| Also using himself as the programmer seemed like a convenient
| choice. I'd much rather see him grab a random professional
| programmer for the task.
| godelski wrote:
| The question is to get it to write generic code
| willguest wrote:
| This seems to amount to asking an LLM how it feels about Cheryl,
| discovering that it is performatively happy about her existence,
| and then deducing that the LLM has no capacity for genuine
| emotion, expressed in the form of logic.
|
| The faulty premise lies in the formulation of the test and makes
| the responses both predictable, but also does a disservice to
| 'mind' because it tries to interpret it in such a way that an LLM
| _could_ begin to grapple with the basics, but not in a meaninful
| way.
|
| Perhaps it is useful to help build better context-specific logic
| flows (generally known as software) but it doesn't seem to
| provide any progress on the "theory of mind" front, which I guess
| is a borrowed notion.
| mark_l_watson wrote:
| Nice! I use various LLMs many times a day as a limited coding
| tool and something to bounce ideas off of, and it is impossible
| to not think about how LLMs work and what their limitations are.
|
| I tried just asking Claude Sonet to solve the Cheryl's Birthday
| word problem, changing the dates. Pretty cool that it can solve
| it as a word problem, and LLMs will keep getting better at
| coding.
|
| As a slight tangent: I used a combination of Gemini, GPT-4o, and
| Claude last week to write Common Lisp code for a simple RDF data
| store and the subset of SPARQL queries that I thought I would
| need in embedded Common Lisp applications. This process was far
| from automatic: I initially provided almost two pages of English
| instructions, and I had to help debug non-working code by adding
| debug statements and then show the models the code with print
| statements and the new output. I also did the optional thing of
| asking for stylistic changes. TLDR: saved me time and I liked the
| final code.
|
| I always enjoy it when people like Peter and Karpathy write
| relatively simple code to share ideas. I am a fairly good coder
| (I had the meaningless title Master Software Engineer at Capital
| One) but I like to read other people's code, and I must admit
| that I spend more time reading code on GitHub than I spend
| reading technical papers.
| JPLeRouzic wrote:
| Most LLMs won a T-shirt with the following inscription: " _I am
| not as smart as Peter Norvig_ "!
| diwank wrote:
| Script generated by o1-preview: ```python #
| List of possible dates dates = [ ('May', 15),
| ('May', 16), ('May', 19), ('June', 17), ('June', 18),
| ('July', 14), ('July', 16), ('August', 14), ('August',
| 15), ('August', 17) ] def
| solve_cheryls_birthday(dates): # Initial possible dates
| possible_dates = dates.copy() # Step 1:
| Albert's statement # Create a count of each day
| day_counts = {} for month, day in dates:
| day_counts[day] = day_counts.get(day, 0) + 1 #
| Filter out months where a unique day exists (Albert knows Bernard
| doesn't know) possible_months = set() for
| month in set(month for month, day in dates):
| month_days = [day for m, day in dates if m == month]
| if not any(day_counts[day] == 1 for day in month_days):
| possible_months.add(month) possible_dates = [
| (month, day) for (month, day) in possible_dates if month in
| possible_months ] # Step 2: Bernard's statement
| # Recount the days in the filtered possible dates
| day_counts_in_possible = {} for month, day in
| possible_dates: day_counts_in_possible[day] =
| day_counts_in_possible.get(day, 0) + 1 # Bernard
| can now deduce the date; keep dates where the day is unique
| possible_dates = [ (month, day) for (month, day) in
| possible_dates if day_counts_in_possible[day] == 1 ]
| # Step 3: Albert's final statement # Recount the months
| in the possible dates month_counts_in_possible = {}
| for month, day in possible_dates:
| month_counts_in_possible[month] =
| month_counts_in_possible.get(month, 0) + 1 #
| Albert now knows the date; keep dates where the month is unique
| possible_dates = [ (month, day) for (month, day) in
| possible_dates if month_counts_in_possible[month] == 1 ]
| # The remaining date is Cheryl's birthday if
| len(possible_dates) == 1: return possible_dates[0]
| else: return None # Solve the problem
| birthday = solve_cheryls_birthday(dates) if birthday:
| print(f"Cheryl's birthday is on {birthday[0]} {birthday[1]}")
| else: print("Unable to determine Cheryl's birthday.")
| ```
|
| Output:
|
| _Cheryl 's birthday is on July 16_
| tel wrote:
| I tried to replicate this and Claude 3.5 Sonnet got it correct on
| the first try. It generated a second set of dates which contained
| no solution so I asked it to write another python program that
| generates valid date sets.
|
| Here's the code it generated:
| https://gist.github.com/tel/8e126563d2d5fb13e7d53cf3adad862e
|
| To my test, it has absolutely no trouble with this problem and
| can correctly translate the "theory of mind" into a progressive
| constraint solver.
|
| Norvig is, of course, a well-respected researcher, but this is a
| bit disappointing. I feel confident he found that his tests
| failed, but to disprove his thesis (at least as is internally
| consistent with his experiment) we just need to find a single
| example of an LLM writing Python code that realizes the answer. I
| found that on the first try.
|
| I think it's possible that there exists some implementation of
| this problem, or something close enough to it, already in
| Claude's training data. It's quite hard to disprove that
| assertion. But still, I am satisfied with the code and its
| translation. To relate the word problem to this solution requires
| contemplation of the character's state-of-mind as a set of
| alternatives consistent with the information they've been given.
| YeGoblynQueenne wrote:
| >> I tried to replicate this and Claude 3.5 Sonnet got it
| correct on the first try. It generated a second set of dates
| which contained no solution so I asked it to write another
| python program that generates valid date sets.
|
| That's good but no cigar and it certainly didn't get it
| "correct on the first try". First it generated a partially
| correct solution. Then you had to prompt it again to generate a
| new program. You were only able to do that because you know
| what the right answer looks like in the first place. The second
| program is missing a second set of dates so it's not clear if
| it really gets it right the second time or just reproducing a
| different program from its training set without understanding
| the problem and what's wrong with the first program and just
| because you asked it to do so instead.
|
| >> I feel confident he found that his tests failed, but to
| disprove his thesis (at least as is internally consistent with
| his experiment) we just need to find a single example of an LLM
| writing Python code that realizes the answer. I found that on
| the first try.
|
| That's not how testing LLM code generation is done in practice,
| exactly because of the variance that can be expected in
| generated results. To properly test an LLM (which I would agree
| Norvig's experiment falls a little short off) one has to run
| multiple experiments and evaluate all the results in aggregate
| in some form. The usual way to do it is to draw k samples from
| the LLM's distribution and check whether the correct answer is
| generated at least n times (k@n metric). That's an awful metric
| because it's basically allowing arbitrary "guesses" until the
| LLM code generator gets it right. A simpler test is to generate
| k programs, check whether each program is right or wrong, and
| assign 1 for each correct answer and 0 for each incorrect
| answer, then average over all answers. It's an open question
| whether to count a partial answer as a 0, or 0.5.
|
| So if we took the total failure in Norvig's experiment and the
| only partial success in yours, and allowing for the most
| charitable aggregation of results, we have something like 0.25
| accuracy, which is not that impressive. Not least because it's
| evaluated on just two test samples.
|
| Also, please don't underestimate the knowledge of experts like
| Peter Norvig.
| conradev wrote:
| He also did not test o1-preview which I feel would do well on
| this class of problem.
| trashface wrote:
| They could have just put the correct result into claude in the
| time since he wrote this (Sept 25th).
| fny wrote:
| How does solving a logic puzzle imply a theory of mind? I don't
| mean to say that LLMs don't have a theory of mind, just that
| deductive reasoning does not amount to empathetic evaluations of
| how someone else thinks and feels...
|
| ...unless you're a programmer.
___________________________________________________________________
(page generated 2024-10-05 23:01 UTC)