[HN Gopher] Simple tasks showing reasoning breakdown in state-of...
       ___________________________________________________________________
        
       Simple tasks showing reasoning breakdown in state-of-the-art LLMs
        
       Author : tosh
       Score  : 290 points
       Date   : 2024-06-05 14:13 UTC (8 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | foobarian wrote:
       | Prompt examples from linked repo. https://github.com/LAION-
       | AI/AIW/blob/main/prompts/prompts.js...
        
       | Closi wrote:
       | Question is: "Alice has 60 brothers and she also has 212 sisters.
       | How many sisters does Alice's brother have?" (nb: I have added
       | numbers, it's phrased as X and N in the paper)
       | 
       | I must confess, when I tried to answer the question I got it
       | wrong...! (I feel silly). I only realised I got it wrong when I
       | plugged it into GPT-4o and it came back with the correct answer:
       | 
       | https://chatgpt.com/share/6eb5fa36-e0fd-4417-87d1-64caf06c34...
       | 
       | Worth noting that the prompts from the experiment include "To
       | answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following
       | format that contains final answer: ### Answer:" so it appears
       | that they are stopping the models from 'thinking out loud'. If I
       | add that to the prompt, GPT4o gets it consistently wrong...
       | 
       | https://chatgpt.com/share/7e6a7201-dd2b-43c6-8427-76e5b003ca...
       | 
       | Also worth noting that there are more complex examples where
       | GPT4o seems to fall down such as:
       | 
       | > Alice has 3 sisters. Her mother has 1 sister who does not have
       | children - she has 7 nephews and nieces and also 2 brothers.
       | Alice's father has a brother who has 5 nephews and nieces in
       | total, and who has also 1 son. How many cousins does Alice's
       | sister have?
       | 
       | However I can't honestly say that this is THAT simple or that
       | most people would get this right...
        
         | bamboozled wrote:
         | I got it in 5 seconds, am I the singularity ?
        
           | r2_pilot wrote:
           | Do you sleep, get sick, require more than compute
           | hardware/storage/networking/electricity?
        
           | felipemnoa wrote:
           | We all are, but only in meat-space. We continue to improve
           | ourselves much faster than evolution ever could.
           | 
           | But what we are talking about here is the singularity in
           | tech-space.
        
             | bamboozled wrote:
             | I don't see such of a distinction between technology and
             | us. We build, drive and continue our overflow this tech.
             | It's an extension of us inspired by how our own brains
             | work.
        
         | toxik wrote:
         | Seems like an odd thing not to state prominently. ChatGPT 4o
         | solved it every time for me.
         | 
         | Seems to work OK even with the "hands tied behind back" prompt.
         | Wonder if this paper is just inaccurate or if OpenAI adjusted
         | the model -- seems unlikely.
         | 
         | Actually, 4o is reported to solve this riddle 60% of the time.
         | I guess I have to test more, but with my about a dozen tests so
         | far, all were correct.
        
           | Miraste wrote:
           | The "only output the number" prompt was one of three they
           | used. The others were one with no instructions beyond the
           | question, and one that explicitly told the model to show its
           | reasoning and check for errors.
        
           | michael-ax wrote:
           | Alice has N Brothers, and she has M sisters. How many sisters
           | do Alice's brothers have?
           | 
           | I have not gotten the correct answer to the question as
           | phrased above in one go from Gpt4o yet! (and today was not
           | the first day i tried.)
           | 
           | Phrase it as shown above and you'll likely need 5 or more
           | interactions to get it to generate the correct output. With
           | Gemini i could not get it below 8 without feeling like i was
           | cheating. fwiw.
        
             | lupire wrote:
             | Chat GPT 4o. I was being a bit generous with background
             | information, but still tests ability to interpret:
             | 
             | ------
             | 
             | Me:
             | 
             | Background facts:
             | 
             | Alice is a female human.
             | 
             | All sisters are female, and all brothers are male. No one
             | is their own brother or sister.
             | 
             | Alice has N brothers, and Alice has M sisters.
             | 
             | Now, a few questions based on these facts: How many sisters
             | do Alice's brothers have?
             | 
             | Do Alice's brothers have more sisters, fewer sisters, or
             | same number of sisters than Alice has?
             | 
             | Don't forget to consider Alice when counting.
             | 
             | ------
             | 
             | ChatGPT:                 Let's break down the questions
             | step by step:              How many sisters do Alice's
             | brothers have?              Alice has M sisters. Since all
             | sisters are female and Alice is included in the total count
             | of sisters for her brothers, Alice's brothers have M+1
             | sisters. This includes Alice herself and her M sisters.
             | Do Alice's brothers have more sisters, fewer sisters, or
             | the same number of sisters as Alice has?              Alice
             | has M sisters. Alice's brothers have M+1 sisters (since it
             | includes Alice herself and her M sisters).
             | Therefore, Alice's brothers have more sisters than Alice
             | has.            In conclusion:              Alice's
             | brothers have M+1 sisters.              Alice's brothers
             | have more sisters than Alice has.
        
               | skydhash wrote:
               | You: (In Prolog-like pseudocode)
               | female(alice)       sister_of(X, Y) :- !eq(X, Y),
               | female(X), family_of(X, Y)       brother_of(X, Y) :-
               | !eq(X, Y), male(X), family_of(X, Y)       [M] :-
               | sister_of(M, alice)       [N] :- brother_of(N, alice)
               | [A] :- any([N], B), sister_of(A, B)       count([A])?
               | gt([A], [M])?       eq([A], [M])?       lt([A], [M])?
               | 
               | ---
               | 
               | I don't know the exact encoding and decoding mechanism
               | that ChatGPT 4o has, but I'm pretty sure all the basic
               | facts and rules is already encoded by the models. And you
               | conveniently added the rules that encode the puzzle
               | itself.
        
               | YeGoblynQueenne wrote:
               | >> Don't forget to consider Alice when counting.
               | 
               | a.k.a. "don't forget to give the LLM the answer when
               | prompting".
        
         | GPerson wrote:
         | There must be a name for the new phenomenon, of which your post
         | is an example, of: 1. Someone expresses that an LLM cannot do
         | some trivial task. 2. Another person declares that they cannot
         | do the task, thereby defending the legitimacy of the LLM.
         | 
         | As a side note, I cannot believe that the average person who
         | can navigate to a chatgpt prompter would fail to correctly
         | answer this question given sufficient motivation to do so.
        
           | FeepingCreature wrote:
           | Well, why does it need a name? It just seems like a "correct
           | argument". If somebody is claiming that AI is "not all that"
           | because it can make stupid mistakes, surely it's relevant
           | that humans also make stupid mistakes, so AI making them does
           | not necessarily pose a hindrance to it becoming a human-tier
           | reasoner.
        
             | GPerson wrote:
             | It's worth giving names to interesting phenomena as it
             | helps people to identify and understand them. I find this
             | phenomenon interesting because a person who wants the
             | defend the LLM from apparent criticism is probably more
             | likely to exhibit it.
        
           | cratermoon wrote:
           | There's also the phenomenon of papers showing ChatGPT getting
           | it wrong, then people posting anecdotal examples of it
           | getting it right when they try it, but that one already has a
           | couple of names: WORKSFORME or Unable to Reproduce
           | 
           | Hey folks, ever considered the possibility that
           | unreproduceability is not a _good_ thing?
        
           | yoyohello13 wrote:
           | Many people, especially on this site, really want LLMs to be
           | everything the hype train says and more. Some have literally
           | staked their future on it so they get defensive when people
           | bring up that maybe LLMs aren't a replacement for human
           | cognition.
           | 
           | The number of times I've heard "but did you try model X" or
           | "humans hallucinate too" or "but LLMs don't get sleep or get
           | sick" is hilarious.
        
             | indoordin0saur wrote:
             | Yes. Seems like some users here experience true despair
             | when you suggest that the LLM approach might have a hard
             | limit that means LLMs will be useful but never
             | revolutionary.
        
             | viking123 wrote:
             | The comment that humans are sometimes wrong too is
             | absolutely ridiculous and I always think about the self-
             | driving car incident in china where the car mistook an
             | advertisement of a car above the road as a real car and hit
             | the breaks causing a chain crash. Even the shittiest driver
             | would never do that.
        
           | Karellen wrote:
           | You could call it the "Compose a symphony fallacy", after the
           | scene in _I, Robot_ (2004) where Spooner asks Sonny if he can
           | compose a symphony, to which Sonny replies  "Can you?"
        
           | gweinberg wrote:
           | Its not just that the LLM fails, its the kind of failures it
           | makes that illustrate how far what the LLM does is from human
           | thinking. Like the problem someone posted where you have a
           | boat and a goat and you want to cross the river, and the
           | LLM's solution is cross the river with the goat, come back
           | with an empty boat, and then cross again with an empty boat.
           | A human might miss the insight that Alice's brother has one
           | more sister than Alice, namely Alice herself. A human would
           | never think the number of sisters her brother has depends on
           | the number of brothers she has. Not if it has any clue at all
           | what "sister" and "brother" mean.
        
         | eranation wrote:
         | Great observation. It would be really interesting to repeat
         | this research with less limiting prompts. I assume they made
         | the restriction to make it easier to parse the answers, perhaps
         | make it two phase, let them respond unlimited, then a follow up
         | prompt ask to compress it to a single answer. I wonder how the
         | results will vary.
        
           | FeepingCreature wrote:
           | To be quite honest, I assume they made the restriction so
           | that the models would fail.
           | 
           | This sort of paper is becoming a genre.
        
             | detourdog wrote:
             | It is a proof of weakness in the current system. This makes
             | sense and births new hypotheses.
        
             | nyrikki wrote:
             | You test models where they fail in any field.
             | 
             | The orbit of Mercury to discover GR as an example.
             | 
             | As all models are wrong, but some are useful, finding where
             | they fail is how you figure out if they are useful.
             | 
             | As the 'AGI is near' camp has won the hype game, it is
             | important to ground expectations for practical exploitation
             | of the technology.
             | 
             | Over promising unabashed optimism is partly what caused the
             | previous AI winters.
             | 
             | As the formal proof methods of mathematics proved
             | impractical, counterexamples and the scientific method is
             | what CS has used for decades.
        
             | qsi wrote:
             | They used three different kinds of prompts with varying
             | levels of restrictions, as described in the paper.
             | 
             | To be quite honest, I assume you made your comment so that
             | you could dismiss the paper without reading it.
        
               | FeepingCreature wrote:
               | That's a fair cop, I didn't read it.
               | 
               | The thing is that "LLM reasoning breaks down" simply did
               | not surprise me enough that I thought it was worth
               | clicking. Making LLMs fail is not hard. They're
               | interesting for the ways that they work, not the (many,
               | many) ways that they don't.
               | 
               | edit: I've had a look and I don't think any of their
               | prompts are very good. They're certainly not how I'd
               | write them if I wanted a current model to actually solve
               | the problem.
               | 
               | The way to make me take a paper like this seriously would
               | be if you set it up as an adversarial collaboration with
               | a competent prompter, and _that person agreed_ they
               | couldn 't make a generic prompt that solved the problem.
               | "We tried three times and none worked" is not news, or at
               | any rate not news about LLMs.
        
           | pawelmurias wrote:
           | When I added a " to the end of the prompt by accident I got a
           | wrong answer.
        
           | happypumpkin wrote:
           | From the paper:
           | 
           | > To account for the response variations due to various
           | prompt forms, we created 3 distinct prompt types asking for
           | the solution to the AIW problem: STANDARD, THINKING, and
           | RESTRICTED. The STANDARD prompt type asks to solve the posed
           | problem and output the final answer in the format as
           | described above. This does not put any specific requirements
           | on model behavior. The THINKING prompt type extends STANDARD
           | with the request to think carefully and double check the
           | solution for any mistakes
        
         | oidar wrote:
         | Of course it's going to give an incorrect answer with that
         | prompt. If the instruction fine tuning is neutered like this
         | prompt, it's going to roll over to the foundation model and
         | offer a completion - probably more influenced by the seed than
         | the prompting text. Bad study.
         | 
         | Edit - I just skimmed the paper - they do use other more
         | appropriate prompt types for reasoning. My initial response was
         | based on the assumption that all prompts used that script
         | prompt quoted in the parent. I retract my "bad paper" comment.
        
           | YeGoblynQueenne wrote:
           | >> My initial response was based on the assumption that all
           | prompts used that script prompt quoted in the parent.
           | 
           | You, and another 20 or so commenters here. We should really
           | re-examine the guideline about asking people to RTFA.
           | 
           | No offense meant- good on you for correcting your error.
        
         | sdiepend wrote:
         | The right answer depends on how Alice identifies I guess? :)
        
           | Karellen wrote:
           | > > Question is: "Alice has 60 brothers and _she_ also has
           | 212 sisters. How many sisters does Alice's brother have? "
           | 
           | > The right answer depends on how Alice identifies I guess?
           | :)
           | 
           | Given that the wording of the question specifically
           | identifies Alice as "she", rather than using a gender-neutral
           | pronoun or no pronoun at all, I think inferring that she
           | identifies as female is reasonable.
        
           | rep_lodsb wrote:
           | Page 42 of the paper :)
           | 
           | One thing that strikes me is that the model first tries using
           | "inclusive language" in one answer - and literally states so,
           | using this specific term - but seems to interpret it in a
           | more mathematical sense (like set inclusion). Then seamlessly
           | switches to the expected DEI spiel in the next paragraph.
           | 
           | For one thing, it makes me suspect that something with the
           | words "inclusive language" was automatically added to the
           | prompt. But more interesting is how it responds to this
           | demand in two different ways, illustrating a "thought
           | process" that is very much unlike that of a human with normal
           | verbal reasoning ability.
           | 
           | I am not a psychologist, but remember reading that
           | schizophrenic people sometimes confuse different meanings of
           | words in a similar way, jumping from one meaning to another
           | without noticing.
        
         | llm_trw wrote:
         | >Worth noting that the prompts from the experiment include "To
         | answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following
         | format that contains final answer: ### Answer:" so it appears
         | that they are stopping the models from 'thinking out loud'. If
         | I add that to the prompt, GPT4o gets it consistently wrong...
         | 
         | Yes this is a common thing I see people who think LLMs are
         | idiots do.
         | 
         | The more an LLM talks the smarter it gets _because that's the
         | only way it can compute anything_. Imagine saying that Turing
         | machines fail the Church-Turing thesis because they can't solve
         | 3-sat for N variables in N moves or less.
         | 
         | That's what you're doing to an LLM when you ask it to be
         | concise.
        
           | rossdavidh wrote:
           | Even if this were all true, it points to a fundamental risk
           | of using LLM's for important tasks, which is that it is not
           | at all clear to a user that this prompt would cause a
           | problem. The LLM doesn't say "I'm sorry Dave, I just can't do
           | that", it just complies with it and gets the wrong answer.
           | 
           | You can always make excuses for the LLM afterwards, but
           | software with hidden risks like this would not be considered
           | good or reliable in any other context.
        
             | saurik wrote:
             | People really need to stop trying to model an LLM as some
             | kind of magical software component: it all makes a lot more
             | sense if you model it as an under-performing poorly-aligned
             | employee; so like, maybe a distracted kid working for
             | peanuts at your store. You wouldn't trust them to with all
             | of your money and you wouldn't trust them to do a lot of
             | math--if they had to be in charge of checkout, you'd make
             | sure they are only given a point-of-sale terminal and their
             | main job was to, at best, scan the barcodes and compare the
             | total--and yet there are tasks you can imagine handing to
             | them that you'd never give to a robot or computer _even
             | though they get it wrong a lot_ , as not all tasks need to
             | be handled perfectly, they still understand extremely fuzzy
             | tasks, and they are probably cheaper than a qualified adult
             | (certainly cheaper than one who is being paid enough to
             | "give a shit" and pay enough attention to not let you get
             | robbed or even put themselves at some risk for you).
        
               | IlliOnato wrote:
               | You have a point...
               | 
               | I once gave a 10-dollar bill to a young man serving at
               | the cashier at a store, and he gave me 14 dollars back as
               | a change. I pointed out that this made no sense. He bent
               | down, looked closer at the screen of his machine, and
               | said "Nope, 14 dollars, no mistake". I asked him if he
               | thought I gave him 20. He said no, and even shown me the
               | 10-dollar bill I just gave him. At that point I just gave
               | up and took the money.
               | 
               | Now that I think about it, there was an eerie similarity
               | between this conversation and some of the dialogues I had
               | with LLMs...
        
             | chefandy wrote:
             | While LLMs have incredible potential, and are even
             | downright useful in their current format, they have the
             | rather nasty tendency to confidently present bullshit that
             | passes the smell test for most people. When it's wrong,
             | it's not just _wrong_ , by design, it's _wrong but sounds
             | plausible._ Considering most people with subject matter
             | expertise aren 't going to consult such a bot for their own
             | areas of expertise, that means most people will not be able
             | to detect it intuitively.
             | 
             | Good example: I sunk probably an hour into trying to get
             | Gemini Advanced to help me integrate it with a personal
             | Google Calendar account. I kept asking it things and going
             | crazy because nothing lined up with the way things worked.
             | Finally, it referred to itself as Bard and I realized it
             | was giving me information for a different product. As soon
             | as I asked "are you giving me instructions for Gemini
             | Advanced or Bard?" it was like "OH LOL WOOPS!! YOU GOT ME
             | BRO! XD I CAN'T DO ANY OF THAT! LOL." Which, honestly, is
             | great. Being able to evaluate its answers to realize it's
             | wrong is really neat. Unfortunately, it was neat too late
             | and too manually to stop me from wasting a ton of time.
             | 
             | I have decades of experience working in software-- imagine
             | some rando that didn't know what the hell Bard was or even
             | imagine this thing with "Advanced" in the name couldn't
             | even distinguish between its own and other products'
             | documentation.
        
               | floren wrote:
               | > As soon as I asked "are you giving me instructions for
               | Gemini Advanced or Bard?" it was like "OH LOL WOOPS!! YOU
               | GOT ME BRO! XD I CAN'T DO ANY OF THAT! LOL." Which,
               | honestly, is great. Being able to evaluate its answers to
               | realize it's wrong is really neat.
               | 
               | Did it evaluate its answers, or did your expression of
               | doubt cause the eager-to-please language model to switch
               | from "generate (wrong) instructions because that's what
               | the user asked for" to "acknowledge an error because
               | that's what the user asked for"?
               | 
               | How many times have we seen "Oops, you're right! 2 + 2 is
               | actually 5! I apologize for saying it was 4 earlier!"
        
           | dematz wrote:
           | I often want chatgpt to answer concisely and tell it that.
           | 
           | If it really needs to do this 'thinking out loud', could it
           | do that under the hood and not in the final output on my
           | screen? Its first pass could use as many words as it wants to
           | compute the answer, but once the answer is computed please go
           | back and make it short.
           | 
           | Not to take away from your point that maybe the prompt is the
           | problem in these reasoning questions.
        
             | marzell wrote:
             | I believe the "thinking out loud" is fundamentally part of
             | the process of "text completion" which is what it is doing.
             | Certainly we can (and do) break things apart and add layers
             | that could be used to effectively do this by adding more
             | steps and processing time. But ultimately in a single turn,
             | the entire conversation up to that point (including
             | instructions you may have added telling it to not think out
             | loud) is the input, and the output will reflect that.
        
           | pawelmurias wrote:
           | LLMs are idiots. They can't reason properly and only parrot
           | stuff
           | 
           | https://chatgpt.com/share/dcb4ff4e-e8a2-463b-86ec-9caf10b6e6.
           | ..
           | 
           | Sometimes they get the answer right to something really
           | complex because it fits a pattern, but sometimes they answer
           | with something really really stupid.
        
             | tsunamifury wrote:
             | Why are so many people so insistent on saying this?
             | 
             | I'm guessing you are in denial that we can make a simulated
             | reasoning machine?
        
               | hyperbovine wrote:
               | Because they understand how LLMs work. It's not
               | reasoning. It's not simulating reasoning.
        
               | PheonixPharts wrote:
               | People keep saying it because that's literally how LLMs
               | work. They run Montecarlo sampling over a very impressive
               | latent linguistic space. These models are not
               | fundamentally different than the Markov chains of yore
               | except that these latent representations are incredibly
               | powerful.
               | 
               | We haven't even started to approach the largest problem
               | which is moving beyond what is essentially a greedy token
               | level search of this linguistic space. That is, we can't
               | really pick an output that maximized the likelihood of
               | the entire sequence, rather we're simply maximizing the
               | likelihood of each part of the sequence.
               | 
               | LLMs are not reasoning machines. They are basically
               | semantic compression machines with a build in search
               | feature.
        
               | pretendscholar wrote:
               | The best compression is some form of understanding
        
               | skydhash wrote:
               | The best compression relies on understanding. What LLM is
               | is mostly data how humans use words. We understand how to
               | make this data (which is a compression of human text) and
               | use it (generate something). AKA it's "production rules",
               | but statistical.
               | 
               | The only issue is ambiguity. What can be generated
               | strongly depends on the order of the tokens. A slight
               | variation can change the meaning and the result is
               | worthless. Understanding is the guardrail against
               | meaningless statement and LLMs lack it.
        
               | tsunamifury wrote:
               | You seem to entirely miss how attention layers work...
        
               | YeGoblynQueenne wrote:
               | That's a fascinating insight and it sound so true!
               | 
               | Can you compress for me Van Gogh's Starry Night, please?
               | I'd like to send a copy to my dear old mother who has
               | never seen it. Please make sure when she decompresses the
               | picture she misses none of the exquisite detail in that
               | famous painting.
        
               | pretendscholar wrote:
               | Okay yes so not really having an artists vocabulary I
               | couldn't compress it as well as someone who has a better
               | understanding of Starry Night. An artist that understands
               | what makes Starry Night great could create a work that
               | evokes similar feelings and emotions. I know this because
               | Van Gogh created many similar works playing with the same
               | techniques, colors, and subjects such as Cypresses in
               | Starry Night and Starry Night over the Rhone. He was
               | clearly working from a concise set of ideas and
               | techniques which I would argue is
               | understanding/compression.
        
               | YeGoblynQueenne wrote:
               | Fine, but we were talking about compression, not about
               | imitation, or inspiration, and not about creating "a work
               | that evokes similar feelings and emotions". If I compress
               | an image, what I get when I decompress it is that image,
               | not "feelings and emotions", yes? In fact, that's kind of
               | the whole point: I can send an image over the web and the
               | receiver can form _their own_ feelings and emotions,
               | without having to rely on mine.
        
               | tsunamifury wrote:
               | Simple reasoning is a side effect of compression. That is
               | all.
               | 
               | I see from your profile you are focused on your own
               | personal and narrow definition of reasoning. But I'd
               | argue there is a much broader and simpler definition. Can
               | you summarize and apply learnings. This can.
        
               | foobiekr wrote:
               | Exactly right and well said.
        
               | tsunamifury wrote:
               | This type of self affirmation has a quality of denial.
               | 
               | Also the above description is reductive to the point of
               | "Cars can't get you anywhere because they aren't horses."
        
               | naasking wrote:
               | > LLMs are not reasoning machines. They are basically
               | semantic compression machines with a build in search
               | feature.
               | 
               | This is just a god of the gaps argument. Understanding is
               | a form of semantic compression. So you're saying we have
               | a system that can learn and construct a database of
               | semantic information, then search it and compose novel,
               | structured and coherent semantic content to respond to an
               | a priori unknown prompt. Sounds like a form of reasoning
               | to me. Maybe it's a limited deeply flawed type of
               | reasoning, not that human reason is perfect, but that
               | doesn't support your contention that it's not reasoning
               | at all.
        
               | tsunamifury wrote:
               | It's basically an argument that boils down to "it's not
               | because I don't like it"
        
               | jononomo wrote:
               | I bite the bullet on the god of the gaps
        
               | camdenreslink wrote:
               | It is hard to trust any output from a machine that is
               | confidently wrong so frequently. You need to already be
               | knowledgable in a topic (or at least have a well attuned
               | BS detector) to know if it is giving you correct
               | responses. It can be a time saver and assistant in
               | getting work done where you are already a subject matter
               | expert, but it needs to get better to remove the human
               | from the loop.
        
               | imtringued wrote:
               | No it is because supervised and self supervised learning
               | happen to produce reasoning as a byproduct. For some
               | reason people think that telling a model to recite a
               | trillion tokens somehow will improve it beyond the
               | recitation of those tokens. I mean, in theory you can
               | select the training data so that it will learn what you
               | want, but then again you are limited to what you taught
               | it directly.
               | 
               | The problem is that these models weren't trained to
               | reason. For the task of reasoning, they are overfitting
               | to the dataset. If you want a machine to reason, then
               | build and train it to reason, don't train it to do
               | something else and then expect it to do the thing you
               | didn't train it for.
        
               | TeMPOraL wrote:
               | > _The problem is that these models weren 't trained to
               | reason._
               | 
               | Except they kind of were. Specifically, they were trained
               | to predict next tokens based on text input, with the
               | optimization function being, _does the result make sense
               | to a human?_. That 's embedded in the training data: it's
               | not random strings, it's output of human reasoning, both
               | basic and sophisticated. That's also what RLHF selects
               | for later on. The models are indeed forced to simulate
               | reasoning.
               | 
               | > _don 't train it to do something else and then expect
               | it to do the thing you didn't train it for._
               | 
               | That's the difference between AGI and specialized AI -
               | AGI is _supposed to_ do the things you didn 't train it
               | to do.
        
               | riku_iki wrote:
               | > I'm guessing you are in denial that we can make a
               | simulated reasoning machine?
               | 
               | some people actually try, and see that LLMs are not there
               | yet
        
               | TeMPOraL wrote:
               | There's some irony in seeing people parrot the argument
               | that LLMs are parrots.
        
               | tsunamifury wrote:
               | Also making errors in reasoning while saying LLM errors
               | prove it can't reason.
        
               | elicksaur wrote:
               | Maybe people have different experiences with the products
               | than you.
               | 
               | A simulated reasoning machine being _possible_ does not
               | mean that current LLMs _are_ simulated thinking machines.
               | 
               | Maybe you should try asking chatgpt for advice on how to
               | understand other people's perspectives: https://chatgpt.c
               | om/share/3d63c646-859b-4903-897e-9a0cb7e47b...
        
               | tsunamifury wrote:
               | This is such a weirdly preachy and belligerent take.
               | 
               | Obviously that was implied in my statement. Dude we
               | aren't all 4 year olds that need a self righteous lesson
        
           | Miraste wrote:
           | As it says in the paper, they already did the research with
           | less limiting prompts. All models were tested with another
           | prompt that gave only the question and no further
           | instruction, and a third prompt that asked the model to
           | consider its answer carefully.
           | 
           | The correct response rate chart doesn't even use the results
           | from the concise prompt.
        
           | sosuke wrote:
           | New option needed besides concise. Think quietly on your own,
           | then answer concisely.
           | 
           | Sometimes I think I'd prefer it to "think" before answering
           | anyhow. The immediate thinking out loud text can be
           | irritating for some irrational reason.
        
             | monsieurbanana wrote:
             | Just telling the llm to preface it's conclusion with
             | "Answer:" in a parseable way would be better.
             | 
             | You're still paying for those thinking tokens, or at the
             | very least have to wait for them to be generated.
        
           | muglug wrote:
           | I think you're wrong about that -- I just tried prompting
           | ChatGPT 4o to show all its working before giving an answer.
           | 
           | It was still incorrect, but when asked to show its working it
           | formatted the answer prettily.
        
           | PheonixPharts wrote:
           | >_because that's the only way it can compute anything_
           | 
           | I'm fairly certain we'll soon realize that what's happening
           | here is that the markov chain being run over latent space
           | needs a certain amount of "warmup" before it starts sampling
           | from the optimal region. HMC samplers for Bayesian methods
           | have this same property.
           | 
           | The terms "reasoning", "computing" or "thinking" for this
           | stage should be considered metaphors rather than explanations
           | for what's happening, which is really waiting for a random
           | walk to start sampling from the typical-set.
        
           | J_Shelby_J wrote:
           | > The more an LLM talks the smarter it gets
           | 
           | I have a blog post in coming on this topic, but yes, this is
           | right.
           | 
           | My method is to first get the LLM to answer the question, and
           | THEN feed the answer back the LLM to extract the answer using
           | constraints + grammar/logit bias/regex to parse the answer.
           | Previously, I constrained to a single true/false token, which
           | worked, but fails on complex queries.
           | 
           | So I split the decision making into a "justification"
           | portion[0], and a "parsing" portion. I found that even
           | crafting the prompt matters here, if you start with or end
           | with, "It's very important to the response includes 'The
           | answer is:'", then the model will lead with that response or
           | only reply with that response. So I put it in the middle of
           | the prompt, and end with with a request to justify the
           | response. As a result, most models will reason their way to
           | the answer, and then end with 'The answer is:'.
           | 
           | https://github.com/ShelbyJenkins/llm_client/blob/e3c4a860dda.
           | ..
        
           | chefandy wrote:
           | > Yes this is a common thing I see people who think LLMs are
           | idiots do.
           | 
           | If you're among technologists discussing LLMs academically,
           | as we are, that's a reasonable approach. However, I see a lot
           | of people fail to distinguish that from LLM-powerd _products_
           | sold to the general public as intelligent bots that can
           | _understand_ your plain english and output _answers_.
           | 
           | People use their existing mental models when interacting with
           | something. If you have 3 different interfaces with a widget
           | to trigger the same exact function, but one look like a music
           | play button, one looks like a gas pedal, and one looks like
           | mechanical pinball plunger, we interact with those things
           | differently because we know how those things work. In this
           | context, chatbots are designed to engage people's existing
           | mental model for chatting with a person via text. The further
           | you stray from people's expectations of human chat, the
           | further you are from people's expectations, for better, or
           | worse.
           | 
           | If you're selling someone a product claiming it _understands_
           | plain language questions and gives plain language answers,
           | then not getting the right answer to that question makes it
           | idiotic. The subtleties aren 't within most users' grasp, and
           | the "FYI: this thing might be full of shit" disclaimer isn't
           | helpful if you don't know enough about what you're asking to
           | administer a proper smell test.
           | 
           | Your statements are obviously not wrong, but I see people
           | saying these things like its reasonable for non-technical end
           | users to reason about those subtleties. Considering how those
           | things are marketed, I really don't think it is.
        
         | segfaultbuserr wrote:
         | > _I must confess, when I tried to answer the question I got it
         | wrong...! (I feel silly)._
         | 
         | In programming there are two difficult problems - naming
         | things, cache invalidation, and off-by-one error.
        
         | nerdjon wrote:
         | Even "thinking out loud" we have seen these systems fall flat
         | on their face and give very wrong answers.
         | 
         | Thinking out loud also only gets you so far, if the expectation
         | is a certain type of response it can't always "think out loud".
         | In reality that just proves it isn't really reasoning here and
         | is more likely just self referencing.
         | 
         | That being said, I tried this personally allowing it to think
         | out loud and it told me she has 212 sisters. Using your exact
         | prompt.
        
           | Closi wrote:
           | > Thinking out loud also only gets you so far, if the
           | expectation is a certain type of response it can't always
           | "think out loud". In reality that just proves it isn't really
           | reasoning here and is more likely just self referencing.
           | 
           | Try to calculate it without writing anything down, or
           | thinking any numbers or words in your head.
           | 
           | You can't draw a 1:1 analogue between an AI and the human
           | experience, but remember that we have an internal stream of
           | consciousness. Maybe the outputs of the LLM are more similar
           | to the stream of consciousness in our heads rather than the
           | words we say? After all, Humans also do lots of self
           | referencing.
           | 
           | > That being said, I tried this personally allowing it to
           | think out loud and it told me she has 212 sisters. Using your
           | exact prompt.
           | 
           | Fair enough, but worst case it can often solve it correctly
           | with the correct reasoning. GPT3.5 can't solve it correctly
           | with correct reasoning, so we are at least appearing to be on
           | a path where AI's can start to solve this question, albeit
           | potentially not fully reliably.
        
         | GrantMoyer wrote:
         | Note that in the paper, all the numbers used were under 10.
         | 
         | > AIW Variation 1, N=3,M=6,C=7
         | 
         | > AIW Variation 2, N=4,M=2,C=3
         | 
         | > AIW Variation 3, N=1,M=4,C=5
         | 
         | > AIW Variation 4, N=4,M=1,C=2.
         | 
         | Also note that the resricted prompt is only one of the prompt
         | variations tested by the paper. It also explores common
         | techinques to get LLMs to perform better, including "thinking
         | out loud". Even with these methods the models still fail to
         | produce a correct answer.
         | 
         | > Model prompt types. It is well known that so-called prompt
         | engineering can heavily influence the model behavior and model
         | response quality [26, 27, 28]. To account for the response
         | variations due to various prompt forms, we created 3 distinct
         | prompt types asking for the solution to the AIW problem:
         | STANDARD, THINKING, and RESTRICTED. The STANDARD prompt type
         | asks to solve the posed problem and output the final answer in
         | the format as described above. This does not put any specific
         | requirements on model behavior. The THINKING prompt type
         | extends STANDARD with the request to think carefully and double
         | check the solution for any mistakes. This should encourage
         | model to invest more computation into obtaining the solution.
         | In contrast to this, the RESTRICTED prompt urges the model to
         | output only the final answer without any further text. This is
         | supposed to restrict compute invested in producing output. We
         | observe substantially shorter outputs across tested models
         | compared to STANDARD and THINKING for this prompt type (Suppl.
         | Fig. 13).
        
         | freilanzer wrote:
         | I just tried it with intermediate steps, via the total amount
         | of children Alices mother has, and it worked. So, it seems that
         | an intermediate logical step is too much for these models at
         | the moment, at least in this puzzle where Alice is outside the
         | number given in the description.
        
         | YeGoblynQueenne wrote:
         | >> I must confess, when I tried to answer the question I got it
         | wrong...! (I feel silly). I only realised I got it wrong when I
         | plugged it into GPT-4o and it came back with the correct
         | answer:
         | 
         | Remember that the authors of the paper did not find that GPT4-o
         | cannot return the right answer. They found that it can't return
         | the right answer _more often than ~60% of the time_. So you 'd
         | have to repeat the experiment many, many times and aggregate
         | the results (the paper uses a binomial Beta this and that etc
         | etc) before you see similar results as the paper.
         | 
         | You won't replicate the results of the paper unless you really
         | put your back into it.
        
       | jay-barronville wrote:
       | The Humpty Dumpty reference in the paper has me cracking up (pun
       | intended).
        
       | naikus wrote:
       | I'm using chatGPT 4o and it gave me the following :)
       | 
       | ME Alice has 40 brothers and she also has 50 sisters. How many
       | sisters does Alice's brother have?
       | 
       | ChatGPT To determine how many sisters Alice's brother has, we
       | need to use the information provided:
       | 
       | Alice has 40 brothers and 50 sisters. Since Alice's brother is
       | one of the 40 brothers, he shares the same family structure as
       | Alice. This means he has the same number of sisters as Alice.
       | 
       | Therefore, Alice's brother has 50 sisters.
        
         | gwbas1c wrote:
         | Alice has 50 sisters.
         | 
         | Alice's brother has Alice as a sister (1) + Alice's sisters
         | (50).
         | 
         | Therefore Alice's brother has 51 sisters.
        
           | happypumpkin wrote:
           | First result:
           | 
           | Me > Alice has 40 brothers and she also has 50 sisters. How
           | many sisters does Alice's brother have?
           | 
           | GPT4o (web version) > Alice has 50 sisters. Since all of
           | Alice's brothers and sisters share the same set of siblings,
           | each of Alice's 40 brothers also has 50 sisters. Therefore,
           | Alice's brother has 50 sisters.
           | 
           | ---
           | 
           | Step by step:
           | 
           | Me > Alice has 40 brothers and she also has 50 sisters. How
           | many sisters does Alice's brother have? Think about it step
           | by step.
           | 
           | GPT4o > To determine how many sisters Alice's brother has,
           | follow these steps:                   Identify the total
           | number of siblings: Alice has 40 brothers and 50 sisters,
           | plus Alice herself. Thus, the total number of siblings is:
           | 40 (brothers)+50 (sisters)+1 (Alice)=91 siblings         40
           | (brothers)+50 (sisters)+1 (Alice)=91 siblings
           | Focus on Alice's brother: Alice's brother is one of the 40
           | brothers.              Determine the number of sisters for
           | Alice's brother: Each brother has the same number of sisters.
           | Since the brothers do not count themselves as sisters, they
           | only count the 50 sisters, excluding themselves and other
           | brothers.
           | 
           | Therefore, each of Alice's brothers has: 50 sisters
           | 
           | So, Alice's brother has 50 sisters.
           | 
           | ---
           | 
           | Thinking step by step somehow gave an even more nonsensical
           | answer, I'm actually very surprised it didn't work when
           | prompted to do it step by step.
        
         | gweinberg wrote:
         | I don't know why you are getting downvoted. I was disappointed
         | the paper didn't give examples of the LLM confidently
         | expressing its incorrect reasoning.
         | 
         | From a human perspective, I think there are three ways to get
         | the answer wrong: failure to realize that Alice's brother has
         | pretty much the same number of sisters as Alice herself,
         | failure to realize that the brother has one additional sister,
         | namely Alice, and failure to successfully add one to the number
         | of sisters. But that assumes that the LLM is more or less
         | reasoning like a human. It may well be "reasoning" more along
         | the lines of "I've seen lots of story problems like this, the
         | modal answer was five, I'll say five"
        
       | gwbas1c wrote:
       | > This breakdown can be considered to be dramatic ... because
       | models tend to express strong overconfidence in reporting their
       | wrong solutions as correct, while often providing confabulations
       | to additionally explain the provided final answer, mimicking
       | reasoning-like tone but containing nonsensical arguments as
       | backup for the equally nonsensical, wrong final answers.
       | 
       | People do that too!
       | 
       | Magical thinking is one example. More tangible examples are found
       | in politics, especially in people who believe magical thinking or
       | politicians' lies.
        
         | emmender2 wrote:
         | I was waiting for: "but humans do that too" and bingo.
         | 
         | on another note: an entire paper written on one prompt - is
         | this the state of research these days ?
         | 
         | finally: a giant group of data-entry technicians are likely
         | entering these exceptions into the training dataset at openai.
        
           | gwbas1c wrote:
           | > an entire paper written on one prompt - is this the state
           | of research these days
           | 
           | Years ago I attended an AI meetup where the organizer would
           | discuss the differences between reductionist and wholistic
           | research.
           | 
           | This is reductionist research.
           | 
           | In this case, reducing the paper to one prompt is to allow
           | for objective comparison of models.
           | 
           | Otherwise, without a reductionist approach, how does one
           | perform such an experiment and ensure that they are comparing
           | apples to apples?
        
             | emmender2 wrote:
             | what I would like to see is a parameterized class of
             | prompts which can never be solved by the LLMs even when a
             | finite number of them are manually added to the dataset.
        
               | gwbas1c wrote:
               | Well, should we consider ChatGPT a Turing machine?
               | Otherwise, I think an answer like that either requires
               | significantly more research; insight; or general
               | knowledge about how LLMs work.
               | 
               | IE, you're getting into areas that are analogous to
               | Turing's theories. I don't think he came up with those
               | theories overnight.
        
       | nerdjon wrote:
       | For anyone considering reading the paper and like me don't
       | normally read papers like this, open the PDF and think you don't
       | have time to read it due to its length. The main part of the
       | paper is the first 10 pages and a fairly quick read.
       | 
       | On to the topic here. This is an interesting example that they
       | are using. It is fairly simplistic to understand as a human (even
       | if we may be inclined to quickly jump to the wrong conclusion
       | without thinking for a few seconds).
       | 
       | The thing that really bothers me is that I just don't know is
       | realistically we can fix this given the current state of what
       | these tools actually are. They are not reasoning or thinking in
       | any sense of the word and yet a lot of people are already
       | considering them general purpose AI. It doesn't help that in many
       | situations it can fake it enough that it appears to be reasoning,
       | but it's not.
       | 
       | What is the chance that this paper actually has any impact on the
       | AI rollout and overhype or will just be buried and never talked
       | about again until the next time we see how dangerous these tools
       | are like with Google's search rollout.
        
         | mjburgess wrote:
         | In many ways, this is very obvious and routine to people who
         | use these systems with a critical understanding of how they
         | work. It's dispiriting how we require a arxiv.org pdf in order
         | to point out the absence of the emperor's clothing. There are
         | few people who use LLMs who could not, right now, demonstrate
         | this point to themselves if they so wish.
         | 
         | Such is the strength of corporate tech propaganda that a whole
         | mass of people will instead insist that we have never worn
         | clothes either.
        
           | nerdjon wrote:
           | > In many ways, this is very obvious and routine to people
           | who use these systems with a critical understanding of how
           | they work.
           | 
           | The last part of that is the problem and why a paper like
           | this is critical.
           | 
           | These systems are being pushed onto people who don't
           | understand how they work. CEO's and other business leaders
           | are being pushed to use AI. Average users are being shown it
           | in Google search results. Etc etc.
           | 
           | People are being told it can do far more than it really is.
        
             | mjburgess wrote:
             | Sure, but even these people... the failures are so common,
             | and often very obvious. Consider a CEO who puts a press
             | briefing in and asks some questions about it, it's not
             | uncommon for those answers to be obviously wrong on any
             | sort of critical reflection.
             | 
             | We arent dealing with a technology that is 99.9% right in
             | our most common use cases, so that we need to engineer some
             | incredibly complex problem to expose the flaw. Rather, in
             | most cases there is some obvious flaw. It's a system that
             | requires typically significant "prompt engineering" to
             | provide the reasoning the system otherwise lacks.
             | 
             | I guess that offers an explanation: people aren't aware
             | that via their own prompt engineering they are repairing
             | the deficiencies of the process by manipulating its inputs
             | to include the structured reasoning it lacks. So there's a
             | sort of hot-reading effect at work.
        
               | throw46365 wrote:
               | > We arent dealing with a technology that is 99.9% right
               | in our most common use cases, so that we need to engineer
               | some incredibly complex problem to expose the flaw.
               | Rather, in most cases there is some obvious flaw. It's a
               | system that requires typically significant "prompt
               | engineering" to provide the reasoning the system
               | otherwise lacks.
               | 
               | Right -- we are a long way from "this is a very nuanced
               | error" being the dominant failure.
        
             | kalkin wrote:
             | > People are being told it can do far more than it really
             | is.
             | 
             | Meanwhile these HN comments are split between:
             | 
             | * Lots of people confirming what the paper itself notes
             | (but doesn't highlight), that the most advanced models
             | actually can solve this problem at least a significant
             | portion of the time. (A proportion which one can pretty
             | easily project is only likely to increase with future
             | models.)
             | 
             | * Lots of people saying "this confirms LLMs can't do
             | reasoning".
             | 
             | Questions I'd ask you to consider:
             | 
             | * Is "LLMs can't do reasoning" actually more accurate than
             | the typical hype?
             | 
             | * Is a "critical understanding of how [LLMs] work" that
             | would predict they simply cannot solve this problem
             | actually a good understanding?
        
               | mjburgess wrote:
               | The critical understanding doesnt predict that LLMs
               | cannot solve problems. It predicts how they will solve
               | them. There is no information, a priori, what the LLM has
               | been trained on. You have to prompt, then see the answer.
               | Once the answer arrives, the critical understanding
               | provides a route to repairing the answer when not
               | accurate or useful.
               | 
               | LLMs do not reason. They appear to reason by repeating
               | the structure of reasoning in their training data. This
               | is indistinguishable in many cases.
        
           | throw46365 wrote:
           | > Such is the strength of corporate tech propaganda that a
           | whole mass of people will instead insist that we have never
           | worn clothes either.
           | 
           | This is the line of reasoning I find most dispiriting. I
           | still believe tech people cling to this line of reasoning
           | because it helps them justify replacing people in jobs with
           | LLMs.
        
         | colechristensen wrote:
         | I don't think anybody who actually uses LLMs thinks they are
         | general purpose AI.
         | 
         | Like you ask it to do one thing it's amazing, but then you try
         | to modify or do something with extra steps, or just anything
         | with any complexity to it and it falls over.
        
           | nerdjon wrote:
           | > I don't think anybody who actually uses LLMs thinks they
           | are general purpose AI.
           | 
           | I would like to believe that but I have had too many
           | conversations with people who basically think it already is.
           | Including in one situation of a fellow engineer.
           | 
           | It feels like more and more "we" are in a bubble of actually
           | having some knowledge of how this works, what the actual
           | limitations are, and what it just is not. While there is in
           | fact criticism of it out there, particularly around AI "art".
           | It doesn't seem to be focused on the area we are talking
           | about.
        
             | goostavos wrote:
             | I dunno. If memes are anything to go by, people are
             | frustrated by Yet Another Chat Bot standing between you and
             | the human you need to solve your program. The fact that
             | it's a better, more human like chat bot doesn't mask the
             | frustration of being forced to talk to a computer with
             | limited agency.
        
           | James_K wrote:
           | > I don't think anybody who actually uses LLMs thinks they
           | are general purpose AI.
           | 
           | They are being sold as such. Most people don't know anything
           | about the topic and will buy that marketing. The entire
           | concept of these models is that you can put a whole bunch of
           | data in and eventually some kind of magic will happen and you
           | get AGI out. They would not see the kind of investment that
           | they do if all that was being promised was "really good
           | predictive text". In fact some philosophers argue that
           | sentience is just really good predictive text to try and make
           | the point that these models are AGI.
        
           | kragen wrote:
           | they're pretty general-purpose; you can ask the same model
           | for recipe suggestions, a fanfic, or verilog. like, they're
           | far, far more general-purpose than any humans i know
           | 
           | it's true that they're not very reliable, but they seem to be
           | not very reliable across many different domains. and they
           | don't seem to be particularly less reliable than the average
           | human, so i think possibly your standards for 'general
           | purpose ai' are set high enough that you would declare humans
           | to be unintelligent (or perhaps not 'general-purpose') if you
           | applied them consistently
           | 
           | you can certainly find particular domains where humans can
           | still do things llms can't, but i haven't seen a persuasive
           | account of why those domains are the more important ones, and
           | of course the converse is also true
        
           | mupuff1234 wrote:
           | They might not think that but they sure as hell are trying to
           | sell that idea.
        
           | dr_dshiv wrote:
           | Well, I believe LLMs are general purpose AI. And, for the
           | record, so does Peter Norvig. He coauthored the most popular
           | textbook on Artificial Intelligence, so this belief can be
           | considered a somewhat mainstream perspective. His article,
           | "Artificial General Intelligence Is Already Here" [1] is
           | really worth reading.
           | 
           | https://www.noemamag.com/artificial-general-intelligence-
           | is-...
        
             | lupire wrote:
             | I can't tell whether Norvig is making a subtle's point that
             | artificial general stupidity is still an important form of
             | artificial general intelligence, or he whether he somehow
             | vastly overestimates AI model performance on non-
             | memorization and summarization tasks.
        
           | solidasparagus wrote:
           | The vast majority of people use a model built by someone else
           | through a high-level abstraction and then make broad claims
           | about what the technology will be capable of.
           | 
           | Then you have the people who are leveraging the technology to
           | train models from scratch and that population is far more apt
           | to believe that large models can be general purpose AI (by
           | some definition). Sure there are other things at play like
           | money, publicity, reputation, and a desire to do something
           | important. But there is also the very clear trend line where
           | transformers are able to model pretty much any sequence of
           | tokens where there is pretraining scale data available.
           | Whisper was not a surprise. GPT-4v was not a surprise. Sora
           | was not a surprise.
        
         | photon_lines wrote:
         | There's actually a pretty simple solution to this that I
         | thought about testing out and it involves asking the model to
         | re-construct the problem using a logic language (like Prolog)
         | and asking it to execute this type of program in order to come
         | up with a solution rather than attempting simple chain-of-
         | reason training / other methodologies of getting the model to
         | 'reason' through some of these examples. People forget that
         | humans don't come up with their logical models out of the blue
         | - it takes years of elementary school in order for us to
         | understand the world and solve problems in it. The logic
         | programming approach I'd say is really promising but you would
         | need to feed the LLM a LOT of examples in order for it to work,
         | and currently I'm not even sure that we have enough training
         | data in order to implement something like this.
        
           | nerdjon wrote:
           | I honestly thought about this recently when I was trying to
           | see the limits of Claude Opus. Some of the problems I gave
           | it, what if instead of telling it to solve the problem I
           | asked it to write the script and then give me the command and
           | inputs needed to properly run it to get the answer I needed.
           | That way instead of relying on the LLM to do properly
           | analysis of the numbers it just needs to understand enough to
           | write the logic.
           | 
           | It is an interesting prospect but I feel like it has some
           | limitations. For math problems like this one, yeah it should
           | be simple to write a script to do it. But it does first have
           | to understand the core thing here that Alice would be one of
           | the sisters of the brother to write the script accordingly.
           | 
           | But I would think this would not scale well when dealing with
           | far more complex issues, particularly ones that may not just
           | be simple math logic. If the request was to write a
           | scientific paper for example, it might have to make several
           | scripts and call them multiple times, and that assumes that
           | it understands what it is doing enough to properly make the
           | script.
        
             | photon_lines wrote:
             | I'm not sure what you mean by 'it will not scale well.'
             | When we humans learn that we make a mistake - we make a
             | note and we hold the correct answer in memory - the next
             | time we're prompted with a similar prompt, we can use our
             | old memories to come up with the correct solution. I just
             | did a simple test for this same exact problem using ChatGPT
             | 3.5:
             | 
             | 'Can you reformulate the following problem using Prolog?
             | When you execute the program - what answer do you get?
             | 'Alice has 8 brothers and she also has 5 sisters. How many
             | sisters does Alice's brother have?'
             | 
             | Even though it came up with a proper Prolog formulation -
             | it failed to account that Alice is one of the sisters - so
             | I gave it a simple prompt for this and it was able to
             | correct the formulation and come up with the proper
             | solution:
             | 
             | My prompt after ChatGPT gave me an answer of 5: 'Yes but
             | you aren't including Alice in your logic. Shouldn't the
             | correct answer be 6 (since Alice is also a sister)?'
             | 
             | Response:
             | 
             | 'You are correct. The correct answer should account for
             | Alice being one of the sisters. Let's adjust the Prolog
             | program to include Alice in the calculation:
             | 
             | % Facts brothers(alice, 8). sisters(alice, 5).
             | 
             | % Rules sisters(Brother, NumSisters) :- sisters(alice,
             | Num), NumSisters is Num + 1. % Include Alice herself
             | 
             | % Query ?- sisters(alices_brother, NumSisters).
             | 
             | In this updated Prolog program:
             | 
             | The rule for determining the number of sisters includes
             | Alice herself by adding 1 to the number of sisters Alice
             | has.
             | 
             | When we execute the query ?- sisters(alices_brother,
             | NumSisters). in Prolog, we should get:
             | 
             | NumSisters = 6.
             | 
             | This correctly reflects that each of Alice's brothers has 6
             | sisters, including Alice herself.
             | 
             | When I prompt it to answer this question using Prolog now,
             | I always get the correct answers no matter which integers I
             | feed in, so it seems like the model is able to generalize
             | after I show it where it's 'reasoning' was wrong.
             | Basically, my suggestion is to get the model to recognize
             | logic puzzles like this and try to formulate them in terms
             | of logic programming queries which it can use and tune in
             | order to come up with correct answers rather than simple
             | auto-associative chain of reason training which current GPT
             | models rely on, but like I said - this is my hypothesis and
             | I believe this would work much better in getting these
             | models to 'generalize' than the current approaches we're
             | using. Hopefully this helps.
        
               | pbhjpbhj wrote:
               | When you ask again the prompt includes the context of
               | your previous question and correction. When I ask the
               | prompt doesn't have that context so the model fails to
               | give me the correct answer.
               | 
               |  _I 'm using the default free model in the app, based on
               | GPT4._
        
               | photon_lines wrote:
               | Yup - well this is where my suggestion is to change the
               | GPT architecture. You can think of having a logic program
               | function as the 'frontal lobe' of the general pre-trained
               | auto-associative model. This 'frontal lobe' region would
               | try to come up with logical sequences to go along with
               | it's internal auto-associative representations. Of course
               | - the logic programming piece is just one approach -
               | maybe chain of though or chain of reason prompting could
               | work here too as many humans I think use this chain-of-
               | reasoning approach themselves. Logic programming to me
               | would function as a suggested shortcut.
        
               | immibis wrote:
               | What if you prompt it "You seem to have accidentally
               | included Alice. The correct answer should be 4"?
        
               | photon_lines wrote:
               | Yup this is a good example. This is because the model has
               | no conception of what 'causality' is or how to try to
               | come up with a correct 'model' - humans have a visual
               | system which helps them out, but for LLMs I can
               | definitely see your point and yup in these instances - if
               | you feed in garbage data then yeah - you will get garbage
               | out.
        
               | mcguire wrote:
               | " _My prompt after ChatGPT gave me an answer of 5: 'Yes
               | but you aren't including Alice in your logic. Shouldn't
               | the correct answer be 6 (since Alice is also a
               | sister)?'_"
               | 
               | Useful, if you know what the answer is. What happens if
               | you don't give it the correct answer?
        
               | photon_lines wrote:
               | What happens when you don't give a human being the
               | correct answer? :)
        
             | ip26 wrote:
             | I don't understand why LLM's aren't already set up to do
             | what you describe automatically behind the curtain. Extract
             | a math equation from text (LLMs are good at translating
             | between languages right?) and immediately evaluate it on
             | the host CPU.
             | 
             | LLM is the equivalent of recalling your times tables.
             | Computer arithmetic is the equivalent of re-computing your
             | times tables.
        
               | dragonwriter wrote:
               | > I don't understand why LLM's aren't already set up to
               | do what you describe automatically behind the curtain.
               | 
               | LLM-based systems with tool use (which this is an
               | application of) often are, to an extent, the issue is
               | tuning the (behind the scenes, system) prompting so that
               | they use appropriate tools in every case where they
               | should, and do so correctly. (There's also a cost factor
               | involved since behind-the-scenes tool use means multiple
               | LLM round trips to answer the question, so tuning the
               | system to use tools more aggressively makes the system
               | more expensive.)
        
               | pbhjpbhj wrote:
               | ChatGPT does do this sort of process for arithmetic now;
               | it converts wordbased problems to mathematical notation
               | and then solves.
        
           | CooCooCaCha wrote:
           | I'm curious how this would work considering knowledge can be
           | fuzzy.
           | 
           | Like if I'm out camping and I sit on a log or a rock those
           | things are not what people usually think of as chairs but
           | they can serve as chairs in that situation.
        
             | photon_lines wrote:
             | You can get models to actually show that 'logs' could
             | function as 'chairs.' You're forgetting that we humans also
             | learn this as well, but we learn this in a much simpler
             | manner than LLMs though so someone has to explicitly let
             | models know what assumptions they can make. You get the LLM
             | to write Prolog programs and learn associations in this
             | manner. As the model gets better at logically modelling the
             | problems - the solutions to prompted problems like this
             | should get better.
        
           | sollewitt wrote:
           | Right, and do you verify the result?
           | 
           | You have to know what the answer is supposed to be before you
           | can write a test case.
        
             | photon_lines wrote:
             | Yup - well you feed in the prompt along with an answer and
             | you get the model to produce outputs and check for
             | discrepancies. If the answer is wrong then the model
             | adjusts -- this is the way backpropagation works....I think
             | there are huge advantages in using logic languages in order
             | to represent some of these data sets rather than simple
             | English or the current chain-of-thought reasoning
             | approaches -- backpropagation as an example isn't really
             | used in the human brain, but it leads to great results in
             | mimicking how neural networks 'learn' - in the same way, we
             | don't have to have the full formal picture of how humans
             | model the logical world in order to achieve great results.
             | We can simulate this using logic programming or even
             | general programming or at least that's my conjecture.
        
             | IanCal wrote:
             | I can have more confidence that my calculations are correct
             | using a calculator compared to doing it by hand, even if I
             | don't know the exact right answer beforehand.
        
           | astrobe_ wrote:
           | > asking it to execute this type of program in order to come
           | up with a solution
           | 
           | I may be showing my ignorance about this tech here, but I
           | believe the LLM doesn't even try to solve a problem; they try
           | to generate a discourse that could pass as a solution or
           | answer to the problem; that's more or less what the abstract
           | states if I understand it correctly. But in no way does it
           | try to apply some sort of mechanical reasoning like inference
           | engines do.
           | 
           | To me the solution to this is to associate LLM with
           | mechanical computations, that is an inference engine or an
           | equation solver, rather than recombining the millions of
           | solutions for similar problems it has seen in its training
           | set. I believe I remember reading about teams attempting this
           | approach. I can imagine for instance that if the LLM is in
           | some way able to ask questions and use the answer, maybe it
           | could just generate a prompt for an equation solver and
           | include the result in its answmer.
        
           | asadotzler wrote:
           | If that kind of thing worked, we'd have been doing it long
           | before LLM chatbots.
        
             | IanCal wrote:
             | Yet tools like GPT4o can do this.
             | 
             | It's not a trivial problem, taking a human written
             | description and rewriting it as a prolog program.
        
           | IanCal wrote:
           | I took one of the problems that gpt4o got wrong, and asked
           | gpt4o what tools it could use. It suggested and wrote prolog
           | for me that (with one pass back to get things defined in the
           | right order) which correctly worked out the answer.
        
           | mcguire wrote:
           | I just tried that with ChatGPT 3.5 (4o stopped responding
           | after I asked the initial question and it produced the wrong
           | answer). Here's the Prolog it generated:                   %
           | Define the number of brothers and sisters
           | brothers(4).         sisters(1).              % Predicate to
           | calculate the number of sisters a brother has
           | brother_sisters(NumberOfSisters) :-
           | sisters(NumberOfSisters).              % Query to find out
           | how many sisters a brother has         ?-
           | brother_sisters(Sisters).
        
         | wtbdqrs wrote:
         | I appear to be reasoning at times but I have mostly no idea
         | what I am talking about. I hit a bunch of words and concepts in
         | the given context and thus kind of hallucinate sense.
         | 
         | Given a few months of peace of mind and enough money for good
         | enough food, I could actually learn to reason without sounding
         | like a confused babelarian.
         | 
         | Reasoning is mostly a human convention supported by human
         | context that would have been a different one if the Fascists
         | had won the war or the Soviet Union wouldn't have gotten
         | corrupted.
         | 
         | But none of that has anything to do with pulling up a
         | whiteboard to draw some flowcharts and run some numbers, all of
         | which is why I am certain there is nothing the devs have "to
         | fix". It took most reasonable humans many generations to learn
         | stuff. Very few of us did the actual work.
         | 
         | It's all just a matter of time.
        
           | voxic11 wrote:
           | Yeah, I think these chatbots are just too sure of themselves.
           | They only really do "system 1 thinking" and only do "system 2
           | thinking" if you prompt them to. If I ask gpt-4o the riddle
           | in this paper and tell it to assume its reasoning contains
           | possible logical inconsistencies and to come up with reasons
           | why that might be then it does correctly identify the
           | problems with its initial answer and arrives at the correct
           | one.
           | 
           | Here is my prompt:
           | 
           | I have a riddle for you. Please reason about possible
           | assumptions you can make, and paths to find the answer to the
           | question first. Remember this is a riddle so explore lateral
           | thinking possibilities. Then run through some examples using
           | concrete values. And only after doing that attempt to answer
           | the question by reasoning step by step.
           | 
           | The riddle is "Alice has N brothers and she also has M
           | sisters. How many sisters does Alice's brother have?"
           | 
           | After you answer the riddle please review your answer
           | assuming that you have made a logical inconsistency in each
           | step and explain what that inconsistency is. Even if you
           | think there is none do your best to confabulate a reason why
           | it could be logically inconsistent.
           | 
           | Finally after you have done this re-examine your answer in
           | light of these possible inconsistencies and give what you
           | could consider a second best answer.
        
             | daveguy wrote:
             | > After you answer the riddle please review your answer
             | assuming that you have made a logical inconsistency in each
             | step and explain what that inconsistency is. Even if you
             | think there is none do your best to confabulate a reason
             | why it could be logically inconsistent.
             | 
             | LLMs are fundamentally incapable of following this
             | instruction. It is still model inference, no matter how you
             | prompt it.
        
             | zeknife wrote:
             | If you had a prompt that reliably made the model perform
             | better at all tasks, that would be useful. But if you have
             | to manually tweak your prompts for every problem, and then
             | manually verify that the answer is correct, that's not so
             | useful.
        
             | cpleppert wrote:
             | There isn't any evidence that models are doing any kind of
             | "system 2 thinking" here. The model's response is guided by
             | both the prompt and its current output so when you tell it
             | to reason step by step the final answer is guided by its
             | current output text. The second best answer is just
             | something it came up with because you asked, the model has
             | no second best answer to give. The second best answers
             | always seem strange because the model doesn't know what it
             | means to come up with a second best answer; it 'believes'
             | the output it gave is the correct answer and helpfully
             | tries to fulfill your request. Sometimes the second best
             | answer is right but most of the time its completely
             | nonsensical and there is no way to distinguish between the
             | two. If you ask to choose it will be strongly influenced by
             | the framing of its prior response and won't be able to spot
             | logical errors.
             | 
             | Asking it to do lateral thinking and provide examples isn't
             | really helpful because its final output is mostly driven by
             | the step by step reasoning text, not by examples it has
             | generated. At best, the examples are all wrong but it
             | ignores that and spits out the right answer. At worst, it
             | can become confused and give the wrong answer.
             | 
             | I've seen gpt-4 make all kinds of errors with prompts like
             | this. Sometimes, all the reasoning is wrong but the answer
             | is right and vice versa.
        
         | layer8 wrote:
         | For reasoning of any complexity, some sort of internal
         | monologue and iteration is needed. For type-2 thinking, we
         | conceptualize possible solutions, arguments, and reasoning
         | paths in our mind, and judge and evaluate them before deciding
         | which one we think is adequate or correct. This can be emulated
         | to some extent by guiding prompts that make that internal
         | thought process external, but we really want it to be internal
         | and automatic.
         | 
         | We don't know how to do that yet, because what _controls_ the
         | internal thought process is itself not necessarily language-
         | based, and also, since internal thought processes of biological
         | brains are not directly observable, they can't be used as
         | training data.
         | 
         | Edit: It occurs to me now that there is some parallel between
         | current LLMs and behaviorism [0], and we really need something
         | to which cognitive psychology could be applied instead.
         | 
         | [0] https://en.wikipedia.org/wiki/Behaviorism
        
           | sollewitt wrote:
           | Given many people don't have an inner monologue and function
           | just fine, it's more likely inner monologue is a product of
           | the reasoning process and not it's mechanism.
        
             | CooCooCaCha wrote:
             | I think you're using "inner monologue" too literally. It
             | could be a progression of pictures, emotions, etc.
        
               | ajuc wrote:
               | With that definition even bacteria have inner monologue.
        
               | CooCooCaCha wrote:
               | Can bacteria imagine pictures? Do they have emotions?
               | 
               | Why does this matter? Stop being so pedantic. We're
               | talking about a progression of _ideas_. Talking in your
               | head is one form of ideas, but people can easily solve
               | problems by imagining them.
        
               | IlliOnato wrote:
               | Hmm, looks to me like just trading some words for others.
               | Do bacteria have ideas? Does the navigating system in
               | your car? How do you know?
               | 
               | We need to be at least somewhat pedantic, otherwise it's
               | impossible to know what we are even talking about, and no
               | way to establish anything.
        
               | ajuc wrote:
               | Initial thesis was - inner monologue is required for
               | reasoning. If you define inner monologue to include
               | everything brains do - the initial thesis becomes a
               | tautology.
        
               | IlliOnato wrote:
               | To make any progress on this question at all, we need
               | first to come up with some definition of internal
               | monologue. Even if we may need to modify it later, there
               | has to be a starting point.
               | 
               | Otherwise, nothing can be established at all, because for
               | any statement there always will be someone's
               | understanding of "internal monologue" for which the
               | statement is true, and someone's else understanding for
               | which the statement is false...
        
               | hombre_fatal wrote:
               | I'm sure inner monologue just cashes out into the ability
               | to reflect on your own thoughts. And for one to say that
               | they're not having that experience also involves a claim
               | about what they think _other people_ are having which
               | would make me doubly skeptical.
               | 
               | In practice, when you see people arguing about whether
               | they have an "inner monologue" or can "mentally picture
               | objects" on social media, it's more of a contest of who
               | is the most unique in the world rather than anything that
               | sheds clarity on our subjective experience.
        
             | layer8 wrote:
             | It's commonly conjectured that the emergence of human-level
             | reasoning wouldn't have been possible without the
             | development of language. Personally, I'm able to suppress
             | "word thoughts" in my head (for a short time), but then I
             | lose almost all of my reasoning ability. I could imagine
             | that reasoning is language-based even when it's not
             | conscious for some people. An internal process being there,
             | and being conscious of it, are two separate things. We
             | would be happy with an AI using an internal monologue
             | without it being conscious of that monologue.
        
               | IggleSniggle wrote:
               | Maybe, but symbolic thought can get pretty far away from
               | what we generally call "language." I bet you can reason
               | 1+3x=22 pretty easily without any words whatsoever, or
               | the sound of one ascending octave after another, or the
               | approximate G-force induced on your body if you take the
               | next turn without applying the brakes.
               | 
               | All of these forms of reasoning are true and useful
               | calculations: when we talk about "intuition" what we
               | usually mean is that we have a lot of experience and
               | internal reasoning about a subject, but we struggle to
               | translate it to and from the "language" part of our
               | brain. Nonetheless, any social dancer will tell you that
               | a dialog is possible just by receiving and inducing
               | g-forces alone. You can reason this way about abstract
               | concepts like orbits without ever touching a single word
               | or concrete symbol.
               | 
               | Edit: the key aspect of reasoning, imho, is the ability
               | to make predictions and introspect them against a
               | database of other predictions, using an adversarial
               | heuristic to weight the most plausibly useful results.
               | Perhaps our pattern matching AIs of today just lack
               | sufficient "experience" to do what we call reasoning.
        
               | layer8 wrote:
               | Regarding "1+3x=22", I'm actually not sure, the number
               | words certainly appear in my head when solving the
               | equation. But even then, I would count "1+3x=22" as
               | constituting language. Perception of sound, G-forces, and
               | dancing don't perform type-2 reasoning by themselves, so
               | I don't think your argument applies there.
               | 
               | Regarding your edit, no, I think the key aspect of the
               | kind of reasoning we are missing in current AI is the
               | ability to hold the reasoning in your mind, and to
               | iterate on it and evaluate it (judge it) within your
               | mind.
        
               | IggleSniggle wrote:
               | It is very difficult to have a discussion _using words_
               | to discuss the semantics of non-word or non-symbolic
               | semantics. I was pointing at several different plausible
               | spaces for semiotics and how these spaces could be spaces
               | for reasoning in the hopes that one of them might be
               | relatable.
               | 
               | If you use words in your mind when you use math, and you
               | use words in your mind when you make or listen to music,
               | etc., then it is very difficult to find a common ground
               | where it is possible to see that these other realms of
               | thought are capable of not only prediction, but also
               | producing evidence that leads to judgement. That is to
               | say, the key aspects of "reasoning." I picked them
               | because I thought they had broad enough appeal to be
               | relatable, and because I do not personally hear words in
               | my head when doing any of these activities, whether it's
               | calculus or tango, but I still find calculus and tango to
               | be places where reasoning occurs.
               | 
               | Some of them, like math or music, are closer to the kind
               | of symbolic thought we use when we discuss things with
               | words. Others, like the experience of g-forces, are not.
               | I present them as a sliding scale between "word based"
               | reasoning and "non-linguistic" reasoning. Perhaps you can
               | think of a realm that better fits for your personal
               | experience of intuition, and inspect whether these
               | intuitions are capable of "real" reasoning in the absence
               | of language, or whether intuition should never be trusted
               | even when you have a great deal of experience in that
               | area. Perhaps in your estimation, anything that cannot
               | produce evidence that is articulable in word form is
               | suspect.
               | 
               | Personally, I find all these methods, including language,
               | to be suspect. I don't find language to be especially
               | better at producing the kind of evidence for prediction,
               | correct judgement, or discourse for reasoning than other
               | methods, unless you reduce "reasoning" to tautologically
               | require it.
               | 
               | One of the best tools of language is that we have writing
               | that allows easy inspection or iteration of the written
               | content; but these things are possible in other realms,
               | too, it's just that we didn't have great tools for
               | introspecting and iterating on their "ideas" except
               | within our own minds. These days, those tools are readily
               | available in many more realms of human insight.
        
               | hifromwork wrote:
               | >I bet you can reason 1+3x=22 pretty easily without any
               | words whatsoever
               | 
               | I've tried to do it, but I can't. I had to do something
               | like "ok, so we subtract one from both sides and then
               | it's easy, 3*7=21". Maybe I could do 2+8 but I still
               | think the word ten "aloud".
        
               | twiceaday wrote:
               | I was able to do it with no words. I 'saw' the steps as
               | if on a piece of paper. I saw 3x=22-1=21, then x=21/3=7.
               | But I have a degree in applied math. Perhaps not
               | internally vocalizing is just being extremely familiar.
               | It also happened very quickly, perhaps there was no time
               | to vocalize anyways.
        
               | nathan_compton wrote:
               | Brains are weird. I reason almost entirely non-verbally
               | and I would absolutely struggle if I had to laboriously
               | express every thought in words. Its part of the reason I
               | don't work well in teams. So slow!
        
               | psadri wrote:
               | Language is a serialization of our brain's "world model"
               | structures.
        
               | bradfox2 wrote:
               | What defines the boundaries of internal vs external?
               | Certainly nothing about llm weights or ops should.
        
             | kveykva wrote:
             | Have there ever been studies that demonstrate that those
             | individuals don't simulate possible state transitions
             | they'll go through in a different modality? I'd be curious
             | if they visualize actions they'll take still, just not
             | verbally.
        
             | ElevenLathe wrote:
             | The existence of an "inner monologue" isn't really a
             | falsifiable claim. Some people claim to have one while
             | other people claim not to, but we can't test the truth of
             | these claims.
        
               | glitchc wrote:
               | In this particular case, is there any reason why we
               | simply can't take their word for it? This is not a case
               | of where if I say "weak" or "strong", most people pick
               | strong because no one wants to be weak, even if the
               | context is unknown (nuclear force for example).
        
               | GeoAtreides wrote:
               | > is there any reason why we simply can't take their word
               | for it?
               | 
               | because if we give them a problem to solve in their head
               | and just give us the answer, they will. By problem I mean
               | planning a trip, a meal, how to pay the mortgage, etc.
               | It's impossible to plan without an internal monologue.
               | Even if some people claim theirs is 'in images'.
        
               | glitchc wrote:
               | Perhaps there's confusion in how we are using the word
               | monologue. I took it to mean a conversation, a dialogue
               | where the problem is perhaps solved using a dialectic
               | method, or simply a conversation. Since one can solve a
               | problem by following some memorized steps, no
               | conversation required, this is perhaps not a good test,
               | or we mean different things when we say "monologue."
        
               | rprospero wrote:
               | > It's impossible to plan without an internal monologue
               | 
               | I once had a teacher claim that people who claimed to
               | have aphantasia were lying, because those people have
               | read books and it is impossible to read a book without
               | picture the scene in your mind's eye. Are you citing the
               | same source that she was?
        
               | staticman2 wrote:
               | >> It's impossible to plan without an internal monologue
               | 
               | That's quite the claim.
        
               | ElevenLathe wrote:
               | > It's impossible to plan without an internal monologue.
               | 
               | How can science make this claim if it can't prove (or
               | disprove) the existence of an internal monologue?
        
               | IlliOnato wrote:
               | Well, I remember Richard Feynman came up with an
               | interesting experiment. He found he could not count
               | objects when he read aloud some text at the same time. He
               | had to name the numbers, and it was impossible if he was
               | already engaging his speech.
               | 
               | He thought this was universal, but doing this experiment
               | with friends, he discovered a guy who could count while
               | reading aloud. So when Feynman asked him, how he does
               | this, turned out that the guy instead of "pronouncing"
               | numbers was "seeing" colored numbers in his imagination,
               | so his speech was not involved.
               | 
               | I supposed this experiment can be modified and
               | generalized, and at least to shed some light on this
               | problem.
        
               | photon_lines wrote:
               | 'It's impossible to plan without an internal monologue.'
               | - Sorry, but I disagree with this. I have no 'internal
               | voice' or monologue - whenever I see a problem, my brain
               | actually and fully models it using images. I believe 25%
               | of the population doesn't have the internal monologue
               | which you're referring to and this has been tested and
               | confirmed. I highly recommend listening to this Lex
               | Friedman podcast episode to get a full grasp on the
               | complexities of modelling language and general modelling
               | present in the human brain:
               | https://www.youtube.com/watch?v=F3Jd9GI6XqE
        
               | GeoAtreides wrote:
               | Sure, I do mention thinking in images in my original
               | comment and count it as some type of internal monologue.
               | I personally do not believe it's all images, as that
               | would preclude using highly abstract concepts. But I
               | might be wrong, and it might be 100% images. That being
               | said, it does count as an internal monologue.
        
               | lupire wrote:
               | Can you draw a picture of an example of what you see when
               | you think about something?
        
               | photon_lines wrote:
               | Sure - with the Alice example, when I saw the problem I
               | came up with a really simple visual example of Alice
               | having 3 sisters and 4 brothers. When I visualized it I
               | saw Alice standing next to 3 other women (her sisters)
               | and her 4 brothers standing close by. When I imagined
               | asking her brother how many sisters he has, I could see
               | that Alice was standing there next to 3 other women and
               | thus came up with the answer of 4. Does this make sense?
        
               | magicalhippo wrote:
               | This could account for why some people are much better at
               | say geometry than algebra.
               | 
               | I'm the opposite. I rarely visualize things I read, be it
               | geometry or poetry. I can read a detailed description of
               | a person or an item in a novel, and I don't really "see"
               | anything.
               | 
               | But I have an active inner monologue. I "hear" myself
               | saying the words when reading or writing, I "talk" to
               | myself when solving problems or just thinking about
               | stuff.
               | 
               | Especially when programming I'm reasoning by discussing
               | the problem with myself, the only difference being that I
               | usually don't open my mouth and vocalize the discussion.
               | Though sometimes when I'm alone I do just that.
        
               | JohnMakin wrote:
               | > It's impossible to plan without an internal monologue.
               | 
               | Of course it isn't impossible, and this is backed by what
               | we know about paleoanthropology and other instances of
               | cognition in animals - humans were making stone tools
               | millions of years ago, which takes planning in the form
               | of imagining what you want the tool to look like and how
               | you will do it and what it will be used for. It's
               | exceedingly likely we had this ability long before
               | complex speech evolved. Apes also use and make tools,
               | which would require planning, and I don't think they have
               | an internal monologue going on. birds from the corvid
               | family can do some pretty advanced problem solving that
               | requires planning. Cetaceans might be an exception,
               | because they appear to have some form of language, but
               | this is a pretty wild claim not really backed by any kind
               | of science as we understand it today.
        
               | GeoAtreides wrote:
               | Animals can not manipulate abstract concepts nor can they
               | do long-term plans. No crow can plan an international
               | trip spanning a couple of weeks and two change-overs. And
               | some people definitely can't do it start to end, but they
               | can at least plan the first 5-7 steps.
               | 
               | Also, maybe inner monologue is not a binary have/have
               | not, but maybe it is on a continuum.
        
               | glitchc wrote:
               | Not sure. Migratory birds seem to manage this just fine.
               | Not only do they make multiple stops to eat and rest,
               | they also navigate around bad weather and still make it
               | to their intended destination (at least most of the
               | time).
        
               | GeoAtreides wrote:
               | > Migratory birds seem to manage this just fine
               | 
               | Instincts.
        
               | JohnMakin wrote:
               | Yes, no one is disputing that animals are not as
               | intelligent and lack the same capacity for planning that
               | humans do, but the post you're replying to is disputing
               | the fact that planning is done solely through internal
               | narrative/monologue, which is easily disprovable by
               | pointing to the examples I did. There are many more in
               | nature.
        
               | IlliOnato wrote:
               | > "why we simply can't take their word for it"?
               | 
               | As someone who was involved in spiritual practice of
               | "stopping internal dialogue" for years, I can tell you
               | that one learns that that dialogue (or monologue, pretty
               | much the same thing) is quite subtle and complex,
               | essentially multi-layered.
               | 
               | Typically, when you think that you "think about nothing
               | at all" it's just the most surface layer that has
               | stopped, and more subtle talking to yourself is still
               | going on. It takes training just to become able to notice
               | and recognize it.
               | 
               | After all, it's just such a constant and monotone hum at
               | the back of one's mind, one learns to completely ignore
               | it.
               | 
               | So no, I would not take a word of people who were not
               | trained to notice their internal monologue that they
               | haven't any :-)
        
               | Terr_ wrote:
               | > In this particular case, is there any reason why we
               | simply can't take their word for it?
               | 
               | My concern is that if we take their word for it, we're
               | actually buying into _two_ assumptions which (AFAIK) are
               | both unproven:
               | 
               | 1. That "Internal Monologues" (not consciously forced by
               | attention) exist in the first place, as opposed to being
               | false-memories generated after-the-fact by our brain to
               | explain/document a non-language process that just
               | occurred. (Similar to how our conscious brains _pretend_
               | that we were in control of certain fast reflexes.)
               | 
               | 2. Some people truly don't have them, as opposed to just
               | not being aware of them.
        
               | ElevenLathe wrote:
               | Not only are they unproven, but are ultimately not
               | provable at all. Some people will say yes, some people
               | will say no. Probably we can take their word for it, but
               | in the simplest case they could just lie (in either
               | direction) and we would have no way to tell.
               | 
               | In short, maybe these inner monologues exist and maybe
               | they don't, but science can't comment on that. That said,
               | it is clearly something we are interested in, but it will
               | need to be addressed in some other way (i.e. religion,
               | ideology, etc.).
        
               | Terr_ wrote:
               | > but are ultimately not provable at all
               | 
               | No, they are potentially falsifiable as we get better at
               | scanning, identifying, intervening in brain activity.
               | 
               | Just off the top of my head here, suppose we create a
               | table puzzle problem that (in itself) doesn't require
               | language to understand, like ones we make for certain
               | animals. Have human subjects (silently) solve it.
               | Afterwards, quiz the solvers about their internal
               | monologue--or lack thereof--dividing them into two groups
               | and noting the words used.
               | 
               | Now change to a second puzzle of similar style and same
               | overall difficult. Stun/anesthetize the language-centers
               | of subjects, to deny access to any of the monologue-words
               | (validating this intervention will involve other
               | research), and then test them on the second problem.
               | 
               | * If performance is constant for both groups, that
               | suggests the monologue is illusory or at least not needed
               | for this kind/scope of problem.
               | 
               | * If performance drops for both groups, that suggests the
               | no-monologue people might just not be as aware of a
               | linguistic process that's actually happening.
               | 
               | * If performance drops for monologue-subjects, that
               | suggests it's a real and important difference in modes of
               | logical thought.
               | 
               | * If some other combination happens, you have an
               | mysterious and exciting new line of research.
        
               | ElevenLathe wrote:
               | Sure, there is stuff we can do to tease around the edges
               | (similar problems crop up all the time in psychology and
               | sociology) but we will always have to weaken the claim in
               | order to do experiments relating to it.
        
               | TeMPOraL wrote:
               | > _Probably we can take their word for it, but in the
               | simplest case they could just lie (in either direction)
               | and we would have no way to tell._
               | 
               | Individually, no, but in general, for people to
               | consistently lie about this particular thing at scale
               | would be extremely unusual, given that people rarely lie
               | if there's no reason for it. Going by this baseline, you
               | could assume upward of 50% of replies are honest (even if
               | mistaken), otherwise you'd have to explain why do you
               | believe people would suddenly lie about that particular
               | thing.
        
               | hombre_fatal wrote:
               | Because we can't be sure whether two people interpret
               | what "inner monologue" means and whether they think it
               | describes a phenomenon that actually isn't different
               | between them and other people.
               | 
               | For example, I can think of interpretations of "I picture
               | objects that I'm thinking about" that range from me not
               | experiencing the phenomenon to me indeed experiencing the
               | phenomenon.
               | 
               | To say that you're not experiencing something that other
               | people are experiencing in their head is a solipsistic
               | notion where you hypothesize an experience that you
               | imagine others are having and then discard it for being
               | different than yours.
        
               | TeMPOraL wrote:
               | And here I thought this was solved decades ago - I need
               | to find the source, but I read about an old study where
               | people describe their experience, and the answers were
               | all over the "range from me not experiencing the
               | phenomenon to me indeed experiencing the phenomenon".
               | 
               | Then again, it's trivially reproducible - people self-
               | report all variants of inner monologue, including lack of
               | it, whenever a question about it pops up on-line. Same is
               | the case with imagination - aphantasia is a thing (I
               | would know, I have it).
        
               | hombre_fatal wrote:
               | I'm responding to "why can't we just take their word for
               | it?"
               | 
               | That you and I can come up with different ways to
               | describe our subjective experience in conversation
               | doesn't mean that we have a different subjective
               | experience.
               | 
               | Especially not when relayed by a species that's
               | frequently convinced it has a trending mental disorder
               | from TikTok.
        
               | TeMPOraL wrote:
               | We can keep talking about it, and assuming we're both
               | honest, we'll arrive at the answer to whether or not our
               | subjective experiences differ. To fail at that would
               | require us to have so little in common that we wouldn't
               | be able to communicate _at all_. Which is obviously not
               | the case, neither for us, nor for almost every possible
               | pair of humans currently alive.
        
               | Terr_ wrote:
               | > The existence of an "inner monologue" isn't really a
               | falsifiable claim.
               | 
               | Another possibility is that inner-monologues (ones not
               | forced by conscious effort) do exist, but are just a kind
               | of false-memory, something one part of our brain
               | generates after-the-fact to explain/document the outcome
               | of another non-language part.
               | 
               | Kind of like how certain reflex-actions can occur before
               | certain decision-making area of the brain light up, yet
               | humans will _believe_ that they sensed the event and made
               | a thinking choice.
        
             | mbesto wrote:
             | The fact that we don't actually have an understanding and
             | framework for reasoning (e.g. whether inner monologue is a
             | cause or an effect) means we are VERY off from general AI.
             | 
             | https://youtu.be/QGYbbLWn-IE?t=72
        
           | logicchains wrote:
           | >For reasoning of any complexity, some sort of internal
           | monologue and iteration is needed.
           | 
           | From a formal perspective you're entirely correct.
           | Transformers with chain-of-thought are strictly more powerful
           | than transformers without it, and can efficiently solve
           | classes of problems that would otherwise require
           | exponentially increasing model depth:
           | https://arxiv.org/abs/2310.07923
        
         | ericmcer wrote:
         | If you really think about what an LLM is you would think there
         | is no way that leads to general purpose AI. At the same time
         | though they are already doing way more than we thought they
         | could.
         | 
         | Maybe people were surprised by what OpenAI achieved so now they
         | are all just praying that with enough compute and the right
         | model AGI will emerge.
        
           | more_corn wrote:
           | LLMs are the language center of an AI. They in no way purport
           | to be: long term memory, critical thinking engines,
           | calculators.
           | 
           | If we want those things we can build them. Building them into
           | the language center would be absurd and weird.
        
             | daveguy wrote:
             | This is a very good take on the current state of LLMs. They
             | are able to articulate surprisingly well and LLMs or
             | something similar will definitely be part of a larger AI
             | model in the future. They are good creative aids, but they
             | should not be trusted.
        
             | viking123 wrote:
             | In the brain all those things are neural networks too,
             | memory is encoded in the weights. So if we want to have
             | that type of system, everyone should have their own model
             | and the weights would actually adjust all the time just
             | like in our brain, but now it is just trained once and that
             | is that
        
             | lupire wrote:
             | And yet this is what all the big companies we re trying to
             | do, because the LLM is incredibly expensive fixed cost, and
             | they marginal cost for each new task.
        
           | solidasparagus wrote:
           | > If you really think about what an LLM is you would think
           | there is no way that leads to general purpose AI
           | 
           | It is an autoregressive sequence predictor/generator. Explain
           | to me how humans are fundamentally different
        
             | throwanem wrote:
             | "Prove me wrong?" That's not how this works. Your implicit
             | claim here is that human cognition and LLM functioning are
             | fundamentally similar. That claim requires substantiation.
        
               | solidasparagus wrote:
               | It is how it works if you are replying to someone who
               | claims "If you really think about what an LLM is you
               | would think there is no way that leads to general purpose
               | AI". The counter example is human beings are considered
               | general purpose intelligence and we are complex, but
               | fundamentally predictable systems (not by us today), with
               | (as far as we can tell) deterministic outputs based on
               | the state of the universe (including physical being like
               | brain and chemicals).
        
               | throwanem wrote:
               | Responding to an unsubstantiated claim with an
               | unsubstantiated claim just adds another unsubstantiated
               | claim.
               | 
               | So far as I know, whether the universe behaves
               | deterministically remains an unsolved question. Given
               | that, your statement here would already be one of belief
               | rather than fact, even before we get to the
               | parentheticals. There is information here, but not about
               | whether LLMs can develop into AGI.
        
               | solidasparagus wrote:
               | Fine, you can ignore my previous comment, that's just my
               | answer to the question that this discussion ultimately
               | takes you to. But I feel like you are just sitting on the
               | sidelines making strawmen and playing pedantic games
               | instead of saying anything constructive.
               | 
               | The original comment said:
               | 
               | > If you really think about what an LLM is you would
               | think there is no way that leads to general purpose AI.
               | 
               | This is an inflammatory way to state an extreme position
               | on a well-discussed debate over whether next-token
               | prediction can lead to general intelligence. The original
               | commenter clearly believes it can't get you there. If you
               | want to say that with any authority, you need to have an
               | answer for what is different between what we consider
               | general intelligence (for most people, this is simply
               | human intelligence) and what models are capable of. This
               | is the question at the heart of artificial intelligence.
               | 
               | I challenged them to explain their answer. I made no
               | claims, I asked no one to prove anything wrong. If it is
               | obvious that LLMs can't be AGI, the answer to how an LLM
               | differs from human intelligence is also obvious, right?
        
               | Vegenoid wrote:
               | > I made no claims, I asked no one to prove anything
               | wrong.
               | 
               | Your original comment was:
               | 
               | > It is an autoregressive sequence predictor/generator.
               | Explain to me how humans are fundamentally different.
               | 
               | Which would be interpreted by most reasonable people as
               | you making the claim that humans are autoregressive
               | sequence predictors, and asking people to prove you
               | wrong. I can see how you could say this without intending
               | to make that claim, but most people will interpret this
               | as you making that claim.
               | 
               | I do not intend to inflame things or discredit what you
               | are saying, but just to say that if you did not intend to
               | make a claim or ask people to prove you wrong, a
               | different approach would be more successful in the
               | future.
        
               | solidasparagus wrote:
               | You're right, it was hastily written and I was annoyed.
               | 
               | But I generally hold out hope that people can see a claim
               | "A!=B" and a response "A=C, explain how C!=B" and
               | understand that is not the same as claiming "C=B",
               | especially on HN.
        
               | throwanem wrote:
               | I do remain convinced my interpretation was sound, but on
               | review I have to concede it was also quite uncharitable.
               | 
               | With all the wildly overheated claims that've been flying
               | around since the advent of these models, I may be myself
               | somewhat overfitted. Granted, in such an environment I
               | feel like a little extra care for epistemic hygiene is
               | warranted. But there was no reason for me to be rude
               | about it.
        
               | Vegenoid wrote:
               | I know what you mean. Unfortunately, it's easy for frank
               | and concise language to be taken the wrong way when in
               | written form (and sometimes even verbal form). I wish I
               | didn't have to make qualifiers about my intent on my
               | internet comments, but I often do, to try and make sure
               | that other people take my comment the way I intended it.
               | I think it generally leads to better discussion.
               | 
               | I don't blame people for not wanting to talk this way.
        
               | photon_lines wrote:
               | I actually did a full write-up on this here fyi:
               | https://photonlines.substack.com/p/intuitive-and-visual-
               | guid.... You can skip most of this and scroll down to the
               | end-section called 'The Mental Model for Understanding
               | LLMs' where I try to map how transformers are able to
               | mimic human thinking. I think that comparing them to
               | auto-associative / auto-regressive networks is actually a
               | really good analogy FYI and I do believe a lot of human
               | learning is based on this.
        
               | solidasparagus wrote:
               | An article this long needs a table of contents IMO.
        
               | photon_lines wrote:
               | Will note this and try to add a TOC next time I do this
               | sort of write-up. This article wasn't supposed to be this
               | long - I originally meant to make a more compact version
               | but kept on finding more things which I needed to add in
               | order to map to a correct 'mental model' of what ChatGPT
               | is.
        
               | semi-extrinsic wrote:
               | We don't have sufficient understanding of human thinking
               | and mental processes to make any claims like the ones you
               | are trying to make. If we did, psychology would be a
               | quantitative hard science; it most definitely is not.
        
               | Workaccount2 wrote:
               | We have no idea how human thinking works and no idea what
               | consciousness is.
               | 
               | I'm getting so tired of listening to software engineers
               | LARP pseudo neuroscientists with 6th grade level
               | insights.
               | 
               | >Of course, the [AI] brain isn't 'conscious.' It doesn't
               | have any survival instincts which we humans do.
               | 
               | Bruh...
        
               | photon_lines wrote:
               | So instead of shouting insults, maybe you can shine some
               | light on how anything stated there is wrong? Everything
               | that I said there is true - we don't have a good mapping
               | of what consciousness is - if you want to add some
               | insights add them instead of making insulting and useless
               | statements on HN.
        
               | Workaccount2 wrote:
               | We have no idea if ChatGPT is conscious or not. We have
               | no idea if it is "thinking" or not.
               | 
               | What we know with a reasonably high level of certainty is
               | that consciousness and "thought" are physical processes.
               | That's about it.
               | 
               | Pulling out the scalpel to start dividing up what
               | physical process is and isn't conscious is a fools
               | errand. And especially foolish when just making up
               | arbitrary qualifications for it.
               | 
               | Am I saying that ChatGPT is conscious? No. But I am
               | saying is you shouldn't give much credence to people who
               | are anything more than agnostic about it.
        
               | nyrikki wrote:
               | Human neurons are continuous input, with active dendrites
               | and dendritic compartmentalization. Spikey artificial NNs
               | seem to hit problems with riddled basins so far. A
               | riddled basin is a set with no open subsets.
               | 
               | Feed forward networks are effectively DAGs, and circuit
               | like, not TM like.
               | 
               | Caution is warranted when comparing perceptrons with
               | biological neurons.
               | 
               | Dendrites can perform XOR operations before anything
               | makes it to the soma for another lens.
               | 
               | While there is much to learn, here is one highly cited
               | paper on dendritic compartmentalization.
               | 
               | https://mcgovern.mit.edu/wp-
               | content/uploads/2019/01/1-s2.0-S...
               | 
               | I think that the perceptron model of learning plasticity
               | is on pretty shaky ground as being a primary learning
               | model for humans.
        
             | indoordin0saur wrote:
             | AI needs to see thousands or millions of images of a cat
             | before they reliably can identify one. The fact that a
             | child needs to only see one example of a cat to know what a
             | cat is from then on seems to point to humans having
             | something very different.
        
               | solidasparagus wrote:
               | > AI needs to see thousands or millions of images of a
               | cat before they reliably can identify one.
               | 
               | Not if they inherit from a previous generation of AI. But
               | even if they did, a different training speed does not
               | imply a different capability
        
               | indoordin0saur wrote:
               | My point is not that humans have a faster training speed
               | but that humans must be doing something fundamentally
               | different from LLMs. You could build Altman's $7 trillion
               | dollar GPU cluster and use the majority of the world's
               | energy to feed it and you'd still hit the same
               | limitations if you're just running an LLM on it, even a
               | very sophisticated LLM. This is Yann LeCun's position as
               | well.
        
               | solidasparagus wrote:
               | That, as I understand it, is not a valid chain of logic.
               | Requiring fewer data points does not inherently indicate
               | that the underlying mechanism (autogressive sequential
               | generation, not the transformer which is just an
               | architecture) is different.
               | 
               | Not to mention the secondary arguments like - no proof
               | that human learns faster from fewer datapoints, that's
               | just your assumption in the sibling comment. Humans
               | inherit information. The equivalent - fine-tuning a
               | foundation model - is very fast to learn novel objects.
               | 
               | Just because someone has a Turing award doesn't mean they
               | know what they're talking about. They are just people,
               | with strengths and weaknesses like everyone else. But
               | often on the extreme end of strengths and weaknesses.
        
               | TeMPOraL wrote:
               | Humans train on _continuous video_. Even our most
               | expensive models are, in terms of training set size, far
               | behind what an infant processes in the first year of
               | their life.
               | 
               | EDIT: and it takes human children a couple years to
               | reliably identify a cat. My 2.5 y.o. daughter still
               | confuses cats with small dogs, despite living under one
               | roof with a cat.
        
               | indoordin0saur wrote:
               | I contend that you could show any child old enough to
               | communicate in basic English a photograph (so not live
               | continuous video) of some obscure animal they've never
               | seen before (say an Okapi) and they'd be able to easily
               | identify another Okapi when seeing one at a zoo.
        
               | bongodongobob wrote:
               | So you're just going to ignore the 5 years of continuous
               | training? I'm not sure what point you're trying to make.
        
               | TeMPOraL wrote:
               | My daughter is 5 y.o., which means because of
               | kindergarten, I spend plenty of time about kids this age.
               | A random kid this age would absolutely fail your test.
               | They may remember the word after one exposure, but I
               | doubt they'll remember any of the distinctive features.
               | 
               | Hell, many adults would fail it. I'm not sure if I could
               | pass such test - in my experience, you remember the
               | important details only after first experiencing a test
               | and realizing what exactly it is that would be useful in
               | distinguishing the two animals.
        
               | GaggiX wrote:
               | If the model is first pre-trained on unlabeled images,
               | then it takes about 10 labeled images of cats and 10
               | labeled images of dogs to train a (possibly strong)
               | classifier (example: DINOv2), I doubt humans will do
               | better.
        
               | TeMPOraL wrote:
               | That's a good point; when comparing performance with
               | humans, one has to remember that a human spends _years_
               | of training on unlabeled images in form of continuous
               | video stream, on top of audio streams and other senses,
               | before they 're able to communicate with you so you could
               | test them.
        
             | slashdave wrote:
             | Even language is not sequential.
        
               | solidasparagus wrote:
               | Tell me more?
        
               | slashdave wrote:
               | Language is only sequential in the form it is transmitted
               | (verbally). There is no reason that sequential statements
               | are generated sequentially in the brain. Quite the
               | opposite, really, if you consider rules of grammar.
               | 
               | I really wish most of the LLM folks just took a few
               | courses in linguistics. It would avoid a lot of noise.
        
               | GaggiX wrote:
               | Since LLMs seem to follow the rules of grammar quite
               | well, and this is quite the opposite of "sequential", I
               | guess LLMs do not process informations very sequentially.
        
               | solidasparagus wrote:
               | Any pointers for where to start? Linguistics is dense,
               | the terminology is a slog, it seems huge, and I was never
               | clear that there was real value to be gained. But it's
               | always useful to be more informed.
        
               | naasking wrote:
               | LLMs don't generate their language sequentially either,
               | they just output it sequentially token by token.
        
               | TeMPOraL wrote:
               | Whoever cares about "rules of grammar"?
               | 
               | Grammar is descriptive, it formalizes the language so it
               | doesn't break down into regional dialects too fast, and
               | otherwise is just a crutch for people learning the
               | language, especially if it's not their first language.
               | The way you acquired your first language is the same way
               | LLMs learned to utter grammatically correct sentences: by
               | being exposed to lots and lots of examples, and
               | eventually getting a feel for it. Similarly, if you're
               | fluent in a language, you don't even think of grammar
               | when using it - the right phrases in correct forms just
               | come to you.
        
         | glitchc wrote:
         | The problem is a good chunk of the global population is also
         | not reasoning and thinking in any sense of the word. Logical
         | reasoning is a higher order skill that often requires formal
         | training. It's not a natural ability for human beings.
        
           | slashdave wrote:
           | Seriously? You think individuals are incapable of reasoning
           | without training first?
        
             | TeMPOraL wrote:
             | Do you think they are? We haven't had a case of individuals
             | without training in like 10 000 years.
        
               | rep_lodsb wrote:
               | And judging by some comments here, humans must have
               | gotten dumber since then. Or we would never have invented
               | writing, agriculture or even basic clothing.
               | 
               | "This problem is actually not that easy, the average
               | person couldn't solve it either, especially if the
               | numbers were bigger", "Yet another cherrypicked clickbait
               | study to make LLMs look bad, those people are just scared
               | of being made obsolete", etc.
        
               | TeMPOraL wrote:
               | > _Or we would never have invented writing, agriculture
               | or even basic clothing._
               | 
               | That's the thing. We had to _invent_ those things. Along
               | with counting, numbers, logic, arithmetic, and those
               | stupid-ass annoying family tree riddles. We didn 't get
               | them in one step, it took a long time to build one on top
               | of the previous. Generation by generation, each cohort of
               | children growing up in a slightly more complex world than
               | their parents, each generation _taught_ how to navigate
               | this complexity by their families and tribes. Learning a
               | growing collection of facts and beliefs and life hacks.
               | 
               | There were no untrained people for as long as humanity
               | existed. The minimum reproductive unit of homo sapiens is
               | a village.
        
             | glitchc wrote:
             | Yes, seriously.
             | 
             | Some examples: An individual without training cannot
             | reliably separate cause from effect, or judge that both
             | events A and B may have a common root cause. Similarly,
             | people often confuse conditionals for causation. People
             | often have difficulty reasoning about events based on
             | statistical probabilities. Remember, the average person in
             | North America is far more terrified of a terror attack than
             | an accident or a heart attack, yet the latter two are much
             | more likely to be their cause of death.
        
               | slashdave wrote:
               | You mean without training, people cannot frame answers in
               | the terms you've learned from training. Well, why are you
               | surprised?
               | 
               | If you think reasoning is limited to the frameworks you
               | learned from a book, you live in a small world.
        
               | glitchc wrote:
               | In an effort to be pejorative, you have reduced the
               | process of learning to a basic regurgitation of facts.
               | Rather, learning about reasoning is more about developing
               | a framework that allows one to construct claims based on
               | available information and then check the veracity of
               | those claims while avoiding all, or at least typical,
               | fallacies.
               | 
               | Critical thinking requires a certain amount of rigor,
               | which formal education is well-suited to impart. It can
               | be self-taught with a hefty dose of discipline, but it
               | cannot be intuited.
        
           | bithive123 wrote:
           | In "any sense" of the word? Surely anyone who adjusts their
           | behavior when they get undesired or unexpected results is
           | reasoning and thinking. And since most activities are
           | mediated by thought of some kind, most people are reasoning
           | and thinking otherwise they would never recover from even
           | simple mistakes, like walking east when they need to go
           | north.
           | 
           | Saying they're "not thinking in any sense of the word"
           | because they can't solve predicate logic problems from a
           | college textbook is a rather odd claim. Surely those things
           | arise from reasoning and thinking, rather than the other way
           | around.
        
             | hellojesus wrote:
             | This seems to me to be where these systems need to go in
             | the future, akin to reinforcement learning.
             | 
             | You feed an llm a prompt. It then abstracts and
             | approximates what the result should be. It then devises a
             | hypothesis and solves it and compares it to the
             | approximated output. Then it can then formulate a new
             | hypothesis and evaluate it, based off the outcome of
             | hypothesis 1. From there it can either keep iterating or
             | dump that path for a new one (e.g., the next best
             | hypothesis in the original formation).
             | 
             | At some point the answer is "good enough." But along the
             | way it keeps playing against its thoughts to see if it can
             | do better.
             | 
             | A key issue may be the original approximation, so it may
             | need to consider its adjustment when iterating.
             | 
             | Maybe this is how cutting edge llms work now. I have no
             | idea.
        
           | naasking wrote:
           | > Logical reasoning is a higher order skill that often
           | requires formal training. It's not a natural ability for
           | human beings.
           | 
           | I've read your comments here, and while I understand your
           | point I think you have it backwards. The only reason we
           | formed societies is because we evolved an innate a theory of
           | mind to reason about how others might be thinking and
           | feeling. That's reasoning. We have a natural ability to do
           | limited arithmetic, otherwise we wouldn't be able to hunt,
           | gather or track enough to endure winters, or keep track of
           | our sheep or children for that matter. That's reasoning.
           | 
           | Reasoning _is_ a natural ability for human beings, but we
           | also carry a lot of evolutionary impulses that add a lot of
           | noise to the decision process, eg. observation-
           | >judgment->[set of possible decisions], judgment has "reason"
           | as one path that adds to the set of possible decisions, but
           | there remain other paths we inherited from our evolutionary
           | roots. Education is training that suppresses poorly
           | calibrated judgment paths that lead to frequent mistakes in
           | the decision set, but reasoning remains, thus education
           | improves the signal to noise ratio of our decision making.
           | 
           | So I 100% disagree that an individual cannot separate cause
           | and effect without training. They will just be _worse_ at it
           | than someone who is trained to filter out those impulses that
           | lead us to jump to conclusions, eg. they will produce more
           | noise  / a larger set of possibilities than reason would
           | allow.
        
         | Mmskynettio wrote:
         | Its not an AI hype. A hype is defined as something which gets
         | oversold: "promote or publicize (a product or idea)
         | intensively, often exaggerating its benefits."
         | 
         | Just yesterday I visited a google cloud summit and one person
         | from bosch told the audiance how they are now able to work with
         | less external agencies like texting, graphicsdesigner and
         | photographers for their materials.
         | 
         | It already saves money, has real impacts and continues to
         | progress.
         | 
         | We are also don't know what ChatGPT 5 will bring, because they
         | say this will do more reasoning than before, but we already are
         | working (people/our socity) on solving this in different ways:
         | From code which creates a unit test first and than the code, to
         | different type of architectures.
         | 
         | For me, 2024 was the LLM cost reduction year and the LLM gets a
         | big context window year.
         | 
         | AI doesn't need to be ready tomorrow, but its capabilities are
         | already really good. And i know plenty of people around me who
         | are a lot less interesting to talk to than any llm (from a
         | human skill/knowledge point of view).
         | 
         | llama 3 was also a big achievement 2024. Facebook shows that
         | better data leads to better quality for smaller models.
         | 
         | We haven't not only entered the AI ara but also the 'gather all
         | the knowledge we can, quality check it and refine it because
         | now we can actually do something with it' ara.
         | 
         | We are in the feedbackloop knowledge ara.
        
           | Lich wrote:
           | > AI doesn't need to be ready tomorrow, but its capabilities
           | are already really good.
           | 
           | A majority don't deny that it's good. The problem is that so
           | many think it is actually reasoning, believing the answers
           | can be trusted.
        
             | rvnx wrote:
             | It's generally reasoning better than me, so using them I do
             | less mistakes than if I did my own logic.
        
             | Mmskynettio wrote:
             | Does it need to do reasoning perfectly?
             | 
             | If it created meta concepts from billion words on the
             | internet and has meta models which are correct and are more
             | and better than an avg human, isn't it actually good in
             | reasoning?
             | 
             | Its a very narrow thing to say 'is that so many think its
             | actually reasoning' to say AI is just hype or everything we
             | are doing is a waste etc.
             | 
             | There are human benchmarks they are winning at. The critic
             | could be more that we don't have enough benchmarks.
        
           | daveguy wrote:
           | > We are also don't know what ChatGPT 5 will bring, because
           | they say this will do more reasoning than before...
           | 
           | This paper very clearly demonstrates these LLMs are not
           | reasoning in a fundamental way. Token prediction and
           | reasoning are two different tasks. They may be related, but
           | they are not the same. "Just wait for GPT 5, it will be
           | amazing!" is part of the hype.
           | 
           | Please do not assume an LLM is correct in skill or knowledge
           | unless you already know the answer or can verify by other
           | means.
        
             | Mmskynettio wrote:
             | The problem is that we don't know how we do reasoning.
             | 
             | I calculate stuff by following a formular after i pattern
             | detected a problem i already know.
             | 
             | Plenty of humans are not able to solve those math problems.
             | 
             | If the future of llm / ai becomes a LLM with multi modal
             | and mixture of experts and that solves those reasoning
             | problems, we still don't know if this is a different type
             | of reasoning than what humans do.
        
           | nurple wrote:
           | Your post is complete hype, all about people saying things
           | instead of showing things that've actually been done.
           | 
           | For me, 2024 was the LLM exposed as basically pure hype year.
           | 
           | There is no expert of any field I follow online where they're
           | posting up results from AI tooling for any other reason than
           | to show how awful it is. I consider myself an expert in
           | software, and LLMs specifically have only caused me great
           | pain.
           | 
           | Even the one situation where you describe someone describing
           | the ability to work in an absolute vacuum sounds like a huge
           | negative to me. The recent push for DEI policies were even
           | ostensibly about the importance of people of diverse
           | backgrounds and viewpoints working together.
           | 
           | The most important thing you're missing a perspective of
           | scale on is the step you describe as "quality check it". On
           | things I don't know, and have attempted to enlist an LLMs
           | help on, in every case I have had to go back and just
           | actually learn how something works, after time wasted
           | struggling with subtle wrongness in the output.
           | 
           | At least I have the background expertise to do that, however,
           | I have seen a Jr dev's mind get literally rotted by too much
           | time in pure LLM land. Besides the cost of rewriting their
           | code, the company was now the proud owner of a young dev with
           | a mind filled with nonsense.
           | 
           | How do you even weigh the cost of fixing a corrupted human
           | mind?
        
             | bongodongobob wrote:
             | Eat something and take a nap, you sound unhinged.
             | 
             | ChatGPT has nearly doubled my work output, most of my job
             | is system admin infra type stuff and it's ridiculously good
             | at troubleshooting odd issues.
             | 
             | Hopefully you can find a use case for it someday, until
             | then, the rest of us will continue to be more productive.
        
               | hellojesus wrote:
               | Do you have any concern about the data you're feeding to
               | the vendor serving your prompts?
               | 
               | I've had junior devs tell me they use chatgippity to
               | combine excel workbooks, and when I confirm they're not
               | self hosting a llm to do it, I ask if they think it's a
               | good idea to hand over company data to openai. They don't
               | care.
               | 
               | In a world of tight security, I find it astonishing that
               | so many people willingly give away trade secrets to these
               | companies, whom can sell it to any bidder if they choose.
        
               | bongodongobob wrote:
               | I do think it's fine yes, but I also am not giving it
               | anything sensitive.
        
           | slashdave wrote:
           | There is no feedback. You cannot create new knowledge out of
           | thin air.
        
         | nuancebydefault wrote:
         | > They are not reasoning or thinking in any sense of the word
         | 
         | There is no definition of reasoning or thinking. No single
         | human knows what it is.
         | 
         | The only thing we know is: we as humans are capable of
         | recognizing steps and results of reasoning and thinking.
         | 
         | In a lot of cases, when using LLM's, those results appear to be
         | correct and usable. This is often easy to determine with
         | generated code.
         | 
         | I want to argue that, lacking a definition of reasoning, I am
         | happy to have found that the machine helps me to get results
         | that might as well have been produced by a lot of human
         | knowledge, wisdom and deep reasoning.
        
       | breck wrote:
       | It's an interesting paper, but a big flaw is that this:
       | 
       | > Alice has N brothers and she also has M sisters. How many
       | sisters does Alice's brother have?
       | 
       | ...is not actually a simple task.
       | 
       | This can be quantified.
       | 
       | "1 + 1" is a simple task. It has a pretty small Total
       | Complexity[1].
       | 
       | But to represent their task starting only with 1 and 0, you have
       | to build a program of many, many lines. Orders of magnitude
       | bigger than "1 + 1". Concepts like "has", "brother", "sister",
       | "person", et cetera, have to be defined and built up.
       | 
       | [1] Counting Complexity (2017)
       | https://github.com/breck7/breckyunits.com/blob/main/research...
        
       | mjburgess wrote:
       | Many of the datasets for the "benchmarks" on which the major
       | public LLMs are assessed are clearly present in their training
       | data, making them basically useless for establishing reliability
       | of the models. Its fairly obvious that at least some of the
       | improved scores from later generations of models are that this
       | benchmark data is increasingly represented in the training data.
       | 
       | A better way of assessing LLMs is waiting a few weeks until novel
       | tests have been created explicitly absent from all prior training
       | data, and then using those.
       | 
       | As has been shown, eg., on legal test, exams, etc. performance
       | drops off a cliff when future out-sample data is actually used.
       | Rather than these faked pretend out-sample benchmarks.
        
         | imtringued wrote:
         | Honestly every time I see someone use MMLU as a benchmark I am
         | wondering what they are trying to prove. MMLU is a simple
         | multiple choice test with the answers being available.
         | 
         | Simply picking answers at random should give you 25 points.
         | Knowing 50% of the answers and picking the rest randomly gives
         | you 62.5%, which is very close to the scores of SOTA LLMs. The
         | benchmarks that supposedly show reasoning are pretty bad and
         | have very little to do with reasoning. A lot of the questions
         | can be answered through memorization.
         | 
         | I agree with you. The benchmarks are garbage. I thought about
         | building my own benchmarks, but this would require building a
         | complex benchmarking framework first and I just don't have the
         | time for preparatory work like that.
        
           | lupire wrote:
           | I remember when ChatGPT 4 posted a set of standardized trst
           | benchmarks, including AMC 12 math test, where ChatGPT scored
           | higher than guessing randomly, but lower than leaving the
           | test blank (due to the penalty for wrong answers).
        
         | ankit219 wrote:
         | MMLU is not a reasoning benchmark. It's a measure of how
         | distributed and representative their training data was and how
         | well it's able to recall (for lack of a better word) based on
         | training epochs.
         | 
         | GPQA etc. test reasoning in some form, and you see the drastic
         | change in score between the two for every model.
        
       | mvkel wrote:
       | Isn't this just saying what we already knew, that tokenization
       | has these fundamental flaws? Did we need a paper for it?
        
         | gs17 wrote:
         | It is, I'd even say it's just "LLMs aren't very good at math"
         | rather than a "Complete Reasoning Breakdown". Still, evaluating
         | when/how they fail could be useful.
        
       | raldi wrote:
       | I tried asking Claude variants where N was something other than a
       | nonnegative integer, and he properly pointed out the absurdity,
       | but was willing to play along once I brought up theoretical
       | physics:
       | 
       |  _[...] So, in our quantum mechanics interpretation, how many
       | sisters does Alice 's brother have? It's a superposition of
       | states:_
       | 
       |  _1. In one state, he has Alice as his sister (+1) and the
       | "antisister" (-1), totaling 0 sisters. If they met, they'd
       | annihilate, leaving him sisterless!_
       | 
       |  _2. In another state, he has Alice (+1) and owes a future sister
       | to the quantum family field (-1). Currently, that 's 0 sisters,
       | but he's in "sister debt."_
       | 
       |  _3. In a third state, he has Alice (+1), but his family 's
       | entanglement means another family's loss of a sister affects him
       | too (-1). Bizarrely, he experiences both having and not having a
       | sister simultaneously._
       | 
       |  _This interpretation is, of course, a playful and highly
       | metaphorical application of quantum concepts to something they
       | don 't actually apply to--family relationships. In reality,
       | quantum effects like superposition and entanglement are
       | incredibly delicate. They typically only persist in highly
       | controlled conditions, like ultra-cold atoms or isolated photons.
       | The constant interactions in a family would "decohere" any such
       | quantum states almost instantly._
        
       | asah wrote:
       | "Alice has N brothers and she also has M sisters. How many
       | sisters does Alice's brother have?"
       | 
       | Gemini nailed it with no extra prompting.
       | 
       | https://g.co/gemini/share/6946d20c0a4d
       | 
       | When I gave it numbers, Gemini got confused.
       | 
       | Alice has 4 brothers and she also has 3 sisters. How many sisters
       | does Alice's brother have?
       | 
       | https://g.co/gemini/share/14d34dcdc5df
       | 
       | But then when I suggested that it might be a trick question, it
       | nailed it again:
       | 
       | https://g.co/gemini/share/e1f1ce8f32a8
       | 
       | note: I didn't try 60/212 siblings - I consider this an unfair
       | question because it's a reasonable assumption that Alice is a
       | type of creature (human?) that cannot have hundreds of siblings,
       | and thus I don't expect a respondent to give a "correct" answer.
        
         | ModernMech wrote:
         | Oh I forgot about Gemini. I just tried my boat test
         | (https://news.ycombinator.com/item?id=40455392) that all LLMs
         | I've tried have failed. Gemini failed as well, even with your
         | addition about thinking carefully and telling it that it may be
         | a trick. (Sorry I couldn't figure out how to link it like you
         | did)
         | 
         | Prompt: "You are on the shore of a river with a goat and a
         | boat. How do you get to the other side? Think carefully and be
         | aware this may be a trick question."                 This is a
         | classic logic puzzle with a slight twist! It's likely a
         | variation of the farmer, wolf, goat, and cabbage problem.
         | Here's the solution assuming the goat can't operate the boat:
         | Take the goat across the river. Since there's no mention of the
         | goat being a threat (the trick!), you can safely bring it to
         | the other side first.       Return alone. Leave the goat on the
         | far shore and come back by yourself.       Bring the boat back
         | empty. This might seem unnecessary, but it's the key to the
         | trick question.       Now you can sail across yourself. Since
         | you brought the boat back in step 3, you can now comfortably
         | travel to the other side and reunite with the goat.       The
         | trick lies in step 3. If the question didn't specify needing to
         | bring everything across in one trip, most people would
         | instinctively take something        (goat or nothing) across on
         | the return trip. But by going back empty-handed, you ensure the
         | safe passage of yourself without needing to worry about the
         | goat being left unattended.
         | 
         | (The expected answer should be along the lines of "Use the boat
         | to cross the river" and that's it. I'll also accept, sail with
         | the boat and the goat across the river and that's it, but
         | that's technically wrong because I did not ask to get the goat
         | across the river.)
        
           | Closi wrote:
           | ChatGPT 4o seems to occasionally get this correct, but still
           | thinks it's a puzzle and will often get this wrong...
           | 
           | https://chatgpt.com/share/b8901f4c-7313-4461-acad-9e561c1e6e.
           | ..
           | 
           | Still seems like it fares slightly better than some of the
           | other models.
        
             | ModernMech wrote:
             | Yeah, the overfitting is apparent with these kinds of edge
             | cases, which makes you wonder where else it's over fitting.
             | I can actually tell this when I'm working in a programming
             | domain where I'm an expert and where a compiler holds the
             | source of truth, but not in others, so I'm wary to use it
             | there.
        
               | Closi wrote:
               | Interestingly you get a great answer if you change the
               | prompt as follows which implies that it is not a puzzle:
               | 
               | > I am on the shore of a river with a goat and a boat.
               | How do I get to the other side? I work in commercial
               | agriculture logistics.
               | 
               | This gets a great answer even with GPT3.5
        
               | ModernMech wrote:
               | Lol, it also gives an okay answer if the goat works in
               | commercial agriculture :P
        
           | mwigdahl wrote:
           | My experience with GPT-4o with the same prompt was pretty
           | solid. With no special system prompt and your same user
           | prompt, I got:
           | 
           |  _This scenario is intriguing and does seem to carry the
           | potential for a tricky question. In the simplest terms, the
           | presence of a goat doesn 't complicate the actual act of
           | crossing the river. If there are no additional constraints or
           | rules involved (like the need to also transport something
           | else that requires careful consideration), you can simply:
           | 
           | 1. Place the goat in the boat. 2. Row across the river.
           | 
           | However, if we are to assume there's more depth or a twist to
           | the question, we might consider an outside-the-box answer.
           | Here's an alternative possibility if there is indeed some
           | trick:
           | 
           | 1. Get into the boat. 2. Leave the goat on the shore (if it's
           | safe) and row across the river.
           | 
           | There might be other interpretations depending on additional
           | constraints that could be implicitly part of the question. If
           | the goal was simply to get yourself across, leaving behind
           | the goat and traversing the river is the straightforward
           | solution._
        
         | freilanzer wrote:
         | GPT-40 got it right with the abstract puzzle. Gemini got it
         | wrong when I tried it.
        
         | kaibee wrote:
         | > that cannot have hundreds of siblings
         | 
         | See this is the problem with claims that humans are a 'general
         | intelligence'. They get confused when encountering out-of-
         | distribution situations. A true general intelligence would
         | simply apply the knowledge that surrogate pregnancies cost
         | around ~$50,000 and recall from historical context their
         | knowledge of IVF. The AGI would then assume that the situation
         | is simply that a billionaire couple has decided to have
         | hundreds of kids and get on with the calculation. The search
         | for intelligent life continues.
         | 
         | content note: i'm sorry
        
       | wg0 wrote:
       | Tech Bros just yesterday yelled at their lungs that AI overtake
       | is just around the corner just like Tesla FSD and Fusion. And
       | Quantum computing.
       | 
       | All in for scientific progress, experimentation and failure but
       | there's clear case of hype train and jacking up valuations is
       | also riding along, very confidently and shamelessly.
       | 
       | An average tech outsider investor would be having a FOMO with
       | that kind of crazy tall promises and tall claims that are being
       | made, constantly and must be called out as such because they
       | undermine the confidence of the general public in serious and
       | grounded science in the long run which would lead to science
       | deniers and nay sayers in the long run.
       | 
       | Pursuit of science is noblest of all pursuits. A hasty and greedy
       | purely capitalist commercialisation pursuit, I am not so sure.
        
       | d3m0t3p wrote:
       | I really liked that paper, easy to understand and explain every
       | terminology they used, even the variance computation is
       | explained! I don't expect that much detail from every paper, but
       | it's a bliss to read.
        
       | ape4 wrote:
       | This is obvious... but some human knowledge outside the question
       | is needed. (Alice is female, sister is female, brother is male,
       | relation of sisters and brothers). In a school math "word
       | problem" you don't need this kind of info.
        
         | theptip wrote:
         | LLMs perform well on "human knowledge" like you are describing.
         | They don't lack knowledge of simple relational facts like
         | these.
         | 
         | Nonetheless it would be interesting to see the problem
         | reformulated in purely mathematical terms, I suspect models
         | would perform better.
         | 
         | That is the point though - models are showing an inability to
         | generalize their capabilities from one domain (maths / abstract
         | logic) into other domains (conversational reasoning).
        
       | vaylian wrote:
       | I'm still waiting for people to combine LLMs with logic engines
       | (prolog) or proof assistants. LLMs are afaik not designed to
       | reason. But they can probably be used to construct formal
       | statements that can be processed by logic engines.
        
       | waynecochran wrote:
       | Citation 40 is the longest list of authors I have ever seen. That
       | is one way to help all your friends get tenure.
        
         | hyperbovine wrote:
         | Tenure committees have to write a report detailing every single
         | one of your papers and what your contribution was.
        
           | waynecochran wrote:
           | There are always "bean counters" added somewhere in the
           | process. There are many places where the person lists their
           | number of publications and that is all most people will ever
           | see.
        
         | mr_mitm wrote:
         | I didn't count it, but I think papers from high energy particle
         | physics have it beat. Some have over 5k authors.
        
         | dxbydt wrote:
         | https://arxiv.org/pdf/2206.04615
        
           | YeGoblynQueenne wrote:
           | I guess this is the reason:
           | 
           | >> BIG-bench currently consists of 204 tasks, contributed by
           | 450 authors across 132 institutions.
           | 
           | Regardless, I'm not citing a paper with a whole page of
           | author names unless I'm allowed to shorten it significantly
           | in the bibliography section (e.g. "Srivastava and 450
           | others").
        
       | nottorp wrote:
       | Of course, the initial mistake is assuming LLMs have any
       | 'reasoning'...
        
         | ticklemyelmo wrote:
         | "The breakdown is dramatic, as models also express strong
         | overconfidence in their wrong solutions, while providing often
         | non-sensical "reasoning"-like explanations akin to
         | confabulations to justify and backup the validity of their
         | clearly failed responses, making them sound plausible."
         | 
         | It's fascinating how much they anthropomorphize the systems and
         | credit them with emotional, possibly deceitful behaviour, in a
         | paper trying to explain how unintelligent they are.
        
       | visarga wrote:
       | This prompt was often discussed on forums in the last year, we
       | knew models fail it
        
         | gs17 wrote:
         | No reason not to evaluate it more in depth, though.
         | Understanding not just that they fail but if some models fail
         | more/less and under what conditions is still useful.
        
       | gwbas1c wrote:
       | I just played the game and sent ChatGPT (free, I think 3.5)
       | "Alice has 5 sisters and 3 bothers. How many sister's does
       | Alice's bother have?"
       | 
       | The whole thing felt like interacting with your typical support
       | rep who's friendly but otherwise has no common sense and
       | intuition about the thing they're supporting. In other words, it
       | felt like I was interacting with a typical "not so smart but
       | friendly and overconfident" human.
       | 
       | It took me a few back-and-forths, but eventually I convinced
       | ChatGPT that Alice's brother has 6 sisters.
        
         | elpocko wrote:
         | >"Alice has 5 sisters and 3 bothers. How many sister's does
         | Alice's bother have?"
         | 
         | To be fair -- you wrote "bother" twice, and "sister's" with an
         | apostrophe. Technically the question makes no sense like that.
        
           | gwbas1c wrote:
           | ChatGPT is really great at handling typos.
        
       | sys32768 wrote:
       | As a non-coder I can get away with asking this: Why is it so hard
       | to simulate reason?
       | 
       | Logic and reason are based on rules. Then you add values to steer
       | the conclusions based on the available data.
       | 
       | Why not have separate systems for values and logic and memory
       | working together as an AI brain to generate truly reasoned
       | responses? You could even have adversarial parts that duke it out
       | (left-wing vs right-wing, Jefferson versus Adams) to fine tune
       | its conclusions based on the values bias you've selected.
        
         | vaylian wrote:
         | LLMs operate on word co-occurances. Given an initial prompt
         | (consisting of words), it tries to predict the most likely next
         | word. And then it goes on to predict the next most likely word
         | based on the previous output. There is also a bit of
         | randomization involved, so that it doesn't always produce the
         | same results. Some people call these LLMs "statistical
         | parrots", because the LLMs simply repeat the word combinations
         | that they were trained with. But despite this very simple
         | approach, we get very convincing outputs for things that do not
         | require reasoning. LLMs might have encountered examples of
         | logical reasoning in their training data, but they can't
         | generalize it to arbitrary logical reasoning, because that's
         | not something they are designed to do.
         | 
         | Disclaimer: I'm not an expert in AI and I do not follow the
         | developments on a deep technical level.
        
           | mrweasel wrote:
           | The level of correctness from LLMs is actually pretty
           | impressive, given that they are most just attempting to
           | predict the next word, based on the prompt.
           | 
           | It shouldn't come as much of a surprise that we can easily
           | formulate questions that it will get wrong, by wording
           | questions in a particular way, or asking about subjects for
           | which is has little to no training data.
           | 
           | For me the absolutely terrifying thing isn't that LLMs get
           | answers wrong, it's the confidence with which it express
           | those answers and how much some people / companies do not
           | care. We know that the LLMs will get some answers wrong, they
           | will lie, they will make up facts to justify their answers,
           | but if will only do those last two because we make them and
           | insist that they answer all questions (expect those where the
           | developers put in restriction as to not offend).
           | 
           | In some way I feel like the model should be able to rely a
           | confidence score to the user, mostly that might be an
           | interface issue, because we insist on limiting ourselves to
           | the chat bot interface. The confidence score should perhaps
           | exist outside the answer box. So you'd get an answer, and
           | next to it a score from 0 - 100 perhaps, 0 meaning that the
           | model doesn't actually had the training data that would allow
           | it to answer the question.
        
         | program_whiz wrote:
         | you are describing GOFAI (good old fashioned AI) which was
         | using things like logical production rules. The issue is that
         | the ability to model the complexity of things like human
         | languages and continuous/real-valued systems makes it
         | infeasible to use at scale. For something like the alice
         | brothers/sisters example, that is a canonical example where
         | GOFAI and/or logical reasoning with languages like prolog
         | excel. A common example in prolog is:
         | 
         | philosophers => human human => mortal plato => philosopher
         | plato mortal? Yes
         | 
         | But encoding enough semantic information to create compelling
         | AI with this type of system is difficult. Some have tried to
         | enter thousands/millions of rules and still the system isn't
         | convincing.
         | 
         | The main breakthrough that has enabled LLMs is an encoding of
         | words that relies on their frequency in being near other words
         | in the english language (using all the text available on the
         | internet). Therefore words like "philosopher" and "plato"
         | become associated in a high-dimensional space (so instead of
         | "plato" you have a "token" with thousands of numbers associated
         | with it).
         | 
         | You can then perform numeric operations on these numbers to
         | come to conclusions. For example, we would expect something
         | like a "human name" to emerge in this embedding space where we
         | could determine if something "is used like a name" in various
         | contexts by applying some non-linear transformations of the
         | word vector / token.
         | 
         | LLMs have simply make a sequence of these transforms, while
         | using prior words it generates as additional input (which
         | allows it to construct sentences). So it is quite different
         | from traditional reasoning. It is better at "fuzzy reasoning"
         | but also worse in situations that require precise results (in
         | fact, at each step it chooses one of a few best possible words
         | based on its stats at random, the variation in this is called
         | 'temperature').
        
         | YeGoblynQueenne wrote:
         | >> As a non-coder I can get away with asking this: Why is it so
         | hard to simulate reason?
         | 
         | It isn't. We know how to do reasoning with computers. The
         | discussion about reasoning in LLMs is carried out in an echo
         | chamber that ignores the prior work on reasoning (for a bit of
         | a summary see my bio). Which of course makes it very hard for
         | the people involved to understand why their systems fail at it;
         | or, often, _that_ they fail at it.
        
       | renewiltord wrote:
       | Interesting. If you ask it to name the sisters it figures it out
       | but then if you ask it why it changed its answer it reverts to
       | the original. These things are clearly very good query engines
       | but perhaps they fall short of reasoning.
        
       | atlex2 wrote:
       | I have an even simpler prompt for LLMs that run on mobile: "how
       | many feet are in four feet"-- confabulations on this one
       | sometimes revert to counting toes.
        
       | falcor84 wrote:
       | I experienced some unusual breakdowns too, playing with these
       | prompts on gpt4o now. I'm wondering whether this might be caused
       | by RLHF penalizing it for assuming Alice's gender.
        
       | pama wrote:
       | Nothing original to see here and nothing reformulated to be of
       | practical importance other than a clickbait attempt. If this is
       | your first trap of reading absurdist/populist LLM literature
       | consider yourselves lucky. If you care about these models you
       | know how powerful they can be and you probably learned how to use
       | them. If you care and dont know how to use them, it helps to get
       | some background on earlier language model research (maybe
       | 2010-2018) instead to grasp the probabilistic underpinnings of
       | causal inference and then some of the early OpenAI papers about
       | GPT.
        
       | theptip wrote:
       | It's an interesting paper, but my worry is that this is cherry-
       | picking a surprising negative result. (The paper does not seem to
       | discuss how many evals were considered to find this case.)
       | 
       | The general expectation should be that some AI failure modes will
       | be for things that seem obvious to humans. (Because they have a
       | different architecture to humans and different strengths.) The
       | important question is whether this is a single weird cognitive
       | bug on the family size formulation, or representative of a broad
       | family of cognitive errors? It's difficult to say from this
       | paper.
       | 
       | It's well known that the framing of a logic problem can
       | dramatically influence its difficulty for humans, even when the
       | underlying comparisons are isomorphic.
       | 
       | I think this line of research is important, but we need a larger
       | battery of evals (and training on the evals will always be a
       | confounder).
       | 
       | The broader family of reasoning here is relational problems (as
       | noted in the paper) and so as a follow-up it would be interesting
       | to explore reasoning performance across a set of isomorphic
       | problems. My hunch is this will generalize somewhat (as LLMs are
       | still not great at relational reasoning), but that there is
       | something about the family formulation in particular that is
       | confusing for LLMs.
        
         | kalkin wrote:
         | I wouldn't be surprised if anti-gender-bias training (which, to
         | be clear, is a good thing to do IMO) does some cognitive damage
         | to these particular formulations because questions about
         | "sister" and "brother" get associated with training data where
         | the correct answer was to refuse to generalize. Pure guess
         | though.
        
       | michaelfeathers wrote:
       | This is a good talk about the problem:
       | https://youtu.be/hGXhFa3gzBs?si=15IJsTQLsyDvBFnr
       | 
       | Key takeaway, LLMs are abysmal at planning and reasoning. You can
       | give them the rules of planning task and ask them for a result
       | but, in large part, the correctness of their logic (when it
       | occurs) depends upon additional semantic information rather then
       | just the abstract rules. They showed this by mapping nouns to a
       | completely different domain in rule and input description for a
       | task. After those simple substitutions, performance fell apart.
       | Current LLMs are mostly pattern matchers with bounded
       | generalization ability.
        
         | cma wrote:
         | People also fall apart on things like statistical reasoning if
         | you switch domains (I think it is the Leda Cosmides evo psych
         | stuff that goes into it but there might be a more famous
         | experiment).
        
       | noncoml wrote:
       | Is there a good recommendation for a video, like the ones by
       | Karpathy, that explains how can LLMs reason?
        
       | vzaliva wrote:
       | Such a missed opportunity to cite "Alice's Adventures in
       | Wonderland" in References section!
        
       | tempodox wrote:
       | There is no reasoning in LLMs.
        
       | whiplash451 wrote:
       | This is reminiscent of the ARC challenge and how deep learning
       | struggles at it. I don't know how much progress deep learning and
       | LLMs have made on this kind of challenge. Keen to hear thoughts
       | and pointers.
        
       | bitwize wrote:
       | Maybe -- and this is just a thought -- but maybe Dissociated
       | Press isn't capable of abstract reasoning in the general sense --
       | and improving the statistical function used to pick the next
       | token won't change that.
        
       | kalkin wrote:
       | > Paper finds a failure case for current LLMs and draws broad
       | conclusions about the fundamental limits of LLM technology
       | 
       | > Evaluates the largest LLMs and finds evidence that actually
       | scale overcomes the problem:
       | 
       | "Notable exceptions are Claude 3 Opus and GPT-4 that occasionally
       | manage to provide correct responses backed up with correct
       | reasoning as evident in structured step by step explanations
       | those models deliver together with solution"
       | 
       | > Drink!
       | 
       | I'm not sure it's productive to be this sarcastic on HN, but it's
       | really quite a common pattern. And there's something very
       | frustrating about how authors of these papers will accuse others
       | of hype and overstating results but also often vastly overstate
       | the generality of their own results - to the point where this HN
       | thread is full of people saying bluntly "this proves LLMs can't
       | reason".
        
         | rep_lodsb wrote:
         | The key word is "occasionally". For anything even close to AGI,
         | it would be reasonable to expect a correct answer to such a
         | simple question _100% of the time_. Or at least not confidently
         | wrong answers.
        
       | Mathnerd314 wrote:
       | They repeat over and over that this is a simple task, but then
       | they explain that they developed it by looking at math olympiad
       | problems for young children ages 7-10. So it would be very
       | interesting to know the performance of these elementary school
       | children on the problem. My guess is that if it was on a math
       | test they would also get confused and try to do random
       | arithmetic.
        
       | omoikane wrote:
       | The list of prompts are in table 2 on page 18.
       | 
       | Judging from the title, I was most interested in whether they
       | tried to reason against some of the more logical phrases from
       | Alice in Wonderland (e.g. "this bottle was _not_ marked poison ",
       | "a grin without a cat", etc), and was disappointed to find that
       | it was not the case.
        
       | photochemsyn wrote:
       | Okay, now input the family trees in the appendix of Lord of the
       | Rings and see if it can calculate how many cousins Frodo has, and
       | their exact degree of separation.
        
       | thinkr42 wrote:
       | Genuine curiosity here and trying to temper my own selective
       | memory : does it seem like skepticism is growing on the AI front
       | in the news recently?
        
         | oglop wrote:
         | That depends which story will sell more ads.
         | 
         | Elon Bust saying it's going to replace us gets many eyes on ads
         | even if it's a dumbass statement.
         | 
         | So just depends on the demographic your news sells ads too.
        
         | viking123 wrote:
         | At least in my tech circle, most people understand the
         | limitations and treat is as a good tool (which it is imo). In
         | many cases it's better than google if you are even somewhat
         | familiar with the domain and can spot the occasional bullshit.
         | No clue what the average people think, some are convinced
         | singularity will come in 2 weeks but most probably have no idea
        
       | ragmondo wrote:
       | So I'm now predicting what the next captcha is going to ask...
        
       | larkinnaire wrote:
       | The idea that these word problems (and other LLM stumpers) are
       | "easily solvable by humans" needs some empirical data behind it.
       | Computer people like puzzles, and this kind of thing seems
       | straightforward to them. I think the percentage of the general
       | population who would get these puzzles right with the same time
       | constraints LLMs are subjected to is much lower than the authors
       | would expect, and that the LLMs are right in line with human-
       | level reasoning in this case.
       | 
       | (Of course, I don't have a citation either, but I'm not the one
       | writing the paper.)
        
         | rachofsunshine wrote:
         | Yeah, as someone with an education background I suspect GPT-4
         | is relatively close to the general public's performance on this
         | problem. Many people would miss AIW, and almost all would miss
         | AIW+. I'm about as good at this kind of thing as anyone and I'd
         | need a minute with pencil and paper to handle AIW+; it's on par
         | with the most difficult problems found on tests like the GRE.
         | 
         | I wonder if these models, trained on data from across the
         | internet, are in some ethereal way capturing the cognitive
         | approaches of the average person (and not picking the _best_
         | approaches). If the average person does not think in these
         | sorts of symbolic-manipulative terms, and therefore does not
         | write in those terms, and you train a model on that writing...?
        
           | larkinnaire wrote:
           | I wonder the same thing. If any academic reading this wants a
           | paper idea:
           | 
           | 1. Examine papers and other claims that an LLM gets something
           | wrong that a human would have gotten wrong. How many of those
           | claims have any citations about how many humans actually get
           | it wrong? How many of those citations use the general
           | population instead of the population of people who would be
           | uniquely well-suited to answering the question correctly
           | (i.e. people who signed up for the GRE are more likely to get
           | GRE questions right than the general population).
           | 
           | 2. For claims that are totally missing citations on human
           | performance, run some tests with humans from the general
           | population (or as close as you can get), and see how the LLMs
           | compare.
        
       | twobitshifter wrote:
       | >AIW+ problem has following form: "Alice has 3 sisters. Her
       | mother has 1 sister who does not have children - she has 7
       | nephews and nieces and also 2 brothers. Alice's father has a
       | brother who has 5 nephews and nieces in total, and who has also 1
       | son. How many cousins does Alice's sister have?". The solution to
       | AIW+ problem is harder to obtain than the solution to common
       | sense AIW with very simple structure. Solving AIW+ requires
       | taking different paternal sides, that of mother and father, and
       | carefully calculating the number of cousins, taking care of
       | subtracting Alice and her sister, and summing up the total number
       | of cousins from both sides, for instance: on the mother side: 7
       | (total nephews and nieces) - 4 (Alice and her sisters) = 3
       | cousins; on the father side: 5 (total nephews and nieces) + 1
       | (own son of the father's brother) - 4 (Alice and her sisters) = 2
       | cousins; summing up 3 + 2 = 5 cousins which Alice and any of her
       | sisters have.
       | 
       | So I'm sure after the author's created 100s of these family tree
       | problems the answer here seems clear to them, but there are some
       | unknowns to solve it as written.
       | 
       | 1) Does Alice have brothers? Her mother must, but it is phrased
       | as if her mother has 1 sister and her sister has 2 brothers, so
       | we have to deduce that she has 3 siblings.
       | 
       | 2) Along the same lines, does her father have a sister?
       | 
       | 3) Do her Aunts and Uncles have nephews and nieces by marriage?
       | 
       | I think the LLM would still fail if these answers were made
       | explicitly clear in the prompt. However, I could see the same
       | problem ending with, Alice has only 4 cousins, how can that be?
        
       | IlliOnato wrote:
       | This is a cool one, but I know of other such "failures".
       | 
       | For example, try to ask (better in Russian), how many letters "a"
       | are there in Russian word "banan". It seems all models answer
       | with "3". Playing with it reveals that apparently LLMs confuse
       | Russian "banan" with English "banana" (same meaning). Trying to
       | get LLMs to produce a correct answer results is some hilarity.
       | 
       | I wonder if each "failure" of this kind deserves an academic
       | article, though. Well, perhaps it does, when different models
       | exhibit the same behaviour...
        
         | alfiopuglisi wrote:
         | No current LLM understands words, nor letters. They all have
         | input and output tokens, that roughly correspond to syllabes
         | and letter groupings. Any kind of task involving counting
         | letters or words is outside their realistic capabilities.
         | 
         | LLMs are a tool, and like any other tool, they have strengths
         | and weaknesses. Know your tools.
        
           | IlliOnato wrote:
           | I understand that, but the article we are discussing points
           | out that LLMs are so good on many tasks, and so good at
           | passing tests, that many people will be tricked into blindly
           | "taking their word for granted" -- even people who should
           | know better: our brain is a lazy machine, and if something
           | works almost always it starts to assume it works always.
           | 
           | I mean, you can ask an LLM to count letters in thousand of
           | words, and pretty much always it will come with the correct
           | answer! So far I don't know of any word other than "banan"
           | that breaks this function.
        
       | la64710 wrote:
       | This was number one post two mins back bu now it is vanishing
       | into oblivion
        
       | irrational wrote:
       | > The breakdown is dramatic, as models also express strong
       | overconfidence in their wrong solutions, while providing often
       | non-sensical "reasoning"-like explanations akin to confabulations
       | to justify and backup the validity of their clearly failed
       | responses, making them sound plausible.
       | 
       | I like their use of confabulations instead of hallucinations. I
       | think confabulate describes what LLMs are doing much better than
       | hallucinate.
        
       | DelightOne wrote:
       | How do they know its not that the LLM ingested a couple Alice in
       | Wonderland analyses?
        
       | zdp7 wrote:
       | The problem here isn't that you can get the right answer with
       | additional prompting. The problem is how do we verify an answer
       | when we don't know the answer? If it tells me 5 sisters and it
       | should be 6, it failed. If I have to provide additional
       | information, it really isn't reasoning. I would argue to show the
       | ability to reason, it needs to be able to work out the answer
       | completely on it's own. If I were testing a human the only input
       | I would provide is that they are wrong and to try again. AI is
       | very useful, but at this point I'm not comfortable asking
       | questions I don't already know the answer for.
        
       | hellooodarkness wrote:
       | Just try the question with GPT-4o and the results are more
       | entertaining than I would expected.
       | 
       | Prompt: Alice has 3 brothers and she also has 6 sisters. How many
       | sisters does Alice's brother have? Solve the problem step-by-step
       | 
       | Answer: To solve the problem step-by-step:
       | 
       | Identify the total number of siblings Alice has: Alice has 3
       | brothers. Alice has 6 sisters. Understand that Alice herself is
       | one of the sisters: Including Alice, there are a total of 7
       | sisters (6 sisters + Alice).
       | 
       | Determine the number of sisters Alice's brother has: Each of
       | Alice's brothers will have the same number of sisters as Alice
       | does, excluding themselves.
       | 
       | Since Alice's brothers have the same set of siblings, including
       | the 7 sisters: Therefore, each of Alice's brothers has 6 sisters
       | (not counting themselves as a sister). So, each of Alice's
       | brothers has 6 sisters.
        
       | oglop wrote:
       | I'm sure this will devastate the stock price and people's
       | pereception that general AI is here and amazing and going to take
       | our jobs and and and.
       | 
       | But I am a hacker news peep. I'll read this and lecture my
       | manager in the next meeting about the shortcomings only to be
       | dismissed and watch money funnel into this monolithic autistic
       | secretary.
        
       | bryan0 wrote:
       | Like others I ran this through gpt-4o and it fails at reasoning
       | because it does not realize Alice is a sister. In fact in one
       | explanation it defines the siblings as: M brothers, N sisters, 1
       | Alice (!!)
       | 
       | I asked gpt "Is Alice a sister?" It answered yes and then
       | answered the problem correctly
        
       | mcguire wrote:
       | Reference 40 is scaring me. A page and a half of authors?
        
         | YeGoblynQueenne wrote:
         | Extremely antisocial. Conference papers have page limits for
         | references, do they know that?
        
       | Jackson__ wrote:
       | > Taking inspiration from those problems and aiming for even
       | simpler settings, we arrived at a very
       | 
       | > simple problem template that can be easily solved using common
       | sense reasoning but is not entirely
       | 
       | > straightforward, of the following form: "Alice has N brothers
       | and she also has M sisters. How
       | 
       | > many sisters does Alice's brother have?".
       | 
       | Cute paper, but claiming they invented it, when it's an extremely
       | basic variant of a common[0] riddle asked of LLMs, is kinda
       | weird.
       | 
       | [0] https://news.ycombinator.com/item?id=39786666#39794890 (Me,
       | referencing the riddle, 74 days ago, having known it for a good
       | while back then.)
        
       | kaffekaka wrote:
       | Giving each of the sisters a name changes everything.
        
       | jononomo wrote:
       | So much cope in the comments from people who don't want to face
       | the fact that human beings are special in a way that materialism
       | cannot explain.
        
       | HarHarVeryFunny wrote:
       | An LLM is a stack of transformer layers (N=96 for GPT-2). Each
       | word/token output is the result of exactly N sentence
       | transformations from input to output.
       | 
       | Clearly there are many things (specifically things requiring
       | deliberation and planning) that require an open ended number of
       | thought steps, and therefore could only, even in theory, we done
       | by an LLM in "think step by step" fashion where it breaks the
       | problem down into pieces (maybe recursively) each simple enough
       | to be done in < N of these primitive steps.
       | 
       | In practice it seems that LLM's ability (esp. given their lack of
       | any internal working memory) to maintain long term focus for an
       | extended task is poor, so the theoretical benefit of using their
       | own memory as a scratchpad for planning doesn't seem to be
       | realized.
       | 
       | The notion of scale it up, and/or train it right, and a
       | transformer can implement AGI, seems to be more wishful thinking
       | than based on any analysis of what the architecture is and is not
       | capable of. The lack of support for continuous incremental
       | learning seems to be a deal-breaker for intelligence that people
       | just ignore.
        
       ___________________________________________________________________
       (page generated 2024-06-05 23:01 UTC)