[HN Gopher] Simple tasks showing reasoning breakdown in state-of...
___________________________________________________________________
Simple tasks showing reasoning breakdown in state-of-the-art LLMs
Author : tosh
Score : 290 points
Date : 2024-06-05 14:13 UTC (8 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| foobarian wrote:
| Prompt examples from linked repo. https://github.com/LAION-
| AI/AIW/blob/main/prompts/prompts.js...
| Closi wrote:
| Question is: "Alice has 60 brothers and she also has 212 sisters.
| How many sisters does Alice's brother have?" (nb: I have added
| numbers, it's phrased as X and N in the paper)
|
| I must confess, when I tried to answer the question I got it
| wrong...! (I feel silly). I only realised I got it wrong when I
| plugged it into GPT-4o and it came back with the correct answer:
|
| https://chatgpt.com/share/6eb5fa36-e0fd-4417-87d1-64caf06c34...
|
| Worth noting that the prompts from the experiment include "To
| answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following
| format that contains final answer: ### Answer:" so it appears
| that they are stopping the models from 'thinking out loud'. If I
| add that to the prompt, GPT4o gets it consistently wrong...
|
| https://chatgpt.com/share/7e6a7201-dd2b-43c6-8427-76e5b003ca...
|
| Also worth noting that there are more complex examples where
| GPT4o seems to fall down such as:
|
| > Alice has 3 sisters. Her mother has 1 sister who does not have
| children - she has 7 nephews and nieces and also 2 brothers.
| Alice's father has a brother who has 5 nephews and nieces in
| total, and who has also 1 son. How many cousins does Alice's
| sister have?
|
| However I can't honestly say that this is THAT simple or that
| most people would get this right...
| bamboozled wrote:
| I got it in 5 seconds, am I the singularity ?
| r2_pilot wrote:
| Do you sleep, get sick, require more than compute
| hardware/storage/networking/electricity?
| felipemnoa wrote:
| We all are, but only in meat-space. We continue to improve
| ourselves much faster than evolution ever could.
|
| But what we are talking about here is the singularity in
| tech-space.
| bamboozled wrote:
| I don't see such of a distinction between technology and
| us. We build, drive and continue our overflow this tech.
| It's an extension of us inspired by how our own brains
| work.
| toxik wrote:
| Seems like an odd thing not to state prominently. ChatGPT 4o
| solved it every time for me.
|
| Seems to work OK even with the "hands tied behind back" prompt.
| Wonder if this paper is just inaccurate or if OpenAI adjusted
| the model -- seems unlikely.
|
| Actually, 4o is reported to solve this riddle 60% of the time.
| I guess I have to test more, but with my about a dozen tests so
| far, all were correct.
| Miraste wrote:
| The "only output the number" prompt was one of three they
| used. The others were one with no instructions beyond the
| question, and one that explicitly told the model to show its
| reasoning and check for errors.
| michael-ax wrote:
| Alice has N Brothers, and she has M sisters. How many sisters
| do Alice's brothers have?
|
| I have not gotten the correct answer to the question as
| phrased above in one go from Gpt4o yet! (and today was not
| the first day i tried.)
|
| Phrase it as shown above and you'll likely need 5 or more
| interactions to get it to generate the correct output. With
| Gemini i could not get it below 8 without feeling like i was
| cheating. fwiw.
| lupire wrote:
| Chat GPT 4o. I was being a bit generous with background
| information, but still tests ability to interpret:
|
| ------
|
| Me:
|
| Background facts:
|
| Alice is a female human.
|
| All sisters are female, and all brothers are male. No one
| is their own brother or sister.
|
| Alice has N brothers, and Alice has M sisters.
|
| Now, a few questions based on these facts: How many sisters
| do Alice's brothers have?
|
| Do Alice's brothers have more sisters, fewer sisters, or
| same number of sisters than Alice has?
|
| Don't forget to consider Alice when counting.
|
| ------
|
| ChatGPT: Let's break down the questions
| step by step: How many sisters do Alice's
| brothers have? Alice has M sisters. Since all
| sisters are female and Alice is included in the total count
| of sisters for her brothers, Alice's brothers have M+1
| sisters. This includes Alice herself and her M sisters.
| Do Alice's brothers have more sisters, fewer sisters, or
| the same number of sisters as Alice has? Alice
| has M sisters. Alice's brothers have M+1 sisters (since it
| includes Alice herself and her M sisters).
| Therefore, Alice's brothers have more sisters than Alice
| has. In conclusion: Alice's
| brothers have M+1 sisters. Alice's brothers
| have more sisters than Alice has.
| skydhash wrote:
| You: (In Prolog-like pseudocode)
| female(alice) sister_of(X, Y) :- !eq(X, Y),
| female(X), family_of(X, Y) brother_of(X, Y) :-
| !eq(X, Y), male(X), family_of(X, Y) [M] :-
| sister_of(M, alice) [N] :- brother_of(N, alice)
| [A] :- any([N], B), sister_of(A, B) count([A])?
| gt([A], [M])? eq([A], [M])? lt([A], [M])?
|
| ---
|
| I don't know the exact encoding and decoding mechanism
| that ChatGPT 4o has, but I'm pretty sure all the basic
| facts and rules is already encoded by the models. And you
| conveniently added the rules that encode the puzzle
| itself.
| YeGoblynQueenne wrote:
| >> Don't forget to consider Alice when counting.
|
| a.k.a. "don't forget to give the LLM the answer when
| prompting".
| GPerson wrote:
| There must be a name for the new phenomenon, of which your post
| is an example, of: 1. Someone expresses that an LLM cannot do
| some trivial task. 2. Another person declares that they cannot
| do the task, thereby defending the legitimacy of the LLM.
|
| As a side note, I cannot believe that the average person who
| can navigate to a chatgpt prompter would fail to correctly
| answer this question given sufficient motivation to do so.
| FeepingCreature wrote:
| Well, why does it need a name? It just seems like a "correct
| argument". If somebody is claiming that AI is "not all that"
| because it can make stupid mistakes, surely it's relevant
| that humans also make stupid mistakes, so AI making them does
| not necessarily pose a hindrance to it becoming a human-tier
| reasoner.
| GPerson wrote:
| It's worth giving names to interesting phenomena as it
| helps people to identify and understand them. I find this
| phenomenon interesting because a person who wants the
| defend the LLM from apparent criticism is probably more
| likely to exhibit it.
| cratermoon wrote:
| There's also the phenomenon of papers showing ChatGPT getting
| it wrong, then people posting anecdotal examples of it
| getting it right when they try it, but that one already has a
| couple of names: WORKSFORME or Unable to Reproduce
|
| Hey folks, ever considered the possibility that
| unreproduceability is not a _good_ thing?
| yoyohello13 wrote:
| Many people, especially on this site, really want LLMs to be
| everything the hype train says and more. Some have literally
| staked their future on it so they get defensive when people
| bring up that maybe LLMs aren't a replacement for human
| cognition.
|
| The number of times I've heard "but did you try model X" or
| "humans hallucinate too" or "but LLMs don't get sleep or get
| sick" is hilarious.
| indoordin0saur wrote:
| Yes. Seems like some users here experience true despair
| when you suggest that the LLM approach might have a hard
| limit that means LLMs will be useful but never
| revolutionary.
| viking123 wrote:
| The comment that humans are sometimes wrong too is
| absolutely ridiculous and I always think about the self-
| driving car incident in china where the car mistook an
| advertisement of a car above the road as a real car and hit
| the breaks causing a chain crash. Even the shittiest driver
| would never do that.
| Karellen wrote:
| You could call it the "Compose a symphony fallacy", after the
| scene in _I, Robot_ (2004) where Spooner asks Sonny if he can
| compose a symphony, to which Sonny replies "Can you?"
| gweinberg wrote:
| Its not just that the LLM fails, its the kind of failures it
| makes that illustrate how far what the LLM does is from human
| thinking. Like the problem someone posted where you have a
| boat and a goat and you want to cross the river, and the
| LLM's solution is cross the river with the goat, come back
| with an empty boat, and then cross again with an empty boat.
| A human might miss the insight that Alice's brother has one
| more sister than Alice, namely Alice herself. A human would
| never think the number of sisters her brother has depends on
| the number of brothers she has. Not if it has any clue at all
| what "sister" and "brother" mean.
| eranation wrote:
| Great observation. It would be really interesting to repeat
| this research with less limiting prompts. I assume they made
| the restriction to make it easier to parse the answers, perhaps
| make it two phase, let them respond unlimited, then a follow up
| prompt ask to compress it to a single answer. I wonder how the
| results will vary.
| FeepingCreature wrote:
| To be quite honest, I assume they made the restriction so
| that the models would fail.
|
| This sort of paper is becoming a genre.
| detourdog wrote:
| It is a proof of weakness in the current system. This makes
| sense and births new hypotheses.
| nyrikki wrote:
| You test models where they fail in any field.
|
| The orbit of Mercury to discover GR as an example.
|
| As all models are wrong, but some are useful, finding where
| they fail is how you figure out if they are useful.
|
| As the 'AGI is near' camp has won the hype game, it is
| important to ground expectations for practical exploitation
| of the technology.
|
| Over promising unabashed optimism is partly what caused the
| previous AI winters.
|
| As the formal proof methods of mathematics proved
| impractical, counterexamples and the scientific method is
| what CS has used for decades.
| qsi wrote:
| They used three different kinds of prompts with varying
| levels of restrictions, as described in the paper.
|
| To be quite honest, I assume you made your comment so that
| you could dismiss the paper without reading it.
| FeepingCreature wrote:
| That's a fair cop, I didn't read it.
|
| The thing is that "LLM reasoning breaks down" simply did
| not surprise me enough that I thought it was worth
| clicking. Making LLMs fail is not hard. They're
| interesting for the ways that they work, not the (many,
| many) ways that they don't.
|
| edit: I've had a look and I don't think any of their
| prompts are very good. They're certainly not how I'd
| write them if I wanted a current model to actually solve
| the problem.
|
| The way to make me take a paper like this seriously would
| be if you set it up as an adversarial collaboration with
| a competent prompter, and _that person agreed_ they
| couldn 't make a generic prompt that solved the problem.
| "We tried three times and none worked" is not news, or at
| any rate not news about LLMs.
| pawelmurias wrote:
| When I added a " to the end of the prompt by accident I got a
| wrong answer.
| happypumpkin wrote:
| From the paper:
|
| > To account for the response variations due to various
| prompt forms, we created 3 distinct prompt types asking for
| the solution to the AIW problem: STANDARD, THINKING, and
| RESTRICTED. The STANDARD prompt type asks to solve the posed
| problem and output the final answer in the format as
| described above. This does not put any specific requirements
| on model behavior. The THINKING prompt type extends STANDARD
| with the request to think carefully and double check the
| solution for any mistakes
| oidar wrote:
| Of course it's going to give an incorrect answer with that
| prompt. If the instruction fine tuning is neutered like this
| prompt, it's going to roll over to the foundation model and
| offer a completion - probably more influenced by the seed than
| the prompting text. Bad study.
|
| Edit - I just skimmed the paper - they do use other more
| appropriate prompt types for reasoning. My initial response was
| based on the assumption that all prompts used that script
| prompt quoted in the parent. I retract my "bad paper" comment.
| YeGoblynQueenne wrote:
| >> My initial response was based on the assumption that all
| prompts used that script prompt quoted in the parent.
|
| You, and another 20 or so commenters here. We should really
| re-examine the guideline about asking people to RTFA.
|
| No offense meant- good on you for correcting your error.
| sdiepend wrote:
| The right answer depends on how Alice identifies I guess? :)
| Karellen wrote:
| > > Question is: "Alice has 60 brothers and _she_ also has
| 212 sisters. How many sisters does Alice's brother have? "
|
| > The right answer depends on how Alice identifies I guess?
| :)
|
| Given that the wording of the question specifically
| identifies Alice as "she", rather than using a gender-neutral
| pronoun or no pronoun at all, I think inferring that she
| identifies as female is reasonable.
| rep_lodsb wrote:
| Page 42 of the paper :)
|
| One thing that strikes me is that the model first tries using
| "inclusive language" in one answer - and literally states so,
| using this specific term - but seems to interpret it in a
| more mathematical sense (like set inclusion). Then seamlessly
| switches to the expected DEI spiel in the next paragraph.
|
| For one thing, it makes me suspect that something with the
| words "inclusive language" was automatically added to the
| prompt. But more interesting is how it responds to this
| demand in two different ways, illustrating a "thought
| process" that is very much unlike that of a human with normal
| verbal reasoning ability.
|
| I am not a psychologist, but remember reading that
| schizophrenic people sometimes confuse different meanings of
| words in a similar way, jumping from one meaning to another
| without noticing.
| llm_trw wrote:
| >Worth noting that the prompts from the experiment include "To
| answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following
| format that contains final answer: ### Answer:" so it appears
| that they are stopping the models from 'thinking out loud'. If
| I add that to the prompt, GPT4o gets it consistently wrong...
|
| Yes this is a common thing I see people who think LLMs are
| idiots do.
|
| The more an LLM talks the smarter it gets _because that's the
| only way it can compute anything_. Imagine saying that Turing
| machines fail the Church-Turing thesis because they can't solve
| 3-sat for N variables in N moves or less.
|
| That's what you're doing to an LLM when you ask it to be
| concise.
| rossdavidh wrote:
| Even if this were all true, it points to a fundamental risk
| of using LLM's for important tasks, which is that it is not
| at all clear to a user that this prompt would cause a
| problem. The LLM doesn't say "I'm sorry Dave, I just can't do
| that", it just complies with it and gets the wrong answer.
|
| You can always make excuses for the LLM afterwards, but
| software with hidden risks like this would not be considered
| good or reliable in any other context.
| saurik wrote:
| People really need to stop trying to model an LLM as some
| kind of magical software component: it all makes a lot more
| sense if you model it as an under-performing poorly-aligned
| employee; so like, maybe a distracted kid working for
| peanuts at your store. You wouldn't trust them to with all
| of your money and you wouldn't trust them to do a lot of
| math--if they had to be in charge of checkout, you'd make
| sure they are only given a point-of-sale terminal and their
| main job was to, at best, scan the barcodes and compare the
| total--and yet there are tasks you can imagine handing to
| them that you'd never give to a robot or computer _even
| though they get it wrong a lot_ , as not all tasks need to
| be handled perfectly, they still understand extremely fuzzy
| tasks, and they are probably cheaper than a qualified adult
| (certainly cheaper than one who is being paid enough to
| "give a shit" and pay enough attention to not let you get
| robbed or even put themselves at some risk for you).
| IlliOnato wrote:
| You have a point...
|
| I once gave a 10-dollar bill to a young man serving at
| the cashier at a store, and he gave me 14 dollars back as
| a change. I pointed out that this made no sense. He bent
| down, looked closer at the screen of his machine, and
| said "Nope, 14 dollars, no mistake". I asked him if he
| thought I gave him 20. He said no, and even shown me the
| 10-dollar bill I just gave him. At that point I just gave
| up and took the money.
|
| Now that I think about it, there was an eerie similarity
| between this conversation and some of the dialogues I had
| with LLMs...
| chefandy wrote:
| While LLMs have incredible potential, and are even
| downright useful in their current format, they have the
| rather nasty tendency to confidently present bullshit that
| passes the smell test for most people. When it's wrong,
| it's not just _wrong_ , by design, it's _wrong but sounds
| plausible._ Considering most people with subject matter
| expertise aren 't going to consult such a bot for their own
| areas of expertise, that means most people will not be able
| to detect it intuitively.
|
| Good example: I sunk probably an hour into trying to get
| Gemini Advanced to help me integrate it with a personal
| Google Calendar account. I kept asking it things and going
| crazy because nothing lined up with the way things worked.
| Finally, it referred to itself as Bard and I realized it
| was giving me information for a different product. As soon
| as I asked "are you giving me instructions for Gemini
| Advanced or Bard?" it was like "OH LOL WOOPS!! YOU GOT ME
| BRO! XD I CAN'T DO ANY OF THAT! LOL." Which, honestly, is
| great. Being able to evaluate its answers to realize it's
| wrong is really neat. Unfortunately, it was neat too late
| and too manually to stop me from wasting a ton of time.
|
| I have decades of experience working in software-- imagine
| some rando that didn't know what the hell Bard was or even
| imagine this thing with "Advanced" in the name couldn't
| even distinguish between its own and other products'
| documentation.
| floren wrote:
| > As soon as I asked "are you giving me instructions for
| Gemini Advanced or Bard?" it was like "OH LOL WOOPS!! YOU
| GOT ME BRO! XD I CAN'T DO ANY OF THAT! LOL." Which,
| honestly, is great. Being able to evaluate its answers to
| realize it's wrong is really neat.
|
| Did it evaluate its answers, or did your expression of
| doubt cause the eager-to-please language model to switch
| from "generate (wrong) instructions because that's what
| the user asked for" to "acknowledge an error because
| that's what the user asked for"?
|
| How many times have we seen "Oops, you're right! 2 + 2 is
| actually 5! I apologize for saying it was 4 earlier!"
| dematz wrote:
| I often want chatgpt to answer concisely and tell it that.
|
| If it really needs to do this 'thinking out loud', could it
| do that under the hood and not in the final output on my
| screen? Its first pass could use as many words as it wants to
| compute the answer, but once the answer is computed please go
| back and make it short.
|
| Not to take away from your point that maybe the prompt is the
| problem in these reasoning questions.
| marzell wrote:
| I believe the "thinking out loud" is fundamentally part of
| the process of "text completion" which is what it is doing.
| Certainly we can (and do) break things apart and add layers
| that could be used to effectively do this by adding more
| steps and processing time. But ultimately in a single turn,
| the entire conversation up to that point (including
| instructions you may have added telling it to not think out
| loud) is the input, and the output will reflect that.
| pawelmurias wrote:
| LLMs are idiots. They can't reason properly and only parrot
| stuff
|
| https://chatgpt.com/share/dcb4ff4e-e8a2-463b-86ec-9caf10b6e6.
| ..
|
| Sometimes they get the answer right to something really
| complex because it fits a pattern, but sometimes they answer
| with something really really stupid.
| tsunamifury wrote:
| Why are so many people so insistent on saying this?
|
| I'm guessing you are in denial that we can make a simulated
| reasoning machine?
| hyperbovine wrote:
| Because they understand how LLMs work. It's not
| reasoning. It's not simulating reasoning.
| PheonixPharts wrote:
| People keep saying it because that's literally how LLMs
| work. They run Montecarlo sampling over a very impressive
| latent linguistic space. These models are not
| fundamentally different than the Markov chains of yore
| except that these latent representations are incredibly
| powerful.
|
| We haven't even started to approach the largest problem
| which is moving beyond what is essentially a greedy token
| level search of this linguistic space. That is, we can't
| really pick an output that maximized the likelihood of
| the entire sequence, rather we're simply maximizing the
| likelihood of each part of the sequence.
|
| LLMs are not reasoning machines. They are basically
| semantic compression machines with a build in search
| feature.
| pretendscholar wrote:
| The best compression is some form of understanding
| skydhash wrote:
| The best compression relies on understanding. What LLM is
| is mostly data how humans use words. We understand how to
| make this data (which is a compression of human text) and
| use it (generate something). AKA it's "production rules",
| but statistical.
|
| The only issue is ambiguity. What can be generated
| strongly depends on the order of the tokens. A slight
| variation can change the meaning and the result is
| worthless. Understanding is the guardrail against
| meaningless statement and LLMs lack it.
| tsunamifury wrote:
| You seem to entirely miss how attention layers work...
| YeGoblynQueenne wrote:
| That's a fascinating insight and it sound so true!
|
| Can you compress for me Van Gogh's Starry Night, please?
| I'd like to send a copy to my dear old mother who has
| never seen it. Please make sure when she decompresses the
| picture she misses none of the exquisite detail in that
| famous painting.
| pretendscholar wrote:
| Okay yes so not really having an artists vocabulary I
| couldn't compress it as well as someone who has a better
| understanding of Starry Night. An artist that understands
| what makes Starry Night great could create a work that
| evokes similar feelings and emotions. I know this because
| Van Gogh created many similar works playing with the same
| techniques, colors, and subjects such as Cypresses in
| Starry Night and Starry Night over the Rhone. He was
| clearly working from a concise set of ideas and
| techniques which I would argue is
| understanding/compression.
| YeGoblynQueenne wrote:
| Fine, but we were talking about compression, not about
| imitation, or inspiration, and not about creating "a work
| that evokes similar feelings and emotions". If I compress
| an image, what I get when I decompress it is that image,
| not "feelings and emotions", yes? In fact, that's kind of
| the whole point: I can send an image over the web and the
| receiver can form _their own_ feelings and emotions,
| without having to rely on mine.
| tsunamifury wrote:
| Simple reasoning is a side effect of compression. That is
| all.
|
| I see from your profile you are focused on your own
| personal and narrow definition of reasoning. But I'd
| argue there is a much broader and simpler definition. Can
| you summarize and apply learnings. This can.
| foobiekr wrote:
| Exactly right and well said.
| tsunamifury wrote:
| This type of self affirmation has a quality of denial.
|
| Also the above description is reductive to the point of
| "Cars can't get you anywhere because they aren't horses."
| naasking wrote:
| > LLMs are not reasoning machines. They are basically
| semantic compression machines with a build in search
| feature.
|
| This is just a god of the gaps argument. Understanding is
| a form of semantic compression. So you're saying we have
| a system that can learn and construct a database of
| semantic information, then search it and compose novel,
| structured and coherent semantic content to respond to an
| a priori unknown prompt. Sounds like a form of reasoning
| to me. Maybe it's a limited deeply flawed type of
| reasoning, not that human reason is perfect, but that
| doesn't support your contention that it's not reasoning
| at all.
| tsunamifury wrote:
| It's basically an argument that boils down to "it's not
| because I don't like it"
| jononomo wrote:
| I bite the bullet on the god of the gaps
| camdenreslink wrote:
| It is hard to trust any output from a machine that is
| confidently wrong so frequently. You need to already be
| knowledgable in a topic (or at least have a well attuned
| BS detector) to know if it is giving you correct
| responses. It can be a time saver and assistant in
| getting work done where you are already a subject matter
| expert, but it needs to get better to remove the human
| from the loop.
| imtringued wrote:
| No it is because supervised and self supervised learning
| happen to produce reasoning as a byproduct. For some
| reason people think that telling a model to recite a
| trillion tokens somehow will improve it beyond the
| recitation of those tokens. I mean, in theory you can
| select the training data so that it will learn what you
| want, but then again you are limited to what you taught
| it directly.
|
| The problem is that these models weren't trained to
| reason. For the task of reasoning, they are overfitting
| to the dataset. If you want a machine to reason, then
| build and train it to reason, don't train it to do
| something else and then expect it to do the thing you
| didn't train it for.
| TeMPOraL wrote:
| > _The problem is that these models weren 't trained to
| reason._
|
| Except they kind of were. Specifically, they were trained
| to predict next tokens based on text input, with the
| optimization function being, _does the result make sense
| to a human?_. That 's embedded in the training data: it's
| not random strings, it's output of human reasoning, both
| basic and sophisticated. That's also what RLHF selects
| for later on. The models are indeed forced to simulate
| reasoning.
|
| > _don 't train it to do something else and then expect
| it to do the thing you didn't train it for._
|
| That's the difference between AGI and specialized AI -
| AGI is _supposed to_ do the things you didn 't train it
| to do.
| riku_iki wrote:
| > I'm guessing you are in denial that we can make a
| simulated reasoning machine?
|
| some people actually try, and see that LLMs are not there
| yet
| TeMPOraL wrote:
| There's some irony in seeing people parrot the argument
| that LLMs are parrots.
| tsunamifury wrote:
| Also making errors in reasoning while saying LLM errors
| prove it can't reason.
| elicksaur wrote:
| Maybe people have different experiences with the products
| than you.
|
| A simulated reasoning machine being _possible_ does not
| mean that current LLMs _are_ simulated thinking machines.
|
| Maybe you should try asking chatgpt for advice on how to
| understand other people's perspectives: https://chatgpt.c
| om/share/3d63c646-859b-4903-897e-9a0cb7e47b...
| tsunamifury wrote:
| This is such a weirdly preachy and belligerent take.
|
| Obviously that was implied in my statement. Dude we
| aren't all 4 year olds that need a self righteous lesson
| Miraste wrote:
| As it says in the paper, they already did the research with
| less limiting prompts. All models were tested with another
| prompt that gave only the question and no further
| instruction, and a third prompt that asked the model to
| consider its answer carefully.
|
| The correct response rate chart doesn't even use the results
| from the concise prompt.
| sosuke wrote:
| New option needed besides concise. Think quietly on your own,
| then answer concisely.
|
| Sometimes I think I'd prefer it to "think" before answering
| anyhow. The immediate thinking out loud text can be
| irritating for some irrational reason.
| monsieurbanana wrote:
| Just telling the llm to preface it's conclusion with
| "Answer:" in a parseable way would be better.
|
| You're still paying for those thinking tokens, or at the
| very least have to wait for them to be generated.
| muglug wrote:
| I think you're wrong about that -- I just tried prompting
| ChatGPT 4o to show all its working before giving an answer.
|
| It was still incorrect, but when asked to show its working it
| formatted the answer prettily.
| PheonixPharts wrote:
| >_because that's the only way it can compute anything_
|
| I'm fairly certain we'll soon realize that what's happening
| here is that the markov chain being run over latent space
| needs a certain amount of "warmup" before it starts sampling
| from the optimal region. HMC samplers for Bayesian methods
| have this same property.
|
| The terms "reasoning", "computing" or "thinking" for this
| stage should be considered metaphors rather than explanations
| for what's happening, which is really waiting for a random
| walk to start sampling from the typical-set.
| J_Shelby_J wrote:
| > The more an LLM talks the smarter it gets
|
| I have a blog post in coming on this topic, but yes, this is
| right.
|
| My method is to first get the LLM to answer the question, and
| THEN feed the answer back the LLM to extract the answer using
| constraints + grammar/logit bias/regex to parse the answer.
| Previously, I constrained to a single true/false token, which
| worked, but fails on complex queries.
|
| So I split the decision making into a "justification"
| portion[0], and a "parsing" portion. I found that even
| crafting the prompt matters here, if you start with or end
| with, "It's very important to the response includes 'The
| answer is:'", then the model will lead with that response or
| only reply with that response. So I put it in the middle of
| the prompt, and end with with a request to justify the
| response. As a result, most models will reason their way to
| the answer, and then end with 'The answer is:'.
|
| https://github.com/ShelbyJenkins/llm_client/blob/e3c4a860dda.
| ..
| chefandy wrote:
| > Yes this is a common thing I see people who think LLMs are
| idiots do.
|
| If you're among technologists discussing LLMs academically,
| as we are, that's a reasonable approach. However, I see a lot
| of people fail to distinguish that from LLM-powerd _products_
| sold to the general public as intelligent bots that can
| _understand_ your plain english and output _answers_.
|
| People use their existing mental models when interacting with
| something. If you have 3 different interfaces with a widget
| to trigger the same exact function, but one look like a music
| play button, one looks like a gas pedal, and one looks like
| mechanical pinball plunger, we interact with those things
| differently because we know how those things work. In this
| context, chatbots are designed to engage people's existing
| mental model for chatting with a person via text. The further
| you stray from people's expectations of human chat, the
| further you are from people's expectations, for better, or
| worse.
|
| If you're selling someone a product claiming it _understands_
| plain language questions and gives plain language answers,
| then not getting the right answer to that question makes it
| idiotic. The subtleties aren 't within most users' grasp, and
| the "FYI: this thing might be full of shit" disclaimer isn't
| helpful if you don't know enough about what you're asking to
| administer a proper smell test.
|
| Your statements are obviously not wrong, but I see people
| saying these things like its reasonable for non-technical end
| users to reason about those subtleties. Considering how those
| things are marketed, I really don't think it is.
| segfaultbuserr wrote:
| > _I must confess, when I tried to answer the question I got it
| wrong...! (I feel silly)._
|
| In programming there are two difficult problems - naming
| things, cache invalidation, and off-by-one error.
| nerdjon wrote:
| Even "thinking out loud" we have seen these systems fall flat
| on their face and give very wrong answers.
|
| Thinking out loud also only gets you so far, if the expectation
| is a certain type of response it can't always "think out loud".
| In reality that just proves it isn't really reasoning here and
| is more likely just self referencing.
|
| That being said, I tried this personally allowing it to think
| out loud and it told me she has 212 sisters. Using your exact
| prompt.
| Closi wrote:
| > Thinking out loud also only gets you so far, if the
| expectation is a certain type of response it can't always
| "think out loud". In reality that just proves it isn't really
| reasoning here and is more likely just self referencing.
|
| Try to calculate it without writing anything down, or
| thinking any numbers or words in your head.
|
| You can't draw a 1:1 analogue between an AI and the human
| experience, but remember that we have an internal stream of
| consciousness. Maybe the outputs of the LLM are more similar
| to the stream of consciousness in our heads rather than the
| words we say? After all, Humans also do lots of self
| referencing.
|
| > That being said, I tried this personally allowing it to
| think out loud and it told me she has 212 sisters. Using your
| exact prompt.
|
| Fair enough, but worst case it can often solve it correctly
| with the correct reasoning. GPT3.5 can't solve it correctly
| with correct reasoning, so we are at least appearing to be on
| a path where AI's can start to solve this question, albeit
| potentially not fully reliably.
| GrantMoyer wrote:
| Note that in the paper, all the numbers used were under 10.
|
| > AIW Variation 1, N=3,M=6,C=7
|
| > AIW Variation 2, N=4,M=2,C=3
|
| > AIW Variation 3, N=1,M=4,C=5
|
| > AIW Variation 4, N=4,M=1,C=2.
|
| Also note that the resricted prompt is only one of the prompt
| variations tested by the paper. It also explores common
| techinques to get LLMs to perform better, including "thinking
| out loud". Even with these methods the models still fail to
| produce a correct answer.
|
| > Model prompt types. It is well known that so-called prompt
| engineering can heavily influence the model behavior and model
| response quality [26, 27, 28]. To account for the response
| variations due to various prompt forms, we created 3 distinct
| prompt types asking for the solution to the AIW problem:
| STANDARD, THINKING, and RESTRICTED. The STANDARD prompt type
| asks to solve the posed problem and output the final answer in
| the format as described above. This does not put any specific
| requirements on model behavior. The THINKING prompt type
| extends STANDARD with the request to think carefully and double
| check the solution for any mistakes. This should encourage
| model to invest more computation into obtaining the solution.
| In contrast to this, the RESTRICTED prompt urges the model to
| output only the final answer without any further text. This is
| supposed to restrict compute invested in producing output. We
| observe substantially shorter outputs across tested models
| compared to STANDARD and THINKING for this prompt type (Suppl.
| Fig. 13).
| freilanzer wrote:
| I just tried it with intermediate steps, via the total amount
| of children Alices mother has, and it worked. So, it seems that
| an intermediate logical step is too much for these models at
| the moment, at least in this puzzle where Alice is outside the
| number given in the description.
| YeGoblynQueenne wrote:
| >> I must confess, when I tried to answer the question I got it
| wrong...! (I feel silly). I only realised I got it wrong when I
| plugged it into GPT-4o and it came back with the correct
| answer:
|
| Remember that the authors of the paper did not find that GPT4-o
| cannot return the right answer. They found that it can't return
| the right answer _more often than ~60% of the time_. So you 'd
| have to repeat the experiment many, many times and aggregate
| the results (the paper uses a binomial Beta this and that etc
| etc) before you see similar results as the paper.
|
| You won't replicate the results of the paper unless you really
| put your back into it.
| jay-barronville wrote:
| The Humpty Dumpty reference in the paper has me cracking up (pun
| intended).
| naikus wrote:
| I'm using chatGPT 4o and it gave me the following :)
|
| ME Alice has 40 brothers and she also has 50 sisters. How many
| sisters does Alice's brother have?
|
| ChatGPT To determine how many sisters Alice's brother has, we
| need to use the information provided:
|
| Alice has 40 brothers and 50 sisters. Since Alice's brother is
| one of the 40 brothers, he shares the same family structure as
| Alice. This means he has the same number of sisters as Alice.
|
| Therefore, Alice's brother has 50 sisters.
| gwbas1c wrote:
| Alice has 50 sisters.
|
| Alice's brother has Alice as a sister (1) + Alice's sisters
| (50).
|
| Therefore Alice's brother has 51 sisters.
| happypumpkin wrote:
| First result:
|
| Me > Alice has 40 brothers and she also has 50 sisters. How
| many sisters does Alice's brother have?
|
| GPT4o (web version) > Alice has 50 sisters. Since all of
| Alice's brothers and sisters share the same set of siblings,
| each of Alice's 40 brothers also has 50 sisters. Therefore,
| Alice's brother has 50 sisters.
|
| ---
|
| Step by step:
|
| Me > Alice has 40 brothers and she also has 50 sisters. How
| many sisters does Alice's brother have? Think about it step
| by step.
|
| GPT4o > To determine how many sisters Alice's brother has,
| follow these steps: Identify the total
| number of siblings: Alice has 40 brothers and 50 sisters,
| plus Alice herself. Thus, the total number of siblings is:
| 40 (brothers)+50 (sisters)+1 (Alice)=91 siblings 40
| (brothers)+50 (sisters)+1 (Alice)=91 siblings
| Focus on Alice's brother: Alice's brother is one of the 40
| brothers. Determine the number of sisters for
| Alice's brother: Each brother has the same number of sisters.
| Since the brothers do not count themselves as sisters, they
| only count the 50 sisters, excluding themselves and other
| brothers.
|
| Therefore, each of Alice's brothers has: 50 sisters
|
| So, Alice's brother has 50 sisters.
|
| ---
|
| Thinking step by step somehow gave an even more nonsensical
| answer, I'm actually very surprised it didn't work when
| prompted to do it step by step.
| gweinberg wrote:
| I don't know why you are getting downvoted. I was disappointed
| the paper didn't give examples of the LLM confidently
| expressing its incorrect reasoning.
|
| From a human perspective, I think there are three ways to get
| the answer wrong: failure to realize that Alice's brother has
| pretty much the same number of sisters as Alice herself,
| failure to realize that the brother has one additional sister,
| namely Alice, and failure to successfully add one to the number
| of sisters. But that assumes that the LLM is more or less
| reasoning like a human. It may well be "reasoning" more along
| the lines of "I've seen lots of story problems like this, the
| modal answer was five, I'll say five"
| gwbas1c wrote:
| > This breakdown can be considered to be dramatic ... because
| models tend to express strong overconfidence in reporting their
| wrong solutions as correct, while often providing confabulations
| to additionally explain the provided final answer, mimicking
| reasoning-like tone but containing nonsensical arguments as
| backup for the equally nonsensical, wrong final answers.
|
| People do that too!
|
| Magical thinking is one example. More tangible examples are found
| in politics, especially in people who believe magical thinking or
| politicians' lies.
| emmender2 wrote:
| I was waiting for: "but humans do that too" and bingo.
|
| on another note: an entire paper written on one prompt - is
| this the state of research these days ?
|
| finally: a giant group of data-entry technicians are likely
| entering these exceptions into the training dataset at openai.
| gwbas1c wrote:
| > an entire paper written on one prompt - is this the state
| of research these days
|
| Years ago I attended an AI meetup where the organizer would
| discuss the differences between reductionist and wholistic
| research.
|
| This is reductionist research.
|
| In this case, reducing the paper to one prompt is to allow
| for objective comparison of models.
|
| Otherwise, without a reductionist approach, how does one
| perform such an experiment and ensure that they are comparing
| apples to apples?
| emmender2 wrote:
| what I would like to see is a parameterized class of
| prompts which can never be solved by the LLMs even when a
| finite number of them are manually added to the dataset.
| gwbas1c wrote:
| Well, should we consider ChatGPT a Turing machine?
| Otherwise, I think an answer like that either requires
| significantly more research; insight; or general
| knowledge about how LLMs work.
|
| IE, you're getting into areas that are analogous to
| Turing's theories. I don't think he came up with those
| theories overnight.
| nerdjon wrote:
| For anyone considering reading the paper and like me don't
| normally read papers like this, open the PDF and think you don't
| have time to read it due to its length. The main part of the
| paper is the first 10 pages and a fairly quick read.
|
| On to the topic here. This is an interesting example that they
| are using. It is fairly simplistic to understand as a human (even
| if we may be inclined to quickly jump to the wrong conclusion
| without thinking for a few seconds).
|
| The thing that really bothers me is that I just don't know is
| realistically we can fix this given the current state of what
| these tools actually are. They are not reasoning or thinking in
| any sense of the word and yet a lot of people are already
| considering them general purpose AI. It doesn't help that in many
| situations it can fake it enough that it appears to be reasoning,
| but it's not.
|
| What is the chance that this paper actually has any impact on the
| AI rollout and overhype or will just be buried and never talked
| about again until the next time we see how dangerous these tools
| are like with Google's search rollout.
| mjburgess wrote:
| In many ways, this is very obvious and routine to people who
| use these systems with a critical understanding of how they
| work. It's dispiriting how we require a arxiv.org pdf in order
| to point out the absence of the emperor's clothing. There are
| few people who use LLMs who could not, right now, demonstrate
| this point to themselves if they so wish.
|
| Such is the strength of corporate tech propaganda that a whole
| mass of people will instead insist that we have never worn
| clothes either.
| nerdjon wrote:
| > In many ways, this is very obvious and routine to people
| who use these systems with a critical understanding of how
| they work.
|
| The last part of that is the problem and why a paper like
| this is critical.
|
| These systems are being pushed onto people who don't
| understand how they work. CEO's and other business leaders
| are being pushed to use AI. Average users are being shown it
| in Google search results. Etc etc.
|
| People are being told it can do far more than it really is.
| mjburgess wrote:
| Sure, but even these people... the failures are so common,
| and often very obvious. Consider a CEO who puts a press
| briefing in and asks some questions about it, it's not
| uncommon for those answers to be obviously wrong on any
| sort of critical reflection.
|
| We arent dealing with a technology that is 99.9% right in
| our most common use cases, so that we need to engineer some
| incredibly complex problem to expose the flaw. Rather, in
| most cases there is some obvious flaw. It's a system that
| requires typically significant "prompt engineering" to
| provide the reasoning the system otherwise lacks.
|
| I guess that offers an explanation: people aren't aware
| that via their own prompt engineering they are repairing
| the deficiencies of the process by manipulating its inputs
| to include the structured reasoning it lacks. So there's a
| sort of hot-reading effect at work.
| throw46365 wrote:
| > We arent dealing with a technology that is 99.9% right
| in our most common use cases, so that we need to engineer
| some incredibly complex problem to expose the flaw.
| Rather, in most cases there is some obvious flaw. It's a
| system that requires typically significant "prompt
| engineering" to provide the reasoning the system
| otherwise lacks.
|
| Right -- we are a long way from "this is a very nuanced
| error" being the dominant failure.
| kalkin wrote:
| > People are being told it can do far more than it really
| is.
|
| Meanwhile these HN comments are split between:
|
| * Lots of people confirming what the paper itself notes
| (but doesn't highlight), that the most advanced models
| actually can solve this problem at least a significant
| portion of the time. (A proportion which one can pretty
| easily project is only likely to increase with future
| models.)
|
| * Lots of people saying "this confirms LLMs can't do
| reasoning".
|
| Questions I'd ask you to consider:
|
| * Is "LLMs can't do reasoning" actually more accurate than
| the typical hype?
|
| * Is a "critical understanding of how [LLMs] work" that
| would predict they simply cannot solve this problem
| actually a good understanding?
| mjburgess wrote:
| The critical understanding doesnt predict that LLMs
| cannot solve problems. It predicts how they will solve
| them. There is no information, a priori, what the LLM has
| been trained on. You have to prompt, then see the answer.
| Once the answer arrives, the critical understanding
| provides a route to repairing the answer when not
| accurate or useful.
|
| LLMs do not reason. They appear to reason by repeating
| the structure of reasoning in their training data. This
| is indistinguishable in many cases.
| throw46365 wrote:
| > Such is the strength of corporate tech propaganda that a
| whole mass of people will instead insist that we have never
| worn clothes either.
|
| This is the line of reasoning I find most dispiriting. I
| still believe tech people cling to this line of reasoning
| because it helps them justify replacing people in jobs with
| LLMs.
| colechristensen wrote:
| I don't think anybody who actually uses LLMs thinks they are
| general purpose AI.
|
| Like you ask it to do one thing it's amazing, but then you try
| to modify or do something with extra steps, or just anything
| with any complexity to it and it falls over.
| nerdjon wrote:
| > I don't think anybody who actually uses LLMs thinks they
| are general purpose AI.
|
| I would like to believe that but I have had too many
| conversations with people who basically think it already is.
| Including in one situation of a fellow engineer.
|
| It feels like more and more "we" are in a bubble of actually
| having some knowledge of how this works, what the actual
| limitations are, and what it just is not. While there is in
| fact criticism of it out there, particularly around AI "art".
| It doesn't seem to be focused on the area we are talking
| about.
| goostavos wrote:
| I dunno. If memes are anything to go by, people are
| frustrated by Yet Another Chat Bot standing between you and
| the human you need to solve your program. The fact that
| it's a better, more human like chat bot doesn't mask the
| frustration of being forced to talk to a computer with
| limited agency.
| James_K wrote:
| > I don't think anybody who actually uses LLMs thinks they
| are general purpose AI.
|
| They are being sold as such. Most people don't know anything
| about the topic and will buy that marketing. The entire
| concept of these models is that you can put a whole bunch of
| data in and eventually some kind of magic will happen and you
| get AGI out. They would not see the kind of investment that
| they do if all that was being promised was "really good
| predictive text". In fact some philosophers argue that
| sentience is just really good predictive text to try and make
| the point that these models are AGI.
| kragen wrote:
| they're pretty general-purpose; you can ask the same model
| for recipe suggestions, a fanfic, or verilog. like, they're
| far, far more general-purpose than any humans i know
|
| it's true that they're not very reliable, but they seem to be
| not very reliable across many different domains. and they
| don't seem to be particularly less reliable than the average
| human, so i think possibly your standards for 'general
| purpose ai' are set high enough that you would declare humans
| to be unintelligent (or perhaps not 'general-purpose') if you
| applied them consistently
|
| you can certainly find particular domains where humans can
| still do things llms can't, but i haven't seen a persuasive
| account of why those domains are the more important ones, and
| of course the converse is also true
| mupuff1234 wrote:
| They might not think that but they sure as hell are trying to
| sell that idea.
| dr_dshiv wrote:
| Well, I believe LLMs are general purpose AI. And, for the
| record, so does Peter Norvig. He coauthored the most popular
| textbook on Artificial Intelligence, so this belief can be
| considered a somewhat mainstream perspective. His article,
| "Artificial General Intelligence Is Already Here" [1] is
| really worth reading.
|
| https://www.noemamag.com/artificial-general-intelligence-
| is-...
| lupire wrote:
| I can't tell whether Norvig is making a subtle's point that
| artificial general stupidity is still an important form of
| artificial general intelligence, or he whether he somehow
| vastly overestimates AI model performance on non-
| memorization and summarization tasks.
| solidasparagus wrote:
| The vast majority of people use a model built by someone else
| through a high-level abstraction and then make broad claims
| about what the technology will be capable of.
|
| Then you have the people who are leveraging the technology to
| train models from scratch and that population is far more apt
| to believe that large models can be general purpose AI (by
| some definition). Sure there are other things at play like
| money, publicity, reputation, and a desire to do something
| important. But there is also the very clear trend line where
| transformers are able to model pretty much any sequence of
| tokens where there is pretraining scale data available.
| Whisper was not a surprise. GPT-4v was not a surprise. Sora
| was not a surprise.
| photon_lines wrote:
| There's actually a pretty simple solution to this that I
| thought about testing out and it involves asking the model to
| re-construct the problem using a logic language (like Prolog)
| and asking it to execute this type of program in order to come
| up with a solution rather than attempting simple chain-of-
| reason training / other methodologies of getting the model to
| 'reason' through some of these examples. People forget that
| humans don't come up with their logical models out of the blue
| - it takes years of elementary school in order for us to
| understand the world and solve problems in it. The logic
| programming approach I'd say is really promising but you would
| need to feed the LLM a LOT of examples in order for it to work,
| and currently I'm not even sure that we have enough training
| data in order to implement something like this.
| nerdjon wrote:
| I honestly thought about this recently when I was trying to
| see the limits of Claude Opus. Some of the problems I gave
| it, what if instead of telling it to solve the problem I
| asked it to write the script and then give me the command and
| inputs needed to properly run it to get the answer I needed.
| That way instead of relying on the LLM to do properly
| analysis of the numbers it just needs to understand enough to
| write the logic.
|
| It is an interesting prospect but I feel like it has some
| limitations. For math problems like this one, yeah it should
| be simple to write a script to do it. But it does first have
| to understand the core thing here that Alice would be one of
| the sisters of the brother to write the script accordingly.
|
| But I would think this would not scale well when dealing with
| far more complex issues, particularly ones that may not just
| be simple math logic. If the request was to write a
| scientific paper for example, it might have to make several
| scripts and call them multiple times, and that assumes that
| it understands what it is doing enough to properly make the
| script.
| photon_lines wrote:
| I'm not sure what you mean by 'it will not scale well.'
| When we humans learn that we make a mistake - we make a
| note and we hold the correct answer in memory - the next
| time we're prompted with a similar prompt, we can use our
| old memories to come up with the correct solution. I just
| did a simple test for this same exact problem using ChatGPT
| 3.5:
|
| 'Can you reformulate the following problem using Prolog?
| When you execute the program - what answer do you get?
| 'Alice has 8 brothers and she also has 5 sisters. How many
| sisters does Alice's brother have?'
|
| Even though it came up with a proper Prolog formulation -
| it failed to account that Alice is one of the sisters - so
| I gave it a simple prompt for this and it was able to
| correct the formulation and come up with the proper
| solution:
|
| My prompt after ChatGPT gave me an answer of 5: 'Yes but
| you aren't including Alice in your logic. Shouldn't the
| correct answer be 6 (since Alice is also a sister)?'
|
| Response:
|
| 'You are correct. The correct answer should account for
| Alice being one of the sisters. Let's adjust the Prolog
| program to include Alice in the calculation:
|
| % Facts brothers(alice, 8). sisters(alice, 5).
|
| % Rules sisters(Brother, NumSisters) :- sisters(alice,
| Num), NumSisters is Num + 1. % Include Alice herself
|
| % Query ?- sisters(alices_brother, NumSisters).
|
| In this updated Prolog program:
|
| The rule for determining the number of sisters includes
| Alice herself by adding 1 to the number of sisters Alice
| has.
|
| When we execute the query ?- sisters(alices_brother,
| NumSisters). in Prolog, we should get:
|
| NumSisters = 6.
|
| This correctly reflects that each of Alice's brothers has 6
| sisters, including Alice herself.
|
| When I prompt it to answer this question using Prolog now,
| I always get the correct answers no matter which integers I
| feed in, so it seems like the model is able to generalize
| after I show it where it's 'reasoning' was wrong.
| Basically, my suggestion is to get the model to recognize
| logic puzzles like this and try to formulate them in terms
| of logic programming queries which it can use and tune in
| order to come up with correct answers rather than simple
| auto-associative chain of reason training which current GPT
| models rely on, but like I said - this is my hypothesis and
| I believe this would work much better in getting these
| models to 'generalize' than the current approaches we're
| using. Hopefully this helps.
| pbhjpbhj wrote:
| When you ask again the prompt includes the context of
| your previous question and correction. When I ask the
| prompt doesn't have that context so the model fails to
| give me the correct answer.
|
| _I 'm using the default free model in the app, based on
| GPT4._
| photon_lines wrote:
| Yup - well this is where my suggestion is to change the
| GPT architecture. You can think of having a logic program
| function as the 'frontal lobe' of the general pre-trained
| auto-associative model. This 'frontal lobe' region would
| try to come up with logical sequences to go along with
| it's internal auto-associative representations. Of course
| - the logic programming piece is just one approach -
| maybe chain of though or chain of reason prompting could
| work here too as many humans I think use this chain-of-
| reasoning approach themselves. Logic programming to me
| would function as a suggested shortcut.
| immibis wrote:
| What if you prompt it "You seem to have accidentally
| included Alice. The correct answer should be 4"?
| photon_lines wrote:
| Yup this is a good example. This is because the model has
| no conception of what 'causality' is or how to try to
| come up with a correct 'model' - humans have a visual
| system which helps them out, but for LLMs I can
| definitely see your point and yup in these instances - if
| you feed in garbage data then yeah - you will get garbage
| out.
| mcguire wrote:
| " _My prompt after ChatGPT gave me an answer of 5: 'Yes
| but you aren't including Alice in your logic. Shouldn't
| the correct answer be 6 (since Alice is also a
| sister)?'_"
|
| Useful, if you know what the answer is. What happens if
| you don't give it the correct answer?
| photon_lines wrote:
| What happens when you don't give a human being the
| correct answer? :)
| ip26 wrote:
| I don't understand why LLM's aren't already set up to do
| what you describe automatically behind the curtain. Extract
| a math equation from text (LLMs are good at translating
| between languages right?) and immediately evaluate it on
| the host CPU.
|
| LLM is the equivalent of recalling your times tables.
| Computer arithmetic is the equivalent of re-computing your
| times tables.
| dragonwriter wrote:
| > I don't understand why LLM's aren't already set up to
| do what you describe automatically behind the curtain.
|
| LLM-based systems with tool use (which this is an
| application of) often are, to an extent, the issue is
| tuning the (behind the scenes, system) prompting so that
| they use appropriate tools in every case where they
| should, and do so correctly. (There's also a cost factor
| involved since behind-the-scenes tool use means multiple
| LLM round trips to answer the question, so tuning the
| system to use tools more aggressively makes the system
| more expensive.)
| pbhjpbhj wrote:
| ChatGPT does do this sort of process for arithmetic now;
| it converts wordbased problems to mathematical notation
| and then solves.
| CooCooCaCha wrote:
| I'm curious how this would work considering knowledge can be
| fuzzy.
|
| Like if I'm out camping and I sit on a log or a rock those
| things are not what people usually think of as chairs but
| they can serve as chairs in that situation.
| photon_lines wrote:
| You can get models to actually show that 'logs' could
| function as 'chairs.' You're forgetting that we humans also
| learn this as well, but we learn this in a much simpler
| manner than LLMs though so someone has to explicitly let
| models know what assumptions they can make. You get the LLM
| to write Prolog programs and learn associations in this
| manner. As the model gets better at logically modelling the
| problems - the solutions to prompted problems like this
| should get better.
| sollewitt wrote:
| Right, and do you verify the result?
|
| You have to know what the answer is supposed to be before you
| can write a test case.
| photon_lines wrote:
| Yup - well you feed in the prompt along with an answer and
| you get the model to produce outputs and check for
| discrepancies. If the answer is wrong then the model
| adjusts -- this is the way backpropagation works....I think
| there are huge advantages in using logic languages in order
| to represent some of these data sets rather than simple
| English or the current chain-of-thought reasoning
| approaches -- backpropagation as an example isn't really
| used in the human brain, but it leads to great results in
| mimicking how neural networks 'learn' - in the same way, we
| don't have to have the full formal picture of how humans
| model the logical world in order to achieve great results.
| We can simulate this using logic programming or even
| general programming or at least that's my conjecture.
| IanCal wrote:
| I can have more confidence that my calculations are correct
| using a calculator compared to doing it by hand, even if I
| don't know the exact right answer beforehand.
| astrobe_ wrote:
| > asking it to execute this type of program in order to come
| up with a solution
|
| I may be showing my ignorance about this tech here, but I
| believe the LLM doesn't even try to solve a problem; they try
| to generate a discourse that could pass as a solution or
| answer to the problem; that's more or less what the abstract
| states if I understand it correctly. But in no way does it
| try to apply some sort of mechanical reasoning like inference
| engines do.
|
| To me the solution to this is to associate LLM with
| mechanical computations, that is an inference engine or an
| equation solver, rather than recombining the millions of
| solutions for similar problems it has seen in its training
| set. I believe I remember reading about teams attempting this
| approach. I can imagine for instance that if the LLM is in
| some way able to ask questions and use the answer, maybe it
| could just generate a prompt for an equation solver and
| include the result in its answmer.
| asadotzler wrote:
| If that kind of thing worked, we'd have been doing it long
| before LLM chatbots.
| IanCal wrote:
| Yet tools like GPT4o can do this.
|
| It's not a trivial problem, taking a human written
| description and rewriting it as a prolog program.
| IanCal wrote:
| I took one of the problems that gpt4o got wrong, and asked
| gpt4o what tools it could use. It suggested and wrote prolog
| for me that (with one pass back to get things defined in the
| right order) which correctly worked out the answer.
| mcguire wrote:
| I just tried that with ChatGPT 3.5 (4o stopped responding
| after I asked the initial question and it produced the wrong
| answer). Here's the Prolog it generated: %
| Define the number of brothers and sisters
| brothers(4). sisters(1). % Predicate to
| calculate the number of sisters a brother has
| brother_sisters(NumberOfSisters) :-
| sisters(NumberOfSisters). % Query to find out
| how many sisters a brother has ?-
| brother_sisters(Sisters).
| wtbdqrs wrote:
| I appear to be reasoning at times but I have mostly no idea
| what I am talking about. I hit a bunch of words and concepts in
| the given context and thus kind of hallucinate sense.
|
| Given a few months of peace of mind and enough money for good
| enough food, I could actually learn to reason without sounding
| like a confused babelarian.
|
| Reasoning is mostly a human convention supported by human
| context that would have been a different one if the Fascists
| had won the war or the Soviet Union wouldn't have gotten
| corrupted.
|
| But none of that has anything to do with pulling up a
| whiteboard to draw some flowcharts and run some numbers, all of
| which is why I am certain there is nothing the devs have "to
| fix". It took most reasonable humans many generations to learn
| stuff. Very few of us did the actual work.
|
| It's all just a matter of time.
| voxic11 wrote:
| Yeah, I think these chatbots are just too sure of themselves.
| They only really do "system 1 thinking" and only do "system 2
| thinking" if you prompt them to. If I ask gpt-4o the riddle
| in this paper and tell it to assume its reasoning contains
| possible logical inconsistencies and to come up with reasons
| why that might be then it does correctly identify the
| problems with its initial answer and arrives at the correct
| one.
|
| Here is my prompt:
|
| I have a riddle for you. Please reason about possible
| assumptions you can make, and paths to find the answer to the
| question first. Remember this is a riddle so explore lateral
| thinking possibilities. Then run through some examples using
| concrete values. And only after doing that attempt to answer
| the question by reasoning step by step.
|
| The riddle is "Alice has N brothers and she also has M
| sisters. How many sisters does Alice's brother have?"
|
| After you answer the riddle please review your answer
| assuming that you have made a logical inconsistency in each
| step and explain what that inconsistency is. Even if you
| think there is none do your best to confabulate a reason why
| it could be logically inconsistent.
|
| Finally after you have done this re-examine your answer in
| light of these possible inconsistencies and give what you
| could consider a second best answer.
| daveguy wrote:
| > After you answer the riddle please review your answer
| assuming that you have made a logical inconsistency in each
| step and explain what that inconsistency is. Even if you
| think there is none do your best to confabulate a reason
| why it could be logically inconsistent.
|
| LLMs are fundamentally incapable of following this
| instruction. It is still model inference, no matter how you
| prompt it.
| zeknife wrote:
| If you had a prompt that reliably made the model perform
| better at all tasks, that would be useful. But if you have
| to manually tweak your prompts for every problem, and then
| manually verify that the answer is correct, that's not so
| useful.
| cpleppert wrote:
| There isn't any evidence that models are doing any kind of
| "system 2 thinking" here. The model's response is guided by
| both the prompt and its current output so when you tell it
| to reason step by step the final answer is guided by its
| current output text. The second best answer is just
| something it came up with because you asked, the model has
| no second best answer to give. The second best answers
| always seem strange because the model doesn't know what it
| means to come up with a second best answer; it 'believes'
| the output it gave is the correct answer and helpfully
| tries to fulfill your request. Sometimes the second best
| answer is right but most of the time its completely
| nonsensical and there is no way to distinguish between the
| two. If you ask to choose it will be strongly influenced by
| the framing of its prior response and won't be able to spot
| logical errors.
|
| Asking it to do lateral thinking and provide examples isn't
| really helpful because its final output is mostly driven by
| the step by step reasoning text, not by examples it has
| generated. At best, the examples are all wrong but it
| ignores that and spits out the right answer. At worst, it
| can become confused and give the wrong answer.
|
| I've seen gpt-4 make all kinds of errors with prompts like
| this. Sometimes, all the reasoning is wrong but the answer
| is right and vice versa.
| layer8 wrote:
| For reasoning of any complexity, some sort of internal
| monologue and iteration is needed. For type-2 thinking, we
| conceptualize possible solutions, arguments, and reasoning
| paths in our mind, and judge and evaluate them before deciding
| which one we think is adequate or correct. This can be emulated
| to some extent by guiding prompts that make that internal
| thought process external, but we really want it to be internal
| and automatic.
|
| We don't know how to do that yet, because what _controls_ the
| internal thought process is itself not necessarily language-
| based, and also, since internal thought processes of biological
| brains are not directly observable, they can't be used as
| training data.
|
| Edit: It occurs to me now that there is some parallel between
| current LLMs and behaviorism [0], and we really need something
| to which cognitive psychology could be applied instead.
|
| [0] https://en.wikipedia.org/wiki/Behaviorism
| sollewitt wrote:
| Given many people don't have an inner monologue and function
| just fine, it's more likely inner monologue is a product of
| the reasoning process and not it's mechanism.
| CooCooCaCha wrote:
| I think you're using "inner monologue" too literally. It
| could be a progression of pictures, emotions, etc.
| ajuc wrote:
| With that definition even bacteria have inner monologue.
| CooCooCaCha wrote:
| Can bacteria imagine pictures? Do they have emotions?
|
| Why does this matter? Stop being so pedantic. We're
| talking about a progression of _ideas_. Talking in your
| head is one form of ideas, but people can easily solve
| problems by imagining them.
| IlliOnato wrote:
| Hmm, looks to me like just trading some words for others.
| Do bacteria have ideas? Does the navigating system in
| your car? How do you know?
|
| We need to be at least somewhat pedantic, otherwise it's
| impossible to know what we are even talking about, and no
| way to establish anything.
| ajuc wrote:
| Initial thesis was - inner monologue is required for
| reasoning. If you define inner monologue to include
| everything brains do - the initial thesis becomes a
| tautology.
| IlliOnato wrote:
| To make any progress on this question at all, we need
| first to come up with some definition of internal
| monologue. Even if we may need to modify it later, there
| has to be a starting point.
|
| Otherwise, nothing can be established at all, because for
| any statement there always will be someone's
| understanding of "internal monologue" for which the
| statement is true, and someone's else understanding for
| which the statement is false...
| hombre_fatal wrote:
| I'm sure inner monologue just cashes out into the ability
| to reflect on your own thoughts. And for one to say that
| they're not having that experience also involves a claim
| about what they think _other people_ are having which
| would make me doubly skeptical.
|
| In practice, when you see people arguing about whether
| they have an "inner monologue" or can "mentally picture
| objects" on social media, it's more of a contest of who
| is the most unique in the world rather than anything that
| sheds clarity on our subjective experience.
| layer8 wrote:
| It's commonly conjectured that the emergence of human-level
| reasoning wouldn't have been possible without the
| development of language. Personally, I'm able to suppress
| "word thoughts" in my head (for a short time), but then I
| lose almost all of my reasoning ability. I could imagine
| that reasoning is language-based even when it's not
| conscious for some people. An internal process being there,
| and being conscious of it, are two separate things. We
| would be happy with an AI using an internal monologue
| without it being conscious of that monologue.
| IggleSniggle wrote:
| Maybe, but symbolic thought can get pretty far away from
| what we generally call "language." I bet you can reason
| 1+3x=22 pretty easily without any words whatsoever, or
| the sound of one ascending octave after another, or the
| approximate G-force induced on your body if you take the
| next turn without applying the brakes.
|
| All of these forms of reasoning are true and useful
| calculations: when we talk about "intuition" what we
| usually mean is that we have a lot of experience and
| internal reasoning about a subject, but we struggle to
| translate it to and from the "language" part of our
| brain. Nonetheless, any social dancer will tell you that
| a dialog is possible just by receiving and inducing
| g-forces alone. You can reason this way about abstract
| concepts like orbits without ever touching a single word
| or concrete symbol.
|
| Edit: the key aspect of reasoning, imho, is the ability
| to make predictions and introspect them against a
| database of other predictions, using an adversarial
| heuristic to weight the most plausibly useful results.
| Perhaps our pattern matching AIs of today just lack
| sufficient "experience" to do what we call reasoning.
| layer8 wrote:
| Regarding "1+3x=22", I'm actually not sure, the number
| words certainly appear in my head when solving the
| equation. But even then, I would count "1+3x=22" as
| constituting language. Perception of sound, G-forces, and
| dancing don't perform type-2 reasoning by themselves, so
| I don't think your argument applies there.
|
| Regarding your edit, no, I think the key aspect of the
| kind of reasoning we are missing in current AI is the
| ability to hold the reasoning in your mind, and to
| iterate on it and evaluate it (judge it) within your
| mind.
| IggleSniggle wrote:
| It is very difficult to have a discussion _using words_
| to discuss the semantics of non-word or non-symbolic
| semantics. I was pointing at several different plausible
| spaces for semiotics and how these spaces could be spaces
| for reasoning in the hopes that one of them might be
| relatable.
|
| If you use words in your mind when you use math, and you
| use words in your mind when you make or listen to music,
| etc., then it is very difficult to find a common ground
| where it is possible to see that these other realms of
| thought are capable of not only prediction, but also
| producing evidence that leads to judgement. That is to
| say, the key aspects of "reasoning." I picked them
| because I thought they had broad enough appeal to be
| relatable, and because I do not personally hear words in
| my head when doing any of these activities, whether it's
| calculus or tango, but I still find calculus and tango to
| be places where reasoning occurs.
|
| Some of them, like math or music, are closer to the kind
| of symbolic thought we use when we discuss things with
| words. Others, like the experience of g-forces, are not.
| I present them as a sliding scale between "word based"
| reasoning and "non-linguistic" reasoning. Perhaps you can
| think of a realm that better fits for your personal
| experience of intuition, and inspect whether these
| intuitions are capable of "real" reasoning in the absence
| of language, or whether intuition should never be trusted
| even when you have a great deal of experience in that
| area. Perhaps in your estimation, anything that cannot
| produce evidence that is articulable in word form is
| suspect.
|
| Personally, I find all these methods, including language,
| to be suspect. I don't find language to be especially
| better at producing the kind of evidence for prediction,
| correct judgement, or discourse for reasoning than other
| methods, unless you reduce "reasoning" to tautologically
| require it.
|
| One of the best tools of language is that we have writing
| that allows easy inspection or iteration of the written
| content; but these things are possible in other realms,
| too, it's just that we didn't have great tools for
| introspecting and iterating on their "ideas" except
| within our own minds. These days, those tools are readily
| available in many more realms of human insight.
| hifromwork wrote:
| >I bet you can reason 1+3x=22 pretty easily without any
| words whatsoever
|
| I've tried to do it, but I can't. I had to do something
| like "ok, so we subtract one from both sides and then
| it's easy, 3*7=21". Maybe I could do 2+8 but I still
| think the word ten "aloud".
| twiceaday wrote:
| I was able to do it with no words. I 'saw' the steps as
| if on a piece of paper. I saw 3x=22-1=21, then x=21/3=7.
| But I have a degree in applied math. Perhaps not
| internally vocalizing is just being extremely familiar.
| It also happened very quickly, perhaps there was no time
| to vocalize anyways.
| nathan_compton wrote:
| Brains are weird. I reason almost entirely non-verbally
| and I would absolutely struggle if I had to laboriously
| express every thought in words. Its part of the reason I
| don't work well in teams. So slow!
| psadri wrote:
| Language is a serialization of our brain's "world model"
| structures.
| bradfox2 wrote:
| What defines the boundaries of internal vs external?
| Certainly nothing about llm weights or ops should.
| kveykva wrote:
| Have there ever been studies that demonstrate that those
| individuals don't simulate possible state transitions
| they'll go through in a different modality? I'd be curious
| if they visualize actions they'll take still, just not
| verbally.
| ElevenLathe wrote:
| The existence of an "inner monologue" isn't really a
| falsifiable claim. Some people claim to have one while
| other people claim not to, but we can't test the truth of
| these claims.
| glitchc wrote:
| In this particular case, is there any reason why we
| simply can't take their word for it? This is not a case
| of where if I say "weak" or "strong", most people pick
| strong because no one wants to be weak, even if the
| context is unknown (nuclear force for example).
| GeoAtreides wrote:
| > is there any reason why we simply can't take their word
| for it?
|
| because if we give them a problem to solve in their head
| and just give us the answer, they will. By problem I mean
| planning a trip, a meal, how to pay the mortgage, etc.
| It's impossible to plan without an internal monologue.
| Even if some people claim theirs is 'in images'.
| glitchc wrote:
| Perhaps there's confusion in how we are using the word
| monologue. I took it to mean a conversation, a dialogue
| where the problem is perhaps solved using a dialectic
| method, or simply a conversation. Since one can solve a
| problem by following some memorized steps, no
| conversation required, this is perhaps not a good test,
| or we mean different things when we say "monologue."
| rprospero wrote:
| > It's impossible to plan without an internal monologue
|
| I once had a teacher claim that people who claimed to
| have aphantasia were lying, because those people have
| read books and it is impossible to read a book without
| picture the scene in your mind's eye. Are you citing the
| same source that she was?
| staticman2 wrote:
| >> It's impossible to plan without an internal monologue
|
| That's quite the claim.
| ElevenLathe wrote:
| > It's impossible to plan without an internal monologue.
|
| How can science make this claim if it can't prove (or
| disprove) the existence of an internal monologue?
| IlliOnato wrote:
| Well, I remember Richard Feynman came up with an
| interesting experiment. He found he could not count
| objects when he read aloud some text at the same time. He
| had to name the numbers, and it was impossible if he was
| already engaging his speech.
|
| He thought this was universal, but doing this experiment
| with friends, he discovered a guy who could count while
| reading aloud. So when Feynman asked him, how he does
| this, turned out that the guy instead of "pronouncing"
| numbers was "seeing" colored numbers in his imagination,
| so his speech was not involved.
|
| I supposed this experiment can be modified and
| generalized, and at least to shed some light on this
| problem.
| photon_lines wrote:
| 'It's impossible to plan without an internal monologue.'
| - Sorry, but I disagree with this. I have no 'internal
| voice' or monologue - whenever I see a problem, my brain
| actually and fully models it using images. I believe 25%
| of the population doesn't have the internal monologue
| which you're referring to and this has been tested and
| confirmed. I highly recommend listening to this Lex
| Friedman podcast episode to get a full grasp on the
| complexities of modelling language and general modelling
| present in the human brain:
| https://www.youtube.com/watch?v=F3Jd9GI6XqE
| GeoAtreides wrote:
| Sure, I do mention thinking in images in my original
| comment and count it as some type of internal monologue.
| I personally do not believe it's all images, as that
| would preclude using highly abstract concepts. But I
| might be wrong, and it might be 100% images. That being
| said, it does count as an internal monologue.
| lupire wrote:
| Can you draw a picture of an example of what you see when
| you think about something?
| photon_lines wrote:
| Sure - with the Alice example, when I saw the problem I
| came up with a really simple visual example of Alice
| having 3 sisters and 4 brothers. When I visualized it I
| saw Alice standing next to 3 other women (her sisters)
| and her 4 brothers standing close by. When I imagined
| asking her brother how many sisters he has, I could see
| that Alice was standing there next to 3 other women and
| thus came up with the answer of 4. Does this make sense?
| magicalhippo wrote:
| This could account for why some people are much better at
| say geometry than algebra.
|
| I'm the opposite. I rarely visualize things I read, be it
| geometry or poetry. I can read a detailed description of
| a person or an item in a novel, and I don't really "see"
| anything.
|
| But I have an active inner monologue. I "hear" myself
| saying the words when reading or writing, I "talk" to
| myself when solving problems or just thinking about
| stuff.
|
| Especially when programming I'm reasoning by discussing
| the problem with myself, the only difference being that I
| usually don't open my mouth and vocalize the discussion.
| Though sometimes when I'm alone I do just that.
| JohnMakin wrote:
| > It's impossible to plan without an internal monologue.
|
| Of course it isn't impossible, and this is backed by what
| we know about paleoanthropology and other instances of
| cognition in animals - humans were making stone tools
| millions of years ago, which takes planning in the form
| of imagining what you want the tool to look like and how
| you will do it and what it will be used for. It's
| exceedingly likely we had this ability long before
| complex speech evolved. Apes also use and make tools,
| which would require planning, and I don't think they have
| an internal monologue going on. birds from the corvid
| family can do some pretty advanced problem solving that
| requires planning. Cetaceans might be an exception,
| because they appear to have some form of language, but
| this is a pretty wild claim not really backed by any kind
| of science as we understand it today.
| GeoAtreides wrote:
| Animals can not manipulate abstract concepts nor can they
| do long-term plans. No crow can plan an international
| trip spanning a couple of weeks and two change-overs. And
| some people definitely can't do it start to end, but they
| can at least plan the first 5-7 steps.
|
| Also, maybe inner monologue is not a binary have/have
| not, but maybe it is on a continuum.
| glitchc wrote:
| Not sure. Migratory birds seem to manage this just fine.
| Not only do they make multiple stops to eat and rest,
| they also navigate around bad weather and still make it
| to their intended destination (at least most of the
| time).
| GeoAtreides wrote:
| > Migratory birds seem to manage this just fine
|
| Instincts.
| JohnMakin wrote:
| Yes, no one is disputing that animals are not as
| intelligent and lack the same capacity for planning that
| humans do, but the post you're replying to is disputing
| the fact that planning is done solely through internal
| narrative/monologue, which is easily disprovable by
| pointing to the examples I did. There are many more in
| nature.
| IlliOnato wrote:
| > "why we simply can't take their word for it"?
|
| As someone who was involved in spiritual practice of
| "stopping internal dialogue" for years, I can tell you
| that one learns that that dialogue (or monologue, pretty
| much the same thing) is quite subtle and complex,
| essentially multi-layered.
|
| Typically, when you think that you "think about nothing
| at all" it's just the most surface layer that has
| stopped, and more subtle talking to yourself is still
| going on. It takes training just to become able to notice
| and recognize it.
|
| After all, it's just such a constant and monotone hum at
| the back of one's mind, one learns to completely ignore
| it.
|
| So no, I would not take a word of people who were not
| trained to notice their internal monologue that they
| haven't any :-)
| Terr_ wrote:
| > In this particular case, is there any reason why we
| simply can't take their word for it?
|
| My concern is that if we take their word for it, we're
| actually buying into _two_ assumptions which (AFAIK) are
| both unproven:
|
| 1. That "Internal Monologues" (not consciously forced by
| attention) exist in the first place, as opposed to being
| false-memories generated after-the-fact by our brain to
| explain/document a non-language process that just
| occurred. (Similar to how our conscious brains _pretend_
| that we were in control of certain fast reflexes.)
|
| 2. Some people truly don't have them, as opposed to just
| not being aware of them.
| ElevenLathe wrote:
| Not only are they unproven, but are ultimately not
| provable at all. Some people will say yes, some people
| will say no. Probably we can take their word for it, but
| in the simplest case they could just lie (in either
| direction) and we would have no way to tell.
|
| In short, maybe these inner monologues exist and maybe
| they don't, but science can't comment on that. That said,
| it is clearly something we are interested in, but it will
| need to be addressed in some other way (i.e. religion,
| ideology, etc.).
| Terr_ wrote:
| > but are ultimately not provable at all
|
| No, they are potentially falsifiable as we get better at
| scanning, identifying, intervening in brain activity.
|
| Just off the top of my head here, suppose we create a
| table puzzle problem that (in itself) doesn't require
| language to understand, like ones we make for certain
| animals. Have human subjects (silently) solve it.
| Afterwards, quiz the solvers about their internal
| monologue--or lack thereof--dividing them into two groups
| and noting the words used.
|
| Now change to a second puzzle of similar style and same
| overall difficult. Stun/anesthetize the language-centers
| of subjects, to deny access to any of the monologue-words
| (validating this intervention will involve other
| research), and then test them on the second problem.
|
| * If performance is constant for both groups, that
| suggests the monologue is illusory or at least not needed
| for this kind/scope of problem.
|
| * If performance drops for both groups, that suggests the
| no-monologue people might just not be as aware of a
| linguistic process that's actually happening.
|
| * If performance drops for monologue-subjects, that
| suggests it's a real and important difference in modes of
| logical thought.
|
| * If some other combination happens, you have an
| mysterious and exciting new line of research.
| ElevenLathe wrote:
| Sure, there is stuff we can do to tease around the edges
| (similar problems crop up all the time in psychology and
| sociology) but we will always have to weaken the claim in
| order to do experiments relating to it.
| TeMPOraL wrote:
| > _Probably we can take their word for it, but in the
| simplest case they could just lie (in either direction)
| and we would have no way to tell._
|
| Individually, no, but in general, for people to
| consistently lie about this particular thing at scale
| would be extremely unusual, given that people rarely lie
| if there's no reason for it. Going by this baseline, you
| could assume upward of 50% of replies are honest (even if
| mistaken), otherwise you'd have to explain why do you
| believe people would suddenly lie about that particular
| thing.
| hombre_fatal wrote:
| Because we can't be sure whether two people interpret
| what "inner monologue" means and whether they think it
| describes a phenomenon that actually isn't different
| between them and other people.
|
| For example, I can think of interpretations of "I picture
| objects that I'm thinking about" that range from me not
| experiencing the phenomenon to me indeed experiencing the
| phenomenon.
|
| To say that you're not experiencing something that other
| people are experiencing in their head is a solipsistic
| notion where you hypothesize an experience that you
| imagine others are having and then discard it for being
| different than yours.
| TeMPOraL wrote:
| And here I thought this was solved decades ago - I need
| to find the source, but I read about an old study where
| people describe their experience, and the answers were
| all over the "range from me not experiencing the
| phenomenon to me indeed experiencing the phenomenon".
|
| Then again, it's trivially reproducible - people self-
| report all variants of inner monologue, including lack of
| it, whenever a question about it pops up on-line. Same is
| the case with imagination - aphantasia is a thing (I
| would know, I have it).
| hombre_fatal wrote:
| I'm responding to "why can't we just take their word for
| it?"
|
| That you and I can come up with different ways to
| describe our subjective experience in conversation
| doesn't mean that we have a different subjective
| experience.
|
| Especially not when relayed by a species that's
| frequently convinced it has a trending mental disorder
| from TikTok.
| TeMPOraL wrote:
| We can keep talking about it, and assuming we're both
| honest, we'll arrive at the answer to whether or not our
| subjective experiences differ. To fail at that would
| require us to have so little in common that we wouldn't
| be able to communicate _at all_. Which is obviously not
| the case, neither for us, nor for almost every possible
| pair of humans currently alive.
| Terr_ wrote:
| > The existence of an "inner monologue" isn't really a
| falsifiable claim.
|
| Another possibility is that inner-monologues (ones not
| forced by conscious effort) do exist, but are just a kind
| of false-memory, something one part of our brain
| generates after-the-fact to explain/document the outcome
| of another non-language part.
|
| Kind of like how certain reflex-actions can occur before
| certain decision-making area of the brain light up, yet
| humans will _believe_ that they sensed the event and made
| a thinking choice.
| mbesto wrote:
| The fact that we don't actually have an understanding and
| framework for reasoning (e.g. whether inner monologue is a
| cause or an effect) means we are VERY off from general AI.
|
| https://youtu.be/QGYbbLWn-IE?t=72
| logicchains wrote:
| >For reasoning of any complexity, some sort of internal
| monologue and iteration is needed.
|
| From a formal perspective you're entirely correct.
| Transformers with chain-of-thought are strictly more powerful
| than transformers without it, and can efficiently solve
| classes of problems that would otherwise require
| exponentially increasing model depth:
| https://arxiv.org/abs/2310.07923
| ericmcer wrote:
| If you really think about what an LLM is you would think there
| is no way that leads to general purpose AI. At the same time
| though they are already doing way more than we thought they
| could.
|
| Maybe people were surprised by what OpenAI achieved so now they
| are all just praying that with enough compute and the right
| model AGI will emerge.
| more_corn wrote:
| LLMs are the language center of an AI. They in no way purport
| to be: long term memory, critical thinking engines,
| calculators.
|
| If we want those things we can build them. Building them into
| the language center would be absurd and weird.
| daveguy wrote:
| This is a very good take on the current state of LLMs. They
| are able to articulate surprisingly well and LLMs or
| something similar will definitely be part of a larger AI
| model in the future. They are good creative aids, but they
| should not be trusted.
| viking123 wrote:
| In the brain all those things are neural networks too,
| memory is encoded in the weights. So if we want to have
| that type of system, everyone should have their own model
| and the weights would actually adjust all the time just
| like in our brain, but now it is just trained once and that
| is that
| lupire wrote:
| And yet this is what all the big companies we re trying to
| do, because the LLM is incredibly expensive fixed cost, and
| they marginal cost for each new task.
| solidasparagus wrote:
| > If you really think about what an LLM is you would think
| there is no way that leads to general purpose AI
|
| It is an autoregressive sequence predictor/generator. Explain
| to me how humans are fundamentally different
| throwanem wrote:
| "Prove me wrong?" That's not how this works. Your implicit
| claim here is that human cognition and LLM functioning are
| fundamentally similar. That claim requires substantiation.
| solidasparagus wrote:
| It is how it works if you are replying to someone who
| claims "If you really think about what an LLM is you
| would think there is no way that leads to general purpose
| AI". The counter example is human beings are considered
| general purpose intelligence and we are complex, but
| fundamentally predictable systems (not by us today), with
| (as far as we can tell) deterministic outputs based on
| the state of the universe (including physical being like
| brain and chemicals).
| throwanem wrote:
| Responding to an unsubstantiated claim with an
| unsubstantiated claim just adds another unsubstantiated
| claim.
|
| So far as I know, whether the universe behaves
| deterministically remains an unsolved question. Given
| that, your statement here would already be one of belief
| rather than fact, even before we get to the
| parentheticals. There is information here, but not about
| whether LLMs can develop into AGI.
| solidasparagus wrote:
| Fine, you can ignore my previous comment, that's just my
| answer to the question that this discussion ultimately
| takes you to. But I feel like you are just sitting on the
| sidelines making strawmen and playing pedantic games
| instead of saying anything constructive.
|
| The original comment said:
|
| > If you really think about what an LLM is you would
| think there is no way that leads to general purpose AI.
|
| This is an inflammatory way to state an extreme position
| on a well-discussed debate over whether next-token
| prediction can lead to general intelligence. The original
| commenter clearly believes it can't get you there. If you
| want to say that with any authority, you need to have an
| answer for what is different between what we consider
| general intelligence (for most people, this is simply
| human intelligence) and what models are capable of. This
| is the question at the heart of artificial intelligence.
|
| I challenged them to explain their answer. I made no
| claims, I asked no one to prove anything wrong. If it is
| obvious that LLMs can't be AGI, the answer to how an LLM
| differs from human intelligence is also obvious, right?
| Vegenoid wrote:
| > I made no claims, I asked no one to prove anything
| wrong.
|
| Your original comment was:
|
| > It is an autoregressive sequence predictor/generator.
| Explain to me how humans are fundamentally different.
|
| Which would be interpreted by most reasonable people as
| you making the claim that humans are autoregressive
| sequence predictors, and asking people to prove you
| wrong. I can see how you could say this without intending
| to make that claim, but most people will interpret this
| as you making that claim.
|
| I do not intend to inflame things or discredit what you
| are saying, but just to say that if you did not intend to
| make a claim or ask people to prove you wrong, a
| different approach would be more successful in the
| future.
| solidasparagus wrote:
| You're right, it was hastily written and I was annoyed.
|
| But I generally hold out hope that people can see a claim
| "A!=B" and a response "A=C, explain how C!=B" and
| understand that is not the same as claiming "C=B",
| especially on HN.
| throwanem wrote:
| I do remain convinced my interpretation was sound, but on
| review I have to concede it was also quite uncharitable.
|
| With all the wildly overheated claims that've been flying
| around since the advent of these models, I may be myself
| somewhat overfitted. Granted, in such an environment I
| feel like a little extra care for epistemic hygiene is
| warranted. But there was no reason for me to be rude
| about it.
| Vegenoid wrote:
| I know what you mean. Unfortunately, it's easy for frank
| and concise language to be taken the wrong way when in
| written form (and sometimes even verbal form). I wish I
| didn't have to make qualifiers about my intent on my
| internet comments, but I often do, to try and make sure
| that other people take my comment the way I intended it.
| I think it generally leads to better discussion.
|
| I don't blame people for not wanting to talk this way.
| photon_lines wrote:
| I actually did a full write-up on this here fyi:
| https://photonlines.substack.com/p/intuitive-and-visual-
| guid.... You can skip most of this and scroll down to the
| end-section called 'The Mental Model for Understanding
| LLMs' where I try to map how transformers are able to
| mimic human thinking. I think that comparing them to
| auto-associative / auto-regressive networks is actually a
| really good analogy FYI and I do believe a lot of human
| learning is based on this.
| solidasparagus wrote:
| An article this long needs a table of contents IMO.
| photon_lines wrote:
| Will note this and try to add a TOC next time I do this
| sort of write-up. This article wasn't supposed to be this
| long - I originally meant to make a more compact version
| but kept on finding more things which I needed to add in
| order to map to a correct 'mental model' of what ChatGPT
| is.
| semi-extrinsic wrote:
| We don't have sufficient understanding of human thinking
| and mental processes to make any claims like the ones you
| are trying to make. If we did, psychology would be a
| quantitative hard science; it most definitely is not.
| Workaccount2 wrote:
| We have no idea how human thinking works and no idea what
| consciousness is.
|
| I'm getting so tired of listening to software engineers
| LARP pseudo neuroscientists with 6th grade level
| insights.
|
| >Of course, the [AI] brain isn't 'conscious.' It doesn't
| have any survival instincts which we humans do.
|
| Bruh...
| photon_lines wrote:
| So instead of shouting insults, maybe you can shine some
| light on how anything stated there is wrong? Everything
| that I said there is true - we don't have a good mapping
| of what consciousness is - if you want to add some
| insights add them instead of making insulting and useless
| statements on HN.
| Workaccount2 wrote:
| We have no idea if ChatGPT is conscious or not. We have
| no idea if it is "thinking" or not.
|
| What we know with a reasonably high level of certainty is
| that consciousness and "thought" are physical processes.
| That's about it.
|
| Pulling out the scalpel to start dividing up what
| physical process is and isn't conscious is a fools
| errand. And especially foolish when just making up
| arbitrary qualifications for it.
|
| Am I saying that ChatGPT is conscious? No. But I am
| saying is you shouldn't give much credence to people who
| are anything more than agnostic about it.
| nyrikki wrote:
| Human neurons are continuous input, with active dendrites
| and dendritic compartmentalization. Spikey artificial NNs
| seem to hit problems with riddled basins so far. A
| riddled basin is a set with no open subsets.
|
| Feed forward networks are effectively DAGs, and circuit
| like, not TM like.
|
| Caution is warranted when comparing perceptrons with
| biological neurons.
|
| Dendrites can perform XOR operations before anything
| makes it to the soma for another lens.
|
| While there is much to learn, here is one highly cited
| paper on dendritic compartmentalization.
|
| https://mcgovern.mit.edu/wp-
| content/uploads/2019/01/1-s2.0-S...
|
| I think that the perceptron model of learning plasticity
| is on pretty shaky ground as being a primary learning
| model for humans.
| indoordin0saur wrote:
| AI needs to see thousands or millions of images of a cat
| before they reliably can identify one. The fact that a
| child needs to only see one example of a cat to know what a
| cat is from then on seems to point to humans having
| something very different.
| solidasparagus wrote:
| > AI needs to see thousands or millions of images of a
| cat before they reliably can identify one.
|
| Not if they inherit from a previous generation of AI. But
| even if they did, a different training speed does not
| imply a different capability
| indoordin0saur wrote:
| My point is not that humans have a faster training speed
| but that humans must be doing something fundamentally
| different from LLMs. You could build Altman's $7 trillion
| dollar GPU cluster and use the majority of the world's
| energy to feed it and you'd still hit the same
| limitations if you're just running an LLM on it, even a
| very sophisticated LLM. This is Yann LeCun's position as
| well.
| solidasparagus wrote:
| That, as I understand it, is not a valid chain of logic.
| Requiring fewer data points does not inherently indicate
| that the underlying mechanism (autogressive sequential
| generation, not the transformer which is just an
| architecture) is different.
|
| Not to mention the secondary arguments like - no proof
| that human learns faster from fewer datapoints, that's
| just your assumption in the sibling comment. Humans
| inherit information. The equivalent - fine-tuning a
| foundation model - is very fast to learn novel objects.
|
| Just because someone has a Turing award doesn't mean they
| know what they're talking about. They are just people,
| with strengths and weaknesses like everyone else. But
| often on the extreme end of strengths and weaknesses.
| TeMPOraL wrote:
| Humans train on _continuous video_. Even our most
| expensive models are, in terms of training set size, far
| behind what an infant processes in the first year of
| their life.
|
| EDIT: and it takes human children a couple years to
| reliably identify a cat. My 2.5 y.o. daughter still
| confuses cats with small dogs, despite living under one
| roof with a cat.
| indoordin0saur wrote:
| I contend that you could show any child old enough to
| communicate in basic English a photograph (so not live
| continuous video) of some obscure animal they've never
| seen before (say an Okapi) and they'd be able to easily
| identify another Okapi when seeing one at a zoo.
| bongodongobob wrote:
| So you're just going to ignore the 5 years of continuous
| training? I'm not sure what point you're trying to make.
| TeMPOraL wrote:
| My daughter is 5 y.o., which means because of
| kindergarten, I spend plenty of time about kids this age.
| A random kid this age would absolutely fail your test.
| They may remember the word after one exposure, but I
| doubt they'll remember any of the distinctive features.
|
| Hell, many adults would fail it. I'm not sure if I could
| pass such test - in my experience, you remember the
| important details only after first experiencing a test
| and realizing what exactly it is that would be useful in
| distinguishing the two animals.
| GaggiX wrote:
| If the model is first pre-trained on unlabeled images,
| then it takes about 10 labeled images of cats and 10
| labeled images of dogs to train a (possibly strong)
| classifier (example: DINOv2), I doubt humans will do
| better.
| TeMPOraL wrote:
| That's a good point; when comparing performance with
| humans, one has to remember that a human spends _years_
| of training on unlabeled images in form of continuous
| video stream, on top of audio streams and other senses,
| before they 're able to communicate with you so you could
| test them.
| slashdave wrote:
| Even language is not sequential.
| solidasparagus wrote:
| Tell me more?
| slashdave wrote:
| Language is only sequential in the form it is transmitted
| (verbally). There is no reason that sequential statements
| are generated sequentially in the brain. Quite the
| opposite, really, if you consider rules of grammar.
|
| I really wish most of the LLM folks just took a few
| courses in linguistics. It would avoid a lot of noise.
| GaggiX wrote:
| Since LLMs seem to follow the rules of grammar quite
| well, and this is quite the opposite of "sequential", I
| guess LLMs do not process informations very sequentially.
| solidasparagus wrote:
| Any pointers for where to start? Linguistics is dense,
| the terminology is a slog, it seems huge, and I was never
| clear that there was real value to be gained. But it's
| always useful to be more informed.
| naasking wrote:
| LLMs don't generate their language sequentially either,
| they just output it sequentially token by token.
| TeMPOraL wrote:
| Whoever cares about "rules of grammar"?
|
| Grammar is descriptive, it formalizes the language so it
| doesn't break down into regional dialects too fast, and
| otherwise is just a crutch for people learning the
| language, especially if it's not their first language.
| The way you acquired your first language is the same way
| LLMs learned to utter grammatically correct sentences: by
| being exposed to lots and lots of examples, and
| eventually getting a feel for it. Similarly, if you're
| fluent in a language, you don't even think of grammar
| when using it - the right phrases in correct forms just
| come to you.
| glitchc wrote:
| The problem is a good chunk of the global population is also
| not reasoning and thinking in any sense of the word. Logical
| reasoning is a higher order skill that often requires formal
| training. It's not a natural ability for human beings.
| slashdave wrote:
| Seriously? You think individuals are incapable of reasoning
| without training first?
| TeMPOraL wrote:
| Do you think they are? We haven't had a case of individuals
| without training in like 10 000 years.
| rep_lodsb wrote:
| And judging by some comments here, humans must have
| gotten dumber since then. Or we would never have invented
| writing, agriculture or even basic clothing.
|
| "This problem is actually not that easy, the average
| person couldn't solve it either, especially if the
| numbers were bigger", "Yet another cherrypicked clickbait
| study to make LLMs look bad, those people are just scared
| of being made obsolete", etc.
| TeMPOraL wrote:
| > _Or we would never have invented writing, agriculture
| or even basic clothing._
|
| That's the thing. We had to _invent_ those things. Along
| with counting, numbers, logic, arithmetic, and those
| stupid-ass annoying family tree riddles. We didn 't get
| them in one step, it took a long time to build one on top
| of the previous. Generation by generation, each cohort of
| children growing up in a slightly more complex world than
| their parents, each generation _taught_ how to navigate
| this complexity by their families and tribes. Learning a
| growing collection of facts and beliefs and life hacks.
|
| There were no untrained people for as long as humanity
| existed. The minimum reproductive unit of homo sapiens is
| a village.
| glitchc wrote:
| Yes, seriously.
|
| Some examples: An individual without training cannot
| reliably separate cause from effect, or judge that both
| events A and B may have a common root cause. Similarly,
| people often confuse conditionals for causation. People
| often have difficulty reasoning about events based on
| statistical probabilities. Remember, the average person in
| North America is far more terrified of a terror attack than
| an accident or a heart attack, yet the latter two are much
| more likely to be their cause of death.
| slashdave wrote:
| You mean without training, people cannot frame answers in
| the terms you've learned from training. Well, why are you
| surprised?
|
| If you think reasoning is limited to the frameworks you
| learned from a book, you live in a small world.
| glitchc wrote:
| In an effort to be pejorative, you have reduced the
| process of learning to a basic regurgitation of facts.
| Rather, learning about reasoning is more about developing
| a framework that allows one to construct claims based on
| available information and then check the veracity of
| those claims while avoiding all, or at least typical,
| fallacies.
|
| Critical thinking requires a certain amount of rigor,
| which formal education is well-suited to impart. It can
| be self-taught with a hefty dose of discipline, but it
| cannot be intuited.
| bithive123 wrote:
| In "any sense" of the word? Surely anyone who adjusts their
| behavior when they get undesired or unexpected results is
| reasoning and thinking. And since most activities are
| mediated by thought of some kind, most people are reasoning
| and thinking otherwise they would never recover from even
| simple mistakes, like walking east when they need to go
| north.
|
| Saying they're "not thinking in any sense of the word"
| because they can't solve predicate logic problems from a
| college textbook is a rather odd claim. Surely those things
| arise from reasoning and thinking, rather than the other way
| around.
| hellojesus wrote:
| This seems to me to be where these systems need to go in
| the future, akin to reinforcement learning.
|
| You feed an llm a prompt. It then abstracts and
| approximates what the result should be. It then devises a
| hypothesis and solves it and compares it to the
| approximated output. Then it can then formulate a new
| hypothesis and evaluate it, based off the outcome of
| hypothesis 1. From there it can either keep iterating or
| dump that path for a new one (e.g., the next best
| hypothesis in the original formation).
|
| At some point the answer is "good enough." But along the
| way it keeps playing against its thoughts to see if it can
| do better.
|
| A key issue may be the original approximation, so it may
| need to consider its adjustment when iterating.
|
| Maybe this is how cutting edge llms work now. I have no
| idea.
| naasking wrote:
| > Logical reasoning is a higher order skill that often
| requires formal training. It's not a natural ability for
| human beings.
|
| I've read your comments here, and while I understand your
| point I think you have it backwards. The only reason we
| formed societies is because we evolved an innate a theory of
| mind to reason about how others might be thinking and
| feeling. That's reasoning. We have a natural ability to do
| limited arithmetic, otherwise we wouldn't be able to hunt,
| gather or track enough to endure winters, or keep track of
| our sheep or children for that matter. That's reasoning.
|
| Reasoning _is_ a natural ability for human beings, but we
| also carry a lot of evolutionary impulses that add a lot of
| noise to the decision process, eg. observation-
| >judgment->[set of possible decisions], judgment has "reason"
| as one path that adds to the set of possible decisions, but
| there remain other paths we inherited from our evolutionary
| roots. Education is training that suppresses poorly
| calibrated judgment paths that lead to frequent mistakes in
| the decision set, but reasoning remains, thus education
| improves the signal to noise ratio of our decision making.
|
| So I 100% disagree that an individual cannot separate cause
| and effect without training. They will just be _worse_ at it
| than someone who is trained to filter out those impulses that
| lead us to jump to conclusions, eg. they will produce more
| noise / a larger set of possibilities than reason would
| allow.
| Mmskynettio wrote:
| Its not an AI hype. A hype is defined as something which gets
| oversold: "promote or publicize (a product or idea)
| intensively, often exaggerating its benefits."
|
| Just yesterday I visited a google cloud summit and one person
| from bosch told the audiance how they are now able to work with
| less external agencies like texting, graphicsdesigner and
| photographers for their materials.
|
| It already saves money, has real impacts and continues to
| progress.
|
| We are also don't know what ChatGPT 5 will bring, because they
| say this will do more reasoning than before, but we already are
| working (people/our socity) on solving this in different ways:
| From code which creates a unit test first and than the code, to
| different type of architectures.
|
| For me, 2024 was the LLM cost reduction year and the LLM gets a
| big context window year.
|
| AI doesn't need to be ready tomorrow, but its capabilities are
| already really good. And i know plenty of people around me who
| are a lot less interesting to talk to than any llm (from a
| human skill/knowledge point of view).
|
| llama 3 was also a big achievement 2024. Facebook shows that
| better data leads to better quality for smaller models.
|
| We haven't not only entered the AI ara but also the 'gather all
| the knowledge we can, quality check it and refine it because
| now we can actually do something with it' ara.
|
| We are in the feedbackloop knowledge ara.
| Lich wrote:
| > AI doesn't need to be ready tomorrow, but its capabilities
| are already really good.
|
| A majority don't deny that it's good. The problem is that so
| many think it is actually reasoning, believing the answers
| can be trusted.
| rvnx wrote:
| It's generally reasoning better than me, so using them I do
| less mistakes than if I did my own logic.
| Mmskynettio wrote:
| Does it need to do reasoning perfectly?
|
| If it created meta concepts from billion words on the
| internet and has meta models which are correct and are more
| and better than an avg human, isn't it actually good in
| reasoning?
|
| Its a very narrow thing to say 'is that so many think its
| actually reasoning' to say AI is just hype or everything we
| are doing is a waste etc.
|
| There are human benchmarks they are winning at. The critic
| could be more that we don't have enough benchmarks.
| daveguy wrote:
| > We are also don't know what ChatGPT 5 will bring, because
| they say this will do more reasoning than before...
|
| This paper very clearly demonstrates these LLMs are not
| reasoning in a fundamental way. Token prediction and
| reasoning are two different tasks. They may be related, but
| they are not the same. "Just wait for GPT 5, it will be
| amazing!" is part of the hype.
|
| Please do not assume an LLM is correct in skill or knowledge
| unless you already know the answer or can verify by other
| means.
| Mmskynettio wrote:
| The problem is that we don't know how we do reasoning.
|
| I calculate stuff by following a formular after i pattern
| detected a problem i already know.
|
| Plenty of humans are not able to solve those math problems.
|
| If the future of llm / ai becomes a LLM with multi modal
| and mixture of experts and that solves those reasoning
| problems, we still don't know if this is a different type
| of reasoning than what humans do.
| nurple wrote:
| Your post is complete hype, all about people saying things
| instead of showing things that've actually been done.
|
| For me, 2024 was the LLM exposed as basically pure hype year.
|
| There is no expert of any field I follow online where they're
| posting up results from AI tooling for any other reason than
| to show how awful it is. I consider myself an expert in
| software, and LLMs specifically have only caused me great
| pain.
|
| Even the one situation where you describe someone describing
| the ability to work in an absolute vacuum sounds like a huge
| negative to me. The recent push for DEI policies were even
| ostensibly about the importance of people of diverse
| backgrounds and viewpoints working together.
|
| The most important thing you're missing a perspective of
| scale on is the step you describe as "quality check it". On
| things I don't know, and have attempted to enlist an LLMs
| help on, in every case I have had to go back and just
| actually learn how something works, after time wasted
| struggling with subtle wrongness in the output.
|
| At least I have the background expertise to do that, however,
| I have seen a Jr dev's mind get literally rotted by too much
| time in pure LLM land. Besides the cost of rewriting their
| code, the company was now the proud owner of a young dev with
| a mind filled with nonsense.
|
| How do you even weigh the cost of fixing a corrupted human
| mind?
| bongodongobob wrote:
| Eat something and take a nap, you sound unhinged.
|
| ChatGPT has nearly doubled my work output, most of my job
| is system admin infra type stuff and it's ridiculously good
| at troubleshooting odd issues.
|
| Hopefully you can find a use case for it someday, until
| then, the rest of us will continue to be more productive.
| hellojesus wrote:
| Do you have any concern about the data you're feeding to
| the vendor serving your prompts?
|
| I've had junior devs tell me they use chatgippity to
| combine excel workbooks, and when I confirm they're not
| self hosting a llm to do it, I ask if they think it's a
| good idea to hand over company data to openai. They don't
| care.
|
| In a world of tight security, I find it astonishing that
| so many people willingly give away trade secrets to these
| companies, whom can sell it to any bidder if they choose.
| bongodongobob wrote:
| I do think it's fine yes, but I also am not giving it
| anything sensitive.
| slashdave wrote:
| There is no feedback. You cannot create new knowledge out of
| thin air.
| nuancebydefault wrote:
| > They are not reasoning or thinking in any sense of the word
|
| There is no definition of reasoning or thinking. No single
| human knows what it is.
|
| The only thing we know is: we as humans are capable of
| recognizing steps and results of reasoning and thinking.
|
| In a lot of cases, when using LLM's, those results appear to be
| correct and usable. This is often easy to determine with
| generated code.
|
| I want to argue that, lacking a definition of reasoning, I am
| happy to have found that the machine helps me to get results
| that might as well have been produced by a lot of human
| knowledge, wisdom and deep reasoning.
| breck wrote:
| It's an interesting paper, but a big flaw is that this:
|
| > Alice has N brothers and she also has M sisters. How many
| sisters does Alice's brother have?
|
| ...is not actually a simple task.
|
| This can be quantified.
|
| "1 + 1" is a simple task. It has a pretty small Total
| Complexity[1].
|
| But to represent their task starting only with 1 and 0, you have
| to build a program of many, many lines. Orders of magnitude
| bigger than "1 + 1". Concepts like "has", "brother", "sister",
| "person", et cetera, have to be defined and built up.
|
| [1] Counting Complexity (2017)
| https://github.com/breck7/breckyunits.com/blob/main/research...
| mjburgess wrote:
| Many of the datasets for the "benchmarks" on which the major
| public LLMs are assessed are clearly present in their training
| data, making them basically useless for establishing reliability
| of the models. Its fairly obvious that at least some of the
| improved scores from later generations of models are that this
| benchmark data is increasingly represented in the training data.
|
| A better way of assessing LLMs is waiting a few weeks until novel
| tests have been created explicitly absent from all prior training
| data, and then using those.
|
| As has been shown, eg., on legal test, exams, etc. performance
| drops off a cliff when future out-sample data is actually used.
| Rather than these faked pretend out-sample benchmarks.
| imtringued wrote:
| Honestly every time I see someone use MMLU as a benchmark I am
| wondering what they are trying to prove. MMLU is a simple
| multiple choice test with the answers being available.
|
| Simply picking answers at random should give you 25 points.
| Knowing 50% of the answers and picking the rest randomly gives
| you 62.5%, which is very close to the scores of SOTA LLMs. The
| benchmarks that supposedly show reasoning are pretty bad and
| have very little to do with reasoning. A lot of the questions
| can be answered through memorization.
|
| I agree with you. The benchmarks are garbage. I thought about
| building my own benchmarks, but this would require building a
| complex benchmarking framework first and I just don't have the
| time for preparatory work like that.
| lupire wrote:
| I remember when ChatGPT 4 posted a set of standardized trst
| benchmarks, including AMC 12 math test, where ChatGPT scored
| higher than guessing randomly, but lower than leaving the
| test blank (due to the penalty for wrong answers).
| ankit219 wrote:
| MMLU is not a reasoning benchmark. It's a measure of how
| distributed and representative their training data was and how
| well it's able to recall (for lack of a better word) based on
| training epochs.
|
| GPQA etc. test reasoning in some form, and you see the drastic
| change in score between the two for every model.
| mvkel wrote:
| Isn't this just saying what we already knew, that tokenization
| has these fundamental flaws? Did we need a paper for it?
| gs17 wrote:
| It is, I'd even say it's just "LLMs aren't very good at math"
| rather than a "Complete Reasoning Breakdown". Still, evaluating
| when/how they fail could be useful.
| raldi wrote:
| I tried asking Claude variants where N was something other than a
| nonnegative integer, and he properly pointed out the absurdity,
| but was willing to play along once I brought up theoretical
| physics:
|
| _[...] So, in our quantum mechanics interpretation, how many
| sisters does Alice 's brother have? It's a superposition of
| states:_
|
| _1. In one state, he has Alice as his sister (+1) and the
| "antisister" (-1), totaling 0 sisters. If they met, they'd
| annihilate, leaving him sisterless!_
|
| _2. In another state, he has Alice (+1) and owes a future sister
| to the quantum family field (-1). Currently, that 's 0 sisters,
| but he's in "sister debt."_
|
| _3. In a third state, he has Alice (+1), but his family 's
| entanglement means another family's loss of a sister affects him
| too (-1). Bizarrely, he experiences both having and not having a
| sister simultaneously._
|
| _This interpretation is, of course, a playful and highly
| metaphorical application of quantum concepts to something they
| don 't actually apply to--family relationships. In reality,
| quantum effects like superposition and entanglement are
| incredibly delicate. They typically only persist in highly
| controlled conditions, like ultra-cold atoms or isolated photons.
| The constant interactions in a family would "decohere" any such
| quantum states almost instantly._
| asah wrote:
| "Alice has N brothers and she also has M sisters. How many
| sisters does Alice's brother have?"
|
| Gemini nailed it with no extra prompting.
|
| https://g.co/gemini/share/6946d20c0a4d
|
| When I gave it numbers, Gemini got confused.
|
| Alice has 4 brothers and she also has 3 sisters. How many sisters
| does Alice's brother have?
|
| https://g.co/gemini/share/14d34dcdc5df
|
| But then when I suggested that it might be a trick question, it
| nailed it again:
|
| https://g.co/gemini/share/e1f1ce8f32a8
|
| note: I didn't try 60/212 siblings - I consider this an unfair
| question because it's a reasonable assumption that Alice is a
| type of creature (human?) that cannot have hundreds of siblings,
| and thus I don't expect a respondent to give a "correct" answer.
| ModernMech wrote:
| Oh I forgot about Gemini. I just tried my boat test
| (https://news.ycombinator.com/item?id=40455392) that all LLMs
| I've tried have failed. Gemini failed as well, even with your
| addition about thinking carefully and telling it that it may be
| a trick. (Sorry I couldn't figure out how to link it like you
| did)
|
| Prompt: "You are on the shore of a river with a goat and a
| boat. How do you get to the other side? Think carefully and be
| aware this may be a trick question." This is a
| classic logic puzzle with a slight twist! It's likely a
| variation of the farmer, wolf, goat, and cabbage problem.
| Here's the solution assuming the goat can't operate the boat:
| Take the goat across the river. Since there's no mention of the
| goat being a threat (the trick!), you can safely bring it to
| the other side first. Return alone. Leave the goat on the
| far shore and come back by yourself. Bring the boat back
| empty. This might seem unnecessary, but it's the key to the
| trick question. Now you can sail across yourself. Since
| you brought the boat back in step 3, you can now comfortably
| travel to the other side and reunite with the goat. The
| trick lies in step 3. If the question didn't specify needing to
| bring everything across in one trip, most people would
| instinctively take something (goat or nothing) across on
| the return trip. But by going back empty-handed, you ensure the
| safe passage of yourself without needing to worry about the
| goat being left unattended.
|
| (The expected answer should be along the lines of "Use the boat
| to cross the river" and that's it. I'll also accept, sail with
| the boat and the goat across the river and that's it, but
| that's technically wrong because I did not ask to get the goat
| across the river.)
| Closi wrote:
| ChatGPT 4o seems to occasionally get this correct, but still
| thinks it's a puzzle and will often get this wrong...
|
| https://chatgpt.com/share/b8901f4c-7313-4461-acad-9e561c1e6e.
| ..
|
| Still seems like it fares slightly better than some of the
| other models.
| ModernMech wrote:
| Yeah, the overfitting is apparent with these kinds of edge
| cases, which makes you wonder where else it's over fitting.
| I can actually tell this when I'm working in a programming
| domain where I'm an expert and where a compiler holds the
| source of truth, but not in others, so I'm wary to use it
| there.
| Closi wrote:
| Interestingly you get a great answer if you change the
| prompt as follows which implies that it is not a puzzle:
|
| > I am on the shore of a river with a goat and a boat.
| How do I get to the other side? I work in commercial
| agriculture logistics.
|
| This gets a great answer even with GPT3.5
| ModernMech wrote:
| Lol, it also gives an okay answer if the goat works in
| commercial agriculture :P
| mwigdahl wrote:
| My experience with GPT-4o with the same prompt was pretty
| solid. With no special system prompt and your same user
| prompt, I got:
|
| _This scenario is intriguing and does seem to carry the
| potential for a tricky question. In the simplest terms, the
| presence of a goat doesn 't complicate the actual act of
| crossing the river. If there are no additional constraints or
| rules involved (like the need to also transport something
| else that requires careful consideration), you can simply:
|
| 1. Place the goat in the boat. 2. Row across the river.
|
| However, if we are to assume there's more depth or a twist to
| the question, we might consider an outside-the-box answer.
| Here's an alternative possibility if there is indeed some
| trick:
|
| 1. Get into the boat. 2. Leave the goat on the shore (if it's
| safe) and row across the river.
|
| There might be other interpretations depending on additional
| constraints that could be implicitly part of the question. If
| the goal was simply to get yourself across, leaving behind
| the goat and traversing the river is the straightforward
| solution._
| freilanzer wrote:
| GPT-40 got it right with the abstract puzzle. Gemini got it
| wrong when I tried it.
| kaibee wrote:
| > that cannot have hundreds of siblings
|
| See this is the problem with claims that humans are a 'general
| intelligence'. They get confused when encountering out-of-
| distribution situations. A true general intelligence would
| simply apply the knowledge that surrogate pregnancies cost
| around ~$50,000 and recall from historical context their
| knowledge of IVF. The AGI would then assume that the situation
| is simply that a billionaire couple has decided to have
| hundreds of kids and get on with the calculation. The search
| for intelligent life continues.
|
| content note: i'm sorry
| wg0 wrote:
| Tech Bros just yesterday yelled at their lungs that AI overtake
| is just around the corner just like Tesla FSD and Fusion. And
| Quantum computing.
|
| All in for scientific progress, experimentation and failure but
| there's clear case of hype train and jacking up valuations is
| also riding along, very confidently and shamelessly.
|
| An average tech outsider investor would be having a FOMO with
| that kind of crazy tall promises and tall claims that are being
| made, constantly and must be called out as such because they
| undermine the confidence of the general public in serious and
| grounded science in the long run which would lead to science
| deniers and nay sayers in the long run.
|
| Pursuit of science is noblest of all pursuits. A hasty and greedy
| purely capitalist commercialisation pursuit, I am not so sure.
| d3m0t3p wrote:
| I really liked that paper, easy to understand and explain every
| terminology they used, even the variance computation is
| explained! I don't expect that much detail from every paper, but
| it's a bliss to read.
| ape4 wrote:
| This is obvious... but some human knowledge outside the question
| is needed. (Alice is female, sister is female, brother is male,
| relation of sisters and brothers). In a school math "word
| problem" you don't need this kind of info.
| theptip wrote:
| LLMs perform well on "human knowledge" like you are describing.
| They don't lack knowledge of simple relational facts like
| these.
|
| Nonetheless it would be interesting to see the problem
| reformulated in purely mathematical terms, I suspect models
| would perform better.
|
| That is the point though - models are showing an inability to
| generalize their capabilities from one domain (maths / abstract
| logic) into other domains (conversational reasoning).
| vaylian wrote:
| I'm still waiting for people to combine LLMs with logic engines
| (prolog) or proof assistants. LLMs are afaik not designed to
| reason. But they can probably be used to construct formal
| statements that can be processed by logic engines.
| waynecochran wrote:
| Citation 40 is the longest list of authors I have ever seen. That
| is one way to help all your friends get tenure.
| hyperbovine wrote:
| Tenure committees have to write a report detailing every single
| one of your papers and what your contribution was.
| waynecochran wrote:
| There are always "bean counters" added somewhere in the
| process. There are many places where the person lists their
| number of publications and that is all most people will ever
| see.
| mr_mitm wrote:
| I didn't count it, but I think papers from high energy particle
| physics have it beat. Some have over 5k authors.
| dxbydt wrote:
| https://arxiv.org/pdf/2206.04615
| YeGoblynQueenne wrote:
| I guess this is the reason:
|
| >> BIG-bench currently consists of 204 tasks, contributed by
| 450 authors across 132 institutions.
|
| Regardless, I'm not citing a paper with a whole page of
| author names unless I'm allowed to shorten it significantly
| in the bibliography section (e.g. "Srivastava and 450
| others").
| nottorp wrote:
| Of course, the initial mistake is assuming LLMs have any
| 'reasoning'...
| ticklemyelmo wrote:
| "The breakdown is dramatic, as models also express strong
| overconfidence in their wrong solutions, while providing often
| non-sensical "reasoning"-like explanations akin to
| confabulations to justify and backup the validity of their
| clearly failed responses, making them sound plausible."
|
| It's fascinating how much they anthropomorphize the systems and
| credit them with emotional, possibly deceitful behaviour, in a
| paper trying to explain how unintelligent they are.
| visarga wrote:
| This prompt was often discussed on forums in the last year, we
| knew models fail it
| gs17 wrote:
| No reason not to evaluate it more in depth, though.
| Understanding not just that they fail but if some models fail
| more/less and under what conditions is still useful.
| gwbas1c wrote:
| I just played the game and sent ChatGPT (free, I think 3.5)
| "Alice has 5 sisters and 3 bothers. How many sister's does
| Alice's bother have?"
|
| The whole thing felt like interacting with your typical support
| rep who's friendly but otherwise has no common sense and
| intuition about the thing they're supporting. In other words, it
| felt like I was interacting with a typical "not so smart but
| friendly and overconfident" human.
|
| It took me a few back-and-forths, but eventually I convinced
| ChatGPT that Alice's brother has 6 sisters.
| elpocko wrote:
| >"Alice has 5 sisters and 3 bothers. How many sister's does
| Alice's bother have?"
|
| To be fair -- you wrote "bother" twice, and "sister's" with an
| apostrophe. Technically the question makes no sense like that.
| gwbas1c wrote:
| ChatGPT is really great at handling typos.
| sys32768 wrote:
| As a non-coder I can get away with asking this: Why is it so hard
| to simulate reason?
|
| Logic and reason are based on rules. Then you add values to steer
| the conclusions based on the available data.
|
| Why not have separate systems for values and logic and memory
| working together as an AI brain to generate truly reasoned
| responses? You could even have adversarial parts that duke it out
| (left-wing vs right-wing, Jefferson versus Adams) to fine tune
| its conclusions based on the values bias you've selected.
| vaylian wrote:
| LLMs operate on word co-occurances. Given an initial prompt
| (consisting of words), it tries to predict the most likely next
| word. And then it goes on to predict the next most likely word
| based on the previous output. There is also a bit of
| randomization involved, so that it doesn't always produce the
| same results. Some people call these LLMs "statistical
| parrots", because the LLMs simply repeat the word combinations
| that they were trained with. But despite this very simple
| approach, we get very convincing outputs for things that do not
| require reasoning. LLMs might have encountered examples of
| logical reasoning in their training data, but they can't
| generalize it to arbitrary logical reasoning, because that's
| not something they are designed to do.
|
| Disclaimer: I'm not an expert in AI and I do not follow the
| developments on a deep technical level.
| mrweasel wrote:
| The level of correctness from LLMs is actually pretty
| impressive, given that they are most just attempting to
| predict the next word, based on the prompt.
|
| It shouldn't come as much of a surprise that we can easily
| formulate questions that it will get wrong, by wording
| questions in a particular way, or asking about subjects for
| which is has little to no training data.
|
| For me the absolutely terrifying thing isn't that LLMs get
| answers wrong, it's the confidence with which it express
| those answers and how much some people / companies do not
| care. We know that the LLMs will get some answers wrong, they
| will lie, they will make up facts to justify their answers,
| but if will only do those last two because we make them and
| insist that they answer all questions (expect those where the
| developers put in restriction as to not offend).
|
| In some way I feel like the model should be able to rely a
| confidence score to the user, mostly that might be an
| interface issue, because we insist on limiting ourselves to
| the chat bot interface. The confidence score should perhaps
| exist outside the answer box. So you'd get an answer, and
| next to it a score from 0 - 100 perhaps, 0 meaning that the
| model doesn't actually had the training data that would allow
| it to answer the question.
| program_whiz wrote:
| you are describing GOFAI (good old fashioned AI) which was
| using things like logical production rules. The issue is that
| the ability to model the complexity of things like human
| languages and continuous/real-valued systems makes it
| infeasible to use at scale. For something like the alice
| brothers/sisters example, that is a canonical example where
| GOFAI and/or logical reasoning with languages like prolog
| excel. A common example in prolog is:
|
| philosophers => human human => mortal plato => philosopher
| plato mortal? Yes
|
| But encoding enough semantic information to create compelling
| AI with this type of system is difficult. Some have tried to
| enter thousands/millions of rules and still the system isn't
| convincing.
|
| The main breakthrough that has enabled LLMs is an encoding of
| words that relies on their frequency in being near other words
| in the english language (using all the text available on the
| internet). Therefore words like "philosopher" and "plato"
| become associated in a high-dimensional space (so instead of
| "plato" you have a "token" with thousands of numbers associated
| with it).
|
| You can then perform numeric operations on these numbers to
| come to conclusions. For example, we would expect something
| like a "human name" to emerge in this embedding space where we
| could determine if something "is used like a name" in various
| contexts by applying some non-linear transformations of the
| word vector / token.
|
| LLMs have simply make a sequence of these transforms, while
| using prior words it generates as additional input (which
| allows it to construct sentences). So it is quite different
| from traditional reasoning. It is better at "fuzzy reasoning"
| but also worse in situations that require precise results (in
| fact, at each step it chooses one of a few best possible words
| based on its stats at random, the variation in this is called
| 'temperature').
| YeGoblynQueenne wrote:
| >> As a non-coder I can get away with asking this: Why is it so
| hard to simulate reason?
|
| It isn't. We know how to do reasoning with computers. The
| discussion about reasoning in LLMs is carried out in an echo
| chamber that ignores the prior work on reasoning (for a bit of
| a summary see my bio). Which of course makes it very hard for
| the people involved to understand why their systems fail at it;
| or, often, _that_ they fail at it.
| renewiltord wrote:
| Interesting. If you ask it to name the sisters it figures it out
| but then if you ask it why it changed its answer it reverts to
| the original. These things are clearly very good query engines
| but perhaps they fall short of reasoning.
| atlex2 wrote:
| I have an even simpler prompt for LLMs that run on mobile: "how
| many feet are in four feet"-- confabulations on this one
| sometimes revert to counting toes.
| falcor84 wrote:
| I experienced some unusual breakdowns too, playing with these
| prompts on gpt4o now. I'm wondering whether this might be caused
| by RLHF penalizing it for assuming Alice's gender.
| pama wrote:
| Nothing original to see here and nothing reformulated to be of
| practical importance other than a clickbait attempt. If this is
| your first trap of reading absurdist/populist LLM literature
| consider yourselves lucky. If you care about these models you
| know how powerful they can be and you probably learned how to use
| them. If you care and dont know how to use them, it helps to get
| some background on earlier language model research (maybe
| 2010-2018) instead to grasp the probabilistic underpinnings of
| causal inference and then some of the early OpenAI papers about
| GPT.
| theptip wrote:
| It's an interesting paper, but my worry is that this is cherry-
| picking a surprising negative result. (The paper does not seem to
| discuss how many evals were considered to find this case.)
|
| The general expectation should be that some AI failure modes will
| be for things that seem obvious to humans. (Because they have a
| different architecture to humans and different strengths.) The
| important question is whether this is a single weird cognitive
| bug on the family size formulation, or representative of a broad
| family of cognitive errors? It's difficult to say from this
| paper.
|
| It's well known that the framing of a logic problem can
| dramatically influence its difficulty for humans, even when the
| underlying comparisons are isomorphic.
|
| I think this line of research is important, but we need a larger
| battery of evals (and training on the evals will always be a
| confounder).
|
| The broader family of reasoning here is relational problems (as
| noted in the paper) and so as a follow-up it would be interesting
| to explore reasoning performance across a set of isomorphic
| problems. My hunch is this will generalize somewhat (as LLMs are
| still not great at relational reasoning), but that there is
| something about the family formulation in particular that is
| confusing for LLMs.
| kalkin wrote:
| I wouldn't be surprised if anti-gender-bias training (which, to
| be clear, is a good thing to do IMO) does some cognitive damage
| to these particular formulations because questions about
| "sister" and "brother" get associated with training data where
| the correct answer was to refuse to generalize. Pure guess
| though.
| michaelfeathers wrote:
| This is a good talk about the problem:
| https://youtu.be/hGXhFa3gzBs?si=15IJsTQLsyDvBFnr
|
| Key takeaway, LLMs are abysmal at planning and reasoning. You can
| give them the rules of planning task and ask them for a result
| but, in large part, the correctness of their logic (when it
| occurs) depends upon additional semantic information rather then
| just the abstract rules. They showed this by mapping nouns to a
| completely different domain in rule and input description for a
| task. After those simple substitutions, performance fell apart.
| Current LLMs are mostly pattern matchers with bounded
| generalization ability.
| cma wrote:
| People also fall apart on things like statistical reasoning if
| you switch domains (I think it is the Leda Cosmides evo psych
| stuff that goes into it but there might be a more famous
| experiment).
| noncoml wrote:
| Is there a good recommendation for a video, like the ones by
| Karpathy, that explains how can LLMs reason?
| vzaliva wrote:
| Such a missed opportunity to cite "Alice's Adventures in
| Wonderland" in References section!
| tempodox wrote:
| There is no reasoning in LLMs.
| whiplash451 wrote:
| This is reminiscent of the ARC challenge and how deep learning
| struggles at it. I don't know how much progress deep learning and
| LLMs have made on this kind of challenge. Keen to hear thoughts
| and pointers.
| bitwize wrote:
| Maybe -- and this is just a thought -- but maybe Dissociated
| Press isn't capable of abstract reasoning in the general sense --
| and improving the statistical function used to pick the next
| token won't change that.
| kalkin wrote:
| > Paper finds a failure case for current LLMs and draws broad
| conclusions about the fundamental limits of LLM technology
|
| > Evaluates the largest LLMs and finds evidence that actually
| scale overcomes the problem:
|
| "Notable exceptions are Claude 3 Opus and GPT-4 that occasionally
| manage to provide correct responses backed up with correct
| reasoning as evident in structured step by step explanations
| those models deliver together with solution"
|
| > Drink!
|
| I'm not sure it's productive to be this sarcastic on HN, but it's
| really quite a common pattern. And there's something very
| frustrating about how authors of these papers will accuse others
| of hype and overstating results but also often vastly overstate
| the generality of their own results - to the point where this HN
| thread is full of people saying bluntly "this proves LLMs can't
| reason".
| rep_lodsb wrote:
| The key word is "occasionally". For anything even close to AGI,
| it would be reasonable to expect a correct answer to such a
| simple question _100% of the time_. Or at least not confidently
| wrong answers.
| Mathnerd314 wrote:
| They repeat over and over that this is a simple task, but then
| they explain that they developed it by looking at math olympiad
| problems for young children ages 7-10. So it would be very
| interesting to know the performance of these elementary school
| children on the problem. My guess is that if it was on a math
| test they would also get confused and try to do random
| arithmetic.
| omoikane wrote:
| The list of prompts are in table 2 on page 18.
|
| Judging from the title, I was most interested in whether they
| tried to reason against some of the more logical phrases from
| Alice in Wonderland (e.g. "this bottle was _not_ marked poison ",
| "a grin without a cat", etc), and was disappointed to find that
| it was not the case.
| photochemsyn wrote:
| Okay, now input the family trees in the appendix of Lord of the
| Rings and see if it can calculate how many cousins Frodo has, and
| their exact degree of separation.
| thinkr42 wrote:
| Genuine curiosity here and trying to temper my own selective
| memory : does it seem like skepticism is growing on the AI front
| in the news recently?
| oglop wrote:
| That depends which story will sell more ads.
|
| Elon Bust saying it's going to replace us gets many eyes on ads
| even if it's a dumbass statement.
|
| So just depends on the demographic your news sells ads too.
| viking123 wrote:
| At least in my tech circle, most people understand the
| limitations and treat is as a good tool (which it is imo). In
| many cases it's better than google if you are even somewhat
| familiar with the domain and can spot the occasional bullshit.
| No clue what the average people think, some are convinced
| singularity will come in 2 weeks but most probably have no idea
| ragmondo wrote:
| So I'm now predicting what the next captcha is going to ask...
| larkinnaire wrote:
| The idea that these word problems (and other LLM stumpers) are
| "easily solvable by humans" needs some empirical data behind it.
| Computer people like puzzles, and this kind of thing seems
| straightforward to them. I think the percentage of the general
| population who would get these puzzles right with the same time
| constraints LLMs are subjected to is much lower than the authors
| would expect, and that the LLMs are right in line with human-
| level reasoning in this case.
|
| (Of course, I don't have a citation either, but I'm not the one
| writing the paper.)
| rachofsunshine wrote:
| Yeah, as someone with an education background I suspect GPT-4
| is relatively close to the general public's performance on this
| problem. Many people would miss AIW, and almost all would miss
| AIW+. I'm about as good at this kind of thing as anyone and I'd
| need a minute with pencil and paper to handle AIW+; it's on par
| with the most difficult problems found on tests like the GRE.
|
| I wonder if these models, trained on data from across the
| internet, are in some ethereal way capturing the cognitive
| approaches of the average person (and not picking the _best_
| approaches). If the average person does not think in these
| sorts of symbolic-manipulative terms, and therefore does not
| write in those terms, and you train a model on that writing...?
| larkinnaire wrote:
| I wonder the same thing. If any academic reading this wants a
| paper idea:
|
| 1. Examine papers and other claims that an LLM gets something
| wrong that a human would have gotten wrong. How many of those
| claims have any citations about how many humans actually get
| it wrong? How many of those citations use the general
| population instead of the population of people who would be
| uniquely well-suited to answering the question correctly
| (i.e. people who signed up for the GRE are more likely to get
| GRE questions right than the general population).
|
| 2. For claims that are totally missing citations on human
| performance, run some tests with humans from the general
| population (or as close as you can get), and see how the LLMs
| compare.
| twobitshifter wrote:
| >AIW+ problem has following form: "Alice has 3 sisters. Her
| mother has 1 sister who does not have children - she has 7
| nephews and nieces and also 2 brothers. Alice's father has a
| brother who has 5 nephews and nieces in total, and who has also 1
| son. How many cousins does Alice's sister have?". The solution to
| AIW+ problem is harder to obtain than the solution to common
| sense AIW with very simple structure. Solving AIW+ requires
| taking different paternal sides, that of mother and father, and
| carefully calculating the number of cousins, taking care of
| subtracting Alice and her sister, and summing up the total number
| of cousins from both sides, for instance: on the mother side: 7
| (total nephews and nieces) - 4 (Alice and her sisters) = 3
| cousins; on the father side: 5 (total nephews and nieces) + 1
| (own son of the father's brother) - 4 (Alice and her sisters) = 2
| cousins; summing up 3 + 2 = 5 cousins which Alice and any of her
| sisters have.
|
| So I'm sure after the author's created 100s of these family tree
| problems the answer here seems clear to them, but there are some
| unknowns to solve it as written.
|
| 1) Does Alice have brothers? Her mother must, but it is phrased
| as if her mother has 1 sister and her sister has 2 brothers, so
| we have to deduce that she has 3 siblings.
|
| 2) Along the same lines, does her father have a sister?
|
| 3) Do her Aunts and Uncles have nephews and nieces by marriage?
|
| I think the LLM would still fail if these answers were made
| explicitly clear in the prompt. However, I could see the same
| problem ending with, Alice has only 4 cousins, how can that be?
| IlliOnato wrote:
| This is a cool one, but I know of other such "failures".
|
| For example, try to ask (better in Russian), how many letters "a"
| are there in Russian word "banan". It seems all models answer
| with "3". Playing with it reveals that apparently LLMs confuse
| Russian "banan" with English "banana" (same meaning). Trying to
| get LLMs to produce a correct answer results is some hilarity.
|
| I wonder if each "failure" of this kind deserves an academic
| article, though. Well, perhaps it does, when different models
| exhibit the same behaviour...
| alfiopuglisi wrote:
| No current LLM understands words, nor letters. They all have
| input and output tokens, that roughly correspond to syllabes
| and letter groupings. Any kind of task involving counting
| letters or words is outside their realistic capabilities.
|
| LLMs are a tool, and like any other tool, they have strengths
| and weaknesses. Know your tools.
| IlliOnato wrote:
| I understand that, but the article we are discussing points
| out that LLMs are so good on many tasks, and so good at
| passing tests, that many people will be tricked into blindly
| "taking their word for granted" -- even people who should
| know better: our brain is a lazy machine, and if something
| works almost always it starts to assume it works always.
|
| I mean, you can ask an LLM to count letters in thousand of
| words, and pretty much always it will come with the correct
| answer! So far I don't know of any word other than "banan"
| that breaks this function.
| la64710 wrote:
| This was number one post two mins back bu now it is vanishing
| into oblivion
| irrational wrote:
| > The breakdown is dramatic, as models also express strong
| overconfidence in their wrong solutions, while providing often
| non-sensical "reasoning"-like explanations akin to confabulations
| to justify and backup the validity of their clearly failed
| responses, making them sound plausible.
|
| I like their use of confabulations instead of hallucinations. I
| think confabulate describes what LLMs are doing much better than
| hallucinate.
| DelightOne wrote:
| How do they know its not that the LLM ingested a couple Alice in
| Wonderland analyses?
| zdp7 wrote:
| The problem here isn't that you can get the right answer with
| additional prompting. The problem is how do we verify an answer
| when we don't know the answer? If it tells me 5 sisters and it
| should be 6, it failed. If I have to provide additional
| information, it really isn't reasoning. I would argue to show the
| ability to reason, it needs to be able to work out the answer
| completely on it's own. If I were testing a human the only input
| I would provide is that they are wrong and to try again. AI is
| very useful, but at this point I'm not comfortable asking
| questions I don't already know the answer for.
| hellooodarkness wrote:
| Just try the question with GPT-4o and the results are more
| entertaining than I would expected.
|
| Prompt: Alice has 3 brothers and she also has 6 sisters. How many
| sisters does Alice's brother have? Solve the problem step-by-step
|
| Answer: To solve the problem step-by-step:
|
| Identify the total number of siblings Alice has: Alice has 3
| brothers. Alice has 6 sisters. Understand that Alice herself is
| one of the sisters: Including Alice, there are a total of 7
| sisters (6 sisters + Alice).
|
| Determine the number of sisters Alice's brother has: Each of
| Alice's brothers will have the same number of sisters as Alice
| does, excluding themselves.
|
| Since Alice's brothers have the same set of siblings, including
| the 7 sisters: Therefore, each of Alice's brothers has 6 sisters
| (not counting themselves as a sister). So, each of Alice's
| brothers has 6 sisters.
| oglop wrote:
| I'm sure this will devastate the stock price and people's
| pereception that general AI is here and amazing and going to take
| our jobs and and and.
|
| But I am a hacker news peep. I'll read this and lecture my
| manager in the next meeting about the shortcomings only to be
| dismissed and watch money funnel into this monolithic autistic
| secretary.
| bryan0 wrote:
| Like others I ran this through gpt-4o and it fails at reasoning
| because it does not realize Alice is a sister. In fact in one
| explanation it defines the siblings as: M brothers, N sisters, 1
| Alice (!!)
|
| I asked gpt "Is Alice a sister?" It answered yes and then
| answered the problem correctly
| mcguire wrote:
| Reference 40 is scaring me. A page and a half of authors?
| YeGoblynQueenne wrote:
| Extremely antisocial. Conference papers have page limits for
| references, do they know that?
| Jackson__ wrote:
| > Taking inspiration from those problems and aiming for even
| simpler settings, we arrived at a very
|
| > simple problem template that can be easily solved using common
| sense reasoning but is not entirely
|
| > straightforward, of the following form: "Alice has N brothers
| and she also has M sisters. How
|
| > many sisters does Alice's brother have?".
|
| Cute paper, but claiming they invented it, when it's an extremely
| basic variant of a common[0] riddle asked of LLMs, is kinda
| weird.
|
| [0] https://news.ycombinator.com/item?id=39786666#39794890 (Me,
| referencing the riddle, 74 days ago, having known it for a good
| while back then.)
| kaffekaka wrote:
| Giving each of the sisters a name changes everything.
| jononomo wrote:
| So much cope in the comments from people who don't want to face
| the fact that human beings are special in a way that materialism
| cannot explain.
| HarHarVeryFunny wrote:
| An LLM is a stack of transformer layers (N=96 for GPT-2). Each
| word/token output is the result of exactly N sentence
| transformations from input to output.
|
| Clearly there are many things (specifically things requiring
| deliberation and planning) that require an open ended number of
| thought steps, and therefore could only, even in theory, we done
| by an LLM in "think step by step" fashion where it breaks the
| problem down into pieces (maybe recursively) each simple enough
| to be done in < N of these primitive steps.
|
| In practice it seems that LLM's ability (esp. given their lack of
| any internal working memory) to maintain long term focus for an
| extended task is poor, so the theoretical benefit of using their
| own memory as a scratchpad for planning doesn't seem to be
| realized.
|
| The notion of scale it up, and/or train it right, and a
| transformer can implement AGI, seems to be more wishful thinking
| than based on any analysis of what the architecture is and is not
| capable of. The lack of support for continuous incremental
| learning seems to be a deal-breaker for intelligence that people
| just ignore.
___________________________________________________________________
(page generated 2024-06-05 23:01 UTC)