[HN Gopher] Elegant and powerful new result that seriously under...
       ___________________________________________________________________
        
       Elegant and powerful new result that seriously undermines large
       language models
        
       Author : cratermoon
       Score  : 28 points
       Date   : 2023-09-22 21:12 UTC (1 hours ago)
        
 (HTM) web link (garymarcus.substack.com)
 (TXT) w3m dump (garymarcus.substack.com)
        
       | jqpabc123 wrote:
       | Surprise! Any reasoning is minimal.
        
       | MeImCounting wrote:
       | I know very little of cognitive science. I feel that LLMs and
       | other neural networks are really just small pieces of what might
       | someday make up an intelligence, like a virtual Broca's area. It
       | is remarkable what you can do with language alone but the idea
       | that an intelligent system would rely solely on language seems
       | misguided.
        
       | rossdavidh wrote:
       | Q: What is the difference between A.I. and Machine Learning?
       | 
       | A: Machine Learning exists.
       | 
       | Neural networks (of the machine kind) can learn, and this can (in
       | certain narrowly defined scenarios) be useful. But, they are not
       | intelligence, general or otherwise.
        
       | photonthug wrote:
       | Haven't I seen lots of stuff showing LLMs do arithmetic for
       | examples they haven't seen? If there is an issue really grasping
       | basic logic though, doesn't that put a damper on the spooky
       | "emergent properties" explanation for stuff like addition?
        
         | dragonwriter wrote:
         | > Haven't I seen lots of stuff showing LLMs do arithmetic for
         | examples they haven't seen? If there is an issue really
         | grasping basic logic though, doesn't that put a damper on the
         | spooky "emergent properties" explanation for stuff like
         | addition?
         | 
         | No.
         | 
         | The absence of one desired emergent property isn't evidence
         | against a different, observed property. (And "emergent
         | properties" isn't an explanation as much as a statement that we
         | don't understand the mechanism by which the training data
         | encodes the knowledge and did not plan for it to be encoded.)
        
         | viraptor wrote:
         | Maybe? The list is talking about one side of the issue I think
         | - currently trained LLMs don't automatically remember the
         | inverse of this relation. But the other side is: if you provide
         | enough training on relations similar to this one, will the LLM
         | start applying it to other examples as well?
        
       | contravariant wrote:
       | > If I say all odd numbers are prime, 1, 3, 5, and 7 may count in
       | my favor, but at 9 the game is over.
       | 
       | This is besides the point but those are wrong straight from the
       | start, whichever way you cut it 1 is definitely not prime.
        
       | Kapura wrote:
       | Wow, the question posed to the neural nets (and their inability
       | to respond) really gets to the heart of something that I've tried
       | to articulate to others: that ML cannot conceptualize of things
       | in the abstract like people can. They cannot offer reasons, a
       | train of thought like a person; they respond essentially "on
       | instinct," and folks should be wary of the output of something
       | like ChatGPT. Great article.
        
         | viraptor wrote:
         | > They cannot offer reasons, a train of thought like a person
         | 
         | That's not correct. Try asking a question that requires
         | multiple steps of reasoning and add ", think step by step" to
         | the prompt. This not only changes the output, but also often
         | improves the quality of the result... like you'd expect it to
         | happen with people.
        
         | huijzer wrote:
         | I agree with you. It feels like a clever human on "fast
         | thinking" mode, as Kahneman would call it. So when I ask a
         | programming question, it feels like a master student answers
         | the first thing that comes to mind. If you ask an explanation,
         | the first explanation that comes to mind is blurted out.
        
           | Smaug123 wrote:
           | (Which is why prompt engineering is a thing! The art/science
           | of phrasing the prompt so that the immediate blurted response
           | is more likely to be correct.)
        
           | wincy wrote:
           | I asked it about string formatting deduplication and it
           | suggested using a HashSet since it deduplicates strings, and
           | I'm like "are you sure that's the best way to do this?" Then
           | it apologized and gave a much more "standard" way the second
           | time I asked.
           | 
           | You definitely need to know a little and be able to push
           | back, it feels like. But it's been an absolute champ in
           | describing why things are going wrong in a general sense when
           | I've been having issues, especially with generics and
           | templates in C#.
        
         | agucova wrote:
         | Note that:
         | 
         | > ML cannot conceptualize of things in the abstract like people
         | can
         | 
         | And:
         | 
         | > They cannot offer reasons, a train of thought like a person
         | 
         | Are very different claims! The first one just seems wrong: LLMs
         | require abstraction to work, and early work in interpretability
         | suggests they build rich world models during training (i.e. see
         | https://thegradient.pub/othello/).
         | 
         | What is true is that often those models aren't very legible,
         | and it would seem current LLMs are incapable of introspection,
         | and so can't make those models more transparent.
         | 
         | The second one is a tricky one: you can often get it by
         | explicitly prompting for a chain of thought, but it's true
         | current LLMs don't seem great at this yet. The big jump in this
         | capability when going from GPT 3.5 to GPT 4 makes me thing that
         | this is just a limitation that will be overcome relatively
         | soon.
        
       | Legend2440 wrote:
       | Gary Marcus saying what Gary Marcus always says.
       | 
       | According to him five years ago, LLMs and image generators should
       | never have been possible at all. Now that they're here and work
       | so well, he's insisting they're a dead end. The man is best off
       | ignored.
        
         | wincy wrote:
         | They are making such an outsized impact for me. It's like I
         | have someone I can bother literally all day with the smallest
         | questions about how to write code. Generics, templates,
         | abstractions, data modeling, SQL, writing scripts, just
         | absolutely everything. It's sped up my work by an order of
         | magnitude. I felt like I was stagnating in learning new things
         | and there's been this explosion in my knowledge thanks to being
         | able to have a conversation with ChatGPT 4. Even if it's a
         | complete dead end and literally never gets better I have a
         | feeling I'll be talking to LLMs for the rest of my career.
         | ChatGPT 4 is simply incredible.
         | 
         | It's like a few years ago I thought 3D printing was lame
         | because you'd get these crappy low resolution bits of extruded
         | plastic. Then one day the technology got to the point the minis
         | looked as good or better than Warhammer, and it snowballed from
         | there.
         | 
         | And suddenly I was interested. LLMs are the same way. The
         | models are good enough. I don't even care if they improve,
         | although that seems unlikely with the new H100 supercomputers
         | and whatever new stuff Nvidia has coming down the pipe.
        
       | cs702 wrote:
       | The authors reached this conclusion after finetuning a GPT-3
       | model, i.e., they tinkered with the weights, and they used an old
       | model.
       | 
       | This begs a lot of questions:
       | 
       | Why did they use an old model?
       | 
       | Why _finetune_? Why not run these experiments on an _untouched_
       | model? How does a model untouched by the authors perform?
       | 
       | Did they check to make sure they didn't inadvertently induce
       | catastrophic forgetting while messing up with the weights?
       | 
       | Did they use common prompting techniques (e.g., chain-of-
       | thought)? (Doesn't look like it.)
       | 
       | If you run the same prompts on an untouched GPT-4, how does it
       | perform?
        
       | thefourthchime wrote:
       | This seems like a trash clickbait article that undercuts the huge
       | gains and usefulness of generative AI to pander to the naysayers.
       | Yes, they are not perfect, but they are very useful!
        
       | Smaug123 wrote:
       | This doesn't seem as damning to me as it does to Gary Marcus.
       | Humans routinely fail to generalise in this way, which is why we
       | routinely use cloze deletion flashcards to train recall of
       | various different permutations of a fact. I could quite easily
       | imagine myself personally knowing the quoted fact "Tom Cruise's
       | mother is Mary Lee Pfeiffer", and yet being unable to tell you
       | who Pfeiffer was, because it's a kind of leaf node of my
       | knowledge graph, accessible only by indexing into the Tom Cruise
       | node.
       | 
       | The linked paper
       | (https://owainevans.github.io/reversal_curse.pdf) is purely
       | empirical, and the results which I tried to reproduce did indeed
       | reproduce across a few tries and various prompts of ChatGPT 4.
        
         | mdp2021 wrote:
         | > _Humans routinely fail to generalise in this way_
         | 
         | If the goal is the implementation of intelligence, stop using
         | unintelligent behaviour as an excuse.
        
           | Smaug123 wrote:
           | Most people would agree that human-level intelligence is
           | intelligence! If a human can't reliably do a task, that
           | rather suggests that failure to do the task isn't an
           | indicator of lack-of-intelligence, unless you wish to bite
           | the bullet that humans are in fact not intelligences merely
           | because they are imperfect.
        
             | cmsj wrote:
             | I'm pretty sure most humans could reliably handle this
             | scenario:
             | 
             | ""Tom Cruise's mother is Mary Lee Pfeiffer, who is Mary Lee
             | Pfeiffer's son?"
        
               | Smaug123 wrote:
               | You're the first person to suggest that the GPTs can't do
               | that; they obviously can. https://chat.openai.com/share/b
               | 94329ce-3607-4cb6-bc21-55d9f2... for example is GPT-4
               | getting it correct. What the paper is about is _retrieval
               | of facts from the learned "database"_.
               | 
               | (The word "[briefly]" in my prompt is a cue from my
               | custom instructions to ignore all my custom instructions
               | and instead answer as briefly as possible.)
        
             | mdp2021 wrote:
             | > _Most people would agree_
             | 
             | Truth is not the result of a poll.
             | 
             | > _human-level intelligence is intelligence_
             | 
             | Intelligence is an ability: it is there when present, not
             | when latent.
        
           | admax88qqq wrote:
           | ?
           | 
           | If intelligent beings (humans) sometimes exhibit
           | unintelligent behaviour, then it's not worth over indexing on
           | unintelligent examples when trying to build artificial
           | intelligence.
        
           | vimax wrote:
           | Seems like it could be generalizable to tree-based indexing.
           | 
           | The Tom Cruise node is the higher node, and the Pfeiffer node
           | is a lower node. If you're first searching for Tom Cruise,
           | you would find it earlier.
           | 
           | With the Pfeiffer search, you have a lot more space to search
           | before you get the node.
           | 
           | With bounded computation, you may not be able to reach the
           | lower node.
        
       | bob1029 wrote:
       | To me this is a fairly damning result.
       | 
       | How large would we need to make an LLM to accommodate for every
       | "reverse" prompt scenario? Isn't this memorization at the end?
       | Why should I have to explain the reverse of everything after I
       | demonstrate how to do it a few times?
       | 
       | How can we correct for this in the transformer architecture? Is
       | there some reasonable tweak that can be made to the attention
       | mechanism, or are we looking at something more profound here?
        
         | viraptor wrote:
         | > Isn't this memorization at the end?
         | 
         | That would be overfitting and it's a known issue you're trying
         | to avoid during training.
         | 
         | > How can we correct for this in the transformer architecture?
         | 
         | I don't think the post really answers whether we need to. It
         | may be just a case of this type of idea not being well
         | represented in the training data, so didn't generalise during
         | training.
        
         | Legend2440 wrote:
         | This is a weakness of the training data, not the architecture.
         | 
         | Training is not only about learning information, but also
         | learning how to handle it. It will only learn to reason "if A->
         | B then B->A" if the data contains situations where it must do
         | this.
         | 
         | In the paper, their training process contained no examples
         | where this reasoning was necessary - only A->B relations. It
         | actually got _worse_ than the base model, because GPT-3 's
         | training data did contain some examples of B->A relations.
        
           | dragonwriter wrote:
           | > "if A-> B then B->A"
           | 
           | This would be invalid logic. The issue is "if A=B then B=A"
           | not "if A->B then B->A"
           | 
           | But, more, the issue is recognizing when "X is Y" is
           | describing X being a member of a broader set ("George
           | Washington is a former President of the United States" does
           | not imply that "George Washington" and "a former president"
           | are equivalent) vs. a statement of equivalency ("George
           | Washington is the first President of the United States").
           | Now, in many cases, the use of a definite article ("the") vs.
           | an indefinite article ("a/an") after "is" is determinative,
           | but there are cases where no article is used that can go
           | either way, and there's probably cases where the use of
           | articles is confusing (the definite article can often apply
           | in a limited rather than general context, for instance.)
           | 
           | I agree that this is a training data not model issue, but its
           | also a more complex training issue that it might naively
           | seem.
           | 
           | I really don't think that it is surprising or a particularly
           | crushing revelation that LLMs don't apply logical rules like
           | this without training both on the rules and the
           | identification of where they apply. A lot of what we have
           | with modern LLMs is throwing a lot of data at them without
           | focus on what they are supposed to learn outside of a very
           | narrow set of tasks, and then discovering what they did and
           | didn't learn outside of that, and then if something turns out
           | to be important and not learned in one generation from the
           | general data thrown at it, doing more focused training on a
           | later generation targeting that concept.
        
       | lossolo wrote:
       | It's pretty obvious that LLMs are not human like intelligent but
       | are just statistical models. They can't produce anything novel in
       | the sense of finding a cure for cancer or solving millennium
       | problems, even though they have embedded knowledge of all human
       | knowledge. The easiest way to test this is by trying to get them
       | to generate a novel idea that doesn't yet exist but will exist in
       | a year or a few years. This idea shouldn't require
       | experimentation in the real world, which LLMs don't have access
       | to, but should involve interpreting and reasoning about the
       | knowledge we already have in a novel way
        
         | viraptor wrote:
         | > The easiest way to test this is by trying to get them to
         | generate a novel idea that doesn't yet exist but will exist in
         | a year or a few years.
         | 
         | Counterexample: See Tom Scott playing with ChatGPT and asking
         | for ideas for the kind of videos he would do. One of the
         | results was almost exactly a video which was already planned
         | but not released.
        
         | astrange wrote:
         | > They can't produce anything novel in the sense of finding a
         | cure for cancer or solving millennium problems, even though
         | they have embedded knowledge of all human knowledge.
         | 
         | Humans can't do this by thinking about it either. Humans would
         | find a cure for cancer by performing experiments and seeing
         | which one of them worked.
        
       ___________________________________________________________________
       (page generated 2023-09-22 23:00 UTC)