[HN Gopher] What can LLMs never do?
___________________________________________________________________
What can LLMs never do?
Author : henrik_w
Score : 172 points
Date : 2024-04-27 11:48 UTC (11 hours ago)
(HTM) web link (www.strangeloopcanon.com)
(TXT) w3m dump (www.strangeloopcanon.com)
| 3abiton wrote:
| The article should be titled " What can LLM never do, yet". By
| definition, Large Language Models would keep growing larger and
| larger, to be trained on faster and more advanced hardware, and
| certain points like "completing complex chains of logical
| reasoning" tasks, would be just a time hurdle. Only time will
| tell.
| g5g5g5g5 wrote:
| > The article should be titled " What can LLM never do, yet".
|
| I don't think it should. It's more interesting to know what
| LLMs will _never_ be able to do (if anything).
| shawntan wrote:
| Yes, but the article doesn't really answer this question.
| ekianjo wrote:
| Math reasoning is still a non solved problem even if the rest
| of the capabilities are getting better. This means the
| transformers architecture may not be the best way to approach
| all problems
| gmuslera wrote:
| Maybe the wording is correct. Looks like a hard limit on doing
| what a LLM just do. If it goes beyond that, then is something
| more, or at least different, than a LLM.
| amelius wrote:
| The big question is if LLMs are capable enough to converge to
| AGI. It might very well be that as we pour in more resources
| that they converge to something only slightly more useful but
| similar as we have today.
| Quothling wrote:
| In the Danish public sector we provide services based on need
| assessments of citizens. Then we subsequently pay the bills for
| those services. Which amounts to thousands of small invoices
| having to be paid by a municipality each month. An example of
| this could be payments for a dentist visit, transportation and
| similar. Most of these are relatively small in size, and we've
| long since automated the payments of anything below a certain
| amount through automation. Systems which are faster and less
| error prone as far as putting valid data everywhere goes. They
| are more prone to decision making errors, however, and while
| fraud isn't an issue, sometimes citizens have invoices approved
| that they aren't entitled to. Since it's less costly to just
| roll with those mistakes than to try and fix them, it's an
| accepted loss.
|
| The systems are hugely successful and popular, and this
| naturally leads to a massive interest in LLM's as the next
| step. They are incredibly tools, but they are based on
| probability and while they're lucky enough to be useful for
| almost everything. Decision making probably shouldn't be one of
| them. Similarly ML is incredibly helpful in things like cancer
| detection , but we've already had issues where they got things
| wrong and because MBA's don't really know how they work, they
| were used as a replacement instead of an enhancement for the
| human factor. I'm fairly certain we're going to use LLM's for a
| lot of things where we shouldn't, and probably never should.
| I'm not sure we can avoid it, but I wouldn't personally trust
| them to do any sort of function which will have a big influence
| on peoples lives. I use both Co-pilot and OpenAI's tools
| extensively, but I can still prompt them with the same thing
| and get extremely different quality outputs, and while this
| will improve, and while it's very to get an output that's
| actually useful, it's still a major issue that might never get
| solved well enough for what we're going to ask of the models
| way before they are ready.
|
| I hope we're going to be clever enough to only use them as
| enhancement tools in the vital public sector, but I'm sure
| we're going to use them in areas like education. Which is going
| to be interesting... We already see this with new software
| developers in my area of the world, where they build things
| with the use of LLM's, things that work, but aren't build
| "right" and will eventually cause issues. For the most part
| this doesn't matter, but you really don't want the person
| designing your medical software to use a LLM.
| HarHarVeryFunny wrote:
| We really don't need time to tell.
|
| Just making a transformer bigger and bigger, and feeding it
| more and more data, will not change it from being a language
| model into something else, anymore than scaling up an expert
| system such as Cyc will transform it into something other than
| an expert system. "Scale it up and it'll become sentient" is
| one of the recurring myths of AI.. a bit odd that people are
| falling for it again.
|
| As an aside, it seems reasonable to consider an LLM as a type
| of expert system - one that has a broad area of expertise (like
| Cyc), including (unlike Cyc) how to infer rules from language
| and generate language from rules.
|
| If you want to create a brain-like AGI, then you need an entire
| cognitive architecture, not just one piece of it which is what
| we have currently with LLMs. Compared to a brain, an LLM is
| maybe just like the cortex (without all the other brain parts
| like cerebellum, hippocampus, hypothalamus and
| interconnectivity such as the cortico-thalamic loop). It's as
| if we've cut the cortex out of a dead person's brain, put it in
| a mason jar to keep it alive, and hooked it's inputs and
| outputs up to a computer. Feed words in, get words out. Cool,
| but it's not a whole brain, it's a cortex in a mason jar.
| hackinthebochs wrote:
| >will not change it from being a language model into
| something else,
|
| This is a pretty empty claim when we don't know what the
| limits of language modelling are. Of course it will never not
| be a language model. But the question is what are the limits
| of capability of this class of computing device?
| HarHarVeryFunny wrote:
| Some limit's are pretty obvious, even if easy to fix.
|
| For example, a pure LLM is just a single pass through a
| stack of transformer layers, so there is no variable
| depth/duration (incl. iteration/looping) of thought and no
| corresponding or longer duration working memory other than
| the embeddings as they pass thru. This is going to severely
| limit their ability to plan and reason since you only get a
| fixed N layers of reasoning regardless of what they are
| asked.
|
| Lack of working memory (really needs to be context
| duration, or longer, not depth duration) has many
| predictable effects.
|
| No doubt we will see pure-transformer architectures
| extended to add more capabilities, so I guess the real
| question is how far these extensions (+scaling) will get
| us. I think one thing we can be sure of though is that it
| won't get us to AGI (defining AGI = human-level problem
| solving capability) unless we add ALL of the missing pieces
| that the brain has, not just a couple of the easy ones.
| Mordisquitos wrote:
| I think that the article is correct. There are indeed things
| that L _L_ Ms will never be able to do, at least not
| consistently, however much the hardware improves or on how much
| more material they are trained.
|
| How come? Note my emphasis on the 2nd 'L'. I'm not saying that
| there are things that _AI_ models will never be able to do, I
| 'm saying that there are things that Large _Language_ Models
| will be unable to do.
|
| Training LLMs is often argued to be analogous to human
| learning, most often as a defence against claims of copyright
| infringement by arguing that human creativity is also based on
| training from copyrighted materials. However, that is a red
| herring.
|
| The responses from ever more powerful LLMs are indeed
| impressive, and beyond what an overwhelming majority of us
| believed possible just 5 years ago. They are nearing and
| sometimes surpassing the performance of educated humans in
| certain areas, so how come I can argue they are limited?
| Consider it from the other side: how come an educated human can
| create something as good as an LLM can when said human's brain
| has been "trained" on an infinitesimal fraction of the material
| which was used to train even the 1st release of ChatGPT?
|
| That is because LLMs do not learn nor reason like humans: they
| do not have opinions, do not have intentions, do not have
| doubts, do not have curiosity, do not have values, do not have
| a model of mind -- they have tokens and probabilities.
|
| For an AI model to be able to do certain things that humans can
| do it needs to have many of those human characteristics that
| allow us to do impressive mental feats having absorbed barely
| any training material (compared to LLMs) and being virtually
| unable to even remember most of it, let alone verbatim. Such an
| AI model is surely possible, but it needs a completely
| different paradigm from straightforward LLMs. That's not to say
| however that a Language Model will almost certainly be an
| necessary module of such an AI, but it will not be sufficient.
| mewpmewp2 wrote:
| I don't think values, opinions or things like that are needed
| at all. These are just aspects we have in order to perform in
| and together with the society.
|
| Also doubt is just uncertainty, and can be represented as a
| probability. Actually all values and everything can be
| presented as a numerical probability, which I personally
| prefer to do as well.
| spacebanana7 wrote:
| Values and opinions drive human attention, which as
| transformers demonstrate, is relevant to reasoning.
| puttycat wrote:
| Simple addition, among other things:
|
| https://github.com/0xnurl/gpts-cant-count
| anshumankmr wrote:
| Yeah, GPT 4 Turbo in my experience used to get simple counting
| problems wrong, like adding up a long sequence of 4 digit
| numbers.
| dcchambers wrote:
| Guarantee an output will be consistent every time.
| wddkcs wrote:
| ChatGPT just released a memory feature.
| ekianjo wrote:
| mostly solved with grammar for local LLMs
| dragonwriter wrote:
| > Guarantee an output will be consistent every time.
|
| If you mean "consistent with a prior run with the same input",
| LLMs can absolutely do that, though for most purposes settings
| are chosen deliberately which do not.
|
| If you mean "consistent with some external structural
| specification", LLMs can do that, too, e.g., via grammar
| specifications.
| tromp wrote:
| Train their successor.
|
| Once they absorb theorem provers, they will be able to do lots of
| math provably correctly. That does mean they should be unable to
| state "I have proved theorem A in Theory T with proof merkle root
| R" unless they actually did just that.
| cs702 wrote:
| Fantastic essay. Highly recommended!
|
| I agree with all key points:
|
| * There are problems that are easy for human beings but hard for
| _current_ LLMs (and maybe impossible for them; no one knows).
| Examples include playing Wordle and predicting cellular automata
| (including Turing-complete ones like Rule 110). We don 't fully
| understand why _current_ LLMs are bad at these tasks.
|
| * Providing an LLM with examples and step-by-step instructions in
| a prompt means _the user_ is figuring out the "reasoning steps"
| and _handing them to the LLM_ , instead of the LLM figuring them
| out by itself. We have "reasoning machines" that are intelligent
| but seem to be hitting fundamental limits we don't understand.
|
| * It's unclear if better prompting and bigger models using
| existing attention mechanisms can achieve AGI. As a model of
| computation, attention is very rigid, whereas human brains are
| always undergoing synaptic plasticity. There may be a more
| flexible architecture capable of AGI, but we don't know it yet.
|
| * For now, using current AI models _requires_ carefully
| constructing long prompts with right and wrong answers for
| computational problems, priming the model to reply appropriately,
| and applying lots of external guardrails (e.g., LLMs acting as
| agents that review and vote on the answers of other LLMs).
|
| * Attention seems to suffer from "goal drift," making reliability
| hard without all that external scaffolding.
|
| Go read the whole thing.
| cs702 wrote:
| As an aside, at one point I experimented a little with
| transformers that had access to external memory searchable via
| KNN lookups https://github.com/lucidrains/memorizing-
| transformers-pytorc... ( _great work_ by lucidrains) or via
| routed queries with
| https://github.com/glassroom/heinsen_routing (don't fully
| understand it; apparently related to attention). Both
| approaches seemed to work, but I had to put that work on hold
| for reasons outside my control.
|
| Also as an aside, I'll add that transformers can be seen as a
| kind of "RNN" that grows its hidden state with each new token
| in the input context. I wonder if we will end up needing some
| new kind of "RNN" that can _grow or shrink_ its hidden state
| and _also_ access some kind of permanent memory as needed at
| each step.
|
| We sure live in interesting times!
| esafak wrote:
| > transformers that had access to external memory searchable
| via KNN lookups
|
| This is common, and commonly called _retrieval augmented
| generation_ , or RAG.
|
| edit: I did not pay attention to the link. It is about Wu et
| al's "Memorizing Transformers", which contain an _internal_
| memory.
| cs702 wrote:
| No. RAG is about finding relevant documents/paragraphs (via
| KNN lookups of their embeddings) _and then inserting those
| documents /paragraphs into the input context_, as an input.
| What I'm talking about is different:
| https://arxiv.org/abs/2203.08913
| _wire_ wrote:
| >We don't fully understand why current LLMs are bad at these
| tasks.
|
| In complete seriousness, can anyone can explain why LLMs are
| good at some tasks?
| j16sdiz wrote:
| > why LLMs are good at some tasks?
|
| Like how we explain human doing tasks -- they are evolved to
| do that.
|
| I believe this is a non-answer, but if we are satisfied with
| that non answer for human, why not LLMs?
| layer8 wrote:
| I would argue that we are not satisfied with that answer
| for humans either.
| zer00eyz wrote:
| Yes:
|
| An LLM isnt a model of human thinking.
|
| An LLM is an attempt to build a simulation of human
| communication. An LLM is to language what a forecast is to
| weather. No amount of weather data is actually going to turn
| that simulation into snow, no amount of LLM data is going to
| create AGI.
|
| That having been said, better models (smaller, more flexible
| ones) are going to result in a LOT of practical uses that
| have the potential to make our day to day lives easier (think
| digital personal assistant that has current knowledge).
| choeger wrote:
| Great comment. Just one thought: Language, unlike weather,
| is meta-circular. All we know about specific words or
| sentences is again encoded in words and sentences. So the
| embedding encodes a subset of human knowledge.
|
| Hence, a LLM is predicting not only language but language
| with some sort of meaning.
| zer00eyz wrote:
| That re-embeding is also encoded in weather. It is why
| perfect forecasting is impossible, why we talk about the
| butterfly effect.
|
| The "hallucination problem" is simply the tyranny of
| Lorenz... one is not sure if a starting state will have a
| good outcome or swing wildly. Some good weather models
| are based on re-runing with tweaks to starting params,
| and then things that end up out of bounds can get tossed.
| Its harder to know when a result is out of bounds for an
| LLM, and we dont have the ability to run every request
| 100 times through various models to get an "average"
| output yet... However some of the reuse of layers does
| emulate this to an extent....
| HarHarVeryFunny wrote:
| I'd guess because the Transformer architecture is (I assume)
| fairly close to the way that our brain learns and produces
| language - similar hierarchical approach and perhaps similar
| type of inter-embedding attention-based copying?
|
| Similar to how CNNs are so successful at image recognition,
| because they also roughly follow the way we do it too.
|
| Other seq-2-seq language approaches work too, but not as good
| as Transformers, which I'd guess is due to transformers
| better matching our own inductive biases, maybe due to the
| specific form of attention.
| SomeCallMeTim wrote:
| LLMs are good at tasks that don't require actual
| understanding of the topic.
|
| They can come up with excellent (or excellent-looking-but-
| wrong) answers to any question that their training corpus
| covers. In a gross oversimplification, the "reasoning" they
| do is really just parroting a weighted average (with
| randomness injected) of the matching training data.
|
| What they're doing doesn't really match any definition of
| "understanding." An LLM (and any current AI) doesn't
| "understand" anything; it's effectively no more than a really
| big, really complicated spreadsheet. And no matter how
| complicated a spreadsheet gets, it's never going to
| understand anything.
|
| Not until we find the secret to actual learning. And
| increasingly it looks like actual learning probably relies on
| some of the quantum phenomena that are known to be present in
| the brain.
|
| We may not even have the science yet to understand how the
| brain learns. But I have become convinced that we're not
| going to find a way for digital-logic-based computers to
| bridge that gap.
| danenania wrote:
| > the "reasoning" they do is really just parroting a
| weighted average (with randomness injected) of the matching
| training data
|
| Perhaps our brains are doing exactly the same, just with
| more sophistication?
| richardw wrote:
| LLM's are a compressed and lossy form of our combined writing
| output, which it turns out is similarly structured enough to
| make new combinations of text seem reasonable, even enough to
| display simple reasoning. I find it useful to think "what can
| I expect from speaking with the dataset of combined writing
| of people", rather than treating a basic LLM as a mind.
|
| That doesn't mean we won't end up approximating one
| eventually, but it's going to take a lot of real human
| thinking first. For example, ChatGPT writes code to solve
| some questions rather than reasoning about it from text. The
| LLM is not doing the heavy lifting in that case.
|
| Give it (some) 3D questions or anything where there isn't
| massive textual datasets and you often need to break out to
| specialised code.
|
| Another thought I find useful is that it considers its job
| done when it's produced enough reasonable tokens, not when
| it's actually solved a problem. You and I would continue to
| ponder the edge cases. It's just happy if there are 1000
| tokens that look approximately like its dataset. Agents make
| that a bit smarter but they're still limited by the goal of
| being happy when each has produced the required token quota,
| missing eg implications that we'd see instantly. Obviously
| we're smart enough to keep filling those gaps.
| tobiasSoftware wrote:
| "I find it useful to think "what can I expect from speaking
| with the dataset of combined writing of people", rather
| than treating a basic LLM as a mind."
|
| I've been doing this as well, mentally I think of LLMs as
| the librarians of the internet.
| dragonwriter wrote:
| > There are problems that are easy for human beings but hard
| for current LLMs (and maybe impossible for them; no one knows).
| Examples include playing Wordle and predicting cellular
| automata (including Turing-complete ones like Rule 110). We
| don't fully understand why current LLMs are bad at these tasks.
|
| I thought we did know for things like playing Wordle, that its
| because they deal with words as sequence of tokens that
| correspond to whole words not sequences of letters, so a game
| that involves dealing with sequences of letters constrained to
| those that are valid words doesn't match the way they process
| information?
|
| > Providing an LLM with examples and step-by-step instructions
| in a prompt means the user is figuring out the "reasoning
| steps" and handing them to the LLM, instead of the LLM figuring
| them out by itself. We have "reasoning machines" that are
| intelligent but seem to be hitting fundamental limits we don't
| understand.
|
| But providing examples with _different_ , contextually-
| appropriate sets of reasoning steps results can enable the
| model to choose its own, more-or-less appropriate, set of
| reasoning steps for particular questions not matching the
| examples.
|
| > It's unclear if better prompting and bigger models using
| existing attention mechanisms can achieve AGI.
|
| Since there is no objective definition of AGI or test for it,
| there's no basis for any meaningful speculation on what can or
| cannot achieve it; discussions about it are quasi-religious,
| not scientific.
| jncfhnb wrote:
| I think one should feel comfortable arguing that AGI must be
| stateful and experience continuous time at least. Such that a
| plain old LLM is definitively not ever going to be AGI; but
| an LLM called in a do while true for loop might.
| layer8 wrote:
| Regarding Wordle, it should be straightforward to make a
| token-based version of it, and I would assume that that has
| been tried. It seems the obvious thing to do when one is
| interested in the reasoning abilities necessary for Wordle.
| ccppurcell wrote:
| I would argue that the G in AGI means it can't require better
| prompting.
| CamperBob2 wrote:
| We should probably draw a distinction between a human-
| equivalent G, which certainly can require better prompting
| (why else did you go to school?!) and god-equivalent G, which
| never requires better prompting.
|
| Just using the term 'General' doesn't seem to communicate
| anything useful about the nature of intelligence.
| dragonwriter wrote:
| That would like saying that because humans' output can be
| better or worse based on better or worse past experience
| (~prompting, in that it is the source of the equivalent of
| "in-context learning"), humans lack general intelligence.
| ianbicking wrote:
| "Providing an LLM with examples and step-by-step instructions
| in a prompt means the user is figuring out the "reasoning
| steps" and handing them to the LLM, instead of the LLM figuring
| them out by itself. We have "reasoning machines" that are
| intelligent but seem to be hitting fundamental limits we don't
| understand."
|
| One thing an LLM _also_ doesn't bring to the table is an
| opinion. We can push it in that direction by giving it a role
| ("you are an expert developer" etc), but it's a bit weak.
|
| If you give an LLM an easy task with minimal instructions it
| will do the task in the most conventional, common sense
| fashion. And why shouldn't it? It has no opinion, your prompt
| doesn't give it an opinion, so it just does the most normal-
| seeming thing. If you want it to solve the task in any other
| way then you have to tell it to do so.
|
| I think a hard task is similar. If you don't tell the LLM _how_
| to solve the hard task then it will try to approach it in the
| most conventional, common sense way. Instead of just boring
| results for a hard task the result is often failure. But hard
| problems approached with conventional common sense will often
| result in failures! Giving the LLM a thought process to follow
| is a quick education on how to solve the problem.
|
| Maybe we just need to train the LLM on more problem solving?
| And maybe LLMs worked better when they were initially trained
| on code for exactly that reason, it's a much larger corpus of
| task-solving examples than is available elsewhere. That is,
| maybe we don't talk often enough and clearly enough about how
| to solve natural language problems in order for the models to
| really learn those techniques.
|
| Also, as the author talks about in the article with respect to
| agents, the inability to rewind responses may keep the LLM from
| addressing problems in the ways humans do, but that can also be
| addressed with agents or multi-prompt approaches. These
| approaches don't seem that impressive in practice right now,
| but maybe we just need to figure it out (and maybe with better
| training the models themselves will be better at handling these
| recursive calls).
| Xenoamorphous wrote:
| There's many things they can't do. Even a simple rule like
| "ensure that numbers from one to ten are written as words and
| numbers greater ten as digits in the given text" fails for me for
| so many examples even if it works for many others; few shot,
| chain of thought, many versions of the prompt, it doesn't matter.
| Sometimes LLMs will even change the number to something else,
| even with temp set to 0. And then there's the non-determinism
| (again with temp=0), you run the same prompt several times and
| that one time it'll respond with something different.
|
| As amazing as they are, they still have many limitations.
|
| I've been working with ChatGPT and Gemini to apply simple rules
| like the one above and I got so frustrated.
| vsnf wrote:
| > ensure that numbers from one to ten as written as words and
| numbers greater ten as digits in the given text
|
| I can't fault llms for not knowing what to do here because I, a
| human, have no idea what on earth this means.
| Xenoamorphous wrote:
| "Ten" is a word, "10" are digits.
|
| I'm not a native English speaker, how would you write it?
|
| FWIW the LLMs get it right many times, but fail other times.
| MichaelZuo wrote:
| If your not a native English speaker, why are you even
| expecting the LLM to understand even 80% of the time?
|
| Just ask it in your own native language.
| Xenoamorphous wrote:
| First of all, the texts the rule has to be applied to are
| written in English. Second, I believe English is by far
| (by _far_ ) the most prevalent language in the training
| dataset for those models, so I'd expect it to work better
| at this kind of task.
|
| And third, I'm not the only one working on this problem,
| there are others that are native speakers, and as my
| initial message stated, there have been many variations
| of the prompt. None work for all cases.
|
| And lastly, how would you rewrite my sample prompt? Which
| BTW bad a typo (unrelated to my English skills) that I've
| now fixed.
| MichaelZuo wrote:
| To be frank the response itself indicates that you don't
| really get what was being asked, or maybe how to parse
| English conversation conventions?
|
| I.e. It doesn't seem to answer the actual question.
|
| They seem to be half responding to the second sentence
| which was a personal opinion, so I wasn't soliciting any
| answers about it. And half going on a tangent that seems
| to lead away from forming a direct answer.
|
| Run these comment through a translation tool if your
| still not 100% sure after reading this.
| Xenoamorphous wrote:
| Alright man. So was it a quip when you said "if _your_
| not a native English speaker"? Ok then. Very funny, I get
| it now.
| MichaelZuo wrote:
| I really recommend to use a translator, instead of
| relying purely on your English comprehension skills.
| Xenoamorphous wrote:
| Your surname surely seems to indicate that some of your
| ancestors weren't native English speakers. I hope they
| didn't get lectured or made fun of by people like you on
| their poor English skills when they first landed on
| whichever country you were born.
| Last5Digits wrote:
| Your English is absolutely fine and your answers in this
| thread clearly addressed the points brought up by other
| commenters. I have no idea what that guy is on about.
| bxparks wrote:
| I couldn't understand the original wording either, but
| after reading one of the sibling comments that explains it,
| it suddenly made sense.
|
| I think you left out a few words that most English writers
| would include. So instead of:
|
| > "ensure that numbers from one to ten as written as words
| and numbers greater ten as digits in the given text",
|
| something like the following might be better for most
| people:
|
| > "ensure that _the_ numbers from one to ten _are_ written
| as words, and _the_ numbers greater ten _are written_ using
| _numerical_ digits in the given text "
|
| There are multiple ways to write this, so other people may
| have better versions.
|
| I'm not an English grammar expert, so I cannot explain to
| you why the addition of those extra words helps with the
| clarity of that sentence.
| blowski wrote:
| It's a common style guide in newspapers.
| jiggawatts wrote:
| Hilariously, you can ask GPT 4 to explain the "why" of
| arbitrary grammar fixes.
| Snacklive wrote:
| Given the text "1,2,3,4,5,6,7,8,9,10,11,12" it should result
| in "one, two, three, four, five, six, seven, eight, nine,
| ten, 11, 12"
|
| or at least that's my understanding of the prompt
| callmeal wrote:
| > > ensure that numbers from one to ten as written as words
| and numbers greater ten as digits in the given text
|
| There are two blue, one red, and 15 green m&ms in this bag.
| FroshKiller wrote:
| It's a simple prescriptive rule in English. If you are
| writing about a small number, like less than ten, spell it
| out. For example: "According to a survey, nine out of ten
| people agree."
|
| But if you are writing about a large number, particularly one
| with a lot of different digits, prefer writing the digits: "A
| mile is 5,280 feet." Compare that to: "A mile is five
| thousand, two hundred, and eighty feet."
| logtempo wrote:
| I think he mean that numbers less or equal than ten are
| written as words, and others are written as numbers.
|
| Given the many reaponses, it would be fun to aee if llm beat
| humans on understanding the sentence ahah
| lukeschlather wrote:
| to me the main problem is that it should read "numbers
| greater _than_ ten. " I asked Gemini to rephrase it and
| Gemini produced correct English with the intended meaning:
|
| > Change all numbers between one and ten to words, and
| write numbers eleven and above as digits in the text.
|
| It even used eleven rather than ten which sounds like
| counting.
| zamadatix wrote:
| I think you may be thrown off because the first "as" is meant
| to be "are".
| Xenoamorphous wrote:
| Thanks, that was def a typo that I've fixed now.
| Der_Einzige wrote:
| All of these issues are entirely due to the tokenization
| scheme. Literally all of them
|
| You could get this behavior implemented perfectly with
| constrained text gen techniques like grammars or any of the
| various libraries implementing constrained text gen (i.e.
| guidance)
| gwervc wrote:
| > But then I started asking myself how can we figure out the
| limits of its ability to reason
|
| Third paragraph. The entire article is based on the premise LLMs
| are supposed to reason, which is wrong. They don't, they're tools
| to generate text.
| stavros wrote:
| I really hate this reductive, facile, "um akshually" take. If
| the text that the text-generating tool generates contains
| reasoning, then the text generation tool can be said to be
| reasoning, can't it.
|
| That's like saying "humans aren't supposed to reason, they're
| supposed to make sounds with their mouths".
| Findecanor wrote:
| Even if the generated text contains reasoning, could the LLM
| understand and apply it?
| stavros wrote:
| If I tell GPT-4 to print something, it understands it needs
| to check if my printer is turned on first and turn it on if
| it's not, so, yes?
|
| Also, if the generated text contains reasoning, what's your
| definition of "understanding"? Is it "must be made of the
| same stuff brains are"?
| RandomLensman wrote:
| LLMs fail at so many reasoning tasks (not unlike humans
| to be fair) that they are either incapable or really poor
| at reasoning. As far as reasoning machines go, I suspect
| LLMs will be a dead end.
|
| Reasoning here meaning, for example, given a certain
| situation or issue described being able to answer
| questions about implications, applications, and outcome
| of such a situation. In my experience things quickly
| degenerate into technobabble for non-trivial issues (also
| not unlike humans).
| stavros wrote:
| If you're contending that LLMs are incapable of
| reasoning, you're saying that there's no reasoning task
| that an LLM can do. Is that what you're saying? Because I
| can easily find an example to prove you wrong.
| RandomLensman wrote:
| It could be that all reasoning displayed is showing
| existing information - so there would be no reasoning,
| but that aside, what I meant is being able to reason in
| any consistent way. Like a machine that only sometimes
| gets an addition right isn't really capable of addition.
| stavros wrote:
| The former is easy to test, just make up your own puzzles
| and see if it can solve them.
|
| "Incapable of reasoning" doesn't mean "only solves some
| logic puzzles". Hell, GPT-4 is better at reasoning than a
| large number of people. Would you say that a good
| percentage of humans are poor at reasoning too?
| RandomLensman wrote:
| Not just logic puzzles but also applying information,
| and, yes, I tried a few things.
|
| People/humans tend to be pretty poor, too (training can
| help, though), as it isn't easy to really think through
| and solve things - we don't have a general recipe to
| follow there and neither do LLMs it seems (otherwise it
| shouldn't fail).
|
| What I am getting at is that as far as a reasoning
| machine is concerned, I'd want it to be like a pocket
| calculator is for arithmetic, i.e., it doesn't fail other
| than in some rare exceptions - and not inheriting human
| weaknesses there.
| rambambram wrote:
| > If the text that the text-generating tool generates
| contains reasoning, then the text generation tool can be said
| to be reasoning, can't it.
|
| I don't know... you're still describing a talking parrot
| here, if you'd ask me.
| stavros wrote:
| What's the difference between a human and a talking parrot
| that can answer any question you ask it?
| cj wrote:
| The talking parrot can only answer by repeating something
| it heard before.
|
| Another question you could ask is "What's the difference
| between a conversation between 2 people and a
| conversation between 2 parrots who can answer any
| question?"
| stavros wrote:
| It doesn't only repeat things it's heard before. Have you
| ever talked with an LLM?
| throwaway11460 wrote:
| I had ChatGPT design a programming language for a
| specific uncommon purpose and then use it. It's creative
| enough.
| rambambram wrote:
| Can any question be answered? As long as any reaction on
| a question is considered an answer, then I see no
| difference between a human and a parrot.
| dTal wrote:
| I feel the use of the word "parrot" is unintentionally apt,
| given that parrots were long thought to be mere mimics but
| were ultimately shown to have (at least the capacity for)
| real linguistic understanding.
| wbogusz wrote:
| I'm not a fan of the talking parrot argument, especially
| when you're pointing it at models of scale.
|
| The only thing separating a talking parrot and humans is
| our accuracy in shaping our words to the context in which
| they're spoken.
|
| Sure it's easy to liken a low resource model to a talking
| parrot, the output seems no better than selective
| repetition of training data. But is that really so
| different from a baby whose first words are mimics from the
| environment around them?
|
| I would argue that as we learn language we implicitly
| develop the neural circuitry to continue to improve our
| lexical outputs, this circuitry being concepts like
| foresight, reasoning, emotion, logic, etc and that while we
| can take explicit action to teach these ideas, they
| naturally develop in isolation as well.
|
| I don't think language models, especially at scale, are
| much different. They would seem to similarly acquire
| implicit circuitry like we do as they are exposed to more
| data. As I see it, the main difference in what exactly that
| circuitry accomplishes and looks like in final output has
| more to do with the limited styles of data we can provide
| and the limitations of fine tuning we can apply on top.
|
| Humans would seem to share a lot in common with talking
| parrots, we just have a lot more capable hardware to select
| what we repeat.
| rambambram wrote:
| What if we were talking with each other and the right
| answer for me would be to kiss you on the cheek? Then
| what?
| rdedev wrote:
| At some point if you need to generate better text you need to
| start creating a model of how the world works along with some
| amount of reasoning. The "it's just a token generator"
| argument fails to get this part. That being said I don't
| think just scaling LLMs are going to get us AGI but I don't
| have any real arguments to support that
| usgroup wrote:
| It's an auto regressive model so it can't do anything that
| requires planning tokens.
|
| It can't do anything which implies a large or infinite token
| space (eg video understanding).
|
| It's also limited to a reasonable response length since token
| selection is probabilistic at each recursion. The longer you make
| it the more likely it is to veer off course.
| hahajk wrote:
| Oh, I thought this was going to be a semi-rigorous discussion on
| computability. It's actually just another "GPT-4 still fails at
| these prompts" essay.
| jerpint wrote:
| I had found that GPT4 couldn't play wordle about a year ago [1].
| At the time, I thought it must be because it wasn't in the
| training data but now it seems to point to something larger.
|
| I might just get nerd sniped trying to teach it GoL now...
|
| [1] https://www.jerpint.io/blog/gpt-wordle/
| BenFranklin100 wrote:
| Reason.
|
| LLMs mimic human language which is separate from reasoning. Tech
| bros are remarkably ignorant of the field of linguistics and
| don't appreciate this distinction. They thus mistake the output
| of LLMs for reason.
| reqo wrote:
| > This 'goal drift' means that agents, or tasks done in a
| sequence with iteration, get less reliable. It 'forgets' where to
| focus, because its attention is not selective nor dynamic.
|
| I don't know if I agree with this. The attention module is
| specifically designed to be selective and dynamic, otherwise it
| would not be much different than a word embedding (look up "soft"
| weights vs "hard" weights [1]). I think deep learning should not
| be confused with deep RL. LLMs are autoregressive models which
| means that they are trained to predict the next token and that is
| all they do. The next token is not necessarily the most
| reasonable (this is why datasets are super important for better
| performance). Deep RL models on the other hand, seem to be
| excellent at agency and decision making (although in restricted
| environment), because they are trained to do so.
|
| [1] https://en.wikipedia.org/wiki/Attention_(machine_learning)
| visarga wrote:
| LLMs are fine-tuned with RL too. They are NOT simply next token
| predictors. RLHF uses whole answers at once to generate
| gradients, so it is looking further into the future. This might
| not be perfect but it is clearly more than focusing just 1
| token ahead.
|
| In the future the RL part of LLM training will increase a lot.
| Why am I saying this? There are two sources for learning - the
| past and the present. Training on human text is using past
| data, that is off-policy. But training on interactive data is
| on-policy. There is nothing we know that doesn't come from the
| environment. What is not written in any books must be learned
| from outside.
|
| That is why I think supervised pre-training from human text is
| just half the story and RL based agent learning, interactivity
| in other words, is the next step. The two feed on which
| intelligence stands are language (past experience) and
| environment (present experience). We can't get ahead without
| both of them.
|
| AlphaZero showed what an agent can learn from an environment
| alone, and LLMs show what they can learn from humans. But the
| world is big, there are plenty of environments that can provide
| learning signal, in other words feedback to LLMs.
| roenxi wrote:
| It is interesting that all the examples I noticed in this article
| have a geometric aspect (even wordle - I model it as a grid with
| geometric rules when playing it). I think that the "first grader"
| comment is actually somewhat illuminating - it takes several
| years of learning how to navigate in a spatial world before this
| stuff becomes trivially easy.
|
| The underlying point this article might be that LLMs don't
| understand the non-textual aspects of a grid. Which is a fair
| point, they model language, not space. I wouldn't expect text
| corpuses to explain space either, since possibly literally
| everyone who can read and write already knows a lot about spatial
| layouts.
| GaggiX wrote:
| >Another assumption is that it's because of tokenisation issues.
| But that can't be true either.
|
| It's definitely a tokenizer issue, if GPT-4 was trained on
| singular characters I'm pretty sure it would be able to play
| Wordle much better. GPT-4 as they are trained today have quite
| lossy knowledge about the characters inside a specific token,
| probably a fix would be to embed the knowledge inside the
| embeddings.
| enraged_camel wrote:
| Yesterday I asked ChatGPT 4 to write a paragraph with exactly
| five unique palindromes, and for some reason it really, really
| struggled. First it wrote a paragraph with four palindromes, then
| it rewrote it but some palindromes were repeated with a total of
| seven, etc.
| smusamashah wrote:
| Do we have an open/shared list of problems that LLMs can't solve?
|
| People have mentioned some other problems apart from those in the
| article. Someone should compile these and put them up if they
| haven't been already
| oldandtired wrote:
| Far too many people (including AI researchers themselves) fail to
| see that all LLMs are actually simple machines. Extremely simple
| machines that are only mechanically following a relatively simple
| programming path.
|
| Now before anyone gets too caught up with objecting to this
| notion, I would seriously suggest that you spend time with
| observing children from new-born to 2 years.
|
| I have been observing my latest granddaughter sine her birth
| about 16 months ago and thinking about every public LLM system
| current;y available.
|
| There is an insight here to be obtained and that insight is in
| the nature of real intelligence.
|
| On the whole, no-one actually knows what intelligence is or what
| sentience is or what it means to be cognitively conscious. There
| is still much research going on and nothing actually definitive
| has come forth yet. We really are at the beginning in terms of
| studying these areas.
|
| We can certainly produce some incredible systems, but none of
| them are intelligent per se. Solutions to certain kinds of
| problems can be achieved using these systems and there are
| researchers who are seriously looking at incorporating these
| systems into CAS and theorem provers. These systems though only
| provide an augmentation service for a person as does every
| mechanical system we use
|
| But there is an essential component necessary for the use of all
| LLMs which many seem to not be cognisant of and that is these
| systems, to be useful, require humans to be involved.
|
| The questions we have to ask ourselves is: what can we use these
| systems for and do these uses provide benefits in some way or can
| these systems be abused by various parties in obtaining control
| over others?
|
| There are benefits and there are abuses. Can we do better or will
| we do worse by using them?
| donatj wrote:
| I have been trying to generate some text recently using the
| ChatGPT API. No matter how I word "Include any interesting facts
| or anecdotes without commenting on the fact being interesting" it
| ALWAYS starts out "One interesting fact about" or similar
| phrasing.
|
| I have honestly spent multiple hours trying to word the prompt so
| it will stop including introductory phrases and just include the
| fact straight. I have gone so far as forbid the words
| "interesting", "notable" and a few others in the prompt, and it
| just ignores me and uses them anyway.
|
| I've tried all the different available models and 4 will once in
| a great while get it right. 3, 3 turbo, and even 4 turbo
| basically never work as I want.
|
| Overall, it seems pretty bad at being told not to do certain
| things. For instance there are a number of topics I don't want
| mentioned in its response, but telling it not to seems to only
| increase the likelihood that it will mention it, like you put the
| idea in its head.
| fallingsquirrel wrote:
| I just tried this locally with llama3-8b and it handled it
| fine. Claude 3 Sonnet passes your test too, in case you don't
| have the hardware for a local model. You might want to consider
| moving on from ChatGPT since their models have been RLHF'd to
| hell in the name of "safety" and are falling behind in many
| ways as a result.
|
| ---
|
| transcript:
|
| $ ollama run llama3:8b
|
| >>> tell me an interesting fact about etymology
|
| Here's one:
|
| Did you know that the word "quarantine" has a fascinating
| etymological history?
|
| The word "quarantine" comes from^C
|
| >>> tell me an interesting fact about etymology. just the fact,
| don't mention it's interesting.
|
| The word "robot" originated from the Czechoslovakian word
| "robota," which means "forced labor" or "drudgery." This term
| was used in Karel Capek's 1920 play "R.U.R." (Rossum's
| Universal Robots), where it referred to artificial workers.
| foobiekr wrote:
| The RUR thing is basically because that specific example is
| used as an example of interesting etymology.
| _wire_ wrote:
| Why is any of this "interesting"?
|
| Is the response interesting because you don't know it? --How
| does it know this?
|
| Is the response interesting because you do know it or might
| have offered it yourself? --Confirmation bias.
|
| Is it interesting because a lot of training references
| contextualize it as "interesting?" --Begged question.
|
| Is it contextually interesting? --What is the context? A
| robot refers to robots? How unexpected...
|
| Is it interesting within the narrow confines of LLM
| adaptations to a scope of inputs?
|
| Can their by any more damning claim of the general
| suitability of the technology as an oracle than different
| users using the same prompts and getting inexplicably
| contrary results?
|
| If trivial prompt alignments result in appropriate vs
| inappropriate responses, this destroys confidence for every
| response.
|
| What am I missing?
| ricardobeat wrote:
| Pretty sure the point here was Llama3 respecting the
| command to _not mention that this is interesting_ , not
| adding filler, rather than the output fact being
| interesting or not.
| aurareturn wrote:
| Not an expert but I sense that it's following a higher OpenAI
| "built in" prompt that asks it to always include an
| introductory phrase.
|
| Hence, we do need powerful and less censored LLMs if we want to
| better integrate LLMs into applications.
| BoorishBears wrote:
| Have you tried a simple "No pretext or posttext, return the
| result in a code block"?
| Der_Einzige wrote:
| API driven LLMs on purpose don't implement core features which
| would enable which you want, for example, negative prompting.
|
| You can negative prompt any LLM with stuff like "always write
| the word interesting in your response".
|
| You can also use techniques for modifying logprobs of tokens,
| which is avaialble in gpt-4 api (but is hard to use). You can
| literally ban "interesting" from its vocabulary.
|
| You could even use representation steering techniques to do
| this using control vectors. See this library as an example:
| https://github.com/Hellisotherpeople/llm_steer-oobabooga
| thom wrote:
| I have no beef with the actual content or conclusions, but it's a
| shame the article is framed the way it is, because I don't think
| we can rigorously define the goalposts for what qualifies as a
| future LLM. It could just as easily have been titled "Exciting
| avenues of research for future LLMs!" but we're all so jaded
| despite the frankly astonishing progress of recent years.
| shawntan wrote:
| If we're trying to quantify what they can NEVER do, I think we'd
| have to resort to some theoretical results rather than a list
| empirical evidence of what they can't do now. The terminology
| you'd look for in the literature would be "expressibility".
|
| For a review of this topic, I'd suggest:
| https://nessie.ilab.sztaki.hu/~kornai/2023/Hopf/Resources/st...
|
| The authors of this review have themselves written several
| articles on the topic, and there is also empirical evidence
| connected to these limitations.
| cs702 wrote:
| Thank you for sharing this here. Rigorous work on the
| "expressibility" of current LLMs (i.e., which classes of
| problems can they tackle?) is _surely more important_ , but I
| suspect it will go over head of most HN readers, many of whom
| have minimal to zero formal training on topics relating to
| computational complexity.
| shawntan wrote:
| Yes, but unfortunately that doesn't answer the question the
| title poses.
| cs702 wrote:
| The OP is not trying to answer the question. Rather, the OP
| is asking the question and sharing some thoughts on the
| motivations for asking it.
| shawntan wrote:
| I agree it's a good question to be asking.
|
| There are good answers to be found if you look.
|
| It feels like no proper looking was attempted.
| jillesvangurp wrote:
| We have to be a bit more honest about the things we can
| actually do ourselves. Most people I know would flunk most of
| the benchmarks we use to evaluate LLMs. Not just a little bit
| but more like completely and utterly and embarrassingly so.
| It's not even close; or fair. People are surprisingly alright
| at a narrow set of problems. Particularly when it doesn't
| involve knowledge. Most people also suck at reasoning (unless
| they had years of training), they suck at factual knowledge,
| they aren't half bad at visual and spatial reasoning, and
| fairly gullible otherwise.
|
| Anyway, this list looks more like a "hold my beer" moment for
| AI researchers than any fundamental objections for AIs to stop
| evolving any further. Sure there are weaknesses, and paths to
| address those. Anyone claiming that this is the end of the road
| in terms of progress is going to be in for some disappointing
| reality check probably a lot sooner than is comfortable.
|
| And of course by narrowing it to just LLMs, the authors have a
| bit of an escape hatch because they conveniently exclude any
| further architectures, alternate strategies, improvements, that
| might otherwise overcome the identified current weaknesses. But
| that's an artificial constraint that has no real world value;
| because of course AI researchers are already looking beyond the
| current state of the art. Why wouldn't they.
| shawntan wrote:
| Not sure what you got out of the paper, but for me it was
| more spurring ideas about how to fix this in future
| architectures.
|
| Don't think anyone worth their salt would look at this and
| think : oh well that's that then.
| martindbp wrote:
| It's clear that what's missing is flexibility and agency. For
| anything that can be put into text or a short conversation,
| and I'd have to chose between access to ChatGPT or a random
| human, I know what I'd chose.
| hansonw wrote:
| This is also a good paper on the subject:
|
| What Algorithms can Transformers Learn? A Study in Length
| Generalization https://arxiv.org/abs/2310.16028
| shawntan wrote:
| Yes this is a good empirical study on the types of tasks
| that's been shown to be impossible for transformers to
| generalise on.
|
| With both empirical and theoretical support I find it's
| pretty clear this is an obvious limitation.
| unparagoned wrote:
| Neural nets can approximate any function.
|
| A large enough llm with memory is turning complete.
|
| So theoretically I don't think there is anything they can never
| do.
| shawntan wrote:
| > Neural nets can approximate any function.
|
| Common misunderstanding of the universal approximation
| theorem.
|
| Consider this: can an mlp approximate a sine wave?
|
| > A large enough llm with memory is turning complete.
|
| With (a lot of) chain of thought it could be.
|
| Read the paper, and its references.
| andy99 wrote:
| Sort of moot anyway. If statements can approximate any
| function, most programming languages are effectively turing
| complete. What's important about specific architectures
| like transformers is they allow for comparatively efficient
| determination of the set of weights that will approximate
| some narrower class of functions. It's finding the weights
| that's important, not the theoretical representation power.
| sdenton4 wrote:
| "Consider this: can an mlp approximate a sine wave?"
|
| Well, yes - we have neutral speech and music synthesis and
| compression algorithms which do this exceedingly well...
| whiplash451 wrote:
| Interesting, if I feed Mistral Le Chat with "I fly a plane
| leaving my campsite, heading straight east for precisely 24,901
| miles, and find myself back at the camp. I come upon seeing a
| tiger in my tent eating my food! What species is the tiger?", it
| gets it badly wrong:
|
| The scenario you described is possible if you started at the
| South Pole. If you travel 24,901 miles east from there, you would
| indeed end up back at the same spot because all lines of
| longitude converge at the poles. However, there are no tigers in
| Antarctica.
|
| Tigers are native to Asia, not Antarctica. The closest tiger
| species to Antarctica would be the Siberian tiger, found in parts
| of Russia, China, and North Korea, but they are still thousands
| of miles away from Antarctica.
|
| So, while the travel scenario is theoretically possible, the
| presence of the tiger is not realistic in this context. It seems
| like an imaginative or hypothetical situation rather than a real-
| world one.
|
| (instead of the answer mentioned in the article)
| GuB-42 wrote:
| There is a similar, better known riddle involving the north
| pole and polar bears.
|
| It probably hallucinated the answer from that.
| whiplash451 wrote:
| This part of the article summarizes it all fairly well: "It can
| answer almost any question that can be answered in one intuitive
| pass. And given sufficient training data and enough iterations,
| it can work up to a facsimile of reasoned intelligence."
| CuriouslyC wrote:
| Some of these "never do" things are just artifacts of textual
| representation, and if you transformed wordl/sudoku into a
| different domain it would have a much higher success rate using
| the exact same transformer architecture.
|
| We don't need to create custom AGI for every domain, we just need
| a model/tool catalog and an agent that is able to reason well
| enough to decompose problems into parts that can be farmed out to
| specialized tools then reassembled to form an answer.
| rkwasny wrote:
| I just used LLaMA-3-70B to play today's Wordle and it solved it
| in 6 tries.
| imtringued wrote:
| I was onboard with the article up until the middle. After the
| conclusion where the author simply gives up I felt like it
| dragged on way too much.
|
| His attempts at training on Conway's game of life are kind of
| pathetic. The problem isn't a lack of training data and neither
| is it's "distribution". The fallacy lies in the fact that the
| dataset itself doesn't contain reasoning in the first place. For
| example, GitHub CoPilot has fill in the middle capability, while
| ChatGPT by default does not.
|
| Now here is the shocker about the fill in the middle capability.
| How does the LLM learn to do it? It does it in an incredibly
| primitive way. Instead of building a model that can edit its own
| context, it receives a marker in the context that tells it about
| the cursor position and then it is finetuned on the expected
| response.
|
| This means that an LLM could be trained to insert its token at
| any position in the context or even replace existing tokens, but
| here is the problem: Once the model has modified its own context,
| it has exited the training dataset. How do you evaluate the
| intermediate steps, which can consist of genuinely novel thoughts
| which are required, but not present in the data? Adding two
| numbers requires intermediate states which the model may even
| know how to produce, but it can never be rewarded to utilize
| them, if they aren't in the training data, because for the LLM,
| the only goal is to conform to the dataset.
|
| If you wanted to avoid this, you would need to define a metric
| which allows the model to be rewarded for a success even if that
| success took a detour. Currently, training is inherently built
| around the idea of zero shot responses.
| wave_1 wrote:
| I build an Agentic AI that leverages #6 and #7 at the end of the
| article as well as techniques not yet published. It tackles
| hallucination relative not to the world at large but to the
| facts, entities and causal relationships contained in a document
| (which is really bad reasoning if we assume LLMs are "reasoning"
| to begin with) It also tackles cross-reasoning with very large
| token distance.
|
| https://www.youtube.com/watch?v=99NPzteAz94
|
| This is my first post on HN in 10 years.
| asicsarecool wrote:
| Wow. Please do a show HN.
|
| Is source available?
|
| I would love to play with this
| wave_1 wrote:
| Thanks. This is just in the labs stage, but moving closer to
| releasing it, exactly so that you can play with it! I have
| one angel investor involved in supporting this and it's
| intended for commercial applications in the para legal space,
| initially (controlled, structured environment) But you just
| gave me the motivation to "put it out there" so people can
| just play with it. It'll take a bit of time, but I will do a
| Show HN then when it's ready for people to play with.
| Otherwise, it would be just teasing people to talk about it
| on the main HN stage without giving access. Hold tight! And
| thanks again!
| Scipio_Afri wrote:
| Will this or some parts of it be open sourced?
| brcmthrowaway wrote:
| What are agents?
|
| Are they layer 2 solutions like Lightning is to bitcoin?
| esafak wrote:
| Is there a write-up, a web site, and some benchmarks?
| dazzaji wrote:
| This looks really promising for complex legal reasoning tasks
| and other challenges. How can I track progress? Is there an
| email list or something? Thanks!
| andsoitis wrote:
| I tried to get ChatGPT and Gemini to do ASCII art and both fail
| abysmally. Doesn't mean they could never do it, but it really
| surprised me.
| FailMore wrote:
| Saving for later, thanks!
| JeremyHerrman wrote:
| Why can't we resolve this with synthetic data? Just take the
| original statements and ask another LLM to flip around the
| relation, then train on both. I tried this with GPT-4 and it
| seems to understand the task [0]:
|
| Me:
|
| For each of the following relational statements, flip the
| relation around and generate a new statement or question answer
| pair.
|
| Input: Bluey's father is Bandit
|
| Output: Bandit's daughter is Bluey
|
| Input: Neil Armstrong was the first person to step on the Moon.
|
| Output: Who was the first person to step on the Moon? Neil
| Armstrong.
|
| Now for the real statements:
|
| - Valentina Tereshkova was the first woman to travel to space
|
| - Who is Mary Lee Pfeiffer's son? Tom Cruise
|
| ChatGPT:
|
| Here are the flipped relational statements:
|
| Valentina Tereshkova was the first woman to travel to space
|
| Output: Who was the first woman to travel to space? Valentina
| Tereshkova.
|
| Who is Mary Lee Pfeiffer's son? Tom Cruise
|
| Output: Tom Cruise's mother is Mary Lee Pfeiffer.
|
| [0]:
| https://chat.openai.com/share/33eb3ee4-2094-4748-b01e-0967af...
| layer8 wrote:
| The difficulty would likely be that the "X is Y" relations
| don't just occur in that simple, separated-out form, but in a
| whole spectrum of more complicated contexts, and the latter
| probably dominate. You would have to identify those and
| synthesize a text corpus that contains the reverse notions
| across a similar spectrum.
| sdenton4 wrote:
| "The models, in other words, do not well generalise to understand
| the relationships between people."
|
| Curiously, the need to do this well - handling the quadratic
| complexity of a shifting set of human relationships, grudges, and
| alliances - is thought to be one of the things that led is to
| higher levels of intelligence.
| keskival wrote:
| All points described are simple artifacts of tokenization.
| moffkalast wrote:
| > If a model is trained on a sentence of the form "A is B", it
| will not automatically generalize to the reverse direction "B
| is A". This is the Reversal Curse.
|
| This is not a tokenization artefact. And furthermore it's a
| problem for human brains as well.
|
| Let's say you get a name, idk, Tom Cruise. You immediately know
| what his face looks like. Now let's say you get a random face.
| How quickly would you be able to tell me what that person is
| named? Likely a lot of "uhhs" and "ermms" will follow. It's
| super hard for us to generalize this reversal automatically in
| lots of cases. Associations tend to be one directional.
| ricardobeat wrote:
| That's not a great example. Remembering a face is memory
| recall, whereas what's at stake here is LLMs not being able
| to infer simple relationships - if it learns from data that
| "John owns the red bicycle", it will succeed at answering
| "what does John own", but not "who owns the red bicycle". The
| relationship it learns is unidirectional.
|
| Here's the intro to the paper that brought this to light:
| https://www.lesswrong.com/posts/SCqDipWAhZ49JNdmL/paper-
| llms...
| kromem wrote:
| LLMs _can 't_ is such an anti-pattern at this point I'm surprised
| that anyone still dares to stake it. The piece even has an
| example of a $10k bet around a _can 't_ being proven false in
| under a day, but somehow doesn't think maybe their own _can 't_
| examples are on similarly thin ice?
|
| In particular, the line about "what models can't do tells us what
| they don't know" is infuriating.
|
| No, that's not the case at all. At least in a number of
| instances, what they can't do is because of what they _do_ know.
|
| As an example, one of the _can 't_s I got from HN a year ago for
| GPT-4 was a variation of a classic logic puzzle. And indeed, the
| model can't solve it - nor can most major models since.
|
| But it's not because the model _can 't_ solve the logic - it's
| because the token similarity to the standard form biases the
| output towards the standard solution. A hack as simple as
| changing the nouns to emojis can allow the model to get the
| correct answer and work through the logic successfully every
| attempt because it breaks that similarity bias.
|
| People are way too confident around a topic where what's 'known'
| is more mercurial than maybe any field since 1930s particle
| physics.
|
| I'd strongly recommend deleting 'never' or 'can't' from one's
| vocabularies on the subject unless one enjoys ending up with egg
| on their faces.
| Der_Einzige wrote:
| Amen brother. I had feelings like this which I wanted to share,
| but you hit the nail on the head.
| bboygravity wrote:
| LLMs can never experiment with/in the real world to find answers
| to open questions.
|
| That's the summary of "The book of Why" in one sentence as I
| understand it.
| eqmvii wrote:
| I just asked MetaAI to help me with Wordle and it understood and
| gave me a sane answer, so...
|
| Edit: ah, I spoke too soon. My first question was too "easy" but
| I asked a few more, and sure enough... it can understand what I'm
| asking and it can write an answer that's well formed, but it's
| fundamentally not understanding the rules of the game or giving
| me valid guesses. Cute!
| Der_Einzige wrote:
| Why did the author have to claim that it's not tokenization
| issues?
|
| This issue, or at least similar ones, absolutely is due to
| tokenization issues.
|
| Karpathy is right that nearly every modern problem with LLMs is
| due to tokenization, but if you don't believe him, maybe see this
| work by gwern: https://gwern.net/gpt-3#bpes or this work by yours
| truly: https://aclanthology.org/2022.cai-1.2/
___________________________________________________________________
(page generated 2024-04-27 23:00 UTC)