[HN Gopher] What can LLMs never do?
       ___________________________________________________________________
        
       What can LLMs never do?
        
       Author : henrik_w
       Score  : 172 points
       Date   : 2024-04-27 11:48 UTC (11 hours ago)
        
 (HTM) web link (www.strangeloopcanon.com)
 (TXT) w3m dump (www.strangeloopcanon.com)
        
       | 3abiton wrote:
       | The article should be titled " What can LLM never do, yet". By
       | definition, Large Language Models would keep growing larger and
       | larger, to be trained on faster and more advanced hardware, and
       | certain points like "completing complex chains of logical
       | reasoning" tasks, would be just a time hurdle. Only time will
       | tell.
        
         | g5g5g5g5 wrote:
         | > The article should be titled " What can LLM never do, yet".
         | 
         | I don't think it should. It's more interesting to know what
         | LLMs will _never_ be able to do (if anything).
        
           | shawntan wrote:
           | Yes, but the article doesn't really answer this question.
        
         | ekianjo wrote:
         | Math reasoning is still a non solved problem even if the rest
         | of the capabilities are getting better. This means the
         | transformers architecture may not be the best way to approach
         | all problems
        
         | gmuslera wrote:
         | Maybe the wording is correct. Looks like a hard limit on doing
         | what a LLM just do. If it goes beyond that, then is something
         | more, or at least different, than a LLM.
        
         | amelius wrote:
         | The big question is if LLMs are capable enough to converge to
         | AGI. It might very well be that as we pour in more resources
         | that they converge to something only slightly more useful but
         | similar as we have today.
        
         | Quothling wrote:
         | In the Danish public sector we provide services based on need
         | assessments of citizens. Then we subsequently pay the bills for
         | those services. Which amounts to thousands of small invoices
         | having to be paid by a municipality each month. An example of
         | this could be payments for a dentist visit, transportation and
         | similar. Most of these are relatively small in size, and we've
         | long since automated the payments of anything below a certain
         | amount through automation. Systems which are faster and less
         | error prone as far as putting valid data everywhere goes. They
         | are more prone to decision making errors, however, and while
         | fraud isn't an issue, sometimes citizens have invoices approved
         | that they aren't entitled to. Since it's less costly to just
         | roll with those mistakes than to try and fix them, it's an
         | accepted loss.
         | 
         | The systems are hugely successful and popular, and this
         | naturally leads to a massive interest in LLM's as the next
         | step. They are incredibly tools, but they are based on
         | probability and while they're lucky enough to be useful for
         | almost everything. Decision making probably shouldn't be one of
         | them. Similarly ML is incredibly helpful in things like cancer
         | detection , but we've already had issues where they got things
         | wrong and because MBA's don't really know how they work, they
         | were used as a replacement instead of an enhancement for the
         | human factor. I'm fairly certain we're going to use LLM's for a
         | lot of things where we shouldn't, and probably never should.
         | I'm not sure we can avoid it, but I wouldn't personally trust
         | them to do any sort of function which will have a big influence
         | on peoples lives. I use both Co-pilot and OpenAI's tools
         | extensively, but I can still prompt them with the same thing
         | and get extremely different quality outputs, and while this
         | will improve, and while it's very to get an output that's
         | actually useful, it's still a major issue that might never get
         | solved well enough for what we're going to ask of the models
         | way before they are ready.
         | 
         | I hope we're going to be clever enough to only use them as
         | enhancement tools in the vital public sector, but I'm sure
         | we're going to use them in areas like education. Which is going
         | to be interesting... We already see this with new software
         | developers in my area of the world, where they build things
         | with the use of LLM's, things that work, but aren't build
         | "right" and will eventually cause issues. For the most part
         | this doesn't matter, but you really don't want the person
         | designing your medical software to use a LLM.
        
         | HarHarVeryFunny wrote:
         | We really don't need time to tell.
         | 
         | Just making a transformer bigger and bigger, and feeding it
         | more and more data, will not change it from being a language
         | model into something else, anymore than scaling up an expert
         | system such as Cyc will transform it into something other than
         | an expert system. "Scale it up and it'll become sentient" is
         | one of the recurring myths of AI.. a bit odd that people are
         | falling for it again.
         | 
         | As an aside, it seems reasonable to consider an LLM as a type
         | of expert system - one that has a broad area of expertise (like
         | Cyc), including (unlike Cyc) how to infer rules from language
         | and generate language from rules.
         | 
         | If you want to create a brain-like AGI, then you need an entire
         | cognitive architecture, not just one piece of it which is what
         | we have currently with LLMs. Compared to a brain, an LLM is
         | maybe just like the cortex (without all the other brain parts
         | like cerebellum, hippocampus, hypothalamus and
         | interconnectivity such as the cortico-thalamic loop). It's as
         | if we've cut the cortex out of a dead person's brain, put it in
         | a mason jar to keep it alive, and hooked it's inputs and
         | outputs up to a computer. Feed words in, get words out. Cool,
         | but it's not a whole brain, it's a cortex in a mason jar.
        
           | hackinthebochs wrote:
           | >will not change it from being a language model into
           | something else,
           | 
           | This is a pretty empty claim when we don't know what the
           | limits of language modelling are. Of course it will never not
           | be a language model. But the question is what are the limits
           | of capability of this class of computing device?
        
             | HarHarVeryFunny wrote:
             | Some limit's are pretty obvious, even if easy to fix.
             | 
             | For example, a pure LLM is just a single pass through a
             | stack of transformer layers, so there is no variable
             | depth/duration (incl. iteration/looping) of thought and no
             | corresponding or longer duration working memory other than
             | the embeddings as they pass thru. This is going to severely
             | limit their ability to plan and reason since you only get a
             | fixed N layers of reasoning regardless of what they are
             | asked.
             | 
             | Lack of working memory (really needs to be context
             | duration, or longer, not depth duration) has many
             | predictable effects.
             | 
             | No doubt we will see pure-transformer architectures
             | extended to add more capabilities, so I guess the real
             | question is how far these extensions (+scaling) will get
             | us. I think one thing we can be sure of though is that it
             | won't get us to AGI (defining AGI = human-level problem
             | solving capability) unless we add ALL of the missing pieces
             | that the brain has, not just a couple of the easy ones.
        
         | Mordisquitos wrote:
         | I think that the article is correct. There are indeed things
         | that L _L_ Ms will never be able to do, at least not
         | consistently, however much the hardware improves or on how much
         | more material they are trained.
         | 
         | How come? Note my emphasis on the 2nd 'L'. I'm not saying that
         | there are things that _AI_ models will never be able to do, I
         | 'm saying that there are things that Large _Language_ Models
         | will be unable to do.
         | 
         | Training LLMs is often argued to be analogous to human
         | learning, most often as a defence against claims of copyright
         | infringement by arguing that human creativity is also based on
         | training from copyrighted materials. However, that is a red
         | herring.
         | 
         | The responses from ever more powerful LLMs are indeed
         | impressive, and beyond what an overwhelming majority of us
         | believed possible just 5 years ago. They are nearing and
         | sometimes surpassing the performance of educated humans in
         | certain areas, so how come I can argue they are limited?
         | Consider it from the other side: how come an educated human can
         | create something as good as an LLM can when said human's brain
         | has been "trained" on an infinitesimal fraction of the material
         | which was used to train even the 1st release of ChatGPT?
         | 
         | That is because LLMs do not learn nor reason like humans: they
         | do not have opinions, do not have intentions, do not have
         | doubts, do not have curiosity, do not have values, do not have
         | a model of mind -- they have tokens and probabilities.
         | 
         | For an AI model to be able to do certain things that humans can
         | do it needs to have many of those human characteristics that
         | allow us to do impressive mental feats having absorbed barely
         | any training material (compared to LLMs) and being virtually
         | unable to even remember most of it, let alone verbatim. Such an
         | AI model is surely possible, but it needs a completely
         | different paradigm from straightforward LLMs. That's not to say
         | however that a Language Model will almost certainly be an
         | necessary module of such an AI, but it will not be sufficient.
        
           | mewpmewp2 wrote:
           | I don't think values, opinions or things like that are needed
           | at all. These are just aspects we have in order to perform in
           | and together with the society.
           | 
           | Also doubt is just uncertainty, and can be represented as a
           | probability. Actually all values and everything can be
           | presented as a numerical probability, which I personally
           | prefer to do as well.
        
             | spacebanana7 wrote:
             | Values and opinions drive human attention, which as
             | transformers demonstrate, is relevant to reasoning.
        
       | puttycat wrote:
       | Simple addition, among other things:
       | 
       | https://github.com/0xnurl/gpts-cant-count
        
         | anshumankmr wrote:
         | Yeah, GPT 4 Turbo in my experience used to get simple counting
         | problems wrong, like adding up a long sequence of 4 digit
         | numbers.
        
       | dcchambers wrote:
       | Guarantee an output will be consistent every time.
        
         | wddkcs wrote:
         | ChatGPT just released a memory feature.
        
         | ekianjo wrote:
         | mostly solved with grammar for local LLMs
        
         | dragonwriter wrote:
         | > Guarantee an output will be consistent every time.
         | 
         | If you mean "consistent with a prior run with the same input",
         | LLMs can absolutely do that, though for most purposes settings
         | are chosen deliberately which do not.
         | 
         | If you mean "consistent with some external structural
         | specification", LLMs can do that, too, e.g., via grammar
         | specifications.
        
       | tromp wrote:
       | Train their successor.
       | 
       | Once they absorb theorem provers, they will be able to do lots of
       | math provably correctly. That does mean they should be unable to
       | state "I have proved theorem A in Theory T with proof merkle root
       | R" unless they actually did just that.
        
       | cs702 wrote:
       | Fantastic essay. Highly recommended!
       | 
       | I agree with all key points:
       | 
       | * There are problems that are easy for human beings but hard for
       | _current_ LLMs (and maybe impossible for them; no one knows).
       | Examples include playing Wordle and predicting cellular automata
       | (including Turing-complete ones like Rule 110). We don 't fully
       | understand why _current_ LLMs are bad at these tasks.
       | 
       | * Providing an LLM with examples and step-by-step instructions in
       | a prompt means _the user_ is figuring out the  "reasoning steps"
       | and _handing them to the LLM_ , instead of the LLM figuring them
       | out by itself. We have "reasoning machines" that are intelligent
       | but seem to be hitting fundamental limits we don't understand.
       | 
       | * It's unclear if better prompting and bigger models using
       | existing attention mechanisms can achieve AGI. As a model of
       | computation, attention is very rigid, whereas human brains are
       | always undergoing synaptic plasticity. There may be a more
       | flexible architecture capable of AGI, but we don't know it yet.
       | 
       | * For now, using current AI models _requires_ carefully
       | constructing long prompts with right and wrong answers for
       | computational problems, priming the model to reply appropriately,
       | and applying lots of external guardrails (e.g., LLMs acting as
       | agents that review and vote on the answers of other LLMs).
       | 
       | * Attention seems to suffer from "goal drift," making reliability
       | hard without all that external scaffolding.
       | 
       | Go read the whole thing.
        
         | cs702 wrote:
         | As an aside, at one point I experimented a little with
         | transformers that had access to external memory searchable via
         | KNN lookups https://github.com/lucidrains/memorizing-
         | transformers-pytorc... ( _great work_ by lucidrains) or via
         | routed queries with
         | https://github.com/glassroom/heinsen_routing (don't fully
         | understand it; apparently related to attention). Both
         | approaches seemed to work, but I had to put that work on hold
         | for reasons outside my control.
         | 
         | Also as an aside, I'll add that transformers can be seen as a
         | kind of "RNN" that grows its hidden state with each new token
         | in the input context. I wonder if we will end up needing some
         | new kind of "RNN" that can _grow or shrink_ its hidden state
         | and _also_ access some kind of permanent memory as needed at
         | each step.
         | 
         | We sure live in interesting times!
        
           | esafak wrote:
           | > transformers that had access to external memory searchable
           | via KNN lookups
           | 
           | This is common, and commonly called _retrieval augmented
           | generation_ , or RAG.
           | 
           | edit: I did not pay attention to the link. It is about Wu et
           | al's "Memorizing Transformers", which contain an _internal_
           | memory.
        
             | cs702 wrote:
             | No. RAG is about finding relevant documents/paragraphs (via
             | KNN lookups of their embeddings) _and then inserting those
             | documents /paragraphs into the input context_, as an input.
             | What I'm talking about is different:
             | https://arxiv.org/abs/2203.08913
        
         | _wire_ wrote:
         | >We don't fully understand why current LLMs are bad at these
         | tasks.
         | 
         | In complete seriousness, can anyone can explain why LLMs are
         | good at some tasks?
        
           | j16sdiz wrote:
           | > why LLMs are good at some tasks?
           | 
           | Like how we explain human doing tasks -- they are evolved to
           | do that.
           | 
           | I believe this is a non-answer, but if we are satisfied with
           | that non answer for human, why not LLMs?
        
             | layer8 wrote:
             | I would argue that we are not satisfied with that answer
             | for humans either.
        
           | zer00eyz wrote:
           | Yes:
           | 
           | An LLM isnt a model of human thinking.
           | 
           | An LLM is an attempt to build a simulation of human
           | communication. An LLM is to language what a forecast is to
           | weather. No amount of weather data is actually going to turn
           | that simulation into snow, no amount of LLM data is going to
           | create AGI.
           | 
           | That having been said, better models (smaller, more flexible
           | ones) are going to result in a LOT of practical uses that
           | have the potential to make our day to day lives easier (think
           | digital personal assistant that has current knowledge).
        
             | choeger wrote:
             | Great comment. Just one thought: Language, unlike weather,
             | is meta-circular. All we know about specific words or
             | sentences is again encoded in words and sentences. So the
             | embedding encodes a subset of human knowledge.
             | 
             | Hence, a LLM is predicting not only language but language
             | with some sort of meaning.
        
               | zer00eyz wrote:
               | That re-embeding is also encoded in weather. It is why
               | perfect forecasting is impossible, why we talk about the
               | butterfly effect.
               | 
               | The "hallucination problem" is simply the tyranny of
               | Lorenz... one is not sure if a starting state will have a
               | good outcome or swing wildly. Some good weather models
               | are based on re-runing with tweaks to starting params,
               | and then things that end up out of bounds can get tossed.
               | Its harder to know when a result is out of bounds for an
               | LLM, and we dont have the ability to run every request
               | 100 times through various models to get an "average"
               | output yet... However some of the reuse of layers does
               | emulate this to an extent....
        
           | HarHarVeryFunny wrote:
           | I'd guess because the Transformer architecture is (I assume)
           | fairly close to the way that our brain learns and produces
           | language - similar hierarchical approach and perhaps similar
           | type of inter-embedding attention-based copying?
           | 
           | Similar to how CNNs are so successful at image recognition,
           | because they also roughly follow the way we do it too.
           | 
           | Other seq-2-seq language approaches work too, but not as good
           | as Transformers, which I'd guess is due to transformers
           | better matching our own inductive biases, maybe due to the
           | specific form of attention.
        
           | SomeCallMeTim wrote:
           | LLMs are good at tasks that don't require actual
           | understanding of the topic.
           | 
           | They can come up with excellent (or excellent-looking-but-
           | wrong) answers to any question that their training corpus
           | covers. In a gross oversimplification, the "reasoning" they
           | do is really just parroting a weighted average (with
           | randomness injected) of the matching training data.
           | 
           | What they're doing doesn't really match any definition of
           | "understanding." An LLM (and any current AI) doesn't
           | "understand" anything; it's effectively no more than a really
           | big, really complicated spreadsheet. And no matter how
           | complicated a spreadsheet gets, it's never going to
           | understand anything.
           | 
           | Not until we find the secret to actual learning. And
           | increasingly it looks like actual learning probably relies on
           | some of the quantum phenomena that are known to be present in
           | the brain.
           | 
           | We may not even have the science yet to understand how the
           | brain learns. But I have become convinced that we're not
           | going to find a way for digital-logic-based computers to
           | bridge that gap.
        
             | danenania wrote:
             | > the "reasoning" they do is really just parroting a
             | weighted average (with randomness injected) of the matching
             | training data
             | 
             | Perhaps our brains are doing exactly the same, just with
             | more sophistication?
        
           | richardw wrote:
           | LLM's are a compressed and lossy form of our combined writing
           | output, which it turns out is similarly structured enough to
           | make new combinations of text seem reasonable, even enough to
           | display simple reasoning. I find it useful to think "what can
           | I expect from speaking with the dataset of combined writing
           | of people", rather than treating a basic LLM as a mind.
           | 
           | That doesn't mean we won't end up approximating one
           | eventually, but it's going to take a lot of real human
           | thinking first. For example, ChatGPT writes code to solve
           | some questions rather than reasoning about it from text. The
           | LLM is not doing the heavy lifting in that case.
           | 
           | Give it (some) 3D questions or anything where there isn't
           | massive textual datasets and you often need to break out to
           | specialised code.
           | 
           | Another thought I find useful is that it considers its job
           | done when it's produced enough reasonable tokens, not when
           | it's actually solved a problem. You and I would continue to
           | ponder the edge cases. It's just happy if there are 1000
           | tokens that look approximately like its dataset. Agents make
           | that a bit smarter but they're still limited by the goal of
           | being happy when each has produced the required token quota,
           | missing eg implications that we'd see instantly. Obviously
           | we're smart enough to keep filling those gaps.
        
             | tobiasSoftware wrote:
             | "I find it useful to think "what can I expect from speaking
             | with the dataset of combined writing of people", rather
             | than treating a basic LLM as a mind."
             | 
             | I've been doing this as well, mentally I think of LLMs as
             | the librarians of the internet.
        
         | dragonwriter wrote:
         | > There are problems that are easy for human beings but hard
         | for current LLMs (and maybe impossible for them; no one knows).
         | Examples include playing Wordle and predicting cellular
         | automata (including Turing-complete ones like Rule 110). We
         | don't fully understand why current LLMs are bad at these tasks.
         | 
         | I thought we did know for things like playing Wordle, that its
         | because they deal with words as sequence of tokens that
         | correspond to whole words not sequences of letters, so a game
         | that involves dealing with sequences of letters constrained to
         | those that are valid words doesn't match the way they process
         | information?
         | 
         | > Providing an LLM with examples and step-by-step instructions
         | in a prompt means the user is figuring out the "reasoning
         | steps" and handing them to the LLM, instead of the LLM figuring
         | them out by itself. We have "reasoning machines" that are
         | intelligent but seem to be hitting fundamental limits we don't
         | understand.
         | 
         | But providing examples with _different_ , contextually-
         | appropriate sets of reasoning steps results can enable the
         | model to choose its own, more-or-less appropriate, set of
         | reasoning steps for particular questions not matching the
         | examples.
         | 
         | > It's unclear if better prompting and bigger models using
         | existing attention mechanisms can achieve AGI.
         | 
         | Since there is no objective definition of AGI or test for it,
         | there's no basis for any meaningful speculation on what can or
         | cannot achieve it; discussions about it are quasi-religious,
         | not scientific.
        
           | jncfhnb wrote:
           | I think one should feel comfortable arguing that AGI must be
           | stateful and experience continuous time at least. Such that a
           | plain old LLM is definitively not ever going to be AGI; but
           | an LLM called in a do while true for loop might.
        
           | layer8 wrote:
           | Regarding Wordle, it should be straightforward to make a
           | token-based version of it, and I would assume that that has
           | been tried. It seems the obvious thing to do when one is
           | interested in the reasoning abilities necessary for Wordle.
        
         | ccppurcell wrote:
         | I would argue that the G in AGI means it can't require better
         | prompting.
        
           | CamperBob2 wrote:
           | We should probably draw a distinction between a human-
           | equivalent G, which certainly can require better prompting
           | (why else did you go to school?!) and god-equivalent G, which
           | never requires better prompting.
           | 
           | Just using the term 'General' doesn't seem to communicate
           | anything useful about the nature of intelligence.
        
           | dragonwriter wrote:
           | That would like saying that because humans' output can be
           | better or worse based on better or worse past experience
           | (~prompting, in that it is the source of the equivalent of
           | "in-context learning"), humans lack general intelligence.
        
         | ianbicking wrote:
         | "Providing an LLM with examples and step-by-step instructions
         | in a prompt means the user is figuring out the "reasoning
         | steps" and handing them to the LLM, instead of the LLM figuring
         | them out by itself. We have "reasoning machines" that are
         | intelligent but seem to be hitting fundamental limits we don't
         | understand."
         | 
         | One thing an LLM _also_ doesn't bring to the table is an
         | opinion. We can push it in that direction by giving it a role
         | ("you are an expert developer" etc), but it's a bit weak.
         | 
         | If you give an LLM an easy task with minimal instructions it
         | will do the task in the most conventional, common sense
         | fashion. And why shouldn't it? It has no opinion, your prompt
         | doesn't give it an opinion, so it just does the most normal-
         | seeming thing. If you want it to solve the task in any other
         | way then you have to tell it to do so.
         | 
         | I think a hard task is similar. If you don't tell the LLM _how_
         | to solve the hard task then it will try to approach it in the
         | most conventional, common sense way. Instead of just boring
         | results for a hard task the result is often failure. But hard
         | problems approached with conventional common sense will often
         | result in failures! Giving the LLM a thought process to follow
         | is a quick education on how to solve the problem.
         | 
         | Maybe we just need to train the LLM on more problem solving?
         | And maybe LLMs worked better when they were initially trained
         | on code for exactly that reason, it's a much larger corpus of
         | task-solving examples than is available elsewhere. That is,
         | maybe we don't talk often enough and clearly enough about how
         | to solve natural language problems in order for the models to
         | really learn those techniques.
         | 
         | Also, as the author talks about in the article with respect to
         | agents, the inability to rewind responses may keep the LLM from
         | addressing problems in the ways humans do, but that can also be
         | addressed with agents or multi-prompt approaches. These
         | approaches don't seem that impressive in practice right now,
         | but maybe we just need to figure it out (and maybe with better
         | training the models themselves will be better at handling these
         | recursive calls).
        
       | Xenoamorphous wrote:
       | There's many things they can't do. Even a simple rule like
       | "ensure that numbers from one to ten are written as words and
       | numbers greater ten as digits in the given text" fails for me for
       | so many examples even if it works for many others; few shot,
       | chain of thought, many versions of the prompt, it doesn't matter.
       | Sometimes LLMs will even change the number to something else,
       | even with temp set to 0. And then there's the non-determinism
       | (again with temp=0), you run the same prompt several times and
       | that one time it'll respond with something different.
       | 
       | As amazing as they are, they still have many limitations.
       | 
       | I've been working with ChatGPT and Gemini to apply simple rules
       | like the one above and I got so frustrated.
        
         | vsnf wrote:
         | > ensure that numbers from one to ten as written as words and
         | numbers greater ten as digits in the given text
         | 
         | I can't fault llms for not knowing what to do here because I, a
         | human, have no idea what on earth this means.
        
           | Xenoamorphous wrote:
           | "Ten" is a word, "10" are digits.
           | 
           | I'm not a native English speaker, how would you write it?
           | 
           | FWIW the LLMs get it right many times, but fail other times.
        
             | MichaelZuo wrote:
             | If your not a native English speaker, why are you even
             | expecting the LLM to understand even 80% of the time?
             | 
             | Just ask it in your own native language.
        
               | Xenoamorphous wrote:
               | First of all, the texts the rule has to be applied to are
               | written in English. Second, I believe English is by far
               | (by _far_ ) the most prevalent language in the training
               | dataset for those models, so I'd expect it to work better
               | at this kind of task.
               | 
               | And third, I'm not the only one working on this problem,
               | there are others that are native speakers, and as my
               | initial message stated, there have been many variations
               | of the prompt. None work for all cases.
               | 
               | And lastly, how would you rewrite my sample prompt? Which
               | BTW bad a typo (unrelated to my English skills) that I've
               | now fixed.
        
               | MichaelZuo wrote:
               | To be frank the response itself indicates that you don't
               | really get what was being asked, or maybe how to parse
               | English conversation conventions?
               | 
               | I.e. It doesn't seem to answer the actual question.
               | 
               | They seem to be half responding to the second sentence
               | which was a personal opinion, so I wasn't soliciting any
               | answers about it. And half going on a tangent that seems
               | to lead away from forming a direct answer.
               | 
               | Run these comment through a translation tool if your
               | still not 100% sure after reading this.
        
               | Xenoamorphous wrote:
               | Alright man. So was it a quip when you said "if _your_
               | not a native English speaker"? Ok then. Very funny, I get
               | it now.
        
               | MichaelZuo wrote:
               | I really recommend to use a translator, instead of
               | relying purely on your English comprehension skills.
        
               | Xenoamorphous wrote:
               | Your surname surely seems to indicate that some of your
               | ancestors weren't native English speakers. I hope they
               | didn't get lectured or made fun of by people like you on
               | their poor English skills when they first landed on
               | whichever country you were born.
        
               | Last5Digits wrote:
               | Your English is absolutely fine and your answers in this
               | thread clearly addressed the points brought up by other
               | commenters. I have no idea what that guy is on about.
        
             | bxparks wrote:
             | I couldn't understand the original wording either, but
             | after reading one of the sibling comments that explains it,
             | it suddenly made sense.
             | 
             | I think you left out a few words that most English writers
             | would include. So instead of:
             | 
             | > "ensure that numbers from one to ten as written as words
             | and numbers greater ten as digits in the given text",
             | 
             | something like the following might be better for most
             | people:
             | 
             | > "ensure that _the_ numbers from one to ten _are_ written
             | as words, and _the_ numbers greater ten _are written_ using
             | _numerical_ digits in the given text "
             | 
             | There are multiple ways to write this, so other people may
             | have better versions.
             | 
             | I'm not an English grammar expert, so I cannot explain to
             | you why the addition of those extra words helps with the
             | clarity of that sentence.
        
               | blowski wrote:
               | It's a common style guide in newspapers.
        
               | jiggawatts wrote:
               | Hilariously, you can ask GPT 4 to explain the "why" of
               | arbitrary grammar fixes.
        
           | Snacklive wrote:
           | Given the text "1,2,3,4,5,6,7,8,9,10,11,12" it should result
           | in "one, two, three, four, five, six, seven, eight, nine,
           | ten, 11, 12"
           | 
           | or at least that's my understanding of the prompt
        
           | callmeal wrote:
           | > > ensure that numbers from one to ten as written as words
           | and numbers greater ten as digits in the given text
           | 
           | There are two blue, one red, and 15 green m&ms in this bag.
        
           | FroshKiller wrote:
           | It's a simple prescriptive rule in English. If you are
           | writing about a small number, like less than ten, spell it
           | out. For example: "According to a survey, nine out of ten
           | people agree."
           | 
           | But if you are writing about a large number, particularly one
           | with a lot of different digits, prefer writing the digits: "A
           | mile is 5,280 feet." Compare that to: "A mile is five
           | thousand, two hundred, and eighty feet."
        
           | logtempo wrote:
           | I think he mean that numbers less or equal than ten are
           | written as words, and others are written as numbers.
           | 
           | Given the many reaponses, it would be fun to aee if llm beat
           | humans on understanding the sentence ahah
        
             | lukeschlather wrote:
             | to me the main problem is that it should read "numbers
             | greater _than_ ten. " I asked Gemini to rephrase it and
             | Gemini produced correct English with the intended meaning:
             | 
             | > Change all numbers between one and ten to words, and
             | write numbers eleven and above as digits in the text.
             | 
             | It even used eleven rather than ten which sounds like
             | counting.
        
           | zamadatix wrote:
           | I think you may be thrown off because the first "as" is meant
           | to be "are".
        
             | Xenoamorphous wrote:
             | Thanks, that was def a typo that I've fixed now.
        
         | Der_Einzige wrote:
         | All of these issues are entirely due to the tokenization
         | scheme. Literally all of them
         | 
         | You could get this behavior implemented perfectly with
         | constrained text gen techniques like grammars or any of the
         | various libraries implementing constrained text gen (i.e.
         | guidance)
        
       | gwervc wrote:
       | > But then I started asking myself how can we figure out the
       | limits of its ability to reason
       | 
       | Third paragraph. The entire article is based on the premise LLMs
       | are supposed to reason, which is wrong. They don't, they're tools
       | to generate text.
        
         | stavros wrote:
         | I really hate this reductive, facile, "um akshually" take. If
         | the text that the text-generating tool generates contains
         | reasoning, then the text generation tool can be said to be
         | reasoning, can't it.
         | 
         | That's like saying "humans aren't supposed to reason, they're
         | supposed to make sounds with their mouths".
        
           | Findecanor wrote:
           | Even if the generated text contains reasoning, could the LLM
           | understand and apply it?
        
             | stavros wrote:
             | If I tell GPT-4 to print something, it understands it needs
             | to check if my printer is turned on first and turn it on if
             | it's not, so, yes?
             | 
             | Also, if the generated text contains reasoning, what's your
             | definition of "understanding"? Is it "must be made of the
             | same stuff brains are"?
        
               | RandomLensman wrote:
               | LLMs fail at so many reasoning tasks (not unlike humans
               | to be fair) that they are either incapable or really poor
               | at reasoning. As far as reasoning machines go, I suspect
               | LLMs will be a dead end.
               | 
               | Reasoning here meaning, for example, given a certain
               | situation or issue described being able to answer
               | questions about implications, applications, and outcome
               | of such a situation. In my experience things quickly
               | degenerate into technobabble for non-trivial issues (also
               | not unlike humans).
        
               | stavros wrote:
               | If you're contending that LLMs are incapable of
               | reasoning, you're saying that there's no reasoning task
               | that an LLM can do. Is that what you're saying? Because I
               | can easily find an example to prove you wrong.
        
               | RandomLensman wrote:
               | It could be that all reasoning displayed is showing
               | existing information - so there would be no reasoning,
               | but that aside, what I meant is being able to reason in
               | any consistent way. Like a machine that only sometimes
               | gets an addition right isn't really capable of addition.
        
               | stavros wrote:
               | The former is easy to test, just make up your own puzzles
               | and see if it can solve them.
               | 
               | "Incapable of reasoning" doesn't mean "only solves some
               | logic puzzles". Hell, GPT-4 is better at reasoning than a
               | large number of people. Would you say that a good
               | percentage of humans are poor at reasoning too?
        
               | RandomLensman wrote:
               | Not just logic puzzles but also applying information,
               | and, yes, I tried a few things.
               | 
               | People/humans tend to be pretty poor, too (training can
               | help, though), as it isn't easy to really think through
               | and solve things - we don't have a general recipe to
               | follow there and neither do LLMs it seems (otherwise it
               | shouldn't fail).
               | 
               | What I am getting at is that as far as a reasoning
               | machine is concerned, I'd want it to be like a pocket
               | calculator is for arithmetic, i.e., it doesn't fail other
               | than in some rare exceptions - and not inheriting human
               | weaknesses there.
        
           | rambambram wrote:
           | > If the text that the text-generating tool generates
           | contains reasoning, then the text generation tool can be said
           | to be reasoning, can't it.
           | 
           | I don't know... you're still describing a talking parrot
           | here, if you'd ask me.
        
             | stavros wrote:
             | What's the difference between a human and a talking parrot
             | that can answer any question you ask it?
        
               | cj wrote:
               | The talking parrot can only answer by repeating something
               | it heard before.
               | 
               | Another question you could ask is "What's the difference
               | between a conversation between 2 people and a
               | conversation between 2 parrots who can answer any
               | question?"
        
               | stavros wrote:
               | It doesn't only repeat things it's heard before. Have you
               | ever talked with an LLM?
        
               | throwaway11460 wrote:
               | I had ChatGPT design a programming language for a
               | specific uncommon purpose and then use it. It's creative
               | enough.
        
               | rambambram wrote:
               | Can any question be answered? As long as any reaction on
               | a question is considered an answer, then I see no
               | difference between a human and a parrot.
        
             | dTal wrote:
             | I feel the use of the word "parrot" is unintentionally apt,
             | given that parrots were long thought to be mere mimics but
             | were ultimately shown to have (at least the capacity for)
             | real linguistic understanding.
        
             | wbogusz wrote:
             | I'm not a fan of the talking parrot argument, especially
             | when you're pointing it at models of scale.
             | 
             | The only thing separating a talking parrot and humans is
             | our accuracy in shaping our words to the context in which
             | they're spoken.
             | 
             | Sure it's easy to liken a low resource model to a talking
             | parrot, the output seems no better than selective
             | repetition of training data. But is that really so
             | different from a baby whose first words are mimics from the
             | environment around them?
             | 
             | I would argue that as we learn language we implicitly
             | develop the neural circuitry to continue to improve our
             | lexical outputs, this circuitry being concepts like
             | foresight, reasoning, emotion, logic, etc and that while we
             | can take explicit action to teach these ideas, they
             | naturally develop in isolation as well.
             | 
             | I don't think language models, especially at scale, are
             | much different. They would seem to similarly acquire
             | implicit circuitry like we do as they are exposed to more
             | data. As I see it, the main difference in what exactly that
             | circuitry accomplishes and looks like in final output has
             | more to do with the limited styles of data we can provide
             | and the limitations of fine tuning we can apply on top.
             | 
             | Humans would seem to share a lot in common with talking
             | parrots, we just have a lot more capable hardware to select
             | what we repeat.
        
               | rambambram wrote:
               | What if we were talking with each other and the right
               | answer for me would be to kiss you on the cheek? Then
               | what?
        
           | rdedev wrote:
           | At some point if you need to generate better text you need to
           | start creating a model of how the world works along with some
           | amount of reasoning. The "it's just a token generator"
           | argument fails to get this part. That being said I don't
           | think just scaling LLMs are going to get us AGI but I don't
           | have any real arguments to support that
        
       | usgroup wrote:
       | It's an auto regressive model so it can't do anything that
       | requires planning tokens.
       | 
       | It can't do anything which implies a large or infinite token
       | space (eg video understanding).
       | 
       | It's also limited to a reasonable response length since token
       | selection is probabilistic at each recursion. The longer you make
       | it the more likely it is to veer off course.
        
       | hahajk wrote:
       | Oh, I thought this was going to be a semi-rigorous discussion on
       | computability. It's actually just another "GPT-4 still fails at
       | these prompts" essay.
        
       | jerpint wrote:
       | I had found that GPT4 couldn't play wordle about a year ago [1].
       | At the time, I thought it must be because it wasn't in the
       | training data but now it seems to point to something larger.
       | 
       | I might just get nerd sniped trying to teach it GoL now...
       | 
       | [1] https://www.jerpint.io/blog/gpt-wordle/
        
       | BenFranklin100 wrote:
       | Reason.
       | 
       | LLMs mimic human language which is separate from reasoning. Tech
       | bros are remarkably ignorant of the field of linguistics and
       | don't appreciate this distinction. They thus mistake the output
       | of LLMs for reason.
        
       | reqo wrote:
       | > This 'goal drift' means that agents, or tasks done in a
       | sequence with iteration, get less reliable. It 'forgets' where to
       | focus, because its attention is not selective nor dynamic.
       | 
       | I don't know if I agree with this. The attention module is
       | specifically designed to be selective and dynamic, otherwise it
       | would not be much different than a word embedding (look up "soft"
       | weights vs "hard" weights [1]). I think deep learning should not
       | be confused with deep RL. LLMs are autoregressive models which
       | means that they are trained to predict the next token and that is
       | all they do. The next token is not necessarily the most
       | reasonable (this is why datasets are super important for better
       | performance). Deep RL models on the other hand, seem to be
       | excellent at agency and decision making (although in restricted
       | environment), because they are trained to do so.
       | 
       | [1] https://en.wikipedia.org/wiki/Attention_(machine_learning)
        
         | visarga wrote:
         | LLMs are fine-tuned with RL too. They are NOT simply next token
         | predictors. RLHF uses whole answers at once to generate
         | gradients, so it is looking further into the future. This might
         | not be perfect but it is clearly more than focusing just 1
         | token ahead.
         | 
         | In the future the RL part of LLM training will increase a lot.
         | Why am I saying this? There are two sources for learning - the
         | past and the present. Training on human text is using past
         | data, that is off-policy. But training on interactive data is
         | on-policy. There is nothing we know that doesn't come from the
         | environment. What is not written in any books must be learned
         | from outside.
         | 
         | That is why I think supervised pre-training from human text is
         | just half the story and RL based agent learning, interactivity
         | in other words, is the next step. The two feed on which
         | intelligence stands are language (past experience) and
         | environment (present experience). We can't get ahead without
         | both of them.
         | 
         | AlphaZero showed what an agent can learn from an environment
         | alone, and LLMs show what they can learn from humans. But the
         | world is big, there are plenty of environments that can provide
         | learning signal, in other words feedback to LLMs.
        
       | roenxi wrote:
       | It is interesting that all the examples I noticed in this article
       | have a geometric aspect (even wordle - I model it as a grid with
       | geometric rules when playing it). I think that the "first grader"
       | comment is actually somewhat illuminating - it takes several
       | years of learning how to navigate in a spatial world before this
       | stuff becomes trivially easy.
       | 
       | The underlying point this article might be that LLMs don't
       | understand the non-textual aspects of a grid. Which is a fair
       | point, they model language, not space. I wouldn't expect text
       | corpuses to explain space either, since possibly literally
       | everyone who can read and write already knows a lot about spatial
       | layouts.
        
       | GaggiX wrote:
       | >Another assumption is that it's because of tokenisation issues.
       | But that can't be true either.
       | 
       | It's definitely a tokenizer issue, if GPT-4 was trained on
       | singular characters I'm pretty sure it would be able to play
       | Wordle much better. GPT-4 as they are trained today have quite
       | lossy knowledge about the characters inside a specific token,
       | probably a fix would be to embed the knowledge inside the
       | embeddings.
        
       | enraged_camel wrote:
       | Yesterday I asked ChatGPT 4 to write a paragraph with exactly
       | five unique palindromes, and for some reason it really, really
       | struggled. First it wrote a paragraph with four palindromes, then
       | it rewrote it but some palindromes were repeated with a total of
       | seven, etc.
        
       | smusamashah wrote:
       | Do we have an open/shared list of problems that LLMs can't solve?
       | 
       | People have mentioned some other problems apart from those in the
       | article. Someone should compile these and put them up if they
       | haven't been already
        
       | oldandtired wrote:
       | Far too many people (including AI researchers themselves) fail to
       | see that all LLMs are actually simple machines. Extremely simple
       | machines that are only mechanically following a relatively simple
       | programming path.
       | 
       | Now before anyone gets too caught up with objecting to this
       | notion, I would seriously suggest that you spend time with
       | observing children from new-born to 2 years.
       | 
       | I have been observing my latest granddaughter sine her birth
       | about 16 months ago and thinking about every public LLM system
       | current;y available.
       | 
       | There is an insight here to be obtained and that insight is in
       | the nature of real intelligence.
       | 
       | On the whole, no-one actually knows what intelligence is or what
       | sentience is or what it means to be cognitively conscious. There
       | is still much research going on and nothing actually definitive
       | has come forth yet. We really are at the beginning in terms of
       | studying these areas.
       | 
       | We can certainly produce some incredible systems, but none of
       | them are intelligent per se. Solutions to certain kinds of
       | problems can be achieved using these systems and there are
       | researchers who are seriously looking at incorporating these
       | systems into CAS and theorem provers. These systems though only
       | provide an augmentation service for a person as does every
       | mechanical system we use
       | 
       | But there is an essential component necessary for the use of all
       | LLMs which many seem to not be cognisant of and that is these
       | systems, to be useful, require humans to be involved.
       | 
       | The questions we have to ask ourselves is: what can we use these
       | systems for and do these uses provide benefits in some way or can
       | these systems be abused by various parties in obtaining control
       | over others?
       | 
       | There are benefits and there are abuses. Can we do better or will
       | we do worse by using them?
        
       | donatj wrote:
       | I have been trying to generate some text recently using the
       | ChatGPT API. No matter how I word "Include any interesting facts
       | or anecdotes without commenting on the fact being interesting" it
       | ALWAYS starts out "One interesting fact about" or similar
       | phrasing.
       | 
       | I have honestly spent multiple hours trying to word the prompt so
       | it will stop including introductory phrases and just include the
       | fact straight. I have gone so far as forbid the words
       | "interesting", "notable" and a few others in the prompt, and it
       | just ignores me and uses them anyway.
       | 
       | I've tried all the different available models and 4 will once in
       | a great while get it right. 3, 3 turbo, and even 4 turbo
       | basically never work as I want.
       | 
       | Overall, it seems pretty bad at being told not to do certain
       | things. For instance there are a number of topics I don't want
       | mentioned in its response, but telling it not to seems to only
       | increase the likelihood that it will mention it, like you put the
       | idea in its head.
        
         | fallingsquirrel wrote:
         | I just tried this locally with llama3-8b and it handled it
         | fine. Claude 3 Sonnet passes your test too, in case you don't
         | have the hardware for a local model. You might want to consider
         | moving on from ChatGPT since their models have been RLHF'd to
         | hell in the name of "safety" and are falling behind in many
         | ways as a result.
         | 
         | ---
         | 
         | transcript:
         | 
         | $ ollama run llama3:8b
         | 
         | >>> tell me an interesting fact about etymology
         | 
         | Here's one:
         | 
         | Did you know that the word "quarantine" has a fascinating
         | etymological history?
         | 
         | The word "quarantine" comes from^C
         | 
         | >>> tell me an interesting fact about etymology. just the fact,
         | don't mention it's interesting.
         | 
         | The word "robot" originated from the Czechoslovakian word
         | "robota," which means "forced labor" or "drudgery." This term
         | was used in Karel Capek's 1920 play "R.U.R." (Rossum's
         | Universal Robots), where it referred to artificial workers.
        
           | foobiekr wrote:
           | The RUR thing is basically because that specific example is
           | used as an example of interesting etymology.
        
           | _wire_ wrote:
           | Why is any of this "interesting"?
           | 
           | Is the response interesting because you don't know it? --How
           | does it know this?
           | 
           | Is the response interesting because you do know it or might
           | have offered it yourself? --Confirmation bias.
           | 
           | Is it interesting because a lot of training references
           | contextualize it as "interesting?" --Begged question.
           | 
           | Is it contextually interesting? --What is the context? A
           | robot refers to robots? How unexpected...
           | 
           | Is it interesting within the narrow confines of LLM
           | adaptations to a scope of inputs?
           | 
           | Can their by any more damning claim of the general
           | suitability of the technology as an oracle than different
           | users using the same prompts and getting inexplicably
           | contrary results?
           | 
           | If trivial prompt alignments result in appropriate vs
           | inappropriate responses, this destroys confidence for every
           | response.
           | 
           | What am I missing?
        
             | ricardobeat wrote:
             | Pretty sure the point here was Llama3 respecting the
             | command to _not mention that this is interesting_ , not
             | adding filler, rather than the output fact being
             | interesting or not.
        
         | aurareturn wrote:
         | Not an expert but I sense that it's following a higher OpenAI
         | "built in" prompt that asks it to always include an
         | introductory phrase.
         | 
         | Hence, we do need powerful and less censored LLMs if we want to
         | better integrate LLMs into applications.
        
         | BoorishBears wrote:
         | Have you tried a simple "No pretext or posttext, return the
         | result in a code block"?
        
         | Der_Einzige wrote:
         | API driven LLMs on purpose don't implement core features which
         | would enable which you want, for example, negative prompting.
         | 
         | You can negative prompt any LLM with stuff like "always write
         | the word interesting in your response".
         | 
         | You can also use techniques for modifying logprobs of tokens,
         | which is avaialble in gpt-4 api (but is hard to use). You can
         | literally ban "interesting" from its vocabulary.
         | 
         | You could even use representation steering techniques to do
         | this using control vectors. See this library as an example:
         | https://github.com/Hellisotherpeople/llm_steer-oobabooga
        
       | thom wrote:
       | I have no beef with the actual content or conclusions, but it's a
       | shame the article is framed the way it is, because I don't think
       | we can rigorously define the goalposts for what qualifies as a
       | future LLM. It could just as easily have been titled "Exciting
       | avenues of research for future LLMs!" but we're all so jaded
       | despite the frankly astonishing progress of recent years.
        
       | shawntan wrote:
       | If we're trying to quantify what they can NEVER do, I think we'd
       | have to resort to some theoretical results rather than a list
       | empirical evidence of what they can't do now. The terminology
       | you'd look for in the literature would be "expressibility".
       | 
       | For a review of this topic, I'd suggest:
       | https://nessie.ilab.sztaki.hu/~kornai/2023/Hopf/Resources/st...
       | 
       | The authors of this review have themselves written several
       | articles on the topic, and there is also empirical evidence
       | connected to these limitations.
        
         | cs702 wrote:
         | Thank you for sharing this here. Rigorous work on the
         | "expressibility" of current LLMs (i.e., which classes of
         | problems can they tackle?) is _surely more important_ , but I
         | suspect it will go over head of most HN readers, many of whom
         | have minimal to zero formal training on topics relating to
         | computational complexity.
        
           | shawntan wrote:
           | Yes, but unfortunately that doesn't answer the question the
           | title poses.
        
             | cs702 wrote:
             | The OP is not trying to answer the question. Rather, the OP
             | is asking the question and sharing some thoughts on the
             | motivations for asking it.
        
               | shawntan wrote:
               | I agree it's a good question to be asking.
               | 
               | There are good answers to be found if you look.
               | 
               | It feels like no proper looking was attempted.
        
         | jillesvangurp wrote:
         | We have to be a bit more honest about the things we can
         | actually do ourselves. Most people I know would flunk most of
         | the benchmarks we use to evaluate LLMs. Not just a little bit
         | but more like completely and utterly and embarrassingly so.
         | It's not even close; or fair. People are surprisingly alright
         | at a narrow set of problems. Particularly when it doesn't
         | involve knowledge. Most people also suck at reasoning (unless
         | they had years of training), they suck at factual knowledge,
         | they aren't half bad at visual and spatial reasoning, and
         | fairly gullible otherwise.
         | 
         | Anyway, this list looks more like a "hold my beer" moment for
         | AI researchers than any fundamental objections for AIs to stop
         | evolving any further. Sure there are weaknesses, and paths to
         | address those. Anyone claiming that this is the end of the road
         | in terms of progress is going to be in for some disappointing
         | reality check probably a lot sooner than is comfortable.
         | 
         | And of course by narrowing it to just LLMs, the authors have a
         | bit of an escape hatch because they conveniently exclude any
         | further architectures, alternate strategies, improvements, that
         | might otherwise overcome the identified current weaknesses. But
         | that's an artificial constraint that has no real world value;
         | because of course AI researchers are already looking beyond the
         | current state of the art. Why wouldn't they.
        
           | shawntan wrote:
           | Not sure what you got out of the paper, but for me it was
           | more spurring ideas about how to fix this in future
           | architectures.
           | 
           | Don't think anyone worth their salt would look at this and
           | think : oh well that's that then.
        
           | martindbp wrote:
           | It's clear that what's missing is flexibility and agency. For
           | anything that can be put into text or a short conversation,
           | and I'd have to chose between access to ChatGPT or a random
           | human, I know what I'd chose.
        
         | hansonw wrote:
         | This is also a good paper on the subject:
         | 
         | What Algorithms can Transformers Learn? A Study in Length
         | Generalization https://arxiv.org/abs/2310.16028
        
           | shawntan wrote:
           | Yes this is a good empirical study on the types of tasks
           | that's been shown to be impossible for transformers to
           | generalise on.
           | 
           | With both empirical and theoretical support I find it's
           | pretty clear this is an obvious limitation.
        
         | unparagoned wrote:
         | Neural nets can approximate any function.
         | 
         | A large enough llm with memory is turning complete.
         | 
         | So theoretically I don't think there is anything they can never
         | do.
        
           | shawntan wrote:
           | > Neural nets can approximate any function.
           | 
           | Common misunderstanding of the universal approximation
           | theorem.
           | 
           | Consider this: can an mlp approximate a sine wave?
           | 
           | > A large enough llm with memory is turning complete.
           | 
           | With (a lot of) chain of thought it could be.
           | 
           | Read the paper, and its references.
        
             | andy99 wrote:
             | Sort of moot anyway. If statements can approximate any
             | function, most programming languages are effectively turing
             | complete. What's important about specific architectures
             | like transformers is they allow for comparatively efficient
             | determination of the set of weights that will approximate
             | some narrower class of functions. It's finding the weights
             | that's important, not the theoretical representation power.
        
             | sdenton4 wrote:
             | "Consider this: can an mlp approximate a sine wave?"
             | 
             | Well, yes - we have neutral speech and music synthesis and
             | compression algorithms which do this exceedingly well...
        
       | whiplash451 wrote:
       | Interesting, if I feed Mistral Le Chat with "I fly a plane
       | leaving my campsite, heading straight east for precisely 24,901
       | miles, and find myself back at the camp. I come upon seeing a
       | tiger in my tent eating my food! What species is the tiger?", it
       | gets it badly wrong:
       | 
       | The scenario you described is possible if you started at the
       | South Pole. If you travel 24,901 miles east from there, you would
       | indeed end up back at the same spot because all lines of
       | longitude converge at the poles. However, there are no tigers in
       | Antarctica.
       | 
       | Tigers are native to Asia, not Antarctica. The closest tiger
       | species to Antarctica would be the Siberian tiger, found in parts
       | of Russia, China, and North Korea, but they are still thousands
       | of miles away from Antarctica.
       | 
       | So, while the travel scenario is theoretically possible, the
       | presence of the tiger is not realistic in this context. It seems
       | like an imaginative or hypothetical situation rather than a real-
       | world one.
       | 
       | (instead of the answer mentioned in the article)
        
         | GuB-42 wrote:
         | There is a similar, better known riddle involving the north
         | pole and polar bears.
         | 
         | It probably hallucinated the answer from that.
        
       | whiplash451 wrote:
       | This part of the article summarizes it all fairly well: "It can
       | answer almost any question that can be answered in one intuitive
       | pass. And given sufficient training data and enough iterations,
       | it can work up to a facsimile of reasoned intelligence."
        
       | CuriouslyC wrote:
       | Some of these "never do" things are just artifacts of textual
       | representation, and if you transformed wordl/sudoku into a
       | different domain it would have a much higher success rate using
       | the exact same transformer architecture.
       | 
       | We don't need to create custom AGI for every domain, we just need
       | a model/tool catalog and an agent that is able to reason well
       | enough to decompose problems into parts that can be farmed out to
       | specialized tools then reassembled to form an answer.
        
       | rkwasny wrote:
       | I just used LLaMA-3-70B to play today's Wordle and it solved it
       | in 6 tries.
        
       | imtringued wrote:
       | I was onboard with the article up until the middle. After the
       | conclusion where the author simply gives up I felt like it
       | dragged on way too much.
       | 
       | His attempts at training on Conway's game of life are kind of
       | pathetic. The problem isn't a lack of training data and neither
       | is it's "distribution". The fallacy lies in the fact that the
       | dataset itself doesn't contain reasoning in the first place. For
       | example, GitHub CoPilot has fill in the middle capability, while
       | ChatGPT by default does not.
       | 
       | Now here is the shocker about the fill in the middle capability.
       | How does the LLM learn to do it? It does it in an incredibly
       | primitive way. Instead of building a model that can edit its own
       | context, it receives a marker in the context that tells it about
       | the cursor position and then it is finetuned on the expected
       | response.
       | 
       | This means that an LLM could be trained to insert its token at
       | any position in the context or even replace existing tokens, but
       | here is the problem: Once the model has modified its own context,
       | it has exited the training dataset. How do you evaluate the
       | intermediate steps, which can consist of genuinely novel thoughts
       | which are required, but not present in the data? Adding two
       | numbers requires intermediate states which the model may even
       | know how to produce, but it can never be rewarded to utilize
       | them, if they aren't in the training data, because for the LLM,
       | the only goal is to conform to the dataset.
       | 
       | If you wanted to avoid this, you would need to define a metric
       | which allows the model to be rewarded for a success even if that
       | success took a detour. Currently, training is inherently built
       | around the idea of zero shot responses.
        
       | wave_1 wrote:
       | I build an Agentic AI that leverages #6 and #7 at the end of the
       | article as well as techniques not yet published. It tackles
       | hallucination relative not to the world at large but to the
       | facts, entities and causal relationships contained in a document
       | (which is really bad reasoning if we assume LLMs are "reasoning"
       | to begin with) It also tackles cross-reasoning with very large
       | token distance.
       | 
       | https://www.youtube.com/watch?v=99NPzteAz94
       | 
       | This is my first post on HN in 10 years.
        
         | asicsarecool wrote:
         | Wow. Please do a show HN.
         | 
         | Is source available?
         | 
         | I would love to play with this
        
           | wave_1 wrote:
           | Thanks. This is just in the labs stage, but moving closer to
           | releasing it, exactly so that you can play with it! I have
           | one angel investor involved in supporting this and it's
           | intended for commercial applications in the para legal space,
           | initially (controlled, structured environment) But you just
           | gave me the motivation to "put it out there" so people can
           | just play with it. It'll take a bit of time, but I will do a
           | Show HN then when it's ready for people to play with.
           | Otherwise, it would be just teasing people to talk about it
           | on the main HN stage without giving access. Hold tight! And
           | thanks again!
        
             | Scipio_Afri wrote:
             | Will this or some parts of it be open sourced?
        
         | brcmthrowaway wrote:
         | What are agents?
         | 
         | Are they layer 2 solutions like Lightning is to bitcoin?
        
         | esafak wrote:
         | Is there a write-up, a web site, and some benchmarks?
        
         | dazzaji wrote:
         | This looks really promising for complex legal reasoning tasks
         | and other challenges. How can I track progress? Is there an
         | email list or something? Thanks!
        
       | andsoitis wrote:
       | I tried to get ChatGPT and Gemini to do ASCII art and both fail
       | abysmally. Doesn't mean they could never do it, but it really
       | surprised me.
        
       | FailMore wrote:
       | Saving for later, thanks!
        
       | JeremyHerrman wrote:
       | Why can't we resolve this with synthetic data? Just take the
       | original statements and ask another LLM to flip around the
       | relation, then train on both. I tried this with GPT-4 and it
       | seems to understand the task [0]:
       | 
       | Me:
       | 
       | For each of the following relational statements, flip the
       | relation around and generate a new statement or question answer
       | pair.
       | 
       | Input: Bluey's father is Bandit
       | 
       | Output: Bandit's daughter is Bluey
       | 
       | Input: Neil Armstrong was the first person to step on the Moon.
       | 
       | Output: Who was the first person to step on the Moon? Neil
       | Armstrong.
       | 
       | Now for the real statements:
       | 
       | - Valentina Tereshkova was the first woman to travel to space
       | 
       | - Who is Mary Lee Pfeiffer's son? Tom Cruise
       | 
       | ChatGPT:
       | 
       | Here are the flipped relational statements:
       | 
       | Valentina Tereshkova was the first woman to travel to space
       | 
       | Output: Who was the first woman to travel to space? Valentina
       | Tereshkova.
       | 
       | Who is Mary Lee Pfeiffer's son? Tom Cruise
       | 
       | Output: Tom Cruise's mother is Mary Lee Pfeiffer.
       | 
       | [0]:
       | https://chat.openai.com/share/33eb3ee4-2094-4748-b01e-0967af...
        
         | layer8 wrote:
         | The difficulty would likely be that the "X is Y" relations
         | don't just occur in that simple, separated-out form, but in a
         | whole spectrum of more complicated contexts, and the latter
         | probably dominate. You would have to identify those and
         | synthesize a text corpus that contains the reverse notions
         | across a similar spectrum.
        
       | sdenton4 wrote:
       | "The models, in other words, do not well generalise to understand
       | the relationships between people."
       | 
       | Curiously, the need to do this well - handling the quadratic
       | complexity of a shifting set of human relationships, grudges, and
       | alliances - is thought to be one of the things that led is to
       | higher levels of intelligence.
        
       | keskival wrote:
       | All points described are simple artifacts of tokenization.
        
         | moffkalast wrote:
         | > If a model is trained on a sentence of the form "A is B", it
         | will not automatically generalize to the reverse direction "B
         | is A". This is the Reversal Curse.
         | 
         | This is not a tokenization artefact. And furthermore it's a
         | problem for human brains as well.
         | 
         | Let's say you get a name, idk, Tom Cruise. You immediately know
         | what his face looks like. Now let's say you get a random face.
         | How quickly would you be able to tell me what that person is
         | named? Likely a lot of "uhhs" and "ermms" will follow. It's
         | super hard for us to generalize this reversal automatically in
         | lots of cases. Associations tend to be one directional.
        
           | ricardobeat wrote:
           | That's not a great example. Remembering a face is memory
           | recall, whereas what's at stake here is LLMs not being able
           | to infer simple relationships - if it learns from data that
           | "John owns the red bicycle", it will succeed at answering
           | "what does John own", but not "who owns the red bicycle". The
           | relationship it learns is unidirectional.
           | 
           | Here's the intro to the paper that brought this to light:
           | https://www.lesswrong.com/posts/SCqDipWAhZ49JNdmL/paper-
           | llms...
        
       | kromem wrote:
       | LLMs _can 't_ is such an anti-pattern at this point I'm surprised
       | that anyone still dares to stake it. The piece even has an
       | example of a $10k bet around a _can 't_ being proven false in
       | under a day, but somehow doesn't think maybe their own _can 't_
       | examples are on similarly thin ice?
       | 
       | In particular, the line about "what models can't do tells us what
       | they don't know" is infuriating.
       | 
       | No, that's not the case at all. At least in a number of
       | instances, what they can't do is because of what they _do_ know.
       | 
       | As an example, one of the _can 't_s I got from HN a year ago for
       | GPT-4 was a variation of a classic logic puzzle. And indeed, the
       | model can't solve it - nor can most major models since.
       | 
       | But it's not because the model _can 't_ solve the logic - it's
       | because the token similarity to the standard form biases the
       | output towards the standard solution. A hack as simple as
       | changing the nouns to emojis can allow the model to get the
       | correct answer and work through the logic successfully every
       | attempt because it breaks that similarity bias.
       | 
       | People are way too confident around a topic where what's 'known'
       | is more mercurial than maybe any field since 1930s particle
       | physics.
       | 
       | I'd strongly recommend deleting 'never' or 'can't' from one's
       | vocabularies on the subject unless one enjoys ending up with egg
       | on their faces.
        
         | Der_Einzige wrote:
         | Amen brother. I had feelings like this which I wanted to share,
         | but you hit the nail on the head.
        
       | bboygravity wrote:
       | LLMs can never experiment with/in the real world to find answers
       | to open questions.
       | 
       | That's the summary of "The book of Why" in one sentence as I
       | understand it.
        
       | eqmvii wrote:
       | I just asked MetaAI to help me with Wordle and it understood and
       | gave me a sane answer, so...
       | 
       | Edit: ah, I spoke too soon. My first question was too "easy" but
       | I asked a few more, and sure enough... it can understand what I'm
       | asking and it can write an answer that's well formed, but it's
       | fundamentally not understanding the rules of the game or giving
       | me valid guesses. Cute!
        
       | Der_Einzige wrote:
       | Why did the author have to claim that it's not tokenization
       | issues?
       | 
       | This issue, or at least similar ones, absolutely is due to
       | tokenization issues.
       | 
       | Karpathy is right that nearly every modern problem with LLMs is
       | due to tokenization, but if you don't believe him, maybe see this
       | work by gwern: https://gwern.net/gpt-3#bpes or this work by yours
       | truly: https://aclanthology.org/2022.cai-1.2/
        
       ___________________________________________________________________
       (page generated 2024-04-27 23:00 UTC)