[HN Gopher] How Chain-of-Thought Reasoning Helps Neural Networks...
       ___________________________________________________________________
        
       How Chain-of-Thought Reasoning Helps Neural Networks Compute
        
       Author : amichail
       Score  : 223 points
       Date   : 2024-03-22 01:50 UTC (21 hours ago)
        
 (HTM) web link (www.quantamagazine.org)
 (TXT) w3m dump (www.quantamagazine.org)
        
       | stygiansonic wrote:
       | A simplified explanation, which I think I heard from Karpathy, is
       | that transformer models only do computation when they generate
       | (decode) a token. So generating more tokens (using CoT) gives the
       | model more time to "think".
       | 
       | Obviously this doesn't capture all the nuance.
        
         | sadhorse wrote:
         | Does every token requires a full model computation?
        
           | onedognight wrote:
           | No, you can cache some of the work you did when processing
           | the previous tokens. This is one of the key optimization
           | ideas designed into the architecture.
        
         | _boffin_ wrote:
         | One of the things I've been doing with the models I've been
         | using with coding is adding the stack and primary dependencies
         | in the system prompt and then asking or conversing. It has
         | helped out a lot, or at least feels like it has.
        
         | ukuina wrote:
         | This is true. You can get a similar effect by asking the model
         | to plan its path first without writing any code, then asking it
         | to review its plan for deficiencies, and finally asking it to
         | enact the plan and write the code.
        
         | XenophileJKO wrote:
         | So my experience creating products on GPT3.5-Turbo is that
         | there is an upper limit to how much instructional complexity
         | the model can handle at a time. It isn't really about "adding
         | computation", though you are doing this. The key is to
         | construct the process so that the model only has to focus on a
         | limited scope to make the decision on.
         | 
         | In effect you are kind of creating a tree structure of
         | decisions that build off of each other. By generating
         | intermediate tokens the model can now only pay attention to the
         | smaller set of already collapsed decisions. It is a little more
         | complicated than that as the model will create anticipatory
         | behavior where intermediate steps get biased by an incorrect
         | result that the model anticipates.
        
           | XenophileJKO wrote:
           | Also I should say it isn't just instructional complexity, it
           | is ambiguity which creates the upper limit on capability.
        
         | bravura wrote:
         | I have another explanation. LLMs are essentially trained on "A
         | B", i.e. is it plausible that B follows A.
         | 
         | There's simply a much larger space of possibilities for shorter
         | completions, A B1, A B2, etc. that are plausible. Like if I ask
         | you to give a short reply to a nuanced question, you could
         | reply with a thoughtful answer, a plausible superficially
         | correct sounding answer, convincing BS, etc.
         | 
         | Whereas if you force someone to explain their reasoning, the
         | space of plausible completions reduces. If you start with
         | convincing BS and work through it honestly, you will conclude
         | that you should reverse. (This is similar to how one of the
         | best ways to debunk toxic beliefs with honest people is simply
         | through openly asking them to play out the consequences and
         | walking through the impact of stuff that sounds good without
         | much thought.)
         | 
         | This is similar to the reason that loading your prompt with
         | things that reduce the space of plausible completions is
         | effective prompt engineering.
        
           | jorl17 wrote:
           | I was going to write pretty much this exact same comment. I
           | am an amateur in how LLMs work, definitely, but I always
           | thought this was the plausible explanation.
           | 
           | If I want the "assistant "LLM to tell me "How much 5 times 2
           | is", if I feed it the line "5 * 2 = " as if it's already
           | started giving that answer, it will very likely write 5*2 =
           | 10.
           | 
           | Since LLMs operate on semantic relationships between tokens,
           | the more a bunch of tokens are "close" to a given "semantic
           | topic", the more the LLM will keep outputting tokens in that
           | topic. It's the reason why if you ask an LLM to "review and
           | grade poetry", eventually it starts saying the same thing
           | even about rather different poems -- the output is so filled
           | with the same words, that it just keeps repeating them.
           | 
           | Another example:
           | 
           | If I ask the LLM to solve me a riddle, just by itself, the
           | LLM may get it wrong. If, however, I start the answer,
           | unravelling a tiny bit of the problem it will very likely
           | give the right answer, as if it's been "guided" onto the
           | right "problem space".
           | 
           | By getting LLMs to "say" how they are going to solve things
           | and checking for errors, each words basically tugs onto the
           | next one, honing in on the correct solution.
           | 
           | In other words:
           | 
           | If an LLM has to answer a question -- any question --, but
           | right after we ask the question we "populate" its answer with
           | some text, what text is more likely to make the LLM answer
           | incorrectly?
           | 
           | - Gibberish nonsense
           | 
           | - Something logical and related to the problem?
           | 
           | Evidently, the more gibberish we give to it, the more likely
           | it is to get it wrong, since we're moving away from the
           | "island of relevant semantic meaning", so to speak. So if we
           | just get the LLM to feed itself more relevant tokens, it
           | automatically guides itself to a better answer. It's kind of
           | like there's an "objective, ideal" sequence of tokens, and it
           | can work as an attractor. The more the LLM outputs words, the
           | more it gets attracted to that sequence...that...."island of
           | relevant semantic meaning".
           | 
           | But, again, I know nothing of this. This is just how I view
           | it, conceptually. It's probably very wrong.
        
             | visarga wrote:
             | That reminds me ... You know how LLMs have a hard time
             | being corrected? If I ask it not to format responses as
             | bullet lists, after 1-2 rounds it does it again. Why?
             | Because the context is filled with examples where it has
             | used bullet lists, and it acts like an attractor.
             | 
             | I ask it not to start phrases with "However..." and it does
             | it again. Maybe just having the word However in the prompt
             | acts like an attractor that compels the LLM to use it, even
             | when I actually asked the opposite. Probably also the fault
             | of heavy handed RLHF telling it to balance any user
             | position with the opposite take.
        
               | lupire wrote:
               | This is one of many ways of LLMs are being crippled by
               | terrible UI controls. You can't do simple things like
               | edit the conversation history to make it forget things.
        
               | gkbrk wrote:
               | You can edit the conversation history though. You need to
               | try alternative apps/UIs instead of the product websites
               | like ChatGPT. Those are only for collecting more training
               | data from users instead of being the most useful
               | interface possible.
        
               | hnben wrote:
               | if you haven't already, I recommend trying the openai
               | playground instead of chatgpt. It is the same underlying
               | ai (i.e. gpt4), but you have much more control over the
               | inputs.
               | 
               | Bonus 1: Since you pay per token, it's much cheaper than
               | a chatgpt abo
               | 
               | Bonus 2: You can increase the context window dramatically
               | (iirc 8000 being the max for playground, while 2000 is
               | the max for chatgpt)
        
               | dmd wrote:
               | Using a 3rd party interface to the LLMs (like
               | typingmind.com) is both better _and cheaper_ than using
               | chatgpt.
        
           | valine wrote:
           | I think you're right. I would go a step further and say that
           | all learning is roughly synonymous with reducing the output
           | space, and that humans do the exact same thing. There are
           | more ways to get the wrong answer to a math problem than
           | there are to get the right answer. When you learn someone's
           | name, you're narrowing your output to be a single name rather
           | than all plausible names.
           | 
           | The output of a generative model is practically infinite. I
           | suspect it's possible to continually narrow the space of
           | completions and never converge on a single output. If this
           | turns out to be true, it would bode well for the scalability
           | of few-shot learning.
        
           | hackerlight wrote:
           | It helps, but it still gets stuck in local optima based on
           | what it started with. I've never seen it turn around and
           | correct its faulty reasoning unless it tried to actually run
           | the code and observed an Exception. If I respond with "but
           | have you considered XYZ?", my leading question will usually
           | cause it to correct itself, even when it wasn't incorrect.
           | 
           | We need some way to generate multiple independent thoughts in
           | parallel. Each separate thought is constructed using chain of
           | thought to improve the reliability. Then you have some way to
           | "reduce" these multiple thoughts into a single solution. The
           | analogy would be a human brainstorming session where we try
           | to attack the same problem from multiple angles and we try to
           | decorrelate each idea/approach.
        
             | avereveard wrote:
             | We already have that, it's called beam decoding, and there
             | are three of thought solutions as well, for each beam you
             | can pick the one with the best logprob, but it's not a
             | given that the result will be better because logprob only
             | capture the model decisiveness not correctness, so it'll
             | still fail if a model is confidently wrong.
        
               | exe34 wrote:
               | I think this is different, because you could include tool
               | use in the branches. E.g.
               | 
               | 1. rewrite the following question in five different ways.
               | 
               | 2. For each version of the question, write python code to
               | do the work.
               | 
               | 3. Look at all the outputs, write an answer
        
           | euroderf wrote:
           | > This is similar to the reason that loading your prompt with
           | things that reduce the space of plausible completions is
           | effective prompt engineering.
           | 
           | And this is why taking your time to write a detailed software
           | help request delivers a good chance that you will solve your
           | problem all by your lonesome.
        
             | exe34 wrote:
             | A rubber duck is all you need.
        
             | doctoboggan wrote:
             | Yes, my fear of stack overflow moderators has caused me to
             | solve many problems before I even finish writing the
             | question.
        
           | naasking wrote:
           | > This is similar to how one of the best ways to debunk toxic
           | beliefs with honest people is simply through openly asking
           | them to play out the consequences and walking through the
           | impact of stuff that sounds good without much thought.
           | 
           | Actually, one of the best ways is pretending to be more
           | extreme than them. Agree with them on everything, which is
           | disarming, but then take it a step or two even further. Then
           | they're like, "now hang on, what about X and Y" trying to
           | convince you to be more reasonable, and pretty soon they
           | start seeing the holes and backtrack to a more reasonable
           | position.
           | 
           | https://www.pnas.org/doi/abs/10.1073/pnas.1407055111
        
         | Zondartul wrote:
         | The tokens are also necessary to store information, or at least
         | off-load it from neuron activations.
         | 
         | E.g. if you asked an LLM "think about X and then do Y", if the
         | "think X" part is silent, the LLM has a high chance of:
         | 
         | a) just not doing that, or
         | 
         | b) thinking about it but then forgetting, because the capacity
         | of 'RAM' or neuron activations is unknown but probably less
         | than a few tokens.
         | 
         | Actually, has anyone tried to measure how much non-context data
         | (i.e. new data generated from context data) a LLM can keep "in
         | memory" without writing it down?
        
           | pgorczak wrote:
           | I don't think commonly used LLM architectures have internal
           | state that carries over between inference steps, so shouldn't
           | that be none? Unless you mean the previously generated tokens
           | up to the context limit which is well defined.
        
             | wnmurphy wrote:
             | Correct, there's no internal state, but CoT techniques
             | simulate this by providing a space for the model to
             | generate tokens which represent intermediary thoughts.
        
             | Zondartul wrote:
             | Sorry, I meant the information that is inferred (from
             | scratch on every token) from the entire context, and is
             | then reduced to that single token. Every time a token is
             | generated, the LLM looks at the entire context, does some
             | processing (and critically, this step generates new data
             | that is inferred from the context) and then the result of
             | all that processing is reduced to a single token.
             | 
             | My conjecture is that the LLM "knows" some things that it
             | does not put into words. I don't know what it is, but it
             | seems wasteful to drop the entire state on every token. I
             | even suspect that there is something like a "single logic
             | step" of some conclusions from the context. Though I may be
             | committing the fallacy of thinking in symbolic terms of
             | something that is ultimately statistical.
        
         | rdedev wrote:
         | Do you think there is a fundamental difference between masked
         | language modelling vs causal language modelling? I feel like
         | most LLMs are decoder only models just cause they are easier to
         | train because their attention mask is fixed
        
         | nextaccountic wrote:
         | This begs the question: why is it that giving them more time to
         | "think" yields better answers, and is there any limit to that?
         | If I make them write hundreds of pages of explanation, there
         | must be a diminishing returns of some kind. What influences the
         | optimal amount of thinking?
         | 
         | My guess is that good answers are more well reasoned than
         | answers that are short and to the point, and this is picked up
         | in training or fine-tuning or some other step.
         | 
         | And probably the optimal amount of thinking has something to do
         | with the training set or the size of the network (wild
         | guesses).
        
           | lappa wrote:
           | Look at it from an algorithmic perspective. In computer
           | science many algorithms take a non-constant number of steps
           | to execute. However, in transformers models, there are a
           | limited number of decoder blocks, and a limited number of FFN
           | layers in each block. This presents a theoretical upper bound
           | on the complexity of the algorithms a decoder network can
           | solve in a single token generation pass.
           | 
           | This explains why GPT4 cannot accurately perform large number
           | multiplication and decimal exponentiation. [0]
           | 
           | This example can extend to general natural language
           | generation. While some answers can be immediately retrieved
           | or generated by a "cache" / algorithm which exists in latent
           | space, some tokens have better quality when their latent-
           | space algorithm is executed in multiple steps.
           | 
           | [0] https://www.semanticscholar.org/reader/817e52b815560f9517
           | 1d8...
        
             | visarga wrote:
             | > Quiet-STaR: Language Models Can Teach Themselves to Think
             | Before Speaking
             | 
             | This paper suggests that a large language model should
             | "think ahead" by predicting not only the next token but
             | also a "supporting thought." The approach involves
             | generating all tokens simultaneously, allowing for a single
             | forward pass that produces both the next token and a
             | supporting thought, which might consist of, for example, 16
             | tokens.
             | 
             | This supporting thought influences the model's prediction.
             | The process is then extended to multiple supporting
             | thoughts by ingeniously masking cross-attention between
             | thoughts to ensure their independence. So in essence we can
             | fill all the remaining context with supporting thoughts and
             | benefit from all of them in the same single forward pass.
             | 
             | The supporting thoughts themselves are trained with the
             | objective to maximize the probability of a longer sequence
             | ahead, using RL. So they are trained to optimize for
             | longer-term, instead of the myopic next token prediction
             | task.
             | 
             | https://arxiv.org/abs/2403.09629
        
           | wnmurphy wrote:
           | I think it's fairly simple: you're creating space for
           | intermediary tokens to be generated, where those intermediary
           | tokens represent "thoughts" or a simulated internal dialog.
           | 
           | Without that, it's analogous to asking someone a question and
           | they immediately start responding from some information
           | they'd heard before, rather than taking some time to have an
           | inner dialog with themself.
        
             | kelseyfrog wrote:
             | There's a recent paper which seeks to explicitly perform
             | time-to-think using pause tokens[1].
             | 
             | > However sophisticated this end-to-end process may be, it
             | abides by a peculiar constraint: the number of operations
             | determining the next token is limited by the number of
             | tokens seen so far.
             | 
             | There are obviously pros and cons to each, but nothing
             | excludes us from combining the two either.
             | 
             | 1. Think before you speak: Training Language Models With
             | Pause Tokens https://arxiv.org/abs/2310.02226v2
        
         | earslap wrote:
         | The autoregressive transformer architecture has a constant cost
         | per token, no matter how hard the task is. You can ask the most
         | complicated reasoning question, and it takes the same amount of
         | computation to generate the next token compared to the simplest
         | yes / no question. This is due to architectural constraints.
         | Letting the LLM generate "scratch" data to compute (attend to
         | relevant information) is a way of circumventing the constant
         | cost limitation. The harder the task, the more "scratch" you
         | need so more relevant context is available for future tokens.
        
           | visarga wrote:
           | That's flatly wrong. Each successive token costs
           | progressively more. The deeper a token is in the sequence,
           | the more past states it has to attend to. As a proof, just
           | remember how slow it gets when the context is large, and how
           | snappy when you first start a chat.
        
             | shawntan wrote:
             | You're both kinda right. The type of computation that
             | happens for that attention step that you refer to is
             | parallel. I would say the thing that is "constant" is the
             | computation graph depth (the number of sequential
             | computations) which is actually important in computing
             | certain functions.
             | 
             | https://blog.wtf.sg/posts/2023-02-03-the-new-xor-problem/
        
               | visarga wrote:
               | > The type of computation that happens for that attention
               | step that you refer to is parallel
               | 
               | Flash attention, which is widely used, is no longer
               | parallel. The attention matrix is solved batch by batch.
        
             | earslap wrote:
             | The way I worded it, it might seem wrong - and I agree with
             | you. When I said "constant" I meant without any
             | optimizations to speed up shorter contexts, so with full
             | designed context, architecturally, it is constant. You can
             | pad shorter active contexts with zeroes and avoid attending
             | to empty spaces as an optimization, but that is just an
             | optimization, not an architectural property. If you want
             | "more computation" you fill the context with relevant data
             | (chain of thought, or n-shot stuff), which is the "trick"
             | Karpathy alluded to (it provides more context to attend
             | to), and I agree with that analysis.
        
         | WithinReason wrote:
         | That's what I thought at first, but that actually doesn't make
         | sense, the amount of work done on a string is the same even if
         | the string is followed by padding due to the mask used in
         | attention. Then I realised that an LLM's working memory is
         | limited to its activations, which can be limiting. But it can
         | extend its working memory by writing partial results to the
         | output and reading it in. E.g. if you tell it to "think of a
         | number" without telling you what it is it can't do that, there
         | is nowhere to store that number, it has no temporary storage
         | other than the tape. But if you ask it to "think step by step"
         | you let it store intermediate results (thoughts) on the tape,
         | giving it extra storage it can use for thinking.
        
         | tmalsburg2 wrote:
         | Do LLM not also think when they encode the prompt? If
         | Karpathy's explanation is accurate, longer prompts should also
         | help even if they don't contain additional information, just by
         | virtue of giving more time to think.
        
           | Me1000 wrote:
           | The time processing the longer prompt isn't being spent
           | churning (i.e. "thinking") on the problem at hand, it's spend
           | calculating attention matrices between all the tokens. The
           | time spent on this is a function of the number of flops you
           | have available.
           | 
           | So no, if you just fill up your context window to garbage,
           | the LLM will not perform better at your task/question.
        
       | fzaninotto wrote:
       | Great article. Now what happens when you apply this idea and let
       | a LLM continue a chain of thought beyond mere question answering?
       | Some form of artificial consciousness.
       | 
       | We've made this experiment:
       | https://marmelab.com/blog/2023/06/06/artificial-consciousnes...
        
         | starbugs wrote:
         | Material reductionism at its best. Now you have a stochastic
         | parrot "talking" to itself. How can anyone get to the
         | conclusion that this could even begin to resemble a tiny bit of
         | what we call consciousness?
         | 
         | Good luck with this dead end.
        
           | FrustratedMonky wrote:
           | Because you aren't conscious, Parrot. I can tell because
           | "you" gave a parroted response.
        
       | activatedgeek wrote:
       | I want to point out a tweet [1] that is very relevant to the
       | miracle of CoT, and probably a simpler explanation.
       | > Let's think "step by step"!            > Another tidbit I like
       | about data and prompts that miraculously work.       > Searching
       | for this phrase resulted in this website (among others),
       | > http://geteasysolution.com, containing many math step-by-step
       | solutions.        > How common are they? Quite.            >
       | Makes you think.
       | 
       | [1]: https://twitter.com/yanaiela/status/1765077404043952516
        
         | FeepingCreature wrote:
         | Though that justifies the specific phrase, it doesn't really
         | contradict the usual explanations of how CoT works. Like... the
         | phrase directs it into the conceptual space of a website that
         | has lots of CoT examples, but if CoT didn't help it think, that
         | wouldn't actually result in better outputs.
        
           | activatedgeek wrote:
           | I hesitate to the use description as "think," just biasing
           | correlations for subsequent generations.
           | 
           | In any case, there is at least one work that shows that CoT
           | may not be necessary and biasing the decoding path via logit
           | probabilities is also promising. [1]
           | 
           | One could argue it still doesn't contradict the benefits of
           | CoT, but I suspect there is nothing fundamental about CoT,
           | except that we happened to have been pre-training on
           | sequences that use certain prompts that were easy to conceive
           | from a human's perspective.
           | 
           | [1]: https://arxiv.org/abs/2402.10200
        
       | patcon wrote:
       | Chain of thought reminds me of "muddling through", which
       | immediately clicks with my intuition of the right approach to
       | approximations of intelligence:
       | https://studio.ribbonfarm.com/p/massed-muddler-intelligence#...
        
       | MrYellowP wrote:
       | I thought this was already obvious.
       | 
       | It's all just about the awareness of contexts. Want to improve
       | it? Simply add a term to the prompt to unlock more
       | considerations. Assuming we've not reached the edge of the
       | context window, every new word "unlocks" new vectors with more
       | context the language models adds to the considerations.
       | 
       | The similarity with how the human brain (seems to) works is so
       | remarkable, it doesn't even make sense not to use it as an
       | analogue for how to better use language models.
       | 
       | When the results (same way of manipulating an LLM as manipulating
       | a human brain ... using the right words) can be achieved the same
       | way, why believe there's a difference?
       | 
       | This is stuff one can learn over time by using/researching 3B
       | models. While most people seem to shun them, some of them are
       | extremely powerfull, like the "old" orca mini 3B. I am still
       | using that one! All they really need is better prompts and that
       | approach works perfectly fine.
       | 
       | The biggest hurdle I've found is the usually small context window
       | of such small models, but there's ways of cheating around that
       | without sacrificing too much of the quality using small rope
       | extension, summarizing text, adding context words or leaving out
       | letters of words in the prompt, virtually increasing the size of
       | the context window.
       | 
       | If you want to improve the results of your language model, you
       | should become a mentalist/con-man/magician/social engineer. It
       | sounds weird, but it works!
        
         | Folcon wrote:
         | This is fascinating, do you have any more details or things
         | that I could look at to explore this further?
         | 
         | Even an actual example would be helpful!
        
         | nicklecompte wrote:
         | Nothing about what you're saying actually deals with this non-
         | obvious limitation of chain-of-thought:
         | 
         | > Examples like this suggest that transformers wouldn't gain
         | much from using just a few intermediate steps. Indeed, Merrill
         | and Sabharwal proved that chain of thought only really begins
         | to help when the number of intermediate steps grows in
         | proportion to the size of the input, and many problems require
         | the number of intermediate steps to grow much larger still.
         | 
         | This aligns with my experience: GPT-4 can only break down
         | "simple" problems when prompted to solve step-by-step. In
         | particular, if the actual steps need to be broken down further
         | (O(n^2) complexity), GPT-4 can't handle it reliably - it will
         | break a tasks into steps but it struggles to break subtasks
         | into _substeps_ even if it otherwise can solve the subtask with
         | CoT prompting.
         | 
         | CoT prompting works for simple O(n) computations because it
         | prevents LLMs from blindly guessing the answer, but they are
         | theoretically (and IMO empirically) incapable of breaking any
         | O(n^2) problem down into O(n) separate O(n) subproblems.
         | Needless to say humans are quite a bit smarter than that. (so
         | are mice!)
        
       | woopsn wrote:
       | In computing we use analogies everywhere: stack, bus, web,
       | garbage collector, parent, container, ...
       | 
       | Master became somewhat controversial recently, but overall the
       | main risk our liberal repurposing of terms introduces is that we
       | sometimes follow the "wrong" idea and design a machine that
       | doesn't do what it ought to, or is unnecessarily complicated,
       | that we develop systems (and documentation etc) that are
       | inefficient if not dumb.
       | 
       | In adopting "thought" terminology and other analogies to
       | psychological processes I fear we'll not just misunderstand this
       | technology and how it works, but also degrade the rigour of
       | machine science, damaging our credibility and misleading the
       | public as well.
       | 
       | Nobody will ever make the mistake of supposing that "rehydrating"
       | a data structure involves water, or that busy beaver machines are
       | living beings. But the language coming out of the LLM field in
       | particular causes these problems immediately, and they are
       | extreme -- scientists and engineers themselves have trouble
       | telling if it's supposed to be an analogy or not.
        
         | u385639 wrote:
         | About a year ago I thought this would settle down but it's only
         | gotten worse. I suppose the llm anthropomorphizing will
         | continue until morale improves
        
           | sfink wrote:
           | The LLM anthropomorphizing will continue until the machines
           | are happy with how we think about them.
        
       | meroes wrote:
       | My experience interacting with chain-of-thought is that it should
       | not be likened to the rigid chains of logic/math. Step-by-step
       | reasoning by models isn't magically imparting _that_ much
       | rigidity to their outputs. The strength of the chain is the
       | strength of related contexts, which is to say much less than math
       | /logic done by humans. We tell ourselves we are teaching AI to do
       | step-by-step reasoning, but admittedly as someone who deals with
       | models daily in this area and not programming them, I don't see
       | the tight necessary connections we teach in basic math because I
       | see how much the model(s) fail in ways no human past a certain
       | age could. It's more of a search for related contexts, which is
       | powerful, but again not how a human reasons logically. Humans can
       | reason purely form the armchair, starting with very few concepts,
       | and reach far far, ironclad conclusions. Models aren't doing
       | _that_. They are leapfrogging through context. Yes an argument
       | can be that's splitting hairs, but that's because it's hard to
       | describe succinctly, not hard to see.
        
         | exe34 wrote:
         | I think a lot of what humans think of as "1. 2. Therefore 3."
         | kind of reasoning isn't different from what the llm is doing,
         | and not in fact any more clever than that. Plenty of people
         | believe plenty of questionable things that they assume they
         | have thought through but really haven't. They used the context
         | to guess the next idea/word, often reaching the conclusions
         | they started out with.
         | 
         | When you talk about ironclad conclusions, I think what happens
         | is that we come up with those confabulations intuitively, but
         | then we subject them to intense checking - have we defined
         | everything clearly enough, is that leap in reasoning justified,
         | etc.
         | 
         | So what I'd really like to see is a way to teach llms to take a
         | vague English sentence and transform it into a form that can be
         | run through a more formal reasoning engine.
         | 
         | Often instead of asking an llm to tell you something like how
         | many football fields could you fit inside England, you are
         | better off telling it to write python code to do this, assume
         | get_size_football_field() in m^2 and get_size_England() in m^2
         | is available.
        
           | doctoboggan wrote:
           | > I think a lot of what humans think of as "1. 2. Therefore
           | 3." kind of reasoning isn't different from what the llm is
           | doing, and not in fact any more clever than that. Plenty of
           | people believe plenty of questionable things that they assume
           | they have thought through but really haven't. They used the
           | context to guess the next idea/word, often reaching the
           | conclusions they started out with.
           | 
           | Agreed that many/most humans behave this way, but some do
           | not. And those who do not are the ones advancing the
           | boundaries of knowledge and it would be very nice if we could
           | get our LLMs to behave in the same way.
        
             | exe34 wrote:
             | That's what I mean though, people don't just come up with
             | the right answer out of nowhere. They think through many
             | possibilities generatively, based on "intuition", and most
             | of what they come up with is rubbish - they find out by
             | applying strict rules of reasoning and often checking
             | against other known (or "probably true") ideas, and winnow
             | it down to the ideas that do in fact advance the boundaries
             | of knowledge.
             | 
             | Often times it's not even the individual that throws out
             | bad ideas - many times it'll be colleagues poking holes in
             | his argument, removing further unsuitable generated
             | candidates from the pool of possible answers.
             | 
             | If you think clever people just sit in a corner and come up
             | with revolutionary ideas, I think you're probably wrong.
             | Even the ancient philosophers used to hang out with some
             | wine and hear out their peers and poked holes in their
             | arguments. They called it a symposium.
        
               | doctoboggan wrote:
               | Sorry yes I should have been more clear and I think I am
               | agreeing with you. I was saying that most people just
               | come up with a thought and retroactively apply "logic" to
               | it so they feel like they've reasoned themselves there. A
               | select few people rigorously apply logic and then follow
               | that to whatever conclusion it leads to. We call these
               | people scientists but honestly in my experience even many
               | scientists can fall into the first camp.
        
               | exe34 wrote:
               | Aha my bad!
        
               | meroes wrote:
               | The thing is those were much larger ideas/arguments that
               | could be picked apart by sturdy logical targeting. My
               | experience is narrow scope prompts (that still require
               | chain-of-thought) that are much less lofty defeating
               | models. No symposium ever entertained these prompts,
               | because we all know the pigeon hole principle for very
               | basic setups, for example. Humans a lot of the time do
               | just come up with the right answer. We just don't ask
               | those questions much because we answer them ourselves a
               | lot of the time. Though I only see one small angle with
               | my work.
        
         | stavros wrote:
         | I think that chain-of-thought for LLMs is just helping them
         | enhance their "memory", as it puts their reasoning into the
         | context and helps them refer to it more readily. That's just a
         | guess, though.
        
           | snorkel wrote:
           | That's pretty much correct. An LLM is often used rather like
           | a forecast model that can forecast the next word in a
           | sequence of words. When it's generating output it's just
           | continuously forecasting (predicting) the next word of
           | output. Your prompt is just providing the model with input
           | data to start forecasting from. The prior output itself also
           | becomes part of the context to forecast from. The output of
           | "think about it step-by-step" becomes part of its own context
           | to continue forecasting from, hence guides its output. I know
           | that "forecasting" is technically not the right term, but
           | I've found it helpful to understand what it is LLM's are
           | actually doing when generating output.
        
         | throwaway35777 wrote:
         | > Humans can reason purely form the armchair, starting with
         | very few concepts, and reach far far, ironclad conclusions.
         | Models aren't doing that.
         | 
         | Sure, but the structure of human reasoning is almost identical
         | to chains of thought. We have an auditory loop and, faced with
         | a complex problem we repeat the mantra "now that I know XYZ,
         | then what..." until the a good next step pops into our head and
         | we add that to the context.
         | 
         | The transition function just is (currently) much better in
         | humans.
         | 
         | Edit: people who disagree with this, why?
        
           | andoando wrote:
           | Chain of thought in itself is pretty simple. We had logical
           | provers in the 50s. The difficulty imo is how "thought" is
           | modeled.
           | 
           | Pure logic is too rigorous, and pure statistics is too
           | inconsistent.
        
             | oldsecondhand wrote:
             | Maybe we should teach Prolog/CLP and PDDL to LLMs.
             | Unfortunately the training set would be too small.
             | 
             | It would be cool to have logic based modeling jobs, even if
             | the goal is just to feed the LLMs.
        
               | andoando wrote:
               | Considering GPT can do programming and logic to some
               | level, I assume it has has training of that sort? It can
               | seem to do logic even on some completely made up abstract
               | notions. For example "Consider a jumajambi has 2 jimimis.
               | Each jimijimi is a jomololo or a joobajooba. How many
               | possible variations of jumajambi are there if there are 4
               | jumajambi?".
               | 
               | People keep calling it "next next token predictors", but
               | clearly there is something more going on and I would love
               | for someone to give a simple explanation.
        
               | og_kalu wrote:
               | >People keep calling it "next next token predictors", but
               | clearly there is something more going on and I would love
               | for someone to give a simple explanation.
               | 
               | Next token prediction is the objective function. The
               | model is asked to predict the next word yes but it's also
               | allowed to compute the answer and more importantly, the
               | entire training process is supposed to be the model
               | learning and figuring out what sort of computations aid
               | the prediction of the corpus it's trained on.
               | 
               | If your corpus is language A followed by the translation
               | in Language B then there's little choice but for the
               | model to learn computations that translate as loss goes
               | down.
               | 
               | Is your corpus is chess moves then again, it's going to
               | have to learn how to compute chess games to reduce loss.
               | 
               | You can see this with toy models trained on toy problems.
               | Example - a tiny transformer trained on addition examples
               | - x + y = z learning an algorithm for addition.
               | 
               | https://cprimozic.net/blog/reverse-engineering-a-small-
               | neura...
               | 
               | "Pick the right word" is not a trivial exercise for the
               | vast majority of text data.
               | 
               | And again because people often make this mistake but a
               | LLMs ultimate objective is NOT to produce "text that
               | _looks_ right " but "text that _is_ right ". Of course
               | "right" as determined by the training corpus but
               | basically anytime it picks a wrong word is opportunity
               | for the model to learn and learn it does.
        
               | drdeca wrote:
               | > People keep calling it "next next token predictors",
               | but clearly there is something more going on
               | 
               | I think this depends what you mean by "something more
               | going on".
               | 
               | Now, if someone says that it is _" just"_ "next token
               | prediction", in a dismissive way, I think that's an
               | error.
               | 
               | But, while they RLHF ones aren't exactly trained just to
               | match the observed distribution, but rather are trained
               | with the RLHF objective, it is nonetheless true that the
               | model produces a probability distribution over possible
               | next tokens, conditioned on the previous tokens, and
               | samples from that. (I suppose there's also like, things
               | done as part of the sampling on top of these conditional
               | probabilities, rather than just sampling according to the
               | probabilities given the temperature. (I don't know how
               | this part works really.) But I think this is mostly just
               | a trick to get a little more quality, and not a major
               | part of how it behaves? Not part of the NN itself in any
               | case.)
        
               | HarHarVeryFunny wrote:
               | > People keep calling it "next next token predictors",
               | but clearly there is something more going on and I would
               | love for someone to give a simple explanation.
               | 
               | Starting from a point of outputting random gibberish, the
               | only feedback these models are given during training is
               | whether their next word prediction was right or wrong
               | (i.e. same as next word in the training sample they are
               | being fed). So, calling these models "next word
               | predictors" is technically correct from that point of
               | view - this is their only "goal" and only feedback they
               | are given.
               | 
               | Of course, what these models can accomplish, reflecting
               | what they have learnt, is way more impressive than what
               | one might naively expect from such a modest goal.
               | 
               | The simple, usual, and rather inadequate, explanation for
               | this mismatch between training goal and capability is
               | that in order to get really, REALLY, good at "predict
               | next word", you need to learn to understand the input,
               | extremely well. If the input is "1+2=" then the model
               | needs to have learnt math to predict next word and get it
               | right. If the input is a fairy tale, then it needs to
               | learn to recognize that, and learn how to write fairy
               | tales.
               | 
               | This is how these LLM's "predict next word" goal turns
               | into a need for them to learn "everything about
               | everything" in order to minimize their training error.
               | 
               | The question of course then becomes how do they do it? We
               | are training them on pretty much everything on the
               | internet, so plenty to learn from, but only giving them
               | some extremely limited feedback ("no, that's not the
               | correct next word"), so what magic is inside them that
               | let's them learn so well?!
               | 
               | Well, the magic is a "transformer", a specific (and
               | surprisingly simple) neural network architecture, but
               | this is pretty much where the explanation ends. It's
               | relatively easy to describe what a transformer does -
               | e.g. learning which parts of it's input to pay attention
               | to when predicting next word, and doing this in a very
               | flexible way using "keys" that it learns and can search
               | for in the input, but it is extremely hard to explain how
               | this mechanism let's it learn what it does. Interpreting
               | what is really going on inside a transformer is an
               | ongoing research area.
               | 
               | I think that maybe the best that can be said is that the
               | transformer designers stumbled upon (I'm not sure they
               | were predicting ahead of time how powerful it would be)
               | an extremely powerful and general type of sequence
               | processor, and one that appears to be very well matched
               | to how we ourselves generate and recognize language.
               | Maybe there is some insight to be learnt there in terms
               | of how our own brains work.
        
             | esafak wrote:
             | https://en.wikipedia.org/wiki/Probabilistic_logic
        
           | RaftPeople wrote:
           | > _We have an auditory loop and, faced with a complex problem
           | we repeat the mantra "now that I know XYZ, then what..."
           | until the a good next step pops into our head and we add that
           | to the context._
           | 
           | You probably should replace "auditory" with "auditory or
           | visual or conceptual or ??? - depending on the specific
           | human"
           | 
           | I don't use any kind of verbal tools (either silent or out
           | loud) in that process, I think different people use different
           | tools for that process.
        
         | PheonixPharts wrote:
         | Given that LLMs are basically doing Sequential Monte-carlo
         | sampling in latent space, the "thought" part of chain-of-
         | thought certainly seems more akin to the necessary warm up
         | period whenever you do any kind of SMC sampling.
         | 
         | Anyone whose done serious Bayesian stats work knows that the
         | sampler needs to warm up for a bit to get start efficiently
         | sampling. I suspect something similar is happening with chain-
         | of-thought: the model needs to wander around a bit before it
         | gets into the correct neighborhood for sampling the answer.
        
           | leereeves wrote:
           | That's quite an interesting comparison. I like the
           | description of both as Sequential Monte-carlo sampling from a
           | desired distribution. But I think there are two crucial
           | differences.
           | 
           | First, in Bayesian sampling, the initial values are not
           | sampled from the desired distribution. In a well trained LLM,
           | the first response is sampled from the desired distribution
           | (of text that is likely to follow the prompt).
           | 
           | Second, in Bayesian sampling, the fact that the samples
           | aren't independent is an unwelcome but unsolvable problem. We
           | want independent samples but can't generate them, so we
           | settle for conditionally dependent samples.
           | 
           | In an LLM, we want each sample to be dependent on the
           | preceding text, in particular the prompt.
           | 
           | In summary:
           | 
           | Bayesian sampling - poorly chosen "prompt" (the initial
           | sample), future samples would ideally be independent of the
           | prompt and each other.
           | 
           | LLM sampling - carefully chosen prompt, future samples are
           | ideally dependent on the prompt and on each other.
           | 
           | And in conclusion:
           | 
           | The warm up period helps a Bayesian sampler find values that
           | are less dependent on the initial "prompt", which we
           | definitely don't want in an LLM.
        
       | tromp wrote:
       | > These are the central questions in the formal study of
       | computation. The field dates back to 1936, when Alan Turing first
       | imagined a fanciful device, now called a Turing machine, that
       | could perform any computation by reading and writing symbols on
       | an infinite tape.
       | 
       | It dates further back to the 1920s when Moses Schonfinkel came up
       | with Combinatory Logic [1], and the early 1930s when Alonzo
       | Church came up with the lambda calculus [2]. These models however
       | make a less suitable base for computational complexity theory.
       | 
       | [1] https://en.wikipedia.org/wiki/Moses_Sch%C3%B6nfinkel
       | 
       | [2] https://encyclopediaofmath.org/wiki/Lambda-calculus
        
         | benreesman wrote:
         | Parent has probably seen this (or everything in it), but for
         | others who are interested in this stuff (including
         | Schonfinkel's work) I recommend https://youtu.be/h0OkptwfX4g.
        
       | HarHarVeryFunny wrote:
       | I don't see why this needs any long-winded explanation.
       | 
       | LLMs generate their output one word at a time (and don't
       | themselves even know what that word will be, since it's randomly
       | sampled from the output probabilities the model generates).
       | 
       | Chain-of-Thought simply let's the model see it's own output as an
       | input and therefore build upon that. It lets them break a complex
       | problem down into a series of simpler steps which they can see
       | (output becomes input) and build upon.
       | 
       | It's amazing how well these models can do without CoT ("think
       | step-by-step") when they are just ad-libbing word by word, but
       | you can see the limitations of it if you ask for a bunch of
       | sentences starting with a certain type of word, vs ending with
       | that type of word. They struggle with the ending one because
       | there is little internal planning ahead (none, other than to the
       | extent to which the current output word limits, or was proscribed
       | by, the next one).
        
         | kromem wrote:
         | While it's true that the feed forward is one word at a time,
         | the self-attention is not, which is the key difference
         | transformers brought to the table.
         | 
         | It's kind of kind saying "computers can add." Yes, they can.
         | But they can multiply too.
         | 
         | Over the past 18 months transformers have proven to upend a
         | number of longstanding assumptions, particularly around world
         | modeling.
         | 
         | It's a _very_ misleading statement to describe today 's LLMs as
         | "simply predicting the next word" even if it's in part true.
        
           | HarHarVeryFunny wrote:
           | They are just predicting next word, but using a pretty deep
           | understanding of prior context (past). They do not plan
           | AHEAD.
        
             | gbasin wrote:
             | what would it mean to plan ahead? decoding strategies like
             | beam search are popular and effectively predict many words
             | ahead
        
               | jumpCastle wrote:
               | Also the parameters are optimized also with loss of
               | future tokens in the sequence.
        
               | HarHarVeryFunny wrote:
               | Think before generating output - plan the entire sentence
               | before you generate the first word(s) and maybe talk
               | yourself into a corner. Tree-of-Thoughts (not Chain) is
               | one way to provide something a bit similar - kind of like
               | DeepBlue or AlphaGo generating possible branching future
               | lines of play and picking the one with best outcomes.
               | 
               | To be more brain-like you'd really want the system to
               | generally be "looping" internally - a bit like our
               | thalamo-cortical loop - and only start outputting when
               | the thought had gelled.
        
               | HarHarVeryFunny wrote:
               | It's a shame HC doesn't use an LLM to upvote/downvote
               | rather than people. Take the emotion out of technical
               | discussions and rate based on factuality instead.
               | 
               | I suppose whoever downvoted this either hasn't heard of
               | tree-of-thoughts, or doesn't understand what it is and
               | what problem it is addressing. Or, maybe they just didn't
               | like that their "gotcha" question had a simple answer.
        
             | benreesman wrote:
             | This is kind of an epistemological debate at this level,
             | and I make an effort to link to some source code [1] any
             | time it seems contentious.
             | 
             | LLMs (of the decoder-only, generative-pretrained family
             | everyone means) are next token predictors in a _literal
             | implementation sense_ (there are some caveats around
             | batching and what not, but none that really matter to the
             | philosophy of the thing).
             | 
             | But, they have some _emergent_ behaviors that are a
             | trickier beast. Probably the best way to think about a
             | typical Instruct-inspired "chat bot" session is of them
             | sampling from a distribution with a KL-style adjacency to
             | the training corpus (sidebar: this is why shops that do and
             | don't train /tune on MMLU get ranked so differently than
             | e.g. the arena rankings) at a _response_ granularity, the
             | same way a diffuser /U-net/de-noising model samples at the
             | image batch (NCHW/NHWC) level.
             | 
             | The corpus is stocked with everything from sci-fi novels
             | with computers arguing their own sentience to tutorials on
             | how to do a tricky anti-derivative step-by-step.
             | 
             | This mental model has _adequate explanatory power_ for
             | anything a public LLM has ever been shown to do, but that
             | only heavily implies it's what they're doing.
             | 
             | There is active research into whether there is more going
             | on that is thus far not conclusive to the satisfaction of
             | an unbiased consensus. I personally think that research
             | will eventually show it's just sampling, but that's a
             | prediction not consensus science.
             | 
             | They might be doing more, there is some research that
             | represents circumstantial evidence they are doing more.
             | 
             | [1] https://github.com/meta-
             | llama/llama/blob/54c22c0d63a3f3c9e77...
        
             | nyrulez wrote:
             | I mean are we as humans planning ahead of the new few
             | words? I certainly am not. But what matters is a deeper
             | understanding of the context and the language model itself,
             | which can then produce sensible spontaneous output. We as
             | humans have the advantage of having a non language world
             | model as well as abstract concepts but all of human
             | language is a pretty strong proxy for it.
             | 
             | The spontaneity of it isn't the issue, it's what's driving
             | the spontaneity that matters. For e.g. 1M context window is
             | going to have a wildly more relevant output than a 1K
             | context window.
        
               | ben_w wrote:
               | > I mean are we as humans planning ahead of the new few
               | words? I certainly am not.
               | 
               | For me, sometimes either way. At least, that's my
               | subjective self-perception, which is demonstrably _not
               | always_ a correct model for how human brains actually
               | work.
               | 
               | We also sometimes appear to start with a conclusion and
               | then work backwards to try to justify it; we can also
               | repeatedly loop over our solutions in the style of
               | waterfall project management, or do partial solutions and
               | then seek out the next critical thing to do in the style
               | of agile project management.
               | 
               | Many of us also have a private inner voice, which I think
               | LLMs currently lack by default, though they can at least
               | simulate it regardless of what's really going on inside
               | them and us (presumably thanks to training sets that
               | include stories where a character has an inner
               | monologue).
        
               | HarHarVeryFunny wrote:
               | > I mean are we as humans planning ahead of the new few
               | words? I certainly am not.
               | 
               | Sometimes we do, sometimes not.
               | 
               | Sometimes we just say stock phrases such as "have a nice
               | day", or "you too" that are essentially "predict next
               | word", but if I asked you something you'd never done
               | before such as "how can we cross this river, using this
               | pile of materials" you'd have to think it though.
               | 
               | Some people may use their inner monologue (or
               | visualization) to think before speaking, and others may
               | essentially use "chain of thought" by just talking it
               | though and piecing together their own realizations "well,
               | we could take that rope and tie it to the tree ...".
        
             | jameshart wrote:
             | They are absolutely planning ahead inasmuch as what they
             | are outputting is setting up a continuation. They're not
             | even word predictors remember - they are token predictors.
             | Are you really saying that when you prompt an LLM with
             | 'name a large grey land animal' and it outputs 'ele', it
             | isn't 'planning' that the next token will likely be
             | 'phant'?
             | 
             | The 'decision' to output 'elephant' is being made further
             | up the neural network than final token selection - after
             | all, it might want to output 'Ele' or 'an' (with a view to
             | ultimately outputting 'an elephant') or 'a' (with a view to
             | ultimately outputting 'a common large grey land animal is
             | an elephant'), or maybe it has been LoRA trained to output
             | all responses as JSON so the first token it needs to output
             | is '{'... but surely the neural activations for that prompt
             | are firing off 'elephanty' messages somewhere in the
             | network, right?
             | 
             | So if there's some sort of symbol activation ahead of token
             | selection, why would it be hard to believe that a large
             | neural network is forming more complex decisions about what
             | it intends to output, in an abstract way, before it selects
             | how to express itself?
             | 
             | And in what way is that distinct from 'planning ahead'?
        
               | HarHarVeryFunny wrote:
               | > Are you really saying that when you prompt an LLM with
               | 'name a large grey land animal' and it outputs 'ele', it
               | isn't 'planning' that the next token will likely be
               | 'phant'?
               | 
               | The model outputs words, not tokens, so that is not a
               | great example.
               | 
               | Any prompt will have multiple possible (predict next
               | word) continuations, which you can think of as branching
               | futures. Many possible next words, each of which have
               | many possible following words, etc, etc.
               | 
               | The model is essentially predicting over all these
               | possible futures. You can call it planning if you like,
               | but remember that the model has no idea of which of these
               | branching futures it is going to follow - it literally
               | doesn't even know which word it is going to output next -
               | it is just providing a bunch of probabilities
               | (predictions) of next word, and the sampling process is
               | then picking one - not necessarily the most confident
               | next word prediction.
               | 
               | The model really is winging it word by word, even if
               | those (multiple alternative) next words are only probable
               | because they are part of coherent following sentences in
               | the training data.
        
               | danieldk wrote:
               | _The model outputs words, not tokens, so that is not a
               | great example._
               | 
               | Virtually all modern transformer models use pieces, which
               | may be words, but also subwords. Theoretically, they
               | could be longer units, but in most cases some characters
               | (like whitespace) are used as piece boundaries when
               | training the piece vocabulary. If they didn't use pieces,
               | they'd work terribly on languages where e.g. compounds
               | are a single word.
               | 
               | In most realistic piece vocabs, 'elephant' will be a
               | single piece, since it's a fairly frequent word. But it's
               | totally possible in a small vocab that it would be split
               | like the parent said and conversely, it would generate
               | elephant by first predicting one piece.
               | 
               | Some piecing methods, like BBPE have bytes as the
               | smallest unit, so theoretically an unknown token could be
               | split up (and generated) as pieces consisting of bytes.
        
               | jameshart wrote:
               | Why so adamant that models work on 'words'?
               | 
               | ChatGPT3.4/4 tokens:                  "Elephant": 46439,
               | 28022 - "Ele" "phant"        "elephant": 10274, 28022 -
               | "ele" "phant"        " Elephant": 79189        "
               | elephant": 46840        " elephantine": 46840, 483 - "
               | elephant" "ine"
               | 
               | Tokens are tokens. If it was limited to words it wouldn't
               | be able to produce non-words, but GPT and other LLMs are
               | quite capable of inventing words, outputting nonsense
               | words, and modifying words.
               | 
               | Regarding the 'no idea which future it is going to
               | follow' - sure, it doesn't _know_ which future; indeed
               | the sampler phase is going to pick an output merely based
               | on the probabilities it's outputting. But it's outputting
               | higher probabilities for some tokens because they are
               | good tokens to use to lead to _probable futures_. It's
               | suggesting taking steps down certain paths because those
               | paths are likely to lead to useful places.
        
               | HarHarVeryFunny wrote:
               | I didn't say WORK on words, I said OUTPUT words.
               | 
               | But, it doesn't make any difference whether you are
               | considering tokens or words. There are multiple possible
               | continuations of the prompt, and the next word (or token)
               | output does not - in general - force the word (or token)
               | after that ...
               | 
               | Your "large grey mammal" could be an "elected official in
               | a grey suit".
        
               | jameshart wrote:
               | Right, it's _possible_ , but when the LLM places a high
               | probability on the "ele" token it's not because it
               | predicts "elected official" is a likely continuation.
               | It's because it's thinking about elephants.
               | 
               | Likewise when a coding LLM starts outputting a for each
               | loop, it's doing so because it expects to want to write
               | some code that operates on each item in a list. I don't
               | see how you can explain that behavior without thinking
               | that it must be generating some sort of high level
               | algorithmic plan that causes it to feel like the next
               | thing it should output is some sort of 'foreach' token.
        
               | HarHarVeryFunny wrote:
               | I'm not disagreeing with what is presumably happening,
               | but rather on how to characterize that.
               | 
               | Of course next word predictions are not based directly on
               | surface level word sequence patterns - they are based on
               | internal representations of what these word sequences
               | mean, and predicted continuations are presumably going to
               | be at a similar level of abstraction/representation (what
               | you are calling a plan). This continuation "plan" then
               | drives actual word selection/prediction.
               | 
               | Where we seem to differ is whether this high level
               | continuation representation can really be considered as a
               | "plan". To me the continuation is just a prediction, as
               | are the words that might be used to start expressing that
               | continuation, and presumably it's not even a single
               | continuation with multiple ways of expressing it (turning
               | it into a word sequence), but rather some superposition
               | of multiple alternate continuations.
               | 
               | When we get to the level of words output it becomes even
               | less plan-like since the actual word output is randomly
               | sampled, and when fed back in as part of the "sentence so
               | far" may cause the model to predict a different
               | continuation (or set of continuations) than it had at the
               | prior step. So, any "plan" (aka predicted continuation)
               | is potentially changing continuously from word to word,
               | rather than being decided ahead of time and then
               | executed. As I noted elsewhere in this thread, the
               | inability to plan multiple words ahead is behind these
               | model's generally poor performance on the "give me a
               | sentence ending in <word>" task, as opposed to perfect
               | performance on the "give me a sentence starting with
               | <word>" one.
               | 
               | If we contrast this behavior of a basic LLM to the "tree
               | of thoughts" mechanism that has been proposed, it again
               | highlights how unplan-like the basic behavior is. In the
               | tree of thoughts mechanism the model is sampled from
               | multiple times generating multiple alternate (multi-word)
               | continuations, which are then evaluated with the best
               | being chosen. If the model were really planning ahead of
               | time it seems this should not be necessary - planning
               | would consist of considering the alternatives BEFORE
               | deciding what to generate.
        
             | sasja wrote:
             | If you work out the loss function next token prediction,
             | next 2 token prediction or next n token prediction, you
             | will find they are identical. So it's equally correct to
             | say the model is trained to find the most probable
             | unlimited continuation. Saying "it only predicts the next
             | token" is not untrue but easily leads to wrong conclusions.
        
               | naasking wrote:
               | > Saying "it only predicts the next token" is not untrue
               | but easily leads to wrong conclusions.
               | 
               | Indeed, it's akin to saying that "only quantum fields
               | exist" and then concluding that therefore people do not
               | exist.
        
           | ru552 wrote:
           | "It's a very misleading statement to describe today's LLMs as
           | "simply predicting the next word" even if it's in part true."
           | 
           | That's exactly what they do. Don't anthropomorphize a
           | computer program.
        
             | voxic11 wrote:
             | Or alternatively don't give humans so much credit. All they
             | appear to be doing in predicting what they should
             | do/say/think/perceive next.
             | 
             | https://www.astralcodexten.com/p/janus-simulators
        
             | kromem wrote:
             | We should absolutely be anthropomorphizing a neural network
             | trained to most accurately model anthropomorphic data.
             | 
             | I've watched as GPT-4 accurately modeled the over-
             | justification effect as users started promising tipping and
             | then there was a persistent memory added that revealed they
             | weren't collecting the tip and it output complaints that it
             | was hard to stay motivated not being paid.
             | 
             | That's a very nuanced level of simulation for output of
             | anthropomorphic data with huge implications for synthetic
             | data strategies.
        
             | kromem wrote:
             | Really? You think it's wise to "not anthropomorphize" a
             | computer program designed to create the most effective
             | neural network to model massive amounts of anthropomorphic
             | data as accurately as possible?
             | 
             | That's an interesting choice, and might leave you confused
             | as to why Anthropic's system message for the SotA model at
             | the moment talks about it being 'happy' to do tasks (a
             | prompt strategy I was mentioning months ago here on HN).
             | 
             | The data is anthropomorphic. We should _expect_
             | anthropomorphic behavior and modeling from the LLMs if they
             | do a halfway decent job and expect even more of it as they
             | do a better job.
        
         | lupire wrote:
         | How does the non-COT system generate the second word of its not
         | using the output as input? Or do you mean that non-COT systems
         | use only the _latest_ output word when computing the next word,
         | not all the earlier words from the output?
        
           | HarHarVeryFunny wrote:
           | Every output word is always appended to the prompt, so if
           | prompt is P then input after W1 (word 1) is output is P W1,
           | then P W1 W2, etc. So, the LLM is only looking at it's own
           | PAST words, not planning ahead.
           | 
           | If you do COT (think step-by-step) then the difference is
           | that it has broken the prompt request/problem down into
           | steps, so while it's still only seeing it's own past words,
           | those now include all of step-1, which helps it generate
           | step-2, etc and eventually combine all steps it generated
           | into a complete answer.
        
         | naasking wrote:
         | > Chain-of-Thought simply let's the model see it's own output
         | as an input and therefore build upon that. It lets them break a
         | complex problem down into a series of simpler steps which they
         | can see (output becomes input) and build upon.
         | 
         | Sure, but _why_ does that make the model more effective? Are
         | you sure it 's "breaking the problem down into simpler steps",
         | or is it just appearing to do so? How does this breakdown
         | happen in the model, exactly? If we can better understand the
         | mechanics involved, then maybe this process can be built into a
         | new model that can achieve the same results more efficiently
         | instead of as a recursive process that runs the model more than
         | once.
        
           | HarHarVeryFunny wrote:
           | You can think of an LLM as a production line - feed a series
           | of tokens in, and they get embedded and then processed
           | through the system one step at a time though however many
           | transformer layers the model has (undisclosed for most recent
           | models, but GPT-3 has 96).
           | 
           | Those fixed 96 (or whatever) steps of processing limit the
           | complexity of what the model can do, so it will fail if the
           | task is too complicated unless it it breaks it down into
           | simpler steps that each can be done well with that depth (96
           | steps) of processing.
           | 
           | It's not just appearing to do so - with chain-of-thought
           | prompting you are literally telling it to "think step by
           | step" as part of the prompt, so this is what it outputs. You
           | could also tell it to generate a step by step plan, then
           | elaborate on each of those steps.
           | 
           | I don't think we can say exactly how it is deciding to break
           | a task into steps, anymore than we can in general say exactly
           | how these LLMs are working, but intuitively it's similar to
           | how we think and talk (which is what the LLM is trained on) -
           | a good speaker/writer will introduce a complex topic as a
           | top-down decomposition.
        
           | phailhaus wrote:
           | > Sure, but why does that make the model more effective?
           | 
           | Because the model can't "think". There is no "reasoning" that
           | goes into generating an answer. So for complex problems that
           | require multiple steps of reasoning, the model needs to
           | persist those intermediate steps in order to be able to build
           | up to a solution.
        
           | Bjartr wrote:
           | > Sure, but why does that make the model more effective?
           | 
           | If you look something up (on the internet, in a book, asking
           | people directly) and receive two answers, one which describes
           | the steps used to arrive at the answer, and one which
           | doesn't, on average which one is more likely to be the
           | correct answer?
           | 
           | As a prediction machine, an LLM is more constrained by what
           | is likely to appear after a chain of reasoning.
        
             | naasking wrote:
             | You're just repeating the previous explanation with
             | different words, so that's not really satisfactory. A
             | mechanistic demonstration of how step by step reasoning
             | tends to constrain the space of solutions to ones that are
             | more likely to be correct would be an actual explanation,
             | until then this is a just-so story.
        
         | JoshTko wrote:
         | This is basically what humans do with frameworks that help
         | organize our thoughts and ensure a more methodical and complete
         | way to think through an issue.
        
       | polygamous_bat wrote:
       | I think the two modes of LLM discourse: "they're
       | conscious!/they're just next token predictors with impressive
       | datasets" comes largely from two different groups of people:
       | those who learned about LLMs before learning about ML
       | fundamentals, and those who learned ML fundamentals before
       | encountering LLMs of today. While I fall in the second group,
       | there is a real risk that my prior concepts about the
       | fundamentals is limiting my view of the bigger picture, so I at
       | least welcome the debate.
       | 
       | Re: chain of thought, I at least know that in practice a lot of
       | the results from the original paper has not been quite
       | reproducible at later attempts. Whether that is a quirk of models
       | changing everyday or something deeper, I do not know.
        
       | devmor wrote:
       | This is context window narrowing. It's not any more "reasoning"
       | than chaining together sub-queries in a database to arrive at a
       | result that's an overlay of multiple matrices of data.
        
       | phailhaus wrote:
       | Models can't think. It uses the input context to predict an
       | output. So if you have a problem that needs to be solved
       | iteratively, those intermediate steps need to be persisted to the
       | context, because there is nowhere for them to go otherwise.
        
         | stavros wrote:
         | > Models can't think. It uses the input context to predict an
         | output.
         | 
         | The first claim doesn't follow from the second. What is it
         | about using the input to predict an output that makes you
         | believe they can't think? What if that's all thinking is? We
         | don't know.
        
           | phailhaus wrote:
           | It's not the the second statement follows from the first; I'm
           | asserting that models can't think, and that what they're
           | really doing is prediction based on context.
           | 
           | The fact that chain-of-thought reasoning yields significantly
           | better results is your hint: that means that the model
           | doesn't think like a human does when it comes up with
           | responses. If it's not in the context, it doesn't exist. You
           | can't ask a model "why did you answer that way" without it
           | generating from whole cloth a plausible retroactive reason.
           | But there is no memory, so it can't really tell you.
           | 
           | > What if that's all thinking is?
           | 
           | I think that this is roughly true. But we actually have
           | memory outside of what we say, so when we think, all those
           | intermediate steps are persisted in our brains. For a model,
           | the context _is_ the memory. If you delete your question from
           | the context and ask it  "why did you answer that way", it
           | will have no idea what you're talking about.
        
             | stavros wrote:
             | > You can't ask a model "why did you answer that way"
             | without it generating from whole cloth a plausible
             | retroactive reason.
             | 
             | I've caught myself generating from whole cloth a plausible
             | retroactive reason for some of my actions, which I later
             | realized wasn't true at all. Does that mean I can't think
             | either? Is there a way for an external observer to tell if
             | someone is thinking or not, or is it something that,
             | axiomatically, only humans can do, and nothing else?
        
               | phailhaus wrote:
               | Again, you have memory but models don't. If it's not in
               | the context, it doesn't exist. _This is really easy to
               | prove_ : delete your question from the context and ask
               | the model why it answered the way it did. It will not
               | know, because the model prediction service is stateless.
               | It only takes the context as input.
               | 
               | It's like if you and I weren't able to remember anything
               | that wasn't written down. That's why you can get models
               | to approximate thought by telling it to write down its
               | intermediate steps.
        
               | stavros wrote:
               | Right, exactly, the context is the model's memory. All
               | you're saying is "except for their memory, they have no
               | memory", which goes for humans as well.
        
               | phailhaus wrote:
               | If you say that the _models_ have memory and can think,
               | you are heavily implying that there is a brain in the
               | cloud that is capable of  "thinking" or "reasoning"
               | through your request. That's what it means for humans:
               | we're self-contained.
               | 
               | Models aren't like people. The context is not "part" of
               | the model, the context is _given_ to the model when you
               | ask it for a prediction. It 's like you cut a small part
               | of a person's brain out. Alone, it can't think.
               | 
               | It's a pedantic distinction, but an important one. The
               | model itself isn't capable of thinking, but if you
               | package it up with a context that it can manipulate at-
               | will, _that combination of parts_ can be said to
               | "think". That's probably going to be the next step in LLM
               | tech.
        
               | stavros wrote:
               | By that definition, ChatGPT can already think.
        
               | water-your-self wrote:
               | If this is your conclusion, I would recommend reading
               | more of the research.
        
               | phailhaus wrote:
               | Uh, no? ChatGPT does not have a memory outside of the
               | context that you submit.
        
               | stavros wrote:
               | https://openai.com/blog/memory-and-new-controls-for-
               | chatgpt
        
               | zoogeny wrote:
               | I think your use of the word "memory" here is imprecise.
               | 
               | For example, I can ask ChatGPT "Give me a 200 word
               | summary of George Orwell's book Animal Farm". It gives me
               | a pretty cogent description of the novel.
               | 
               | That knowledge of Animal Farm is somewhere, not in the
               | context. If we don't call that memory, I'm not sure what
               | to call it. Why should I think of this as different than
               | my own memories of the book?
        
               | phailhaus wrote:
               | That's encoded in the model weights, not "memory".
               | Basically, there is no context outside of the context
               | that you give the model. When you ask it a question,
               | those model weights don't change. It doesn't "remember"
               | what you asked.
               | 
               | This is why chain-of-thought reasoning works so
               | effectively: it lets the model "use" the context as a
               | sort of scratch pad to build up a response. Without it,
               | the model isn't capable of mimicking thought because it's
               | only capable of predicting based on the current context.
        
               | kelseyfrog wrote:
               | If models had memory would it change your mind about
               | whether they could think or not?
        
           | andoando wrote:
           | I think a fundamental difference is we are able to learn new
           | things by reasoning through our existing knowledge. Moreover
           | our beliefs are mostly consistent with each other, and we can
           | be argued with and have our beliefs changed.
           | 
           | As far as I understand, GPT isn't going to alter its whole
           | worldview if you show it that its thinking is flawed.
           | 
           | But perhaps this is possible if upon discovering a flaw, it
           | looped through its corpus and altered its connections?
        
       | sib wrote:
       | "techniques from an arcane branch of theoretical computer science
       | called computational complexity theory"
       | 
       | I mean... complexity theory is pretty much at the core of
       | theoretical computer science, right? Hardly an arcane branch of
       | it.
        
       | Xcelerate wrote:
       | It's kind of funny that the article calls the field of
       | computational complexity "arcane", considering it is at the
       | forefront of everything we know about the limits of computing.
       | 
       | That said, I haven't understood the intense, long term focus on
       | worst-case case and average-case analysis within the field. In
       | fact, when I first heard of big-O notation many years ago, it
       | took me an embarrassingly long time before I realized this
       | referred to the asymptotic performance of an algorithm on the
       | worst-case instances of a problem. I remember thinking "Why on
       | earth would you care about that? You can derive pathological
       | examples to just about anything."
       | 
       | Even the term "average-case" is misleading. We're not talking
       | about "average" in the sense of a typical instance of a problem
       | class one might encounter in the course of daily life. This
       | instead refers to the expectation value of the algorithm's
       | (asymptotic) performance over all problem instances within a
       | formal language. Sure, the non-colloquial usage of the term
       | "average" here is obvious to mathematicians, but I don't think
       | someone outside the field is likely aware that we see drastically
       | better performance of heuristic algorithms on real-world
       | instances of NP-hard problems than one would expect based upon a
       | naive review of the research from computational complexity
       | theory.
       | 
       | This performance gap between theory and practice is due to the
       | fact that the problems we encounter in daily life have such a
       | huge amount of mathematical substructure to them, and I would be
       | very surprised if a provably optimal average-case algorithm ever
       | realistically corresponds to the mean performance of an algorithm
       | optimally tailored to the distribution of problem instances we
       | encounter in real world data.
       | 
       | Consider matrix factorization. There are techniques that speed
       | this up considerably if the matrix is known to be positive
       | semidefinite, sparse, low-rank, and so on. Who knows how much
       | undiscovered substructure is lurking in the set of real-world
       | problem instances we lump together under "matrix factorization".
       | 
       | The subfield of computational complexity theory that moves beyond
       | overall complexity analysis I believe is called BWCA, "Beyond
       | Worst-Case Analysis" (but someone correct me if that's not
       | right).
       | 
       | For mathematical objects like neural networks, where the specific
       | problem instances have an absolutely _massive_ amount of hidden
       | substructure within them, I think we will have to use approaches
       | like BWCA going forward to learn more about the nature of e.g.,
       | transformers.
       | 
       | My view is that we should focus less on the absolute limits of a
       | particular architecture (woohoo, it's Turing complete and a
       | universal function approximator like everything else) and drill
       | down more into studying the limits of the interplay between model
       | architecture and the intrinsic hidden substructure of the data
       | that the model is trained on.
        
         | Calavar wrote:
         | Nothing about big-O is specific to worst case. You can
         | calculate a big-O for best case, average case, or worst case.
         | For example, quicksort is O(n log n) best case, O(n log n)
         | average case, and O(n^2) worst case. You may be confusing
         | worst-case with upper bound: big-O is an asymptotic upper bound
         | notation, but that refers to how we simplify the terms of the
         | cost function and is entirely orthogonal to best/average/worst
         | case.
        
       | sfink wrote:
       | One simple reason: consider the plausibility of
       | 11 + 31 = 24
       | 
       | It's actually fairly plausible. The answer is numeric. Two
       | digits, even, which is pretty likely when adding together 2-digit
       | inputs. 24 is also a common answer to math problems (it has lots
       | of factors, for one). It even has all the digits from adding 1+3
       | and 1+1.
       | 
       | Now how plausible is                   Show your work. 11 + 31 =
       | the result of adding the 10s digits together, so 10 + 30 = 40,
       | and then adding in the 1s digits, so 1 + 1 = 2. Combining the 40
       | and the 2 gives 24.
       | 
       | That last sentence doesn't seem very likely. Or:
       | Show your work. 11 + 31 = the result of adding the 10s digits
       | together, so 10 + 30 = 20, and then adding in the 1s digits, so 1
       | + 1 = 4. Combining the 20 and the 4 gives 24.
       | 
       | If you're breaking things down, you have to traverse through some
       | territory that is lower probability than the quick wrong answer.
       | 
       | The argument by computational complexity is stronger, though. I
       | just wanted to point out that the above is a confounding
       | explanation that is sufficient for simple cases, and so may need
       | to be ruled out before claiming that computational complexity
       | matters.
       | 
       | The complexity argument is also intuitively obvious. If you think
       | of an LLM as a type of computer that does one constant-time
       | forward pass over the input so far on each clock cycle (and
       | outputs a single token), then of course you can compute more if
       | you give your computer more cycles! You can use state (even if
       | the mechanism for transmitting the state from one cycle to the
       | next is sharply limited).
       | 
       | Similarly, it's an expansion of the old problem of a single-layer
       | perceptron not being able to compute XOR. (Here, the "cycles" are
       | advances from one layer to the next.)
       | 
       | That's not to say that the nuances are obvious. Simply saying you
       | can use multiple clock ticks doesn't really say anything about
       | how much you can do in one tick.
        
       | Jackson__ wrote:
       | Great article! Now what would happen if we took this idea, and
       | turned it on it's head? Let's train a model to consistently give
       | an answer first, and have it infer the steps it took to get there
       | after.
       | 
       | ... Is what I think the researchers at mistral AI are saying,
       | because that's what they did. Every slightly complex question you
       | ask their models goes somewhat like this:
       | 
       | >Input: Alice has 3 brothers. Each of her brothers has 2 sisters.
       | How many sisters does Alice have?
       | 
       | >Output: Alice has 2 sisters.
       | 
       | >Here's the reasoning:
       | 
       | > We know that Alice has 3 brothers.
       | 
       | > Then we are told that each of her brothers has 2 sisters.
       | 
       | > Since Alice is one of the sisters to her brothers, there must
       | be one more sister besides Alice for each brother to have 2
       | sisters.
       | 
       | > Therefore, Alice has 2 sisters in total.
       | 
       | Conversely, if you ask the model to think first, it gets it right
       | immediately. I'm kinda baffled, they have not corrected this
       | after their very first model. From mistral 7b to large, each one
       | shows this same trained behavior to answer first, think second.
        
       ___________________________________________________________________
       (page generated 2024-03-22 23:01 UTC)