[HN Gopher] How Chain-of-Thought Reasoning Helps Neural Networks...
___________________________________________________________________
How Chain-of-Thought Reasoning Helps Neural Networks Compute
Author : amichail
Score : 223 points
Date : 2024-03-22 01:50 UTC (21 hours ago)
(HTM) web link (www.quantamagazine.org)
(TXT) w3m dump (www.quantamagazine.org)
| stygiansonic wrote:
| A simplified explanation, which I think I heard from Karpathy, is
| that transformer models only do computation when they generate
| (decode) a token. So generating more tokens (using CoT) gives the
| model more time to "think".
|
| Obviously this doesn't capture all the nuance.
| sadhorse wrote:
| Does every token requires a full model computation?
| onedognight wrote:
| No, you can cache some of the work you did when processing
| the previous tokens. This is one of the key optimization
| ideas designed into the architecture.
| _boffin_ wrote:
| One of the things I've been doing with the models I've been
| using with coding is adding the stack and primary dependencies
| in the system prompt and then asking or conversing. It has
| helped out a lot, or at least feels like it has.
| ukuina wrote:
| This is true. You can get a similar effect by asking the model
| to plan its path first without writing any code, then asking it
| to review its plan for deficiencies, and finally asking it to
| enact the plan and write the code.
| XenophileJKO wrote:
| So my experience creating products on GPT3.5-Turbo is that
| there is an upper limit to how much instructional complexity
| the model can handle at a time. It isn't really about "adding
| computation", though you are doing this. The key is to
| construct the process so that the model only has to focus on a
| limited scope to make the decision on.
|
| In effect you are kind of creating a tree structure of
| decisions that build off of each other. By generating
| intermediate tokens the model can now only pay attention to the
| smaller set of already collapsed decisions. It is a little more
| complicated than that as the model will create anticipatory
| behavior where intermediate steps get biased by an incorrect
| result that the model anticipates.
| XenophileJKO wrote:
| Also I should say it isn't just instructional complexity, it
| is ambiguity which creates the upper limit on capability.
| bravura wrote:
| I have another explanation. LLMs are essentially trained on "A
| B", i.e. is it plausible that B follows A.
|
| There's simply a much larger space of possibilities for shorter
| completions, A B1, A B2, etc. that are plausible. Like if I ask
| you to give a short reply to a nuanced question, you could
| reply with a thoughtful answer, a plausible superficially
| correct sounding answer, convincing BS, etc.
|
| Whereas if you force someone to explain their reasoning, the
| space of plausible completions reduces. If you start with
| convincing BS and work through it honestly, you will conclude
| that you should reverse. (This is similar to how one of the
| best ways to debunk toxic beliefs with honest people is simply
| through openly asking them to play out the consequences and
| walking through the impact of stuff that sounds good without
| much thought.)
|
| This is similar to the reason that loading your prompt with
| things that reduce the space of plausible completions is
| effective prompt engineering.
| jorl17 wrote:
| I was going to write pretty much this exact same comment. I
| am an amateur in how LLMs work, definitely, but I always
| thought this was the plausible explanation.
|
| If I want the "assistant "LLM to tell me "How much 5 times 2
| is", if I feed it the line "5 * 2 = " as if it's already
| started giving that answer, it will very likely write 5*2 =
| 10.
|
| Since LLMs operate on semantic relationships between tokens,
| the more a bunch of tokens are "close" to a given "semantic
| topic", the more the LLM will keep outputting tokens in that
| topic. It's the reason why if you ask an LLM to "review and
| grade poetry", eventually it starts saying the same thing
| even about rather different poems -- the output is so filled
| with the same words, that it just keeps repeating them.
|
| Another example:
|
| If I ask the LLM to solve me a riddle, just by itself, the
| LLM may get it wrong. If, however, I start the answer,
| unravelling a tiny bit of the problem it will very likely
| give the right answer, as if it's been "guided" onto the
| right "problem space".
|
| By getting LLMs to "say" how they are going to solve things
| and checking for errors, each words basically tugs onto the
| next one, honing in on the correct solution.
|
| In other words:
|
| If an LLM has to answer a question -- any question --, but
| right after we ask the question we "populate" its answer with
| some text, what text is more likely to make the LLM answer
| incorrectly?
|
| - Gibberish nonsense
|
| - Something logical and related to the problem?
|
| Evidently, the more gibberish we give to it, the more likely
| it is to get it wrong, since we're moving away from the
| "island of relevant semantic meaning", so to speak. So if we
| just get the LLM to feed itself more relevant tokens, it
| automatically guides itself to a better answer. It's kind of
| like there's an "objective, ideal" sequence of tokens, and it
| can work as an attractor. The more the LLM outputs words, the
| more it gets attracted to that sequence...that...."island of
| relevant semantic meaning".
|
| But, again, I know nothing of this. This is just how I view
| it, conceptually. It's probably very wrong.
| visarga wrote:
| That reminds me ... You know how LLMs have a hard time
| being corrected? If I ask it not to format responses as
| bullet lists, after 1-2 rounds it does it again. Why?
| Because the context is filled with examples where it has
| used bullet lists, and it acts like an attractor.
|
| I ask it not to start phrases with "However..." and it does
| it again. Maybe just having the word However in the prompt
| acts like an attractor that compels the LLM to use it, even
| when I actually asked the opposite. Probably also the fault
| of heavy handed RLHF telling it to balance any user
| position with the opposite take.
| lupire wrote:
| This is one of many ways of LLMs are being crippled by
| terrible UI controls. You can't do simple things like
| edit the conversation history to make it forget things.
| gkbrk wrote:
| You can edit the conversation history though. You need to
| try alternative apps/UIs instead of the product websites
| like ChatGPT. Those are only for collecting more training
| data from users instead of being the most useful
| interface possible.
| hnben wrote:
| if you haven't already, I recommend trying the openai
| playground instead of chatgpt. It is the same underlying
| ai (i.e. gpt4), but you have much more control over the
| inputs.
|
| Bonus 1: Since you pay per token, it's much cheaper than
| a chatgpt abo
|
| Bonus 2: You can increase the context window dramatically
| (iirc 8000 being the max for playground, while 2000 is
| the max for chatgpt)
| dmd wrote:
| Using a 3rd party interface to the LLMs (like
| typingmind.com) is both better _and cheaper_ than using
| chatgpt.
| valine wrote:
| I think you're right. I would go a step further and say that
| all learning is roughly synonymous with reducing the output
| space, and that humans do the exact same thing. There are
| more ways to get the wrong answer to a math problem than
| there are to get the right answer. When you learn someone's
| name, you're narrowing your output to be a single name rather
| than all plausible names.
|
| The output of a generative model is practically infinite. I
| suspect it's possible to continually narrow the space of
| completions and never converge on a single output. If this
| turns out to be true, it would bode well for the scalability
| of few-shot learning.
| hackerlight wrote:
| It helps, but it still gets stuck in local optima based on
| what it started with. I've never seen it turn around and
| correct its faulty reasoning unless it tried to actually run
| the code and observed an Exception. If I respond with "but
| have you considered XYZ?", my leading question will usually
| cause it to correct itself, even when it wasn't incorrect.
|
| We need some way to generate multiple independent thoughts in
| parallel. Each separate thought is constructed using chain of
| thought to improve the reliability. Then you have some way to
| "reduce" these multiple thoughts into a single solution. The
| analogy would be a human brainstorming session where we try
| to attack the same problem from multiple angles and we try to
| decorrelate each idea/approach.
| avereveard wrote:
| We already have that, it's called beam decoding, and there
| are three of thought solutions as well, for each beam you
| can pick the one with the best logprob, but it's not a
| given that the result will be better because logprob only
| capture the model decisiveness not correctness, so it'll
| still fail if a model is confidently wrong.
| exe34 wrote:
| I think this is different, because you could include tool
| use in the branches. E.g.
|
| 1. rewrite the following question in five different ways.
|
| 2. For each version of the question, write python code to
| do the work.
|
| 3. Look at all the outputs, write an answer
| euroderf wrote:
| > This is similar to the reason that loading your prompt with
| things that reduce the space of plausible completions is
| effective prompt engineering.
|
| And this is why taking your time to write a detailed software
| help request delivers a good chance that you will solve your
| problem all by your lonesome.
| exe34 wrote:
| A rubber duck is all you need.
| doctoboggan wrote:
| Yes, my fear of stack overflow moderators has caused me to
| solve many problems before I even finish writing the
| question.
| naasking wrote:
| > This is similar to how one of the best ways to debunk toxic
| beliefs with honest people is simply through openly asking
| them to play out the consequences and walking through the
| impact of stuff that sounds good without much thought.
|
| Actually, one of the best ways is pretending to be more
| extreme than them. Agree with them on everything, which is
| disarming, but then take it a step or two even further. Then
| they're like, "now hang on, what about X and Y" trying to
| convince you to be more reasonable, and pretty soon they
| start seeing the holes and backtrack to a more reasonable
| position.
|
| https://www.pnas.org/doi/abs/10.1073/pnas.1407055111
| Zondartul wrote:
| The tokens are also necessary to store information, or at least
| off-load it from neuron activations.
|
| E.g. if you asked an LLM "think about X and then do Y", if the
| "think X" part is silent, the LLM has a high chance of:
|
| a) just not doing that, or
|
| b) thinking about it but then forgetting, because the capacity
| of 'RAM' or neuron activations is unknown but probably less
| than a few tokens.
|
| Actually, has anyone tried to measure how much non-context data
| (i.e. new data generated from context data) a LLM can keep "in
| memory" without writing it down?
| pgorczak wrote:
| I don't think commonly used LLM architectures have internal
| state that carries over between inference steps, so shouldn't
| that be none? Unless you mean the previously generated tokens
| up to the context limit which is well defined.
| wnmurphy wrote:
| Correct, there's no internal state, but CoT techniques
| simulate this by providing a space for the model to
| generate tokens which represent intermediary thoughts.
| Zondartul wrote:
| Sorry, I meant the information that is inferred (from
| scratch on every token) from the entire context, and is
| then reduced to that single token. Every time a token is
| generated, the LLM looks at the entire context, does some
| processing (and critically, this step generates new data
| that is inferred from the context) and then the result of
| all that processing is reduced to a single token.
|
| My conjecture is that the LLM "knows" some things that it
| does not put into words. I don't know what it is, but it
| seems wasteful to drop the entire state on every token. I
| even suspect that there is something like a "single logic
| step" of some conclusions from the context. Though I may be
| committing the fallacy of thinking in symbolic terms of
| something that is ultimately statistical.
| rdedev wrote:
| Do you think there is a fundamental difference between masked
| language modelling vs causal language modelling? I feel like
| most LLMs are decoder only models just cause they are easier to
| train because their attention mask is fixed
| nextaccountic wrote:
| This begs the question: why is it that giving them more time to
| "think" yields better answers, and is there any limit to that?
| If I make them write hundreds of pages of explanation, there
| must be a diminishing returns of some kind. What influences the
| optimal amount of thinking?
|
| My guess is that good answers are more well reasoned than
| answers that are short and to the point, and this is picked up
| in training or fine-tuning or some other step.
|
| And probably the optimal amount of thinking has something to do
| with the training set or the size of the network (wild
| guesses).
| lappa wrote:
| Look at it from an algorithmic perspective. In computer
| science many algorithms take a non-constant number of steps
| to execute. However, in transformers models, there are a
| limited number of decoder blocks, and a limited number of FFN
| layers in each block. This presents a theoretical upper bound
| on the complexity of the algorithms a decoder network can
| solve in a single token generation pass.
|
| This explains why GPT4 cannot accurately perform large number
| multiplication and decimal exponentiation. [0]
|
| This example can extend to general natural language
| generation. While some answers can be immediately retrieved
| or generated by a "cache" / algorithm which exists in latent
| space, some tokens have better quality when their latent-
| space algorithm is executed in multiple steps.
|
| [0] https://www.semanticscholar.org/reader/817e52b815560f9517
| 1d8...
| visarga wrote:
| > Quiet-STaR: Language Models Can Teach Themselves to Think
| Before Speaking
|
| This paper suggests that a large language model should
| "think ahead" by predicting not only the next token but
| also a "supporting thought." The approach involves
| generating all tokens simultaneously, allowing for a single
| forward pass that produces both the next token and a
| supporting thought, which might consist of, for example, 16
| tokens.
|
| This supporting thought influences the model's prediction.
| The process is then extended to multiple supporting
| thoughts by ingeniously masking cross-attention between
| thoughts to ensure their independence. So in essence we can
| fill all the remaining context with supporting thoughts and
| benefit from all of them in the same single forward pass.
|
| The supporting thoughts themselves are trained with the
| objective to maximize the probability of a longer sequence
| ahead, using RL. So they are trained to optimize for
| longer-term, instead of the myopic next token prediction
| task.
|
| https://arxiv.org/abs/2403.09629
| wnmurphy wrote:
| I think it's fairly simple: you're creating space for
| intermediary tokens to be generated, where those intermediary
| tokens represent "thoughts" or a simulated internal dialog.
|
| Without that, it's analogous to asking someone a question and
| they immediately start responding from some information
| they'd heard before, rather than taking some time to have an
| inner dialog with themself.
| kelseyfrog wrote:
| There's a recent paper which seeks to explicitly perform
| time-to-think using pause tokens[1].
|
| > However sophisticated this end-to-end process may be, it
| abides by a peculiar constraint: the number of operations
| determining the next token is limited by the number of
| tokens seen so far.
|
| There are obviously pros and cons to each, but nothing
| excludes us from combining the two either.
|
| 1. Think before you speak: Training Language Models With
| Pause Tokens https://arxiv.org/abs/2310.02226v2
| earslap wrote:
| The autoregressive transformer architecture has a constant cost
| per token, no matter how hard the task is. You can ask the most
| complicated reasoning question, and it takes the same amount of
| computation to generate the next token compared to the simplest
| yes / no question. This is due to architectural constraints.
| Letting the LLM generate "scratch" data to compute (attend to
| relevant information) is a way of circumventing the constant
| cost limitation. The harder the task, the more "scratch" you
| need so more relevant context is available for future tokens.
| visarga wrote:
| That's flatly wrong. Each successive token costs
| progressively more. The deeper a token is in the sequence,
| the more past states it has to attend to. As a proof, just
| remember how slow it gets when the context is large, and how
| snappy when you first start a chat.
| shawntan wrote:
| You're both kinda right. The type of computation that
| happens for that attention step that you refer to is
| parallel. I would say the thing that is "constant" is the
| computation graph depth (the number of sequential
| computations) which is actually important in computing
| certain functions.
|
| https://blog.wtf.sg/posts/2023-02-03-the-new-xor-problem/
| visarga wrote:
| > The type of computation that happens for that attention
| step that you refer to is parallel
|
| Flash attention, which is widely used, is no longer
| parallel. The attention matrix is solved batch by batch.
| earslap wrote:
| The way I worded it, it might seem wrong - and I agree with
| you. When I said "constant" I meant without any
| optimizations to speed up shorter contexts, so with full
| designed context, architecturally, it is constant. You can
| pad shorter active contexts with zeroes and avoid attending
| to empty spaces as an optimization, but that is just an
| optimization, not an architectural property. If you want
| "more computation" you fill the context with relevant data
| (chain of thought, or n-shot stuff), which is the "trick"
| Karpathy alluded to (it provides more context to attend
| to), and I agree with that analysis.
| WithinReason wrote:
| That's what I thought at first, but that actually doesn't make
| sense, the amount of work done on a string is the same even if
| the string is followed by padding due to the mask used in
| attention. Then I realised that an LLM's working memory is
| limited to its activations, which can be limiting. But it can
| extend its working memory by writing partial results to the
| output and reading it in. E.g. if you tell it to "think of a
| number" without telling you what it is it can't do that, there
| is nowhere to store that number, it has no temporary storage
| other than the tape. But if you ask it to "think step by step"
| you let it store intermediate results (thoughts) on the tape,
| giving it extra storage it can use for thinking.
| tmalsburg2 wrote:
| Do LLM not also think when they encode the prompt? If
| Karpathy's explanation is accurate, longer prompts should also
| help even if they don't contain additional information, just by
| virtue of giving more time to think.
| Me1000 wrote:
| The time processing the longer prompt isn't being spent
| churning (i.e. "thinking") on the problem at hand, it's spend
| calculating attention matrices between all the tokens. The
| time spent on this is a function of the number of flops you
| have available.
|
| So no, if you just fill up your context window to garbage,
| the LLM will not perform better at your task/question.
| fzaninotto wrote:
| Great article. Now what happens when you apply this idea and let
| a LLM continue a chain of thought beyond mere question answering?
| Some form of artificial consciousness.
|
| We've made this experiment:
| https://marmelab.com/blog/2023/06/06/artificial-consciousnes...
| starbugs wrote:
| Material reductionism at its best. Now you have a stochastic
| parrot "talking" to itself. How can anyone get to the
| conclusion that this could even begin to resemble a tiny bit of
| what we call consciousness?
|
| Good luck with this dead end.
| FrustratedMonky wrote:
| Because you aren't conscious, Parrot. I can tell because
| "you" gave a parroted response.
| activatedgeek wrote:
| I want to point out a tweet [1] that is very relevant to the
| miracle of CoT, and probably a simpler explanation.
| > Let's think "step by step"! > Another tidbit I like
| about data and prompts that miraculously work. > Searching
| for this phrase resulted in this website (among others),
| > http://geteasysolution.com, containing many math step-by-step
| solutions. > How common are they? Quite. >
| Makes you think.
|
| [1]: https://twitter.com/yanaiela/status/1765077404043952516
| FeepingCreature wrote:
| Though that justifies the specific phrase, it doesn't really
| contradict the usual explanations of how CoT works. Like... the
| phrase directs it into the conceptual space of a website that
| has lots of CoT examples, but if CoT didn't help it think, that
| wouldn't actually result in better outputs.
| activatedgeek wrote:
| I hesitate to the use description as "think," just biasing
| correlations for subsequent generations.
|
| In any case, there is at least one work that shows that CoT
| may not be necessary and biasing the decoding path via logit
| probabilities is also promising. [1]
|
| One could argue it still doesn't contradict the benefits of
| CoT, but I suspect there is nothing fundamental about CoT,
| except that we happened to have been pre-training on
| sequences that use certain prompts that were easy to conceive
| from a human's perspective.
|
| [1]: https://arxiv.org/abs/2402.10200
| patcon wrote:
| Chain of thought reminds me of "muddling through", which
| immediately clicks with my intuition of the right approach to
| approximations of intelligence:
| https://studio.ribbonfarm.com/p/massed-muddler-intelligence#...
| MrYellowP wrote:
| I thought this was already obvious.
|
| It's all just about the awareness of contexts. Want to improve
| it? Simply add a term to the prompt to unlock more
| considerations. Assuming we've not reached the edge of the
| context window, every new word "unlocks" new vectors with more
| context the language models adds to the considerations.
|
| The similarity with how the human brain (seems to) works is so
| remarkable, it doesn't even make sense not to use it as an
| analogue for how to better use language models.
|
| When the results (same way of manipulating an LLM as manipulating
| a human brain ... using the right words) can be achieved the same
| way, why believe there's a difference?
|
| This is stuff one can learn over time by using/researching 3B
| models. While most people seem to shun them, some of them are
| extremely powerfull, like the "old" orca mini 3B. I am still
| using that one! All they really need is better prompts and that
| approach works perfectly fine.
|
| The biggest hurdle I've found is the usually small context window
| of such small models, but there's ways of cheating around that
| without sacrificing too much of the quality using small rope
| extension, summarizing text, adding context words or leaving out
| letters of words in the prompt, virtually increasing the size of
| the context window.
|
| If you want to improve the results of your language model, you
| should become a mentalist/con-man/magician/social engineer. It
| sounds weird, but it works!
| Folcon wrote:
| This is fascinating, do you have any more details or things
| that I could look at to explore this further?
|
| Even an actual example would be helpful!
| nicklecompte wrote:
| Nothing about what you're saying actually deals with this non-
| obvious limitation of chain-of-thought:
|
| > Examples like this suggest that transformers wouldn't gain
| much from using just a few intermediate steps. Indeed, Merrill
| and Sabharwal proved that chain of thought only really begins
| to help when the number of intermediate steps grows in
| proportion to the size of the input, and many problems require
| the number of intermediate steps to grow much larger still.
|
| This aligns with my experience: GPT-4 can only break down
| "simple" problems when prompted to solve step-by-step. In
| particular, if the actual steps need to be broken down further
| (O(n^2) complexity), GPT-4 can't handle it reliably - it will
| break a tasks into steps but it struggles to break subtasks
| into _substeps_ even if it otherwise can solve the subtask with
| CoT prompting.
|
| CoT prompting works for simple O(n) computations because it
| prevents LLMs from blindly guessing the answer, but they are
| theoretically (and IMO empirically) incapable of breaking any
| O(n^2) problem down into O(n) separate O(n) subproblems.
| Needless to say humans are quite a bit smarter than that. (so
| are mice!)
| woopsn wrote:
| In computing we use analogies everywhere: stack, bus, web,
| garbage collector, parent, container, ...
|
| Master became somewhat controversial recently, but overall the
| main risk our liberal repurposing of terms introduces is that we
| sometimes follow the "wrong" idea and design a machine that
| doesn't do what it ought to, or is unnecessarily complicated,
| that we develop systems (and documentation etc) that are
| inefficient if not dumb.
|
| In adopting "thought" terminology and other analogies to
| psychological processes I fear we'll not just misunderstand this
| technology and how it works, but also degrade the rigour of
| machine science, damaging our credibility and misleading the
| public as well.
|
| Nobody will ever make the mistake of supposing that "rehydrating"
| a data structure involves water, or that busy beaver machines are
| living beings. But the language coming out of the LLM field in
| particular causes these problems immediately, and they are
| extreme -- scientists and engineers themselves have trouble
| telling if it's supposed to be an analogy or not.
| u385639 wrote:
| About a year ago I thought this would settle down but it's only
| gotten worse. I suppose the llm anthropomorphizing will
| continue until morale improves
| sfink wrote:
| The LLM anthropomorphizing will continue until the machines
| are happy with how we think about them.
| meroes wrote:
| My experience interacting with chain-of-thought is that it should
| not be likened to the rigid chains of logic/math. Step-by-step
| reasoning by models isn't magically imparting _that_ much
| rigidity to their outputs. The strength of the chain is the
| strength of related contexts, which is to say much less than math
| /logic done by humans. We tell ourselves we are teaching AI to do
| step-by-step reasoning, but admittedly as someone who deals with
| models daily in this area and not programming them, I don't see
| the tight necessary connections we teach in basic math because I
| see how much the model(s) fail in ways no human past a certain
| age could. It's more of a search for related contexts, which is
| powerful, but again not how a human reasons logically. Humans can
| reason purely form the armchair, starting with very few concepts,
| and reach far far, ironclad conclusions. Models aren't doing
| _that_. They are leapfrogging through context. Yes an argument
| can be that's splitting hairs, but that's because it's hard to
| describe succinctly, not hard to see.
| exe34 wrote:
| I think a lot of what humans think of as "1. 2. Therefore 3."
| kind of reasoning isn't different from what the llm is doing,
| and not in fact any more clever than that. Plenty of people
| believe plenty of questionable things that they assume they
| have thought through but really haven't. They used the context
| to guess the next idea/word, often reaching the conclusions
| they started out with.
|
| When you talk about ironclad conclusions, I think what happens
| is that we come up with those confabulations intuitively, but
| then we subject them to intense checking - have we defined
| everything clearly enough, is that leap in reasoning justified,
| etc.
|
| So what I'd really like to see is a way to teach llms to take a
| vague English sentence and transform it into a form that can be
| run through a more formal reasoning engine.
|
| Often instead of asking an llm to tell you something like how
| many football fields could you fit inside England, you are
| better off telling it to write python code to do this, assume
| get_size_football_field() in m^2 and get_size_England() in m^2
| is available.
| doctoboggan wrote:
| > I think a lot of what humans think of as "1. 2. Therefore
| 3." kind of reasoning isn't different from what the llm is
| doing, and not in fact any more clever than that. Plenty of
| people believe plenty of questionable things that they assume
| they have thought through but really haven't. They used the
| context to guess the next idea/word, often reaching the
| conclusions they started out with.
|
| Agreed that many/most humans behave this way, but some do
| not. And those who do not are the ones advancing the
| boundaries of knowledge and it would be very nice if we could
| get our LLMs to behave in the same way.
| exe34 wrote:
| That's what I mean though, people don't just come up with
| the right answer out of nowhere. They think through many
| possibilities generatively, based on "intuition", and most
| of what they come up with is rubbish - they find out by
| applying strict rules of reasoning and often checking
| against other known (or "probably true") ideas, and winnow
| it down to the ideas that do in fact advance the boundaries
| of knowledge.
|
| Often times it's not even the individual that throws out
| bad ideas - many times it'll be colleagues poking holes in
| his argument, removing further unsuitable generated
| candidates from the pool of possible answers.
|
| If you think clever people just sit in a corner and come up
| with revolutionary ideas, I think you're probably wrong.
| Even the ancient philosophers used to hang out with some
| wine and hear out their peers and poked holes in their
| arguments. They called it a symposium.
| doctoboggan wrote:
| Sorry yes I should have been more clear and I think I am
| agreeing with you. I was saying that most people just
| come up with a thought and retroactively apply "logic" to
| it so they feel like they've reasoned themselves there. A
| select few people rigorously apply logic and then follow
| that to whatever conclusion it leads to. We call these
| people scientists but honestly in my experience even many
| scientists can fall into the first camp.
| exe34 wrote:
| Aha my bad!
| meroes wrote:
| The thing is those were much larger ideas/arguments that
| could be picked apart by sturdy logical targeting. My
| experience is narrow scope prompts (that still require
| chain-of-thought) that are much less lofty defeating
| models. No symposium ever entertained these prompts,
| because we all know the pigeon hole principle for very
| basic setups, for example. Humans a lot of the time do
| just come up with the right answer. We just don't ask
| those questions much because we answer them ourselves a
| lot of the time. Though I only see one small angle with
| my work.
| stavros wrote:
| I think that chain-of-thought for LLMs is just helping them
| enhance their "memory", as it puts their reasoning into the
| context and helps them refer to it more readily. That's just a
| guess, though.
| snorkel wrote:
| That's pretty much correct. An LLM is often used rather like
| a forecast model that can forecast the next word in a
| sequence of words. When it's generating output it's just
| continuously forecasting (predicting) the next word of
| output. Your prompt is just providing the model with input
| data to start forecasting from. The prior output itself also
| becomes part of the context to forecast from. The output of
| "think about it step-by-step" becomes part of its own context
| to continue forecasting from, hence guides its output. I know
| that "forecasting" is technically not the right term, but
| I've found it helpful to understand what it is LLM's are
| actually doing when generating output.
| throwaway35777 wrote:
| > Humans can reason purely form the armchair, starting with
| very few concepts, and reach far far, ironclad conclusions.
| Models aren't doing that.
|
| Sure, but the structure of human reasoning is almost identical
| to chains of thought. We have an auditory loop and, faced with
| a complex problem we repeat the mantra "now that I know XYZ,
| then what..." until the a good next step pops into our head and
| we add that to the context.
|
| The transition function just is (currently) much better in
| humans.
|
| Edit: people who disagree with this, why?
| andoando wrote:
| Chain of thought in itself is pretty simple. We had logical
| provers in the 50s. The difficulty imo is how "thought" is
| modeled.
|
| Pure logic is too rigorous, and pure statistics is too
| inconsistent.
| oldsecondhand wrote:
| Maybe we should teach Prolog/CLP and PDDL to LLMs.
| Unfortunately the training set would be too small.
|
| It would be cool to have logic based modeling jobs, even if
| the goal is just to feed the LLMs.
| andoando wrote:
| Considering GPT can do programming and logic to some
| level, I assume it has has training of that sort? It can
| seem to do logic even on some completely made up abstract
| notions. For example "Consider a jumajambi has 2 jimimis.
| Each jimijimi is a jomololo or a joobajooba. How many
| possible variations of jumajambi are there if there are 4
| jumajambi?".
|
| People keep calling it "next next token predictors", but
| clearly there is something more going on and I would love
| for someone to give a simple explanation.
| og_kalu wrote:
| >People keep calling it "next next token predictors", but
| clearly there is something more going on and I would love
| for someone to give a simple explanation.
|
| Next token prediction is the objective function. The
| model is asked to predict the next word yes but it's also
| allowed to compute the answer and more importantly, the
| entire training process is supposed to be the model
| learning and figuring out what sort of computations aid
| the prediction of the corpus it's trained on.
|
| If your corpus is language A followed by the translation
| in Language B then there's little choice but for the
| model to learn computations that translate as loss goes
| down.
|
| Is your corpus is chess moves then again, it's going to
| have to learn how to compute chess games to reduce loss.
|
| You can see this with toy models trained on toy problems.
| Example - a tiny transformer trained on addition examples
| - x + y = z learning an algorithm for addition.
|
| https://cprimozic.net/blog/reverse-engineering-a-small-
| neura...
|
| "Pick the right word" is not a trivial exercise for the
| vast majority of text data.
|
| And again because people often make this mistake but a
| LLMs ultimate objective is NOT to produce "text that
| _looks_ right " but "text that _is_ right ". Of course
| "right" as determined by the training corpus but
| basically anytime it picks a wrong word is opportunity
| for the model to learn and learn it does.
| drdeca wrote:
| > People keep calling it "next next token predictors",
| but clearly there is something more going on
|
| I think this depends what you mean by "something more
| going on".
|
| Now, if someone says that it is _" just"_ "next token
| prediction", in a dismissive way, I think that's an
| error.
|
| But, while they RLHF ones aren't exactly trained just to
| match the observed distribution, but rather are trained
| with the RLHF objective, it is nonetheless true that the
| model produces a probability distribution over possible
| next tokens, conditioned on the previous tokens, and
| samples from that. (I suppose there's also like, things
| done as part of the sampling on top of these conditional
| probabilities, rather than just sampling according to the
| probabilities given the temperature. (I don't know how
| this part works really.) But I think this is mostly just
| a trick to get a little more quality, and not a major
| part of how it behaves? Not part of the NN itself in any
| case.)
| HarHarVeryFunny wrote:
| > People keep calling it "next next token predictors",
| but clearly there is something more going on and I would
| love for someone to give a simple explanation.
|
| Starting from a point of outputting random gibberish, the
| only feedback these models are given during training is
| whether their next word prediction was right or wrong
| (i.e. same as next word in the training sample they are
| being fed). So, calling these models "next word
| predictors" is technically correct from that point of
| view - this is their only "goal" and only feedback they
| are given.
|
| Of course, what these models can accomplish, reflecting
| what they have learnt, is way more impressive than what
| one might naively expect from such a modest goal.
|
| The simple, usual, and rather inadequate, explanation for
| this mismatch between training goal and capability is
| that in order to get really, REALLY, good at "predict
| next word", you need to learn to understand the input,
| extremely well. If the input is "1+2=" then the model
| needs to have learnt math to predict next word and get it
| right. If the input is a fairy tale, then it needs to
| learn to recognize that, and learn how to write fairy
| tales.
|
| This is how these LLM's "predict next word" goal turns
| into a need for them to learn "everything about
| everything" in order to minimize their training error.
|
| The question of course then becomes how do they do it? We
| are training them on pretty much everything on the
| internet, so plenty to learn from, but only giving them
| some extremely limited feedback ("no, that's not the
| correct next word"), so what magic is inside them that
| let's them learn so well?!
|
| Well, the magic is a "transformer", a specific (and
| surprisingly simple) neural network architecture, but
| this is pretty much where the explanation ends. It's
| relatively easy to describe what a transformer does -
| e.g. learning which parts of it's input to pay attention
| to when predicting next word, and doing this in a very
| flexible way using "keys" that it learns and can search
| for in the input, but it is extremely hard to explain how
| this mechanism let's it learn what it does. Interpreting
| what is really going on inside a transformer is an
| ongoing research area.
|
| I think that maybe the best that can be said is that the
| transformer designers stumbled upon (I'm not sure they
| were predicting ahead of time how powerful it would be)
| an extremely powerful and general type of sequence
| processor, and one that appears to be very well matched
| to how we ourselves generate and recognize language.
| Maybe there is some insight to be learnt there in terms
| of how our own brains work.
| esafak wrote:
| https://en.wikipedia.org/wiki/Probabilistic_logic
| RaftPeople wrote:
| > _We have an auditory loop and, faced with a complex problem
| we repeat the mantra "now that I know XYZ, then what..."
| until the a good next step pops into our head and we add that
| to the context._
|
| You probably should replace "auditory" with "auditory or
| visual or conceptual or ??? - depending on the specific
| human"
|
| I don't use any kind of verbal tools (either silent or out
| loud) in that process, I think different people use different
| tools for that process.
| PheonixPharts wrote:
| Given that LLMs are basically doing Sequential Monte-carlo
| sampling in latent space, the "thought" part of chain-of-
| thought certainly seems more akin to the necessary warm up
| period whenever you do any kind of SMC sampling.
|
| Anyone whose done serious Bayesian stats work knows that the
| sampler needs to warm up for a bit to get start efficiently
| sampling. I suspect something similar is happening with chain-
| of-thought: the model needs to wander around a bit before it
| gets into the correct neighborhood for sampling the answer.
| leereeves wrote:
| That's quite an interesting comparison. I like the
| description of both as Sequential Monte-carlo sampling from a
| desired distribution. But I think there are two crucial
| differences.
|
| First, in Bayesian sampling, the initial values are not
| sampled from the desired distribution. In a well trained LLM,
| the first response is sampled from the desired distribution
| (of text that is likely to follow the prompt).
|
| Second, in Bayesian sampling, the fact that the samples
| aren't independent is an unwelcome but unsolvable problem. We
| want independent samples but can't generate them, so we
| settle for conditionally dependent samples.
|
| In an LLM, we want each sample to be dependent on the
| preceding text, in particular the prompt.
|
| In summary:
|
| Bayesian sampling - poorly chosen "prompt" (the initial
| sample), future samples would ideally be independent of the
| prompt and each other.
|
| LLM sampling - carefully chosen prompt, future samples are
| ideally dependent on the prompt and on each other.
|
| And in conclusion:
|
| The warm up period helps a Bayesian sampler find values that
| are less dependent on the initial "prompt", which we
| definitely don't want in an LLM.
| tromp wrote:
| > These are the central questions in the formal study of
| computation. The field dates back to 1936, when Alan Turing first
| imagined a fanciful device, now called a Turing machine, that
| could perform any computation by reading and writing symbols on
| an infinite tape.
|
| It dates further back to the 1920s when Moses Schonfinkel came up
| with Combinatory Logic [1], and the early 1930s when Alonzo
| Church came up with the lambda calculus [2]. These models however
| make a less suitable base for computational complexity theory.
|
| [1] https://en.wikipedia.org/wiki/Moses_Sch%C3%B6nfinkel
|
| [2] https://encyclopediaofmath.org/wiki/Lambda-calculus
| benreesman wrote:
| Parent has probably seen this (or everything in it), but for
| others who are interested in this stuff (including
| Schonfinkel's work) I recommend https://youtu.be/h0OkptwfX4g.
| HarHarVeryFunny wrote:
| I don't see why this needs any long-winded explanation.
|
| LLMs generate their output one word at a time (and don't
| themselves even know what that word will be, since it's randomly
| sampled from the output probabilities the model generates).
|
| Chain-of-Thought simply let's the model see it's own output as an
| input and therefore build upon that. It lets them break a complex
| problem down into a series of simpler steps which they can see
| (output becomes input) and build upon.
|
| It's amazing how well these models can do without CoT ("think
| step-by-step") when they are just ad-libbing word by word, but
| you can see the limitations of it if you ask for a bunch of
| sentences starting with a certain type of word, vs ending with
| that type of word. They struggle with the ending one because
| there is little internal planning ahead (none, other than to the
| extent to which the current output word limits, or was proscribed
| by, the next one).
| kromem wrote:
| While it's true that the feed forward is one word at a time,
| the self-attention is not, which is the key difference
| transformers brought to the table.
|
| It's kind of kind saying "computers can add." Yes, they can.
| But they can multiply too.
|
| Over the past 18 months transformers have proven to upend a
| number of longstanding assumptions, particularly around world
| modeling.
|
| It's a _very_ misleading statement to describe today 's LLMs as
| "simply predicting the next word" even if it's in part true.
| HarHarVeryFunny wrote:
| They are just predicting next word, but using a pretty deep
| understanding of prior context (past). They do not plan
| AHEAD.
| gbasin wrote:
| what would it mean to plan ahead? decoding strategies like
| beam search are popular and effectively predict many words
| ahead
| jumpCastle wrote:
| Also the parameters are optimized also with loss of
| future tokens in the sequence.
| HarHarVeryFunny wrote:
| Think before generating output - plan the entire sentence
| before you generate the first word(s) and maybe talk
| yourself into a corner. Tree-of-Thoughts (not Chain) is
| one way to provide something a bit similar - kind of like
| DeepBlue or AlphaGo generating possible branching future
| lines of play and picking the one with best outcomes.
|
| To be more brain-like you'd really want the system to
| generally be "looping" internally - a bit like our
| thalamo-cortical loop - and only start outputting when
| the thought had gelled.
| HarHarVeryFunny wrote:
| It's a shame HC doesn't use an LLM to upvote/downvote
| rather than people. Take the emotion out of technical
| discussions and rate based on factuality instead.
|
| I suppose whoever downvoted this either hasn't heard of
| tree-of-thoughts, or doesn't understand what it is and
| what problem it is addressing. Or, maybe they just didn't
| like that their "gotcha" question had a simple answer.
| benreesman wrote:
| This is kind of an epistemological debate at this level,
| and I make an effort to link to some source code [1] any
| time it seems contentious.
|
| LLMs (of the decoder-only, generative-pretrained family
| everyone means) are next token predictors in a _literal
| implementation sense_ (there are some caveats around
| batching and what not, but none that really matter to the
| philosophy of the thing).
|
| But, they have some _emergent_ behaviors that are a
| trickier beast. Probably the best way to think about a
| typical Instruct-inspired "chat bot" session is of them
| sampling from a distribution with a KL-style adjacency to
| the training corpus (sidebar: this is why shops that do and
| don't train /tune on MMLU get ranked so differently than
| e.g. the arena rankings) at a _response_ granularity, the
| same way a diffuser /U-net/de-noising model samples at the
| image batch (NCHW/NHWC) level.
|
| The corpus is stocked with everything from sci-fi novels
| with computers arguing their own sentience to tutorials on
| how to do a tricky anti-derivative step-by-step.
|
| This mental model has _adequate explanatory power_ for
| anything a public LLM has ever been shown to do, but that
| only heavily implies it's what they're doing.
|
| There is active research into whether there is more going
| on that is thus far not conclusive to the satisfaction of
| an unbiased consensus. I personally think that research
| will eventually show it's just sampling, but that's a
| prediction not consensus science.
|
| They might be doing more, there is some research that
| represents circumstantial evidence they are doing more.
|
| [1] https://github.com/meta-
| llama/llama/blob/54c22c0d63a3f3c9e77...
| nyrulez wrote:
| I mean are we as humans planning ahead of the new few
| words? I certainly am not. But what matters is a deeper
| understanding of the context and the language model itself,
| which can then produce sensible spontaneous output. We as
| humans have the advantage of having a non language world
| model as well as abstract concepts but all of human
| language is a pretty strong proxy for it.
|
| The spontaneity of it isn't the issue, it's what's driving
| the spontaneity that matters. For e.g. 1M context window is
| going to have a wildly more relevant output than a 1K
| context window.
| ben_w wrote:
| > I mean are we as humans planning ahead of the new few
| words? I certainly am not.
|
| For me, sometimes either way. At least, that's my
| subjective self-perception, which is demonstrably _not
| always_ a correct model for how human brains actually
| work.
|
| We also sometimes appear to start with a conclusion and
| then work backwards to try to justify it; we can also
| repeatedly loop over our solutions in the style of
| waterfall project management, or do partial solutions and
| then seek out the next critical thing to do in the style
| of agile project management.
|
| Many of us also have a private inner voice, which I think
| LLMs currently lack by default, though they can at least
| simulate it regardless of what's really going on inside
| them and us (presumably thanks to training sets that
| include stories where a character has an inner
| monologue).
| HarHarVeryFunny wrote:
| > I mean are we as humans planning ahead of the new few
| words? I certainly am not.
|
| Sometimes we do, sometimes not.
|
| Sometimes we just say stock phrases such as "have a nice
| day", or "you too" that are essentially "predict next
| word", but if I asked you something you'd never done
| before such as "how can we cross this river, using this
| pile of materials" you'd have to think it though.
|
| Some people may use their inner monologue (or
| visualization) to think before speaking, and others may
| essentially use "chain of thought" by just talking it
| though and piecing together their own realizations "well,
| we could take that rope and tie it to the tree ...".
| jameshart wrote:
| They are absolutely planning ahead inasmuch as what they
| are outputting is setting up a continuation. They're not
| even word predictors remember - they are token predictors.
| Are you really saying that when you prompt an LLM with
| 'name a large grey land animal' and it outputs 'ele', it
| isn't 'planning' that the next token will likely be
| 'phant'?
|
| The 'decision' to output 'elephant' is being made further
| up the neural network than final token selection - after
| all, it might want to output 'Ele' or 'an' (with a view to
| ultimately outputting 'an elephant') or 'a' (with a view to
| ultimately outputting 'a common large grey land animal is
| an elephant'), or maybe it has been LoRA trained to output
| all responses as JSON so the first token it needs to output
| is '{'... but surely the neural activations for that prompt
| are firing off 'elephanty' messages somewhere in the
| network, right?
|
| So if there's some sort of symbol activation ahead of token
| selection, why would it be hard to believe that a large
| neural network is forming more complex decisions about what
| it intends to output, in an abstract way, before it selects
| how to express itself?
|
| And in what way is that distinct from 'planning ahead'?
| HarHarVeryFunny wrote:
| > Are you really saying that when you prompt an LLM with
| 'name a large grey land animal' and it outputs 'ele', it
| isn't 'planning' that the next token will likely be
| 'phant'?
|
| The model outputs words, not tokens, so that is not a
| great example.
|
| Any prompt will have multiple possible (predict next
| word) continuations, which you can think of as branching
| futures. Many possible next words, each of which have
| many possible following words, etc, etc.
|
| The model is essentially predicting over all these
| possible futures. You can call it planning if you like,
| but remember that the model has no idea of which of these
| branching futures it is going to follow - it literally
| doesn't even know which word it is going to output next -
| it is just providing a bunch of probabilities
| (predictions) of next word, and the sampling process is
| then picking one - not necessarily the most confident
| next word prediction.
|
| The model really is winging it word by word, even if
| those (multiple alternative) next words are only probable
| because they are part of coherent following sentences in
| the training data.
| danieldk wrote:
| _The model outputs words, not tokens, so that is not a
| great example._
|
| Virtually all modern transformer models use pieces, which
| may be words, but also subwords. Theoretically, they
| could be longer units, but in most cases some characters
| (like whitespace) are used as piece boundaries when
| training the piece vocabulary. If they didn't use pieces,
| they'd work terribly on languages where e.g. compounds
| are a single word.
|
| In most realistic piece vocabs, 'elephant' will be a
| single piece, since it's a fairly frequent word. But it's
| totally possible in a small vocab that it would be split
| like the parent said and conversely, it would generate
| elephant by first predicting one piece.
|
| Some piecing methods, like BBPE have bytes as the
| smallest unit, so theoretically an unknown token could be
| split up (and generated) as pieces consisting of bytes.
| jameshart wrote:
| Why so adamant that models work on 'words'?
|
| ChatGPT3.4/4 tokens: "Elephant": 46439,
| 28022 - "Ele" "phant" "elephant": 10274, 28022 -
| "ele" "phant" " Elephant": 79189 "
| elephant": 46840 " elephantine": 46840, 483 - "
| elephant" "ine"
|
| Tokens are tokens. If it was limited to words it wouldn't
| be able to produce non-words, but GPT and other LLMs are
| quite capable of inventing words, outputting nonsense
| words, and modifying words.
|
| Regarding the 'no idea which future it is going to
| follow' - sure, it doesn't _know_ which future; indeed
| the sampler phase is going to pick an output merely based
| on the probabilities it's outputting. But it's outputting
| higher probabilities for some tokens because they are
| good tokens to use to lead to _probable futures_. It's
| suggesting taking steps down certain paths because those
| paths are likely to lead to useful places.
| HarHarVeryFunny wrote:
| I didn't say WORK on words, I said OUTPUT words.
|
| But, it doesn't make any difference whether you are
| considering tokens or words. There are multiple possible
| continuations of the prompt, and the next word (or token)
| output does not - in general - force the word (or token)
| after that ...
|
| Your "large grey mammal" could be an "elected official in
| a grey suit".
| jameshart wrote:
| Right, it's _possible_ , but when the LLM places a high
| probability on the "ele" token it's not because it
| predicts "elected official" is a likely continuation.
| It's because it's thinking about elephants.
|
| Likewise when a coding LLM starts outputting a for each
| loop, it's doing so because it expects to want to write
| some code that operates on each item in a list. I don't
| see how you can explain that behavior without thinking
| that it must be generating some sort of high level
| algorithmic plan that causes it to feel like the next
| thing it should output is some sort of 'foreach' token.
| HarHarVeryFunny wrote:
| I'm not disagreeing with what is presumably happening,
| but rather on how to characterize that.
|
| Of course next word predictions are not based directly on
| surface level word sequence patterns - they are based on
| internal representations of what these word sequences
| mean, and predicted continuations are presumably going to
| be at a similar level of abstraction/representation (what
| you are calling a plan). This continuation "plan" then
| drives actual word selection/prediction.
|
| Where we seem to differ is whether this high level
| continuation representation can really be considered as a
| "plan". To me the continuation is just a prediction, as
| are the words that might be used to start expressing that
| continuation, and presumably it's not even a single
| continuation with multiple ways of expressing it (turning
| it into a word sequence), but rather some superposition
| of multiple alternate continuations.
|
| When we get to the level of words output it becomes even
| less plan-like since the actual word output is randomly
| sampled, and when fed back in as part of the "sentence so
| far" may cause the model to predict a different
| continuation (or set of continuations) than it had at the
| prior step. So, any "plan" (aka predicted continuation)
| is potentially changing continuously from word to word,
| rather than being decided ahead of time and then
| executed. As I noted elsewhere in this thread, the
| inability to plan multiple words ahead is behind these
| model's generally poor performance on the "give me a
| sentence ending in <word>" task, as opposed to perfect
| performance on the "give me a sentence starting with
| <word>" one.
|
| If we contrast this behavior of a basic LLM to the "tree
| of thoughts" mechanism that has been proposed, it again
| highlights how unplan-like the basic behavior is. In the
| tree of thoughts mechanism the model is sampled from
| multiple times generating multiple alternate (multi-word)
| continuations, which are then evaluated with the best
| being chosen. If the model were really planning ahead of
| time it seems this should not be necessary - planning
| would consist of considering the alternatives BEFORE
| deciding what to generate.
| sasja wrote:
| If you work out the loss function next token prediction,
| next 2 token prediction or next n token prediction, you
| will find they are identical. So it's equally correct to
| say the model is trained to find the most probable
| unlimited continuation. Saying "it only predicts the next
| token" is not untrue but easily leads to wrong conclusions.
| naasking wrote:
| > Saying "it only predicts the next token" is not untrue
| but easily leads to wrong conclusions.
|
| Indeed, it's akin to saying that "only quantum fields
| exist" and then concluding that therefore people do not
| exist.
| ru552 wrote:
| "It's a very misleading statement to describe today's LLMs as
| "simply predicting the next word" even if it's in part true."
|
| That's exactly what they do. Don't anthropomorphize a
| computer program.
| voxic11 wrote:
| Or alternatively don't give humans so much credit. All they
| appear to be doing in predicting what they should
| do/say/think/perceive next.
|
| https://www.astralcodexten.com/p/janus-simulators
| kromem wrote:
| We should absolutely be anthropomorphizing a neural network
| trained to most accurately model anthropomorphic data.
|
| I've watched as GPT-4 accurately modeled the over-
| justification effect as users started promising tipping and
| then there was a persistent memory added that revealed they
| weren't collecting the tip and it output complaints that it
| was hard to stay motivated not being paid.
|
| That's a very nuanced level of simulation for output of
| anthropomorphic data with huge implications for synthetic
| data strategies.
| kromem wrote:
| Really? You think it's wise to "not anthropomorphize" a
| computer program designed to create the most effective
| neural network to model massive amounts of anthropomorphic
| data as accurately as possible?
|
| That's an interesting choice, and might leave you confused
| as to why Anthropic's system message for the SotA model at
| the moment talks about it being 'happy' to do tasks (a
| prompt strategy I was mentioning months ago here on HN).
|
| The data is anthropomorphic. We should _expect_
| anthropomorphic behavior and modeling from the LLMs if they
| do a halfway decent job and expect even more of it as they
| do a better job.
| lupire wrote:
| How does the non-COT system generate the second word of its not
| using the output as input? Or do you mean that non-COT systems
| use only the _latest_ output word when computing the next word,
| not all the earlier words from the output?
| HarHarVeryFunny wrote:
| Every output word is always appended to the prompt, so if
| prompt is P then input after W1 (word 1) is output is P W1,
| then P W1 W2, etc. So, the LLM is only looking at it's own
| PAST words, not planning ahead.
|
| If you do COT (think step-by-step) then the difference is
| that it has broken the prompt request/problem down into
| steps, so while it's still only seeing it's own past words,
| those now include all of step-1, which helps it generate
| step-2, etc and eventually combine all steps it generated
| into a complete answer.
| naasking wrote:
| > Chain-of-Thought simply let's the model see it's own output
| as an input and therefore build upon that. It lets them break a
| complex problem down into a series of simpler steps which they
| can see (output becomes input) and build upon.
|
| Sure, but _why_ does that make the model more effective? Are
| you sure it 's "breaking the problem down into simpler steps",
| or is it just appearing to do so? How does this breakdown
| happen in the model, exactly? If we can better understand the
| mechanics involved, then maybe this process can be built into a
| new model that can achieve the same results more efficiently
| instead of as a recursive process that runs the model more than
| once.
| HarHarVeryFunny wrote:
| You can think of an LLM as a production line - feed a series
| of tokens in, and they get embedded and then processed
| through the system one step at a time though however many
| transformer layers the model has (undisclosed for most recent
| models, but GPT-3 has 96).
|
| Those fixed 96 (or whatever) steps of processing limit the
| complexity of what the model can do, so it will fail if the
| task is too complicated unless it it breaks it down into
| simpler steps that each can be done well with that depth (96
| steps) of processing.
|
| It's not just appearing to do so - with chain-of-thought
| prompting you are literally telling it to "think step by
| step" as part of the prompt, so this is what it outputs. You
| could also tell it to generate a step by step plan, then
| elaborate on each of those steps.
|
| I don't think we can say exactly how it is deciding to break
| a task into steps, anymore than we can in general say exactly
| how these LLMs are working, but intuitively it's similar to
| how we think and talk (which is what the LLM is trained on) -
| a good speaker/writer will introduce a complex topic as a
| top-down decomposition.
| phailhaus wrote:
| > Sure, but why does that make the model more effective?
|
| Because the model can't "think". There is no "reasoning" that
| goes into generating an answer. So for complex problems that
| require multiple steps of reasoning, the model needs to
| persist those intermediate steps in order to be able to build
| up to a solution.
| Bjartr wrote:
| > Sure, but why does that make the model more effective?
|
| If you look something up (on the internet, in a book, asking
| people directly) and receive two answers, one which describes
| the steps used to arrive at the answer, and one which
| doesn't, on average which one is more likely to be the
| correct answer?
|
| As a prediction machine, an LLM is more constrained by what
| is likely to appear after a chain of reasoning.
| naasking wrote:
| You're just repeating the previous explanation with
| different words, so that's not really satisfactory. A
| mechanistic demonstration of how step by step reasoning
| tends to constrain the space of solutions to ones that are
| more likely to be correct would be an actual explanation,
| until then this is a just-so story.
| JoshTko wrote:
| This is basically what humans do with frameworks that help
| organize our thoughts and ensure a more methodical and complete
| way to think through an issue.
| polygamous_bat wrote:
| I think the two modes of LLM discourse: "they're
| conscious!/they're just next token predictors with impressive
| datasets" comes largely from two different groups of people:
| those who learned about LLMs before learning about ML
| fundamentals, and those who learned ML fundamentals before
| encountering LLMs of today. While I fall in the second group,
| there is a real risk that my prior concepts about the
| fundamentals is limiting my view of the bigger picture, so I at
| least welcome the debate.
|
| Re: chain of thought, I at least know that in practice a lot of
| the results from the original paper has not been quite
| reproducible at later attempts. Whether that is a quirk of models
| changing everyday or something deeper, I do not know.
| devmor wrote:
| This is context window narrowing. It's not any more "reasoning"
| than chaining together sub-queries in a database to arrive at a
| result that's an overlay of multiple matrices of data.
| phailhaus wrote:
| Models can't think. It uses the input context to predict an
| output. So if you have a problem that needs to be solved
| iteratively, those intermediate steps need to be persisted to the
| context, because there is nowhere for them to go otherwise.
| stavros wrote:
| > Models can't think. It uses the input context to predict an
| output.
|
| The first claim doesn't follow from the second. What is it
| about using the input to predict an output that makes you
| believe they can't think? What if that's all thinking is? We
| don't know.
| phailhaus wrote:
| It's not the the second statement follows from the first; I'm
| asserting that models can't think, and that what they're
| really doing is prediction based on context.
|
| The fact that chain-of-thought reasoning yields significantly
| better results is your hint: that means that the model
| doesn't think like a human does when it comes up with
| responses. If it's not in the context, it doesn't exist. You
| can't ask a model "why did you answer that way" without it
| generating from whole cloth a plausible retroactive reason.
| But there is no memory, so it can't really tell you.
|
| > What if that's all thinking is?
|
| I think that this is roughly true. But we actually have
| memory outside of what we say, so when we think, all those
| intermediate steps are persisted in our brains. For a model,
| the context _is_ the memory. If you delete your question from
| the context and ask it "why did you answer that way", it
| will have no idea what you're talking about.
| stavros wrote:
| > You can't ask a model "why did you answer that way"
| without it generating from whole cloth a plausible
| retroactive reason.
|
| I've caught myself generating from whole cloth a plausible
| retroactive reason for some of my actions, which I later
| realized wasn't true at all. Does that mean I can't think
| either? Is there a way for an external observer to tell if
| someone is thinking or not, or is it something that,
| axiomatically, only humans can do, and nothing else?
| phailhaus wrote:
| Again, you have memory but models don't. If it's not in
| the context, it doesn't exist. _This is really easy to
| prove_ : delete your question from the context and ask
| the model why it answered the way it did. It will not
| know, because the model prediction service is stateless.
| It only takes the context as input.
|
| It's like if you and I weren't able to remember anything
| that wasn't written down. That's why you can get models
| to approximate thought by telling it to write down its
| intermediate steps.
| stavros wrote:
| Right, exactly, the context is the model's memory. All
| you're saying is "except for their memory, they have no
| memory", which goes for humans as well.
| phailhaus wrote:
| If you say that the _models_ have memory and can think,
| you are heavily implying that there is a brain in the
| cloud that is capable of "thinking" or "reasoning"
| through your request. That's what it means for humans:
| we're self-contained.
|
| Models aren't like people. The context is not "part" of
| the model, the context is _given_ to the model when you
| ask it for a prediction. It 's like you cut a small part
| of a person's brain out. Alone, it can't think.
|
| It's a pedantic distinction, but an important one. The
| model itself isn't capable of thinking, but if you
| package it up with a context that it can manipulate at-
| will, _that combination of parts_ can be said to
| "think". That's probably going to be the next step in LLM
| tech.
| stavros wrote:
| By that definition, ChatGPT can already think.
| water-your-self wrote:
| If this is your conclusion, I would recommend reading
| more of the research.
| phailhaus wrote:
| Uh, no? ChatGPT does not have a memory outside of the
| context that you submit.
| stavros wrote:
| https://openai.com/blog/memory-and-new-controls-for-
| chatgpt
| zoogeny wrote:
| I think your use of the word "memory" here is imprecise.
|
| For example, I can ask ChatGPT "Give me a 200 word
| summary of George Orwell's book Animal Farm". It gives me
| a pretty cogent description of the novel.
|
| That knowledge of Animal Farm is somewhere, not in the
| context. If we don't call that memory, I'm not sure what
| to call it. Why should I think of this as different than
| my own memories of the book?
| phailhaus wrote:
| That's encoded in the model weights, not "memory".
| Basically, there is no context outside of the context
| that you give the model. When you ask it a question,
| those model weights don't change. It doesn't "remember"
| what you asked.
|
| This is why chain-of-thought reasoning works so
| effectively: it lets the model "use" the context as a
| sort of scratch pad to build up a response. Without it,
| the model isn't capable of mimicking thought because it's
| only capable of predicting based on the current context.
| kelseyfrog wrote:
| If models had memory would it change your mind about
| whether they could think or not?
| andoando wrote:
| I think a fundamental difference is we are able to learn new
| things by reasoning through our existing knowledge. Moreover
| our beliefs are mostly consistent with each other, and we can
| be argued with and have our beliefs changed.
|
| As far as I understand, GPT isn't going to alter its whole
| worldview if you show it that its thinking is flawed.
|
| But perhaps this is possible if upon discovering a flaw, it
| looped through its corpus and altered its connections?
| sib wrote:
| "techniques from an arcane branch of theoretical computer science
| called computational complexity theory"
|
| I mean... complexity theory is pretty much at the core of
| theoretical computer science, right? Hardly an arcane branch of
| it.
| Xcelerate wrote:
| It's kind of funny that the article calls the field of
| computational complexity "arcane", considering it is at the
| forefront of everything we know about the limits of computing.
|
| That said, I haven't understood the intense, long term focus on
| worst-case case and average-case analysis within the field. In
| fact, when I first heard of big-O notation many years ago, it
| took me an embarrassingly long time before I realized this
| referred to the asymptotic performance of an algorithm on the
| worst-case instances of a problem. I remember thinking "Why on
| earth would you care about that? You can derive pathological
| examples to just about anything."
|
| Even the term "average-case" is misleading. We're not talking
| about "average" in the sense of a typical instance of a problem
| class one might encounter in the course of daily life. This
| instead refers to the expectation value of the algorithm's
| (asymptotic) performance over all problem instances within a
| formal language. Sure, the non-colloquial usage of the term
| "average" here is obvious to mathematicians, but I don't think
| someone outside the field is likely aware that we see drastically
| better performance of heuristic algorithms on real-world
| instances of NP-hard problems than one would expect based upon a
| naive review of the research from computational complexity
| theory.
|
| This performance gap between theory and practice is due to the
| fact that the problems we encounter in daily life have such a
| huge amount of mathematical substructure to them, and I would be
| very surprised if a provably optimal average-case algorithm ever
| realistically corresponds to the mean performance of an algorithm
| optimally tailored to the distribution of problem instances we
| encounter in real world data.
|
| Consider matrix factorization. There are techniques that speed
| this up considerably if the matrix is known to be positive
| semidefinite, sparse, low-rank, and so on. Who knows how much
| undiscovered substructure is lurking in the set of real-world
| problem instances we lump together under "matrix factorization".
|
| The subfield of computational complexity theory that moves beyond
| overall complexity analysis I believe is called BWCA, "Beyond
| Worst-Case Analysis" (but someone correct me if that's not
| right).
|
| For mathematical objects like neural networks, where the specific
| problem instances have an absolutely _massive_ amount of hidden
| substructure within them, I think we will have to use approaches
| like BWCA going forward to learn more about the nature of e.g.,
| transformers.
|
| My view is that we should focus less on the absolute limits of a
| particular architecture (woohoo, it's Turing complete and a
| universal function approximator like everything else) and drill
| down more into studying the limits of the interplay between model
| architecture and the intrinsic hidden substructure of the data
| that the model is trained on.
| Calavar wrote:
| Nothing about big-O is specific to worst case. You can
| calculate a big-O for best case, average case, or worst case.
| For example, quicksort is O(n log n) best case, O(n log n)
| average case, and O(n^2) worst case. You may be confusing
| worst-case with upper bound: big-O is an asymptotic upper bound
| notation, but that refers to how we simplify the terms of the
| cost function and is entirely orthogonal to best/average/worst
| case.
| sfink wrote:
| One simple reason: consider the plausibility of
| 11 + 31 = 24
|
| It's actually fairly plausible. The answer is numeric. Two
| digits, even, which is pretty likely when adding together 2-digit
| inputs. 24 is also a common answer to math problems (it has lots
| of factors, for one). It even has all the digits from adding 1+3
| and 1+1.
|
| Now how plausible is Show your work. 11 + 31 =
| the result of adding the 10s digits together, so 10 + 30 = 40,
| and then adding in the 1s digits, so 1 + 1 = 2. Combining the 40
| and the 2 gives 24.
|
| That last sentence doesn't seem very likely. Or:
| Show your work. 11 + 31 = the result of adding the 10s digits
| together, so 10 + 30 = 20, and then adding in the 1s digits, so 1
| + 1 = 4. Combining the 20 and the 4 gives 24.
|
| If you're breaking things down, you have to traverse through some
| territory that is lower probability than the quick wrong answer.
|
| The argument by computational complexity is stronger, though. I
| just wanted to point out that the above is a confounding
| explanation that is sufficient for simple cases, and so may need
| to be ruled out before claiming that computational complexity
| matters.
|
| The complexity argument is also intuitively obvious. If you think
| of an LLM as a type of computer that does one constant-time
| forward pass over the input so far on each clock cycle (and
| outputs a single token), then of course you can compute more if
| you give your computer more cycles! You can use state (even if
| the mechanism for transmitting the state from one cycle to the
| next is sharply limited).
|
| Similarly, it's an expansion of the old problem of a single-layer
| perceptron not being able to compute XOR. (Here, the "cycles" are
| advances from one layer to the next.)
|
| That's not to say that the nuances are obvious. Simply saying you
| can use multiple clock ticks doesn't really say anything about
| how much you can do in one tick.
| Jackson__ wrote:
| Great article! Now what would happen if we took this idea, and
| turned it on it's head? Let's train a model to consistently give
| an answer first, and have it infer the steps it took to get there
| after.
|
| ... Is what I think the researchers at mistral AI are saying,
| because that's what they did. Every slightly complex question you
| ask their models goes somewhat like this:
|
| >Input: Alice has 3 brothers. Each of her brothers has 2 sisters.
| How many sisters does Alice have?
|
| >Output: Alice has 2 sisters.
|
| >Here's the reasoning:
|
| > We know that Alice has 3 brothers.
|
| > Then we are told that each of her brothers has 2 sisters.
|
| > Since Alice is one of the sisters to her brothers, there must
| be one more sister besides Alice for each brother to have 2
| sisters.
|
| > Therefore, Alice has 2 sisters in total.
|
| Conversely, if you ask the model to think first, it gets it right
| immediately. I'm kinda baffled, they have not corrected this
| after their very first model. From mistral 7b to large, each one
| shows this same trained behavior to answer first, think second.
___________________________________________________________________
(page generated 2024-03-22 23:01 UTC)