[HN Gopher] Procedural knowledge in pretraining drives reasoning...
       ___________________________________________________________________
        
       Procedural knowledge in pretraining drives reasoning in large
       language models
        
       Author : reqo
       Score  : 148 points
       Date   : 2024-12-01 16:54 UTC (6 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | largbae wrote:
       | Is this conclusion similar to my layman's understanding of
       | AlphaGo vs AlphaZero? That human procedural knowledge helps ML
       | training to a point, and from there on becomes a limitation?
        
         | dinfinity wrote:
         | No. They're saying that the model they analyzed used mainly
         | information on _how_ to solve math problems from its training
         | data, rather than documents that contained the answers to the
         | (identical) math problems:
         | 
         | > "We investigate which data influence the model's produced
         | reasoning traces and how those data relate to the specific
         | problems being addressed. Are models simply 'retrieving'
         | answers from previously seen pretraining data and reassembling
         | them, or are they employing a more robust strategy for
         | generalisation?"
         | 
         | > "When we characterise the top ranked documents for the
         | reasoning questions qualitatively, we confirm that the
         | influential documents often contain procedural knowledge, like
         | demonstrating how to obtain a solution using formulae or code.
         | Our findings indicate that the approach to reasoning the models
         | use is unlike retrieval, and more like a generalisable strategy
         | that synthesises procedural knowledge from documents doing a
         | similar form of reasoning."
         | 
         | Example reasoning question: > "Prompt Calculate the answer: (7
         | - 4) * 7 Think step-by-step."
        
           | spitfire wrote:
           | What I further got from this is the models are learning the
           | methods, but not evaluating themselves along the way. They
           | don't check for errors.
           | 
           | So once they go down a path they can't properly backtrack.
           | 
           | This feels like the ground truth I've experienced in LLMs to
           | date.
        
             | spitfire wrote:
             | I'll add when I say "learning" I mean memorization.
             | Memorizing on a higher level than facts.
             | 
             | I would love to spend the time and see how altering the
             | query alters the reasoning path. How firm is in the path
             | once it's chosen?
             | 
             | A high level approach has the possibility to be very
             | computer efficient.
        
             | NitpickLawyer wrote:
             | > So once they go down a path they can't properly
             | backtrack.
             | 
             | That's what the specific training in o1 / r1 / qwq are
             | addressing. The model outputs things like "i need to ... >
             | thought 1 > ... > wait that's wrong > i need to go back >
             | thought 2 > ... etc
        
       | sgt101 wrote:
       | drives retrieval of patterns of procedure?
       | 
       | I mean - like for arithmetic?
        
       | ijk wrote:
       | This would explain the unexpected benefits of training on code.
        
         | strken wrote:
         | That sounds interesting, but I'm a layman and don't know
         | anything about it. Can you provide a link?
         | 
         | I was able to find https://arxiv.org/abs/2408.10914, but I
         | don't have the context to know whether it's the paper you're
         | talking about.
        
           | MurizS wrote:
           | I think GP was probably referring to "Scaling Data-
           | Constrained Language Models" (2305.16264) from NeurIPS 2023,
           | which looked first at how to optimally scale LLMs when
           | training data is limited. There is a short section on mixing
           | code (Python) into the training data and the effect this has
           | on performance on e.g. natural language tasks. One of their
           | findings was that training data can be up to 50% code without
           | actually degrading performance, and in some cases (benchmarks
           | like bAbI and WebNLG) with improvements (probably because
           | these tasks have an emphasis on what they call "long-range
           | state tracking capabilities").
           | 
           | For reference: In the Llama 3 technical report (2407.21783),
           | they mention that they ended up using 17% code tokens in
           | their training data.
        
       | jpcom wrote:
       | You mean you need humans to step-by-step solve a problem so a
       | neural net can mimic it? It sounds kinda obvious now that I write
       | it out.
        
         | mattdeboard wrote:
         | No. If I'm understanding correctly it means the software is
         | learning how to solve problems in general by ingesting examples
         | of procedural problem-solving.
        
           | jpcom wrote:
           | You're close, but there's an important nuance. The process
           | isn't about "learning how to solve problems in general" in
           | the broad sense. It's more specific: the neural network is
           | trained to mimic the step-by-step process demonstrated by
           | humans solving a specific problem.
           | 
           | The distinction is that the software doesn't autonomously
           | derive general problem-solving heuristics from scratch.
           | Instead, it observes examples of how humans solve problems
           | procedurally and uses that to replicate similar reasoning.
           | This is crucial because the step-by-step demonstrations give
           | the model structure and guidance, which is different from
           | learning a generalizable strategy for solving any kind of
           | problem without those examples.
           | 
           | In essence, it's like a neural net learning to follow a
           | recipe by watching a chef cook--rather than inventing its own
           | recipes entirely from first principles.
        
             | jebarker wrote:
             | > In essence, it's like a neural net learning to follow a
             | recipe by watching a chef cook--rather than inventing its
             | own recipes entirely from first principles.
             | 
             | Just like how a chef learns
        
               | Retric wrote:
               | A chef also learns through trial and error not just
               | reading how others have cooked in the past and then
               | copping their motions.
               | 
               | This is exemplified by how altitude has a meaningful
               | impact but isn't discussed for a given recipe.
        
               | exe34 wrote:
               | a text LLM isn't going to learn by trial and error, it's
               | not been given that sort of freedom. RLHF would be the
               | llm version of trial and error - but it's like the chef
               | is only allowed to do that for a few days after years of
               | chef school and from then on, he has to stick to what he
               | has already learnt.
        
               | jebarker wrote:
               | Why isn't LLM pre-training based on next token prediction
               | considered "trial and error"? It seems to fit that
               | description pretty well to me.
        
               | exe34 wrote:
               | a chef doesn't get feedback on his meal after picking up
               | the spoon. he gets feedback when he or somebody else
               | tastes the meal part way through and at the end.
        
               | Retric wrote:
               | Pre-training is based on a proxy for desired output not
               | actually desired output. It's not in the form of
               | responses to a prompt, and 1:1 reproducing copyrighted
               | works in production would be bad.
               | 
               | It's the difference between a painter copying some work
               | and a painter making an original piece and then get
               | feedback on it. We consider the second trial and error
               | because the full process is being tested not just
               | technique.
        
             | scellus wrote:
             | Yes, except that I'm not so sure there is a clear
             | distinction between following general instructions and
             | generating new heuristics. It's just a difference in the
             | level of abstraction there, and probably not even that one
             | in any discrete sense, more like a continuum.
             | 
             | (Current) models may of course lack sufficient training
             | data to act on a metalevel enough ("be creative problem
             | solvers"), or they may lack deep enough representations to
             | efficiently act in a more creative way. (And those two may
             | be more or less the same thing or not.)
        
               | exe34 wrote:
               | it's exactly how we learn. many examples and then general
               | principles. if you start with general principles,
               | everybody drops out.
        
               | bravura wrote:
               | Not "exactly" how we learn. Humans learn through a
               | combination of reinforcement learning (which is
               | costly/risky/painful) and through observation of existing
               | patterns and norms.
               | 
               | Better observation-based learning is a less expensive way
               | of improving existing corpus-based approaches than trial-
               | and-error and participating in an environment.
        
               | exe34 wrote:
               | except that the careful observation comes late in the
               | curriculum. children don't learn if you start out with
               | the Stern Gerlach experiment. they sing ABCs.
        
               | pfisherman wrote:
               | The parent of any young child can tell you that they
               | learn through lots of exploration and reinforcement -
               | often to the worry and chagrin of caregivers. Indeed much
               | of our job is to guide exploration away from excessively
               | dangerous "research" activities (ex. locking away
               | cleaning products).
        
             | limit499karma wrote:
             | > it observes
             | 
             | Observe implies sentience that, without question, a neural
             | net simply does not possess. "It" certainly 'records', or
             | more specifically it 'maps', but there is no observer in
             | sight (npi).
             | 
             | > mimic
             | 
             | LLM's do not mimic. The magic is mathematical and happening
             | in the high dimensional space. If there is intrinsic
             | underlying pattern and semantic affinities between process
             | X (used in training) and process Y (used in application),
             | it is very likely that both share proximity, possibly form,
             | in some dimensions of the high dimensional model.
        
             | ChadNauseam wrote:
             | spoken eerily similar to how chatgpt would put it :) https:
             | //chatgpt.com/share/674cd11d-a30c-8005-90a3-023d0c9c18...
        
             | unit149 wrote:
             | Crucially, this is what MacIntyre's narrativity thesis is
             | talking about:
             | 
             | If a university professor is giving a lecture on
             | decentralized finance and forks into a recipe for chocolate
             | chip cookies: crack two eggs, add a cup of flour, and fold
             | in brown sugar prior to baking, it would break linearity.
             | 
             | A generalizable strategy for synthesizing LLMs
             | differentiated by their training parameters is a
             | tokenization is isolating data sets and then establishing a
             | lattice in uniformity within the field of technics.
        
       | semessier wrote:
       | that resonates - less facts and more reasoning training data. The
       | most low hanging in terms of non synthetic data probably being
       | mathematical proofs. With prolog and the like many alternate
       | reasoning paths could be generated. It's hard to say if these
       | many-path would help in llm training without access to the
       | gigantic machines (it's so unfair) to try it on.
        
       | shermantanktop wrote:
       | Going meta a bit: comments so far on this post show diametrically
       | opposing understandings of the paper, which demonstrates just how
       | varied the interpretation of complex text can be.
       | 
       | We hold AI to a pretty high standard of correctness, as we
       | should, but humans are not that reliable on matters of fact, let
       | alone on rigor of reasoning.
        
         | sigmoid10 wrote:
         | This is extremely common in these discussions. Most humans are
         | not that good at reasoning themselves and fall for the same
         | kind of fallacies over and over because of the way they were
         | brought up (their training data so to speak). And yet they
         | somehow think they can argue why or why not LLMs should be able
         | to do the same. If anything, the current limits of these morels
         | show the limits of human cognition which is spread throughout
         | the internet - because this is literally what they learned
         | from. I believe once we achieve a more independent learning
         | (like we've seen glimpses of in the MuZero paper) these models
         | will blow human intelligence out of the water.
        
           | pclmulqdq wrote:
           | It's because we can put responsibility on humans to be
           | correct but we can't on computers. Humans given the
           | appropriate incentives are _very_ good at their jobs and
           | there is a path for compensation if they screw up. Computers
           | have neither of these things.
        
             | sigmoid10 wrote:
             | Humans already put a lot of trust in computers not because
             | they can take responsibility but because traditional
             | software can be made very predictable or at least
             | compliant. There are whole industries built around software
             | standards to ensure that. The problem is we don't yet know
             | enough about identifying and patching problems in these
             | models. Once we get something equivalent to MISRA for LLMs
             | to achieve the same level of compliance, there is very
             | little that could still hold them back.
        
       | ninetyninenine wrote:
       | >On the one hand, LLMs demonstrate a general ability to solve
       | problems. On the other hand, they show surprising reasoning gaps
       | when compared to humans, casting doubt on the robustness of their
       | generalisation strategies
       | 
       | surprised this gets voted up given the surprising amount of users
       | on HN who think LLMs can't reason at all and that the only way to
       | characterize an LLM is through the lens of a next token
       | predictor. Last time I was talking about LLM intelligence someone
       | rudely told me to read up on how LLMs work and that we already
       | know exactly how they work and they're just token predictors.
        
         | ben_w wrote:
         | The loudest people seem to be those with the most extreme
         | positions, and that includes on "is ${specific AI}
         | (useless|superhuman) for ${domain}?". Perhaps it's just
         | perception, but perhaps the arguments make them persist, as CGP
         | Grey pointed out: https://www.youtube.com/watch?v=rE3j_RHkqJc
         | 
         | As I'm in the middle, I get flack from people on both extremes,
         | as I'm outside their (equivalent of or just literally?) Overton
         | window on this subject. Seems like an odd zone to be in for the
         | opinion "this is a useful tool, but I see loads of ways it can
         | go wrong". Makes me wonder what the _real_ common discourse was
         | of looms during the industrial revolution, and not just the
         | modern summary of that era.
        
         | vundercind wrote:
         | The "surprising gaps" are precisely because they're not
         | reasoning--or, at least, not "reasoning" about the things a
         | human would be to solve the problems, but about some often-
         | correlated but different set of facts about relationships
         | between tokens in writing.
         | 
         | It's the failure modes that make the distinction clearest.
         | 
         | LLM output is only meaningful, in the way we usually mean that,
         | at the point we assigned external, human meaning to it, after
         | the fact. The LLM wouldn't stop operating or become "confused"
         | if fed gibberish, because the meaning it's extracting doesn't
         | depend on the meaning _humans_ assign things, except by
         | coincidence--which coincidence we foster by feeding them with
         | things we do _not_ regard as gibberish, but that's beside the
         | point so far as how they "really work" goes.
        
       | rors wrote:
       | It seems obvious to me that LLMs wouldn't be able to find
       | examples of every single problem posed to them in training data.
       | There wouldn't be enough examples for the factual look up needed
       | in an information retrieval style search. I can believe that
       | they're doing some form of extrapolation to create novel
       | solutions to posed problems.
       | 
       | It's interesting that this paper doesn't contradict the
       | conclusions of the Apple LLM paper[0], where prompts were
       | corrupted to force the LLM into making errors. I can also believe
       | that LLMs can only make small deviations from existing example
       | solutions in creation of these novel solutions.
       | 
       | I hate that we're using the term "reasoning" for this solution
       | generation process. It's a term coined by LLM companies to evoke
       | an almost emotional response on how we talk about this
       | technology. However, it does appear that we are capable of
       | instructing machines to follow a series of steps using natural
       | language, with some degree of ambiguity. That in of itself is a
       | huge stride forward.
       | 
       | [0] https://machinelearning.apple.com/research/gsm-symbolic
        
         | ucefkh wrote:
         | Totally, these companies are pushing towards showcasing their
         | AI models as self thinking and reasoning AI while they are just
         | trained of a lot of amount of data in dataset format which they
         | extrapolate to find the right answer.
         | 
         | They still can't think outsider their box of datasets
        
         | pfisherman wrote:
         | I very much agree with the perspective that LLMs are not suited
         | for "reasoning" in the sense of creative problem solving or
         | application of logic. I think that the real potential in this
         | domain is having them act as a sort of "compiler" layer that
         | bridges the gap between natural language - which is imprecise -
         | and formal languages (sql, prolog, python, lean, etc) that are
         | more suited for solving these types of problems. And then maybe
         | synthesizing the results / outputs of the formal language
         | layer. Basically "agents".
         | 
         | That being said, I do think that LLMs are capable of "verbal
         | reasoning" operations. I don't have a good sense of the
         | boundaries that distinguish the logics - verbal, qualitative,
         | quantitative reasoning. What comes to my mind is the verbal
         | sections of standardized tests.
        
       | ricardobeat wrote:
       | Does this mean LLMs might do better if trained on large amounts
       | of student notes, exams, book reviews and such? That would be
       | incredibly interesting.
        
         | GarnetFloride wrote:
         | I have wondered that from time to time, why not train an AI
         | system using educational curricula plus some games and play? It
         | might be fascinating to see what comes out using various
         | systems from around the world.
        
       | btilly wrote:
       | This is highly relevant to the recent discussion at
       | https://news.ycombinator.com/item?id=42285128.
       | 
       | Google claims that their use of pretraining is a key requirement
       | for being able to deliver a (slightly) better chip design. And
       | they claim that a responding paper that did not attempt to do
       | pretraining, should have been expected to be well below the state
       | of the art in chip design.
       | 
       | Given how important reasoning is for chip design, and given how
       | important pretraining is for driving reasoning in large language
       | models, it is obvious that Google's reasoning is very reasonable.
       | If Google barely beats the state of the art while using
       | pretraining, an attempt that doesn't pretrain should be expected
       | to be well below the current state of the art. And therefore that
       | second attempt's poor performance says nothing about whether
       | Google's results are plausible.
        
         | pfisherman wrote:
         | I am not an expert in the particular application domain of that
         | article; but I can see why their argument of pre training might
         | be valid. It is not especially controversial to say that pre
         | training neural nets improves few shot learning performance.
         | And I suspect there is an inflection point for every problem
         | where pre trained neural nets yield better few shot learning
         | performance than less data hungry approaches - such as hand
         | crafted features or strong priors.
         | 
         | That being said, it seems that the question here is whether
         | that inflection point has been reached in this case.
        
       ___________________________________________________________________
       (page generated 2024-12-01 23:00 UTC)