[HN Gopher] Procedural knowledge in pretraining drives reasoning...
___________________________________________________________________
Procedural knowledge in pretraining drives reasoning in large
language models
Author : reqo
Score : 148 points
Date : 2024-12-01 16:54 UTC (6 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| largbae wrote:
| Is this conclusion similar to my layman's understanding of
| AlphaGo vs AlphaZero? That human procedural knowledge helps ML
| training to a point, and from there on becomes a limitation?
| dinfinity wrote:
| No. They're saying that the model they analyzed used mainly
| information on _how_ to solve math problems from its training
| data, rather than documents that contained the answers to the
| (identical) math problems:
|
| > "We investigate which data influence the model's produced
| reasoning traces and how those data relate to the specific
| problems being addressed. Are models simply 'retrieving'
| answers from previously seen pretraining data and reassembling
| them, or are they employing a more robust strategy for
| generalisation?"
|
| > "When we characterise the top ranked documents for the
| reasoning questions qualitatively, we confirm that the
| influential documents often contain procedural knowledge, like
| demonstrating how to obtain a solution using formulae or code.
| Our findings indicate that the approach to reasoning the models
| use is unlike retrieval, and more like a generalisable strategy
| that synthesises procedural knowledge from documents doing a
| similar form of reasoning."
|
| Example reasoning question: > "Prompt Calculate the answer: (7
| - 4) * 7 Think step-by-step."
| spitfire wrote:
| What I further got from this is the models are learning the
| methods, but not evaluating themselves along the way. They
| don't check for errors.
|
| So once they go down a path they can't properly backtrack.
|
| This feels like the ground truth I've experienced in LLMs to
| date.
| spitfire wrote:
| I'll add when I say "learning" I mean memorization.
| Memorizing on a higher level than facts.
|
| I would love to spend the time and see how altering the
| query alters the reasoning path. How firm is in the path
| once it's chosen?
|
| A high level approach has the possibility to be very
| computer efficient.
| NitpickLawyer wrote:
| > So once they go down a path they can't properly
| backtrack.
|
| That's what the specific training in o1 / r1 / qwq are
| addressing. The model outputs things like "i need to ... >
| thought 1 > ... > wait that's wrong > i need to go back >
| thought 2 > ... etc
| sgt101 wrote:
| drives retrieval of patterns of procedure?
|
| I mean - like for arithmetic?
| ijk wrote:
| This would explain the unexpected benefits of training on code.
| strken wrote:
| That sounds interesting, but I'm a layman and don't know
| anything about it. Can you provide a link?
|
| I was able to find https://arxiv.org/abs/2408.10914, but I
| don't have the context to know whether it's the paper you're
| talking about.
| MurizS wrote:
| I think GP was probably referring to "Scaling Data-
| Constrained Language Models" (2305.16264) from NeurIPS 2023,
| which looked first at how to optimally scale LLMs when
| training data is limited. There is a short section on mixing
| code (Python) into the training data and the effect this has
| on performance on e.g. natural language tasks. One of their
| findings was that training data can be up to 50% code without
| actually degrading performance, and in some cases (benchmarks
| like bAbI and WebNLG) with improvements (probably because
| these tasks have an emphasis on what they call "long-range
| state tracking capabilities").
|
| For reference: In the Llama 3 technical report (2407.21783),
| they mention that they ended up using 17% code tokens in
| their training data.
| jpcom wrote:
| You mean you need humans to step-by-step solve a problem so a
| neural net can mimic it? It sounds kinda obvious now that I write
| it out.
| mattdeboard wrote:
| No. If I'm understanding correctly it means the software is
| learning how to solve problems in general by ingesting examples
| of procedural problem-solving.
| jpcom wrote:
| You're close, but there's an important nuance. The process
| isn't about "learning how to solve problems in general" in
| the broad sense. It's more specific: the neural network is
| trained to mimic the step-by-step process demonstrated by
| humans solving a specific problem.
|
| The distinction is that the software doesn't autonomously
| derive general problem-solving heuristics from scratch.
| Instead, it observes examples of how humans solve problems
| procedurally and uses that to replicate similar reasoning.
| This is crucial because the step-by-step demonstrations give
| the model structure and guidance, which is different from
| learning a generalizable strategy for solving any kind of
| problem without those examples.
|
| In essence, it's like a neural net learning to follow a
| recipe by watching a chef cook--rather than inventing its own
| recipes entirely from first principles.
| jebarker wrote:
| > In essence, it's like a neural net learning to follow a
| recipe by watching a chef cook--rather than inventing its
| own recipes entirely from first principles.
|
| Just like how a chef learns
| Retric wrote:
| A chef also learns through trial and error not just
| reading how others have cooked in the past and then
| copping their motions.
|
| This is exemplified by how altitude has a meaningful
| impact but isn't discussed for a given recipe.
| exe34 wrote:
| a text LLM isn't going to learn by trial and error, it's
| not been given that sort of freedom. RLHF would be the
| llm version of trial and error - but it's like the chef
| is only allowed to do that for a few days after years of
| chef school and from then on, he has to stick to what he
| has already learnt.
| jebarker wrote:
| Why isn't LLM pre-training based on next token prediction
| considered "trial and error"? It seems to fit that
| description pretty well to me.
| exe34 wrote:
| a chef doesn't get feedback on his meal after picking up
| the spoon. he gets feedback when he or somebody else
| tastes the meal part way through and at the end.
| Retric wrote:
| Pre-training is based on a proxy for desired output not
| actually desired output. It's not in the form of
| responses to a prompt, and 1:1 reproducing copyrighted
| works in production would be bad.
|
| It's the difference between a painter copying some work
| and a painter making an original piece and then get
| feedback on it. We consider the second trial and error
| because the full process is being tested not just
| technique.
| scellus wrote:
| Yes, except that I'm not so sure there is a clear
| distinction between following general instructions and
| generating new heuristics. It's just a difference in the
| level of abstraction there, and probably not even that one
| in any discrete sense, more like a continuum.
|
| (Current) models may of course lack sufficient training
| data to act on a metalevel enough ("be creative problem
| solvers"), or they may lack deep enough representations to
| efficiently act in a more creative way. (And those two may
| be more or less the same thing or not.)
| exe34 wrote:
| it's exactly how we learn. many examples and then general
| principles. if you start with general principles,
| everybody drops out.
| bravura wrote:
| Not "exactly" how we learn. Humans learn through a
| combination of reinforcement learning (which is
| costly/risky/painful) and through observation of existing
| patterns and norms.
|
| Better observation-based learning is a less expensive way
| of improving existing corpus-based approaches than trial-
| and-error and participating in an environment.
| exe34 wrote:
| except that the careful observation comes late in the
| curriculum. children don't learn if you start out with
| the Stern Gerlach experiment. they sing ABCs.
| pfisherman wrote:
| The parent of any young child can tell you that they
| learn through lots of exploration and reinforcement -
| often to the worry and chagrin of caregivers. Indeed much
| of our job is to guide exploration away from excessively
| dangerous "research" activities (ex. locking away
| cleaning products).
| limit499karma wrote:
| > it observes
|
| Observe implies sentience that, without question, a neural
| net simply does not possess. "It" certainly 'records', or
| more specifically it 'maps', but there is no observer in
| sight (npi).
|
| > mimic
|
| LLM's do not mimic. The magic is mathematical and happening
| in the high dimensional space. If there is intrinsic
| underlying pattern and semantic affinities between process
| X (used in training) and process Y (used in application),
| it is very likely that both share proximity, possibly form,
| in some dimensions of the high dimensional model.
| ChadNauseam wrote:
| spoken eerily similar to how chatgpt would put it :) https:
| //chatgpt.com/share/674cd11d-a30c-8005-90a3-023d0c9c18...
| unit149 wrote:
| Crucially, this is what MacIntyre's narrativity thesis is
| talking about:
|
| If a university professor is giving a lecture on
| decentralized finance and forks into a recipe for chocolate
| chip cookies: crack two eggs, add a cup of flour, and fold
| in brown sugar prior to baking, it would break linearity.
|
| A generalizable strategy for synthesizing LLMs
| differentiated by their training parameters is a
| tokenization is isolating data sets and then establishing a
| lattice in uniformity within the field of technics.
| semessier wrote:
| that resonates - less facts and more reasoning training data. The
| most low hanging in terms of non synthetic data probably being
| mathematical proofs. With prolog and the like many alternate
| reasoning paths could be generated. It's hard to say if these
| many-path would help in llm training without access to the
| gigantic machines (it's so unfair) to try it on.
| shermantanktop wrote:
| Going meta a bit: comments so far on this post show diametrically
| opposing understandings of the paper, which demonstrates just how
| varied the interpretation of complex text can be.
|
| We hold AI to a pretty high standard of correctness, as we
| should, but humans are not that reliable on matters of fact, let
| alone on rigor of reasoning.
| sigmoid10 wrote:
| This is extremely common in these discussions. Most humans are
| not that good at reasoning themselves and fall for the same
| kind of fallacies over and over because of the way they were
| brought up (their training data so to speak). And yet they
| somehow think they can argue why or why not LLMs should be able
| to do the same. If anything, the current limits of these morels
| show the limits of human cognition which is spread throughout
| the internet - because this is literally what they learned
| from. I believe once we achieve a more independent learning
| (like we've seen glimpses of in the MuZero paper) these models
| will blow human intelligence out of the water.
| pclmulqdq wrote:
| It's because we can put responsibility on humans to be
| correct but we can't on computers. Humans given the
| appropriate incentives are _very_ good at their jobs and
| there is a path for compensation if they screw up. Computers
| have neither of these things.
| sigmoid10 wrote:
| Humans already put a lot of trust in computers not because
| they can take responsibility but because traditional
| software can be made very predictable or at least
| compliant. There are whole industries built around software
| standards to ensure that. The problem is we don't yet know
| enough about identifying and patching problems in these
| models. Once we get something equivalent to MISRA for LLMs
| to achieve the same level of compliance, there is very
| little that could still hold them back.
| ninetyninenine wrote:
| >On the one hand, LLMs demonstrate a general ability to solve
| problems. On the other hand, they show surprising reasoning gaps
| when compared to humans, casting doubt on the robustness of their
| generalisation strategies
|
| surprised this gets voted up given the surprising amount of users
| on HN who think LLMs can't reason at all and that the only way to
| characterize an LLM is through the lens of a next token
| predictor. Last time I was talking about LLM intelligence someone
| rudely told me to read up on how LLMs work and that we already
| know exactly how they work and they're just token predictors.
| ben_w wrote:
| The loudest people seem to be those with the most extreme
| positions, and that includes on "is ${specific AI}
| (useless|superhuman) for ${domain}?". Perhaps it's just
| perception, but perhaps the arguments make them persist, as CGP
| Grey pointed out: https://www.youtube.com/watch?v=rE3j_RHkqJc
|
| As I'm in the middle, I get flack from people on both extremes,
| as I'm outside their (equivalent of or just literally?) Overton
| window on this subject. Seems like an odd zone to be in for the
| opinion "this is a useful tool, but I see loads of ways it can
| go wrong". Makes me wonder what the _real_ common discourse was
| of looms during the industrial revolution, and not just the
| modern summary of that era.
| vundercind wrote:
| The "surprising gaps" are precisely because they're not
| reasoning--or, at least, not "reasoning" about the things a
| human would be to solve the problems, but about some often-
| correlated but different set of facts about relationships
| between tokens in writing.
|
| It's the failure modes that make the distinction clearest.
|
| LLM output is only meaningful, in the way we usually mean that,
| at the point we assigned external, human meaning to it, after
| the fact. The LLM wouldn't stop operating or become "confused"
| if fed gibberish, because the meaning it's extracting doesn't
| depend on the meaning _humans_ assign things, except by
| coincidence--which coincidence we foster by feeding them with
| things we do _not_ regard as gibberish, but that's beside the
| point so far as how they "really work" goes.
| rors wrote:
| It seems obvious to me that LLMs wouldn't be able to find
| examples of every single problem posed to them in training data.
| There wouldn't be enough examples for the factual look up needed
| in an information retrieval style search. I can believe that
| they're doing some form of extrapolation to create novel
| solutions to posed problems.
|
| It's interesting that this paper doesn't contradict the
| conclusions of the Apple LLM paper[0], where prompts were
| corrupted to force the LLM into making errors. I can also believe
| that LLMs can only make small deviations from existing example
| solutions in creation of these novel solutions.
|
| I hate that we're using the term "reasoning" for this solution
| generation process. It's a term coined by LLM companies to evoke
| an almost emotional response on how we talk about this
| technology. However, it does appear that we are capable of
| instructing machines to follow a series of steps using natural
| language, with some degree of ambiguity. That in of itself is a
| huge stride forward.
|
| [0] https://machinelearning.apple.com/research/gsm-symbolic
| ucefkh wrote:
| Totally, these companies are pushing towards showcasing their
| AI models as self thinking and reasoning AI while they are just
| trained of a lot of amount of data in dataset format which they
| extrapolate to find the right answer.
|
| They still can't think outsider their box of datasets
| pfisherman wrote:
| I very much agree with the perspective that LLMs are not suited
| for "reasoning" in the sense of creative problem solving or
| application of logic. I think that the real potential in this
| domain is having them act as a sort of "compiler" layer that
| bridges the gap between natural language - which is imprecise -
| and formal languages (sql, prolog, python, lean, etc) that are
| more suited for solving these types of problems. And then maybe
| synthesizing the results / outputs of the formal language
| layer. Basically "agents".
|
| That being said, I do think that LLMs are capable of "verbal
| reasoning" operations. I don't have a good sense of the
| boundaries that distinguish the logics - verbal, qualitative,
| quantitative reasoning. What comes to my mind is the verbal
| sections of standardized tests.
| ricardobeat wrote:
| Does this mean LLMs might do better if trained on large amounts
| of student notes, exams, book reviews and such? That would be
| incredibly interesting.
| GarnetFloride wrote:
| I have wondered that from time to time, why not train an AI
| system using educational curricula plus some games and play? It
| might be fascinating to see what comes out using various
| systems from around the world.
| btilly wrote:
| This is highly relevant to the recent discussion at
| https://news.ycombinator.com/item?id=42285128.
|
| Google claims that their use of pretraining is a key requirement
| for being able to deliver a (slightly) better chip design. And
| they claim that a responding paper that did not attempt to do
| pretraining, should have been expected to be well below the state
| of the art in chip design.
|
| Given how important reasoning is for chip design, and given how
| important pretraining is for driving reasoning in large language
| models, it is obvious that Google's reasoning is very reasonable.
| If Google barely beats the state of the art while using
| pretraining, an attempt that doesn't pretrain should be expected
| to be well below the current state of the art. And therefore that
| second attempt's poor performance says nothing about whether
| Google's results are plausible.
| pfisherman wrote:
| I am not an expert in the particular application domain of that
| article; but I can see why their argument of pre training might
| be valid. It is not especially controversial to say that pre
| training neural nets improves few shot learning performance.
| And I suspect there is an inflection point for every problem
| where pre trained neural nets yield better few shot learning
| performance than less data hungry approaches - such as hand
| crafted features or strong priors.
|
| That being said, it seems that the question here is whether
| that inflection point has been reached in this case.
___________________________________________________________________
(page generated 2024-12-01 23:00 UTC)