[HN Gopher] Why can't transformers learn multiplication?
___________________________________________________________________
Why can't transformers learn multiplication?
Author : PaulHoule
Score : 111 points
Date : 2025-10-21 19:47 UTC (3 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| LouisSayers wrote:
| Given their names I'd say they're too busy optimising primes...
| IAmBroom wrote:
| Take your damned upvote, and go away.
| daxfohl wrote:
| The chains-of-thought here are artificially constructed, very
| information-dense partial sums formatted in a specific way that
| guides the fine tuning. A potential next step would be to look at
| real-world chains-of-thought and see whether some process could
| start with those and achieve the same result. Then you could
| really have a self-improving system!
|
| Also I wonder if the LLM "knows" that it has this capability
| after fine-tuning. If it encounters multiplication as part of
| some larger chain-of-thought, will it solve that internally, or
| will it continue to do it step-by-step in the chain-of-thought?
| hxu129 wrote:
| But it's very hard to define "real-world CoT" -- think about
| human, we learn multiplications by vertical calculation and we
| learn division in a similar way -- all these learning process
| requires an "information dense" tools (calculation process)
| with intrinsic math rules in it. Isn't that an adapted way of
| CoT?
| jerf wrote:
| This is a gut impression and I don't deny it, but LLMs are Large
| _Language_ Models, and in my own brain, my Language Model isn 't
| doing large-scale multiplication. I have a language-based
| intuition for the sigle-digit multiplication table and a touch
| beyond (and based on my observations that's already above average
| for a human Language Model, at least in my age peer group), but
| it's not my Language Model doing 283 times 9284. That requires a
| symbolic manipulation model, and in fact I would observe that my
| personal neural net, for all the things it is amazingly good at,
| is in fact quite terrible at that sort of multiplication too. A
| Commodore PET is by all measures vastly, vastly simpler than my
| brain, but it blows away my multiplication capabilities. And then
| the symbolic systems tacked on another, what, 15 orders of
| magnitude from that "blows away my multiplication capabilities"?
| Depends on how you count, but something like that.
|
| You can sit here and force me to recite ("train me on") multi-
| digit multiplication problems and their result until the day I
| die, and my language model is only going to get marginally
| better. It is in practicing my symbolic manipulation that I'm
| going to get better and faster.
|
| It seems to me that expecting a Language Model to be very good at
| multiplication is asking for a substantially superhuman level of
| performance from them, and one that we have little reason to
| believe will scale anyhow. What we need is symbolic manipulation,
| better than the approximation they achieve when "reasoning".
|
| I find it rather ironic to sit here and use the aforementioned 15
| orders of magnitude improvement over the Commodore PET to use
| that level of symbolic manipulation firepower to laboriously
| recreate a software system that is as bad as we are at
| multiplication for what may well be the same fundamental
| reasons... and then have the audacity to _complain_ about it. My
| metaphorical dude, you did a couple trillion multiplications just
| to get to this single bad multiplication output... maybe another
| approach is called for.
| suddenlybananas wrote:
| Language _is_ the symbolic manipulation system par excellence
| though.
| jerf wrote:
| There's equivocation in that statement, though, whether you
| meant there to be or not. There is clearly a difference in
| how we manipulate English words for normal human activities
| and the symbolic manipulation with very strict rules we today
| associate with mathematics and computer science. Human
| language goes back thousands of years, into the indefinite
| past we can't track past. Symbolic manipulation is a much,
| much more recent development, starting only ~2300 years ago
| around Euclid and not really coming into full development
| until much later... you can argue about exactly when it is
| but I'd personally put it as late as the 19th century for it
| to be recognized in the modern sense. It must be something
| different if separated by that many centuries.
|
| To disprove my point, please generate a list of 5 random
| 5-digit numbers and demonstrate multiplying them in your head
| as quickly as you can read them. Since you can't, clearly
| there is something about that that is hard for you, despite
| the fact that the act of reading this text, maintaining
| physical homeostasis while you do it, and all the other
| things your brain is doing as you do this represents a
| staggering amount of raw computation that is vastly, vastly
| in excess of what is nominally needed to achieve that
| computation.
| suddenlybananas wrote:
| Doing multiplication in your head isn't the point though,
| you can externalise language and use it to do things you
| can't do in your head by writing it down.
|
| Mathematics was born out of very careful reasoning that we
| do _through language_ , we only use formalisms as they
| allow us to avoid the massive ambiguities that exist in
| natural language. Formal symbolic manipulation came out of
| our already existing abilities of symbolic manipulation
| through language.
| lacy_tinpot wrote:
| A lot of savants that are able to do really cool calculations,
| or even people that have synesthesia seeing numbers as colors,
| don't actually do "real" calculations.
|
| I think most humans that do math aren't actually literally
| computing things as some kind of logic machine.
|
| We can produce logic, and follow the steps of using that logic,
| but it doesn't seem to me that our cognition is some kind of
| logic machine itself.
| daxfohl wrote:
| True. Generally it seems like you're visualizing things,
| moving stuff around, seeing vague patterns and trying to make
| them more clear. IDK how a transformer architecture would fit
| all of that in its context, or use it productivity once it's
| there. You can't just keep appending forever, but you also
| can't delete stuff either, because unlike humans, a deletion
| is a hard delete; there's no fuzzy remembrance left to rely
| on, so even deleting bad ideas is dangerous because it'll
| forget that it was a bad idea and infinite loop. Symbols
| manipulation doesn't come until the end, after you have a
| good idea what that part will look like.
| daxfohl wrote:
| Hmm, I wonder what happens if you let them manipulate their own
| context symbolically, maybe something like a stack machine.
| Perhaps all you need is a "delete" token, or a "replace" flag.
| That way you don't have context full of irrelevant information.
|
| I guess the challenge is, where would the training data come
| from? Data on the internet is in its final form so "next token"
| is never a delete.
|
| Edit: I guess in essence, that's what reasoning LLMs already
| do. IIUC the thought blocks are ephemeral, and only the
| response is maintained for the chat. Maybe there'd be some
| benefit of doing this recursively? But that's also kind of what
| subagents are for. So, perhaps nothing new here.
| r0x0r007 wrote:
| I agree with you, seems like we are trying to make the shoe
| fit. Not only are we missing the understanding of what is
| happening inside transformers, but now we are trying to teach
| them and see how they respond and then interpret it. That seems
| fine with viruses and animals, but we are talking about a piece
| of software here. Shouldn't we know what's happening inside?
| Maybe these kinds of papers can shine more light and give us
| better understanding though, still it feels backwards to
| me...Regarding the multiplication itself, shouldn't pure
| understanding of the meaning of multiplication(it's a summation
| basically) be enough for 'AI' to call it a day? If AI or human
| understands that, then the rest is computation part. We already
| got that covered, so instead of having 'AI' learn it on its own
| on crazy amount of data and get it right 99% of time, shouldn't
| we just give it a calculator? Somebody PLEEAASE give this AI a
| calculator :-)
| hodgehog11 wrote:
| I think you might be missing some appropriate context. I agree
| that it is ridiculous to expect a language model to be good at
| symbolic manipulation; that is best done with tool use.
| However, there is a significant line of work dedicated to
| algorithm discovery for mathematical problems using neural
| networks. Transformers are used here due to their popularity,
| but also some theoretical analysis to suggest that they are the
| among the most efficient architecture for learning automata.
| It's still unclear whether this is truly sound though, which is
| where this kind of research matters.
| mikkupikku wrote:
| They're not any better at addition, are they? If they are, I
| wonder how good they are at adding numbers in log space.
| yorwba wrote:
| The paper uses a number representation that is designed to make
| attention easy to learn: each digit is a separate token and the
| least significant digit is put first, so that the first digit
| of the output is simply the sum of the first digits of the
| inputs and the second digit is the sum of the second digits
| plus an optional carry from the first digits and so on.
|
| If the numbers are represented with the most significant digit
| first as usual, you need a bunch of intermediate steps before
| outputting even the first digit just to determine whether it is
| affected by a carry or not.
|
| The paper looks at multiplication of numbers represented with
| the least significant digit first as a toy task requiring
| several additions as intermediate steps to study why a model
| large enough to perform those additions in principle fails to
| learn to do so in practice.
|
| They compare with a model that is first trained to produce the
| intermediate additions explicitly (as a "chain of thought" with
| a specific format) and then has this CoT progressively
| shortened during training until there's nothing left of it. But
| that second model successfully multiplies.
|
| The difference appears to be that the presence of the
| intermediate results induces a better number representation in
| latent space, whereas the model without CoT gets stuck in a
| less efficient local minimum.
|
| So the answer to the question "Why can't transformers learn
| multiplication?" is that the training process is insufficient
| for the model to discover the best intermediate steps on its
| own.
|
| You could do a similar experiment where the CoT involves first
| taking the logarithm, adding, and then exponentiating to get
| the final result, but I think logarithms are probably another
| computation that's too difficult to learn without additional
| hints for intermediate steps.
| mikkupikku wrote:
| > _but I think logarithms are probably another computation
| that 's too difficult to learn without additional hints for
| intermediate steps._
|
| I suppose you're probably right, but LLMs probably have a lot
| of log tables in their training data so I'm not so sure.
| yorwba wrote:
| The paper is about the ability of transformers to learn a
| task based on training data for that task only, not about
| LLMs pretrained on much of the internet. And training on
| log tables doesn't necessarily allow the model to always
| output the correct logarithm, just as training on
| multiplication tables doesn't necessarily confer the
| ability to multiply.
| nico wrote:
| Would love to see an architecture that learned more like humans.
| Start with just imitating one letter, then a few more, than some
| syllables, then full words, then sentences, etc. Progressively
| adding on top of previous knowledge
|
| Also, it's interesting that one of the big goals/measures of
| models is their capacity to "generalize", but the training
| methods optimize for loss/accuracy, and only after training test
| for generalization to validate
|
| Are there training methods/curriculums that explicitly maximize
| generalization?
| serced wrote:
| Yes, I also wonder about this! Progress from children books to
| scientific papers etc. Could it learn e.g. language structure
| faster in a pre-training stage? Also somehow one needs to
| define a proxy to generalization to compute a loss and do
| backpropagation.
| arbot360 wrote:
| This field of study is known as "Curriculum Learning" for
| your Googling pleasure (or I guess ChatGPT Deep Research
| now).
| exit wrote:
| "an architecture that learned more like humans"
|
| i.e. enduring countless generations of evolutionary selection
| and cross breeding, then fine-tuning a bit?
|
| although it could be interesting, i don't think training on
| progressively complex strings entirely recapitulates this.
| nico wrote:
| That's a very interesting take. I hadn't really considered
| evolution
|
| I guess if you really wanted to start from scratch, you could
| figure out how to evolve the whole system from a single cell
| or something like that. In some ways neural networks have
| kind of evolved in that way, assisted by humans. They started
| with a single perceptron, and have gone all the way to deep
| learning and convolutional networks
|
| I also remember a long time ago studying genetic and
| evolutionary algorithms, but they were pretty basic in terms
| of what they could learn and do, compared to modern LLMs
|
| Although recently I saw some research in which they were
| applying essentially genetic algorithms to merge model
| weights and produce models with new/evolved capabilities
| ares623 wrote:
| Isn't that what all the hundreds of billions are banking on?
| "General" intelligence.
| onlyrealcuzzo wrote:
| You don't need general intelligence to make good memes to
| keep people scrolling through Instagram.
|
| You don't need general intelligence to make a decent coding
| tool like Cursor.
|
| You don't need general intelligence to improve SERPs.
|
| You don't need general intelligence to sell a subscription
| for a decent AI assistant.
|
| There's tons of value already added without anything general.
| zer00eyz wrote:
| "Would love to see an architecture that learned"
|
| Would be a far more accurate statement. Training != Learning.
| rokobobo wrote:
| Do you have an example of an algorithm that learns, rather
| than is trained/trains itself? I don't really see the
| boundary between the two concepts.
| carodgers wrote:
| Because they produce output probabilistically, when
| multiplication is deterministic. Why is this so hard for
| everyone?
| trollied wrote:
| Not true though. Internally they can "shell out" to sub-tasks
| that know how to do specific things. The specific things don't
| have to be models.
|
| (I'm specifically talking about commercial hosted ones that
| have the capability i describe - obviously your run of the mill
| one downloaded off of the internet cannot do this).
| rrix2 wrote:
| yes, what your describing is not a transformer but a high-
| level LLM-based product with tool-calling wired up to it
| KalMann wrote:
| That doesn't appear to be the kind of thing this article is
| describing.
| skinner_ wrote:
| If being probabilistic prevented learning deterministic
| functions, transformers couldn't learn addition either. But
| they can, so that can't be the reason.
| wat10000 wrote:
| People are probabilistic, and I've been informed that people
| are able to perform multiplication.
| ddingus wrote:
| Yes, and unlike the LLM they can iterate on a problem.
|
| When I multiply, I take it in chunks.
|
| Put the LLM into a loop, instruct it to keep track of where
| it is and have it solve a digit at a time.
|
| I bet it does just fine. See my other comment as to why I
| think that is.
| kovek wrote:
| I tried to ask a model to tell me what is the "long
| multiplication algorithm". It gave it to me. I asked it to follow
| that algorithm to solve eg. 12987318927 * 12098102983, and it
| followed the algorithm, and it got the right answer. It DOES fail
| more when the numbers are longer (because it results with more
| text in the context), but that can be improved by having the
| model focus on the right subset of the text, right?
| photonthug wrote:
| > It DOES fail more when the numbers are longer (because it
| results with more text in the context),
|
| I tried to raise this question yesterday.
| https://news.ycombinator.com/item?id=45683113#45687769
|
| Declaring victory on "reasoning" based on cherry-picking a
| correct result about arithmetic is, of course, very narrow and
| absurdly optimistic. Even if it correctly works for _all_ NxM
| calculations. Moving on from arithmetic to any kind of problem
| that fundamentally reduces to model-checking behind the
| scenes.. we would be talking about exploring a state-space with
| potentially many thousands of state-transitions for simple
| stuff. If each one even has a _small_ chance of crapping out
| due to hallucination, the chance of encountering errors at the
| macro-scale is going to be practically guaranteed.
|
| Everyone will say, "but you want tool-use or code-gen for this
| anyway". Sure! But carry-digits or similar is just one version
| of "correct matters" and putting some non-local kinds of
| demands on attention, plus it's easier to check than code. So
| tool-use or code-gen is just pushing the same problem somewhere
| else to hide it.. there's still a lot of steps involved, and
| each one really has to be correct if the macro-layer is going
| to be correct and the whole thing is going to be hands-off /
| actually automated. Maybe that's why local-models can still
| barely handle nontrivial tool-calling.
| kovek wrote:
| Well, if the model can reliably keep in context CPU cache
| plus CPU registers plus CPU instructions and is able to do
| operations based on those, then we pretty much solved
| computation using LLMs, right? It could use RAG to operate on
| RAM and SSD.
|
| Here we can see the amount of data a high end traditional
| non-SOC CPU holds:
|
| > For a recent high-end non-SoC desktop CPU: > Cache: ~40-100
| MB total (L1 + L2 + shared L3) > Register files: tens to few
| hundreds of KB total across cores (e.g., ~200-300 KB or so) >
| Combined: So you're looking at ~40-100 MB + ~0.2 MB - roughly
| ~40-100 MB of total on-chip caches + registers.
|
| I'm sure we can reduce these caches to fit in the context
| windows of today's LLMs (~500,000 tokens).
|
| Then, with temperature 0 we get more "discrete" operations.
| Now, we still have the rare problem of hallucinations, but it
| should be small with temperature 0.
| lossolo wrote:
| It doesn't work like mapping CPU caches/registers into an
| LLM context. Transformers have no mutable registers, they
| attend over past tokens and can't update prior state. RAG
| isn't RAM. Even with huge context, you still can't step CPU
| style instructions without an external, read/write
| memory/tooling.
|
| And temperature 0 makes outputs deterministic, not
| magically correct.
| photonthug wrote:
| > And temperature 0 makes outputs deterministic, not
| magically correct.
|
| For reasons I don't claim to really understand, I don't
| think it even makes them deterministic. Floating point
| something something? I'm not sure temperature even has a
| static technical definition or implementation everywhere
| at this point. I've been ignoring temperature and using
| nucleus sampling anywhere that's exposed and it seems to
| work better.
|
| Random but typical example.. pydantic-ai has a caveat
| that doesn't reference any particular model: "Note that
| even with temperature of 0.0, the results will not be
| fully deterministic". And of course this is just the very
| bottom layer of model-config and in a system of diverse
| agents using different frameworks and models, it's even
| worse.
| kovek wrote:
| Well, the LLM may re-infer the whole state fully on every
| instruction. Temperature 0 is deterministic and that's
| what we are looking for. If the model is trained properly
| on how the CPU state + instructions should be handled,
| then it should be able to produce the next state.
| lossolo wrote:
| With temp = 0 if the model is off by one bit at step k,
| all subsequent steps are deterministically wrong.
|
| Your previous example shows the best case, which is a
| model can sometimes follow a textual recipe for long
| multiplication on short inputs. That's not the same as
| learning a length generalizing bit exact algorithm.
|
| Basically what you shown is the model can describe the
| algorithm. It doesn't show it can execute it at scale.
| Without writable state and bit exact ops, errors grow
| with length and "focus more" only slows that failure, it
| doesn't eliminate it.
| alyxya wrote:
| I think it should be able to learn multiplication with chain of
| thought. Without it, it's probably really difficult to generalize
| the multiplication of two n-digit integers when you have to
| accumulate up to n products of digits and handle carrying for
| each output digit.
| amelius wrote:
| What probably works: Ask it to write a python program, but tell
| it to not use any built-in multiplication functions.
| janalsncm wrote:
| Then your transformer would need to know Python.
| ddingus wrote:
| A while back I saw a post where people ran a model over and over
| to accomplish a code base port from one language to another.
|
| In their prompt, they told it to leave itself a note and to
| accomplish something each time.
|
| Then they put the model in a loop and it worked. In one instance,
| a model removed itself from the loop by editing a file or some
| other basic means.
|
| To me, iterative tasks like like multiply and long divide, look
| an awful lot like the code port experiment.
|
| Putting models into loops so they get more than one bite at the
| task seems to be a logical progression to improve capability.
| CaveTech wrote:
| The amount of paths in the wrong direction are infinitely more
| than then number in the right direction. You'll quickly realize
| this doesn't actually scale.
| hodgehog11 wrote:
| I'm a bit confused by this; are you referring to
| vanishing/exploding gradients during training or iteration at
| inference? If the former, this is only true if you take too
| many steps. If the latter, we already know this works and
| scales well.
| CaveTech wrote:
| The latter, and I would disagree that "this works and
| scales well" in the general sense. It clearly has very
| finite bounds by the fact we haven't achieved agi by
| running an llm in a loop..
|
| The approach of "try a few more things before stopping" is
| a great strategy akin to taking a few more stabs at RNG.
| It's not the same as saying keep trying until you get there
| - you won't.
| hodgehog11 wrote:
| > It clearly has very finite bounds by the fact we
| haven't achieved agi by running an llm in a loop..
|
| That's one hell of a criterion. Test-time inference
| undergoes a similar scaling law to pretraining, and has
| resulted in dramatically improved performance on many
| complex tasks. Law of diminishing returns kicks in of
| course, but this doesn't mean it's ineffective.
|
| > akin to taking a few more stabs at RNG
|
| No it isn't. Scaling laws cannot appear with glassy
| optimisation procedures (essentially iid trials until you
| succeed, the mental model you seem to be implying here).
| They only appear if the underlying optimisation is
| globally connected and roughly convex. It's no different
| than gradient descent in this regard.
| smartmic wrote:
| Yesterday, I learned the opposite. Simon Willison demonstrated in
| another thread how this works out ... see
| https://news.ycombinator.com/item?id=45686295
| skinner_ wrote:
| That's very cool, but it's not an apples to apples comparison.
| The reasoning model learned how to do long multiplication.
| (Either from the internet, or from generated examples of long
| multiplication that were used to sharpen its reasoning skills.
| In principle, it might have invented it on its own during RL,
| but no, I don't think so.)
|
| In this paper, the task is to learn how to multiply, strictly
| from AxB=C examples, with 4-digit numbers. Their vanilla
| transformer can't learn it, but the one with (their variant of)
| chain-of-thought can. These are transformers that have never
| encountered written text, and are too small to understand any
| of it anyway.
| faragon wrote:
| Maybe the AGI will come with the equivalent of a "Turing Machine"
| enabling some kind of computability.
| Nifty3929 wrote:
| Numbers aren't language, or even sequences of tokens, or vectors.
|
| There is an inherent numeric-ness and logic to math that I don't
| think we can represent well using LLMs and transformers.
|
| 3 isn't about the word "three" - it is a quantity or a
| measurement. And 3x4 is a specific numerical operation that is
| not really contained in that sequence of symbols.
| akomtu wrote:
| IMO, the mystery has a simple explanation: addition is mostly
| local in nature, when the 5th digit in the input impacts only 5th
| or 4th digits in the output, while multiplication is not. That
| being said, LLMs don't understand addition either: the illusion
| will break down on very large inputs.
___________________________________________________________________
(page generated 2025-10-24 23:00 UTC)