hngopher.com

       [HN Gopher] Why can't transformers learn multiplication?
       ___________________________________________________________________
        
       Why can't transformers learn multiplication?
        
       Author : PaulHoule
       Score  : 111 points
       Date   : 2025-10-21 19:47 UTC (3 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | LouisSayers wrote:
       | Given their names I'd say they're too busy optimising primes...
        
         | IAmBroom wrote:
         | Take your damned upvote, and go away.
        
       | daxfohl wrote:
       | The chains-of-thought here are artificially constructed, very
       | information-dense partial sums formatted in a specific way that
       | guides the fine tuning. A potential next step would be to look at
       | real-world chains-of-thought and see whether some process could
       | start with those and achieve the same result. Then you could
       | really have a self-improving system!
       | 
       | Also I wonder if the LLM "knows" that it has this capability
       | after fine-tuning. If it encounters multiplication as part of
       | some larger chain-of-thought, will it solve that internally, or
       | will it continue to do it step-by-step in the chain-of-thought?
        
         | hxu129 wrote:
         | But it's very hard to define "real-world CoT" -- think about
         | human, we learn multiplications by vertical calculation and we
         | learn division in a similar way -- all these learning process
         | requires an "information dense" tools (calculation process)
         | with intrinsic math rules in it. Isn't that an adapted way of
         | CoT?
        
       | jerf wrote:
       | This is a gut impression and I don't deny it, but LLMs are Large
       | _Language_ Models, and in my own brain, my Language Model isn 't
       | doing large-scale multiplication. I have a language-based
       | intuition for the sigle-digit multiplication table and a touch
       | beyond (and based on my observations that's already above average
       | for a human Language Model, at least in my age peer group), but
       | it's not my Language Model doing 283 times 9284. That requires a
       | symbolic manipulation model, and in fact I would observe that my
       | personal neural net, for all the things it is amazingly good at,
       | is in fact quite terrible at that sort of multiplication too. A
       | Commodore PET is by all measures vastly, vastly simpler than my
       | brain, but it blows away my multiplication capabilities. And then
       | the symbolic systems tacked on another, what, 15 orders of
       | magnitude from that "blows away my multiplication capabilities"?
       | Depends on how you count, but something like that.
       | 
       | You can sit here and force me to recite ("train me on") multi-
       | digit multiplication problems and their result until the day I
       | die, and my language model is only going to get marginally
       | better. It is in practicing my symbolic manipulation that I'm
       | going to get better and faster.
       | 
       | It seems to me that expecting a Language Model to be very good at
       | multiplication is asking for a substantially superhuman level of
       | performance from them, and one that we have little reason to
       | believe will scale anyhow. What we need is symbolic manipulation,
       | better than the approximation they achieve when "reasoning".
       | 
       | I find it rather ironic to sit here and use the aforementioned 15
       | orders of magnitude improvement over the Commodore PET to use
       | that level of symbolic manipulation firepower to laboriously
       | recreate a software system that is as bad as we are at
       | multiplication for what may well be the same fundamental
       | reasons... and then have the audacity to _complain_ about it. My
       | metaphorical dude, you did a couple trillion multiplications just
       | to get to this single bad multiplication output... maybe another
       | approach is called for.
        
         | suddenlybananas wrote:
         | Language _is_ the symbolic manipulation system par excellence
         | though.
        
           | jerf wrote:
           | There's equivocation in that statement, though, whether you
           | meant there to be or not. There is clearly a difference in
           | how we manipulate English words for normal human activities
           | and the symbolic manipulation with very strict rules we today
           | associate with mathematics and computer science. Human
           | language goes back thousands of years, into the indefinite
           | past we can't track past. Symbolic manipulation is a much,
           | much more recent development, starting only ~2300 years ago
           | around Euclid and not really coming into full development
           | until much later... you can argue about exactly when it is
           | but I'd personally put it as late as the 19th century for it
           | to be recognized in the modern sense. It must be something
           | different if separated by that many centuries.
           | 
           | To disprove my point, please generate a list of 5 random
           | 5-digit numbers and demonstrate multiplying them in your head
           | as quickly as you can read them. Since you can't, clearly
           | there is something about that that is hard for you, despite
           | the fact that the act of reading this text, maintaining
           | physical homeostasis while you do it, and all the other
           | things your brain is doing as you do this represents a
           | staggering amount of raw computation that is vastly, vastly
           | in excess of what is nominally needed to achieve that
           | computation.
        
             | suddenlybananas wrote:
             | Doing multiplication in your head isn't the point though,
             | you can externalise language and use it to do things you
             | can't do in your head by writing it down.
             | 
             | Mathematics was born out of very careful reasoning that we
             | do _through language_ , we only use formalisms as they
             | allow us to avoid the massive ambiguities that exist in
             | natural language. Formal symbolic manipulation came out of
             | our already existing abilities of symbolic manipulation
             | through language.
        
         | lacy_tinpot wrote:
         | A lot of savants that are able to do really cool calculations,
         | or even people that have synesthesia seeing numbers as colors,
         | don't actually do "real" calculations.
         | 
         | I think most humans that do math aren't actually literally
         | computing things as some kind of logic machine.
         | 
         | We can produce logic, and follow the steps of using that logic,
         | but it doesn't seem to me that our cognition is some kind of
         | logic machine itself.
        
           | daxfohl wrote:
           | True. Generally it seems like you're visualizing things,
           | moving stuff around, seeing vague patterns and trying to make
           | them more clear. IDK how a transformer architecture would fit
           | all of that in its context, or use it productivity once it's
           | there. You can't just keep appending forever, but you also
           | can't delete stuff either, because unlike humans, a deletion
           | is a hard delete; there's no fuzzy remembrance left to rely
           | on, so even deleting bad ideas is dangerous because it'll
           | forget that it was a bad idea and infinite loop. Symbols
           | manipulation doesn't come until the end, after you have a
           | good idea what that part will look like.
        
         | daxfohl wrote:
         | Hmm, I wonder what happens if you let them manipulate their own
         | context symbolically, maybe something like a stack machine.
         | Perhaps all you need is a "delete" token, or a "replace" flag.
         | That way you don't have context full of irrelevant information.
         | 
         | I guess the challenge is, where would the training data come
         | from? Data on the internet is in its final form so "next token"
         | is never a delete.
         | 
         | Edit: I guess in essence, that's what reasoning LLMs already
         | do. IIUC the thought blocks are ephemeral, and only the
         | response is maintained for the chat. Maybe there'd be some
         | benefit of doing this recursively? But that's also kind of what
         | subagents are for. So, perhaps nothing new here.
        
         | r0x0r007 wrote:
         | I agree with you, seems like we are trying to make the shoe
         | fit. Not only are we missing the understanding of what is
         | happening inside transformers, but now we are trying to teach
         | them and see how they respond and then interpret it. That seems
         | fine with viruses and animals, but we are talking about a piece
         | of software here. Shouldn't we know what's happening inside?
         | Maybe these kinds of papers can shine more light and give us
         | better understanding though, still it feels backwards to
         | me...Regarding the multiplication itself, shouldn't pure
         | understanding of the meaning of multiplication(it's a summation
         | basically) be enough for 'AI' to call it a day? If AI or human
         | understands that, then the rest is computation part. We already
         | got that covered, so instead of having 'AI' learn it on its own
         | on crazy amount of data and get it right 99% of time, shouldn't
         | we just give it a calculator? Somebody PLEEAASE give this AI a
         | calculator :-)
        
         | hodgehog11 wrote:
         | I think you might be missing some appropriate context. I agree
         | that it is ridiculous to expect a language model to be good at
         | symbolic manipulation; that is best done with tool use.
         | However, there is a significant line of work dedicated to
         | algorithm discovery for mathematical problems using neural
         | networks. Transformers are used here due to their popularity,
         | but also some theoretical analysis to suggest that they are the
         | among the most efficient architecture for learning automata.
         | It's still unclear whether this is truly sound though, which is
         | where this kind of research matters.
        
       | mikkupikku wrote:
       | They're not any better at addition, are they? If they are, I
       | wonder how good they are at adding numbers in log space.
        
         | yorwba wrote:
         | The paper uses a number representation that is designed to make
         | attention easy to learn: each digit is a separate token and the
         | least significant digit is put first, so that the first digit
         | of the output is simply the sum of the first digits of the
         | inputs and the second digit is the sum of the second digits
         | plus an optional carry from the first digits and so on.
         | 
         | If the numbers are represented with the most significant digit
         | first as usual, you need a bunch of intermediate steps before
         | outputting even the first digit just to determine whether it is
         | affected by a carry or not.
         | 
         | The paper looks at multiplication of numbers represented with
         | the least significant digit first as a toy task requiring
         | several additions as intermediate steps to study why a model
         | large enough to perform those additions in principle fails to
         | learn to do so in practice.
         | 
         | They compare with a model that is first trained to produce the
         | intermediate additions explicitly (as a "chain of thought" with
         | a specific format) and then has this CoT progressively
         | shortened during training until there's nothing left of it. But
         | that second model successfully multiplies.
         | 
         | The difference appears to be that the presence of the
         | intermediate results induces a better number representation in
         | latent space, whereas the model without CoT gets stuck in a
         | less efficient local minimum.
         | 
         | So the answer to the question "Why can't transformers learn
         | multiplication?" is that the training process is insufficient
         | for the model to discover the best intermediate steps on its
         | own.
         | 
         | You could do a similar experiment where the CoT involves first
         | taking the logarithm, adding, and then exponentiating to get
         | the final result, but I think logarithms are probably another
         | computation that's too difficult to learn without additional
         | hints for intermediate steps.
        
           | mikkupikku wrote:
           | > _but I think logarithms are probably another computation
           | that 's too difficult to learn without additional hints for
           | intermediate steps._
           | 
           | I suppose you're probably right, but LLMs probably have a lot
           | of log tables in their training data so I'm not so sure.
        
             | yorwba wrote:
             | The paper is about the ability of transformers to learn a
             | task based on training data for that task only, not about
             | LLMs pretrained on much of the internet. And training on
             | log tables doesn't necessarily allow the model to always
             | output the correct logarithm, just as training on
             | multiplication tables doesn't necessarily confer the
             | ability to multiply.
        
       | nico wrote:
       | Would love to see an architecture that learned more like humans.
       | Start with just imitating one letter, then a few more, than some
       | syllables, then full words, then sentences, etc. Progressively
       | adding on top of previous knowledge
       | 
       | Also, it's interesting that one of the big goals/measures of
       | models is their capacity to "generalize", but the training
       | methods optimize for loss/accuracy, and only after training test
       | for generalization to validate
       | 
       | Are there training methods/curriculums that explicitly maximize
       | generalization?
        
         | serced wrote:
         | Yes, I also wonder about this! Progress from children books to
         | scientific papers etc. Could it learn e.g. language structure
         | faster in a pre-training stage? Also somehow one needs to
         | define a proxy to generalization to compute a loss and do
         | backpropagation.
        
           | arbot360 wrote:
           | This field of study is known as "Curriculum Learning" for
           | your Googling pleasure (or I guess ChatGPT Deep Research
           | now).
        
         | exit wrote:
         | "an architecture that learned more like humans"
         | 
         | i.e. enduring countless generations of evolutionary selection
         | and cross breeding, then fine-tuning a bit?
         | 
         | although it could be interesting, i don't think training on
         | progressively complex strings entirely recapitulates this.
        
           | nico wrote:
           | That's a very interesting take. I hadn't really considered
           | evolution
           | 
           | I guess if you really wanted to start from scratch, you could
           | figure out how to evolve the whole system from a single cell
           | or something like that. In some ways neural networks have
           | kind of evolved in that way, assisted by humans. They started
           | with a single perceptron, and have gone all the way to deep
           | learning and convolutional networks
           | 
           | I also remember a long time ago studying genetic and
           | evolutionary algorithms, but they were pretty basic in terms
           | of what they could learn and do, compared to modern LLMs
           | 
           | Although recently I saw some research in which they were
           | applying essentially genetic algorithms to merge model
           | weights and produce models with new/evolved capabilities
        
         | ares623 wrote:
         | Isn't that what all the hundreds of billions are banking on?
         | "General" intelligence.
        
           | onlyrealcuzzo wrote:
           | You don't need general intelligence to make good memes to
           | keep people scrolling through Instagram.
           | 
           | You don't need general intelligence to make a decent coding
           | tool like Cursor.
           | 
           | You don't need general intelligence to improve SERPs.
           | 
           | You don't need general intelligence to sell a subscription
           | for a decent AI assistant.
           | 
           | There's tons of value already added without anything general.
        
         | zer00eyz wrote:
         | "Would love to see an architecture that learned"
         | 
         | Would be a far more accurate statement. Training != Learning.
        
           | rokobobo wrote:
           | Do you have an example of an algorithm that learns, rather
           | than is trained/trains itself? I don't really see the
           | boundary between the two concepts.
        
       | carodgers wrote:
       | Because they produce output probabilistically, when
       | multiplication is deterministic. Why is this so hard for
       | everyone?
        
         | trollied wrote:
         | Not true though. Internally they can "shell out" to sub-tasks
         | that know how to do specific things. The specific things don't
         | have to be models.
         | 
         | (I'm specifically talking about commercial hosted ones that
         | have the capability i describe - obviously your run of the mill
         | one downloaded off of the internet cannot do this).
        
           | rrix2 wrote:
           | yes, what your describing is not a transformer but a high-
           | level LLM-based product with tool-calling wired up to it
        
           | KalMann wrote:
           | That doesn't appear to be the kind of thing this article is
           | describing.
        
         | skinner_ wrote:
         | If being probabilistic prevented learning deterministic
         | functions, transformers couldn't learn addition either. But
         | they can, so that can't be the reason.
        
           | wat10000 wrote:
           | People are probabilistic, and I've been informed that people
           | are able to perform multiplication.
        
             | ddingus wrote:
             | Yes, and unlike the LLM they can iterate on a problem.
             | 
             | When I multiply, I take it in chunks.
             | 
             | Put the LLM into a loop, instruct it to keep track of where
             | it is and have it solve a digit at a time.
             | 
             | I bet it does just fine. See my other comment as to why I
             | think that is.
        
       | kovek wrote:
       | I tried to ask a model to tell me what is the "long
       | multiplication algorithm". It gave it to me. I asked it to follow
       | that algorithm to solve eg. 12987318927 * 12098102983, and it
       | followed the algorithm, and it got the right answer. It DOES fail
       | more when the numbers are longer (because it results with more
       | text in the context), but that can be improved by having the
       | model focus on the right subset of the text, right?
        
         | photonthug wrote:
         | > It DOES fail more when the numbers are longer (because it
         | results with more text in the context),
         | 
         | I tried to raise this question yesterday.
         | https://news.ycombinator.com/item?id=45683113#45687769
         | 
         | Declaring victory on "reasoning" based on cherry-picking a
         | correct result about arithmetic is, of course, very narrow and
         | absurdly optimistic. Even if it correctly works for _all_ NxM
         | calculations. Moving on from arithmetic to any kind of problem
         | that fundamentally reduces to model-checking behind the
         | scenes.. we would be talking about exploring a state-space with
         | potentially many thousands of state-transitions for simple
         | stuff. If each one even has a _small_ chance of crapping out
         | due to hallucination, the chance of encountering errors at the
         | macro-scale is going to be practically guaranteed.
         | 
         | Everyone will say, "but you want tool-use or code-gen for this
         | anyway". Sure! But carry-digits or similar is just one version
         | of "correct matters" and putting some non-local kinds of
         | demands on attention, plus it's easier to check than code. So
         | tool-use or code-gen is just pushing the same problem somewhere
         | else to hide it.. there's still a lot of steps involved, and
         | each one really has to be correct if the macro-layer is going
         | to be correct and the whole thing is going to be hands-off /
         | actually automated. Maybe that's why local-models can still
         | barely handle nontrivial tool-calling.
        
           | kovek wrote:
           | Well, if the model can reliably keep in context CPU cache
           | plus CPU registers plus CPU instructions and is able to do
           | operations based on those, then we pretty much solved
           | computation using LLMs, right? It could use RAG to operate on
           | RAM and SSD.
           | 
           | Here we can see the amount of data a high end traditional
           | non-SOC CPU holds:
           | 
           | > For a recent high-end non-SoC desktop CPU: > Cache: ~40-100
           | MB total (L1 + L2 + shared L3) > Register files: tens to few
           | hundreds of KB total across cores (e.g., ~200-300 KB or so) >
           | Combined: So you're looking at ~40-100 MB + ~0.2 MB - roughly
           | ~40-100 MB of total on-chip caches + registers.
           | 
           | I'm sure we can reduce these caches to fit in the context
           | windows of today's LLMs (~500,000 tokens).
           | 
           | Then, with temperature 0 we get more "discrete" operations.
           | Now, we still have the rare problem of hallucinations, but it
           | should be small with temperature 0.
        
             | lossolo wrote:
             | It doesn't work like mapping CPU caches/registers into an
             | LLM context. Transformers have no mutable registers, they
             | attend over past tokens and can't update prior state. RAG
             | isn't RAM. Even with huge context, you still can't step CPU
             | style instructions without an external, read/write
             | memory/tooling.
             | 
             | And temperature 0 makes outputs deterministic, not
             | magically correct.
        
               | photonthug wrote:
               | > And temperature 0 makes outputs deterministic, not
               | magically correct.
               | 
               | For reasons I don't claim to really understand, I don't
               | think it even makes them deterministic. Floating point
               | something something? I'm not sure temperature even has a
               | static technical definition or implementation everywhere
               | at this point. I've been ignoring temperature and using
               | nucleus sampling anywhere that's exposed and it seems to
               | work better.
               | 
               | Random but typical example.. pydantic-ai has a caveat
               | that doesn't reference any particular model: "Note that
               | even with temperature of 0.0, the results will not be
               | fully deterministic". And of course this is just the very
               | bottom layer of model-config and in a system of diverse
               | agents using different frameworks and models, it's even
               | worse.
        
               | kovek wrote:
               | Well, the LLM may re-infer the whole state fully on every
               | instruction. Temperature 0 is deterministic and that's
               | what we are looking for. If the model is trained properly
               | on how the CPU state + instructions should be handled,
               | then it should be able to produce the next state.
        
               | lossolo wrote:
               | With temp = 0 if the model is off by one bit at step k,
               | all subsequent steps are deterministically wrong.
               | 
               | Your previous example shows the best case, which is a
               | model can sometimes follow a textual recipe for long
               | multiplication on short inputs. That's not the same as
               | learning a length generalizing bit exact algorithm.
               | 
               | Basically what you shown is the model can describe the
               | algorithm. It doesn't show it can execute it at scale.
               | Without writable state and bit exact ops, errors grow
               | with length and "focus more" only slows that failure, it
               | doesn't eliminate it.
        
       | alyxya wrote:
       | I think it should be able to learn multiplication with chain of
       | thought. Without it, it's probably really difficult to generalize
       | the multiplication of two n-digit integers when you have to
       | accumulate up to n products of digits and handle carrying for
       | each output digit.
        
       | amelius wrote:
       | What probably works: Ask it to write a python program, but tell
       | it to not use any built-in multiplication functions.
        
         | janalsncm wrote:
         | Then your transformer would need to know Python.
        
       | ddingus wrote:
       | A while back I saw a post where people ran a model over and over
       | to accomplish a code base port from one language to another.
       | 
       | In their prompt, they told it to leave itself a note and to
       | accomplish something each time.
       | 
       | Then they put the model in a loop and it worked. In one instance,
       | a model removed itself from the loop by editing a file or some
       | other basic means.
       | 
       | To me, iterative tasks like like multiply and long divide, look
       | an awful lot like the code port experiment.
       | 
       | Putting models into loops so they get more than one bite at the
       | task seems to be a logical progression to improve capability.
        
         | CaveTech wrote:
         | The amount of paths in the wrong direction are infinitely more
         | than then number in the right direction. You'll quickly realize
         | this doesn't actually scale.
        
           | hodgehog11 wrote:
           | I'm a bit confused by this; are you referring to
           | vanishing/exploding gradients during training or iteration at
           | inference? If the former, this is only true if you take too
           | many steps. If the latter, we already know this works and
           | scales well.
        
             | CaveTech wrote:
             | The latter, and I would disagree that "this works and
             | scales well" in the general sense. It clearly has very
             | finite bounds by the fact we haven't achieved agi by
             | running an llm in a loop..
             | 
             | The approach of "try a few more things before stopping" is
             | a great strategy akin to taking a few more stabs at RNG.
             | It's not the same as saying keep trying until you get there
             | - you won't.
        
               | hodgehog11 wrote:
               | > It clearly has very finite bounds by the fact we
               | haven't achieved agi by running an llm in a loop..
               | 
               | That's one hell of a criterion. Test-time inference
               | undergoes a similar scaling law to pretraining, and has
               | resulted in dramatically improved performance on many
               | complex tasks. Law of diminishing returns kicks in of
               | course, but this doesn't mean it's ineffective.
               | 
               | > akin to taking a few more stabs at RNG
               | 
               | No it isn't. Scaling laws cannot appear with glassy
               | optimisation procedures (essentially iid trials until you
               | succeed, the mental model you seem to be implying here).
               | They only appear if the underlying optimisation is
               | globally connected and roughly convex. It's no different
               | than gradient descent in this regard.
        
       | smartmic wrote:
       | Yesterday, I learned the opposite. Simon Willison demonstrated in
       | another thread how this works out ... see
       | https://news.ycombinator.com/item?id=45686295
        
         | skinner_ wrote:
         | That's very cool, but it's not an apples to apples comparison.
         | The reasoning model learned how to do long multiplication.
         | (Either from the internet, or from generated examples of long
         | multiplication that were used to sharpen its reasoning skills.
         | In principle, it might have invented it on its own during RL,
         | but no, I don't think so.)
         | 
         | In this paper, the task is to learn how to multiply, strictly
         | from AxB=C examples, with 4-digit numbers. Their vanilla
         | transformer can't learn it, but the one with (their variant of)
         | chain-of-thought can. These are transformers that have never
         | encountered written text, and are too small to understand any
         | of it anyway.
        
       | faragon wrote:
       | Maybe the AGI will come with the equivalent of a "Turing Machine"
       | enabling some kind of computability.
        
       | Nifty3929 wrote:
       | Numbers aren't language, or even sequences of tokens, or vectors.
       | 
       | There is an inherent numeric-ness and logic to math that I don't
       | think we can represent well using LLMs and transformers.
       | 
       | 3 isn't about the word "three" - it is a quantity or a
       | measurement. And 3x4 is a specific numerical operation that is
       | not really contained in that sequence of symbols.
        
       | akomtu wrote:
       | IMO, the mystery has a simple explanation: addition is mostly
       | local in nature, when the 5th digit in the input impacts only 5th
       | or 4th digits in the output, while multiplication is not. That
       | being said, LLMs don't understand addition either: the illusion
       | will break down on very large inputs.
        
       ___________________________________________________________________
       (page generated 2025-10-24 23:00 UTC)