[HN Gopher] Theoretical limitations of multi-layer Transformer
       ___________________________________________________________________
        
       Theoretical limitations of multi-layer Transformer
        
       Author : fovc
       Score  : 95 points
       Date   : 2025-01-31 17:48 UTC (14 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | cs702 wrote:
       | Huh. I just skimmed this and quickly concluded that it's
       | definitely _not_ light reading.
       | 
       | It sure looks and smells like good work, so I've added it to my
       | reading list.
       | 
       | Nowadays I feel like my reading list is growing faster than I can
       | go through it.
        
         | Matthyze wrote:
         | I'd humbly like to ask people who've read the paper whether
         | it's worth trying to understand it without a great math
         | background. The paper looks intersting but daunting, and I'd
         | hate to sink a lot of time into it and leave defeated.
         | 
         | It sometimes sucks being in ML with 'only' a CS background.
         | Feels like all the math and physics grads are running around
         | having fun with their fancy mathematics, while I stand here,
         | feeling dimwitted.
        
           | aketchum wrote:
           | Most of it is linear algebra and convex optimization. You can
           | learn a lot of it with free resources from MIT, Stanford,
           | Georgia Tech, or YouTube. If you want more of a school style
           | learning environment you can enroll in the Georgia Tech OMSCS
           | program and just take the classes related to the math etc
           | that you are interested in. No reason you have to graduate
           | and it is maybe $800 a course.
        
             | Matthyze wrote:
             | Thanks! Now might actually be a great time for me to pick
             | up the material. If anyone has suggestions about
             | particularly good free/online mathematics courses for ML,
             | I'd really love to hear it. Or books!
        
               | pcooperchi wrote:
               | https://mml-book.github.io/
        
           | szundi wrote:
           | Would be fun to ask o3
        
             | drodgers wrote:
             | It looks like o3-mini doesn't yet support PDF input, but
             | here's what Claude opus had to say:                   This
             | paper proves the first unconditional limitations of multi-
             | layer decoder-only transformers, which are the dominant
             | architecture used in large language models (LLMs) today.
             | The key takeaways are:                  1. Any constant-
             | depth L-layer decoder-only transformer requires a
             | polynomial model dimension of at least n^(2^(-4L)) to
             | perform sequential composition of L functions over an input
             | of n tokens. This is an exponentially larger model size
             | than what an (L+1)-layer transformer needs.
             | 2. There are certain tasks, like L-step sequential function
             | composition, that are exponentially harder for L-layer
             | transformers compared to (L+1)-layer transformers. This
             | proves a depth-width tradeoff.                  3. The same
             | sequential composition tasks used above can be solved by an
             | exponentially shallower and smaller encoder-only
             | transformer compared to decoder-only transformers. This
             | unconditionally separates the computational power of
             | encoders vs decoders.                  4. Sequential
             | composition tasks that are hard for L-layer transformers
             | become exponentially easier when allowing L steps of "chain
             | of thought" reasoning. This provides the first provable
             | benefit of chain of thought prompting.                  The
             | results rely on a new theoretical model called the
             | "autoregressive communication model" that abstracts the key
             | aspects of decoder-only attention. The lower bounds are
             | proved by analyzing the ability of this model to perform
             | multi-step function composition.                  The
             | proofs appear technically sound, relying on information-
             | theoretic arguments without unproven assumptions. The
             | separation between encoders and decoders was previously
             | only known assuming unproven complexity conjectures.
             | The key limitation is that the theoretical model makes some
             | simplifications compared to practical transformers. But
             | overall, this looks like an important foundational result
             | that rigorously demonstrates limitations of the currently
             | dominant LLM architecture and provides theoretical
             | motivation for alternative approaches like encoder-decoder
             | models, chain of thought, and deep but narrow models. The
             | unconditional nature of the results makes them more
             | definitive than prior work relying on complexity
             | assumptions.             The sequential function
             | composition problem that the paper shows to be hard for
             | constant-depth decoder-only transformers essentially
             | requires iteratively applying a sequence of functions,
             | where each function takes as input the output of the
             | previous function application.                   In terms
             | of practical everyday reasoning tasks, this could
             | potentially capture things like:                  1. Multi-
             | step arithmetic: A task like "What is 5 plus 3, minus 2,
             | times 4, divided by 8" requires sequentially applying the
             | arithmetic operations. Each operation takes as input the
             | result of the previous one.                  2. Chained
             | logical inference: Deriving a conclusion from a sequence of
             | logical premises, where each inference step builds on the
             | previous conclusions. For example: "If A then B. If B then
             | C. If C then D. A is true. Therefore D is true."  Reaching
             | the final conclusion requires iteratively applying the
             | logical rules.                  3. Complex query
             | decomposition: Answering a complex query that requires
             | multiple steps of information retrieval and computation.
             | Like "How many countries have a population larger than
             | Russia's GDP per capita?" This might involve retrieving
             | Russia's GDP, dividing by population to get per capita,
             | then comparing each country's population to this value and
             | counting matches.                  4. Multi-hop reasoning
             | over knowledge: Drawing inferences that require combining
             | multiple facts from a knowledge base. E.g. "The tallest
             | building in the city with the largest population in John's
             | home state" would require finding John's state, its largest
             | city, and the tallest building there, using separate
             | knowledge lookups.                  5. Sequential decision
             | making: Planning tasks that require a sequence of
             | decisions, where each decision depends on the state
             | resulting from previous decisions.                  The
             | paper suggests that constant-depth transformers, the
             | current architecture of choice for LLMs, may struggle with
             | tasks like these that require longer chains of computation,
             | despite their strong performance on many benchmarks.
             | Exploring alternative architectures like deeper models or
             | encoder-decoder structures may help address these
             | limitations.                  Of course, the actual
             | difficulty of any given task for an LLM depends on many
             | factors beyond just the abstract computational structure.
             | But this result provides a valuable lens for understanding
             | the types of everyday reasoning that may be challenging for
             | the current paradigm and motivates further research into
             | more compositional and reasoning-oriented architectures.
        
           | svachalek wrote:
           | I'm not a math person either, but I'm familiar with some of
           | the terminology. The two things I got out of this paper were:
           | 
           | 1. Depth is more important than breadth for making
           | transformers smarter. That is, for a given size of model it
           | will be more powerful with more, smaller layers than it would
           | be with less, bigger ones. Interestingly, Mistral just
           | updated their small model yesterday with a big reduction in
           | layers in order to improve performance. Among the many ways
           | they say it more technically they do say directly "depth
           | plays a more critical role than width in reasoning and
           | composition tasks".
           | 
           | 2. As I understand it, they are claiming to be able to prove
           | mathematically that Chain of Thought such as seen in the new
           | DeepSeek R1 and GPT o1/o3 models creates results that
           | wouldn't be possible without it. The thought chains
           | effectively act as additional layers, and per the previous
           | point, the more layers the better. "From a theoretical view,
           | CoT provides Transformer with extra computation space, and
           | previous work ... proved that log-precision Transformer with
           | CoT could simulate any polynomial-time algorithm. Therefore,
           | by further assuming certain complexity conjecture ... their
           | results imply that constant depth Transformer with CoT could
           | simulate poly-time algorithm, while constant depth Transform
           | ... itself can not solve P-complete task."
        
           | nighthawk454 wrote:
           | A nice trick for most papers is to skip the middle (at
           | first). Just don't read most of the lines of math. Focus on
           | the inputs and outputs.
           | 
           | If the premise and conclusion don't make sense on
           | fundamentals the math isn't likely to fix it. Most lines are
           | literally equals signs - just walking you through some
           | equivalencies as proof. A large statement saying "If ABC,
           | then ... (and then ... and then ... and then ...) and finally
           | XYZ"
           | 
           | The middle 'and then's aren't really that important if the
           | conclusion XYZ isn't interesting. Or much more commonly, the
           | ABC premise is false anyway so who cares.
           | 
           | Most readers I'd wager are not sitting here deciphering
           | opaque gradient derivations every single paper. Just skip it
           | unless it proves worthy
        
             | robotresearcher wrote:
             | Good advice. I ran an academic lab for a long time, read
             | and reviewed a lot of papers, and trained students to read
             | them. My process is as follows.
             | 
             | In order, and quickly, I look at the abstract, the picture
             | at the top of page 2 (unless it was on page 1 like vision
             | and graphics papers tend to do), the references, the
             | conclusion, the lit review, if I'm still interested I try
             | to scan the whole thing to decide what the main point of
             | the paper is. Then if I care to, I start again and read it
             | linearly start to finish.
             | 
             | If I'd agreed to review a paper, I don't get to abort the
             | process. Otherwise I can bail at any step.
        
               | throw_pm23 wrote:
               | It's sad that papers in this area are so standardized in
               | format though. Some of the classic papers break every
               | convention and are written in very idiosyncratic ways. I
               | want to read papers that can be enjoyed slowly and that
               | change my life, not papers where I get trained to read
               | them quickly in a certain way and that have prescribed
               | sections and predictable figures. Life is too short for
               | this rote work.
        
               | peppery wrote:
               | Agreed, idiosyncratic voice is so life- and mind-
               | affirming in papers. (Do you mind sharing examples of
               | three papers that you _did_ enjoy slowly and change your
               | conceptual life?)
        
               | idiotsecant wrote:
               | The paper is a tool for conveying understanding.
               | Standardized tools are a good thing.
        
               | nighthawk454 wrote:
               | Nice! Yeah that workflow feels pretty relatable. My other
               | rule of thumb is if it's an architecture paper (system or
               | neural net) without a diagram, I immediately bail haha
        
           | DoctorOetker wrote:
           | It's never too late to study physics or mathematics. I would
           | however recommend studying physics since it will force you to
           | study the slices of mathematics that was motivated by
           | necessity. A lot of mathematics is genuinely interesting in
           | and of itself, or results in mysteriously rich structures,
           | but has not necessarily found many applications. A lot of
           | mathematics was in fact motivated and (often sloppily)
           | invented by physicists, before being put on more rigorous
           | footing by mathematicians...
           | 
           | Studying physics without learning math is impossible, but
           | physics results in a curricular selection of math which is
           | highly relevant to machine learning.
        
           | gessha wrote:
           | Most of the math in ML papers is smoke and mirrors. Some good
           | theoretical papers exist but too many others cover up
           | accidental discovery with vague mathy explanations.
           | 
           | If you really want to go deep on a topic in ML, take a
           | subtopic paper and read related works until you can connect
           | the basics to the paper. This consists of not only writing
           | down the math but also coding the concepts. ML students go
           | through this process when they choose a topic and start
           | working on it. Focus on one topic and learn it. Present it to
           | yourself or others. Find holes in the literature and try
           | exploring them. Science!
        
       | cubefox wrote:
       | Loosely related thought: A year ago, there was a lot of talk
       | about the Mamba SSM architecture replacing transformers.
       | Apparently that didn't happen so far.
        
       ___________________________________________________________________
       (page generated 2025-02-01 08:01 UTC)