[HN Gopher] Theoretical limitations of multi-layer Transformer
___________________________________________________________________
Theoretical limitations of multi-layer Transformer
Author : fovc
Score : 95 points
Date : 2025-01-31 17:48 UTC (14 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| cs702 wrote:
| Huh. I just skimmed this and quickly concluded that it's
| definitely _not_ light reading.
|
| It sure looks and smells like good work, so I've added it to my
| reading list.
|
| Nowadays I feel like my reading list is growing faster than I can
| go through it.
| Matthyze wrote:
| I'd humbly like to ask people who've read the paper whether
| it's worth trying to understand it without a great math
| background. The paper looks intersting but daunting, and I'd
| hate to sink a lot of time into it and leave defeated.
|
| It sometimes sucks being in ML with 'only' a CS background.
| Feels like all the math and physics grads are running around
| having fun with their fancy mathematics, while I stand here,
| feeling dimwitted.
| aketchum wrote:
| Most of it is linear algebra and convex optimization. You can
| learn a lot of it with free resources from MIT, Stanford,
| Georgia Tech, or YouTube. If you want more of a school style
| learning environment you can enroll in the Georgia Tech OMSCS
| program and just take the classes related to the math etc
| that you are interested in. No reason you have to graduate
| and it is maybe $800 a course.
| Matthyze wrote:
| Thanks! Now might actually be a great time for me to pick
| up the material. If anyone has suggestions about
| particularly good free/online mathematics courses for ML,
| I'd really love to hear it. Or books!
| pcooperchi wrote:
| https://mml-book.github.io/
| szundi wrote:
| Would be fun to ask o3
| drodgers wrote:
| It looks like o3-mini doesn't yet support PDF input, but
| here's what Claude opus had to say: This
| paper proves the first unconditional limitations of multi-
| layer decoder-only transformers, which are the dominant
| architecture used in large language models (LLMs) today.
| The key takeaways are: 1. Any constant-
| depth L-layer decoder-only transformer requires a
| polynomial model dimension of at least n^(2^(-4L)) to
| perform sequential composition of L functions over an input
| of n tokens. This is an exponentially larger model size
| than what an (L+1)-layer transformer needs.
| 2. There are certain tasks, like L-step sequential function
| composition, that are exponentially harder for L-layer
| transformers compared to (L+1)-layer transformers. This
| proves a depth-width tradeoff. 3. The same
| sequential composition tasks used above can be solved by an
| exponentially shallower and smaller encoder-only
| transformer compared to decoder-only transformers. This
| unconditionally separates the computational power of
| encoders vs decoders. 4. Sequential
| composition tasks that are hard for L-layer transformers
| become exponentially easier when allowing L steps of "chain
| of thought" reasoning. This provides the first provable
| benefit of chain of thought prompting. The
| results rely on a new theoretical model called the
| "autoregressive communication model" that abstracts the key
| aspects of decoder-only attention. The lower bounds are
| proved by analyzing the ability of this model to perform
| multi-step function composition. The
| proofs appear technically sound, relying on information-
| theoretic arguments without unproven assumptions. The
| separation between encoders and decoders was previously
| only known assuming unproven complexity conjectures.
| The key limitation is that the theoretical model makes some
| simplifications compared to practical transformers. But
| overall, this looks like an important foundational result
| that rigorously demonstrates limitations of the currently
| dominant LLM architecture and provides theoretical
| motivation for alternative approaches like encoder-decoder
| models, chain of thought, and deep but narrow models. The
| unconditional nature of the results makes them more
| definitive than prior work relying on complexity
| assumptions. The sequential function
| composition problem that the paper shows to be hard for
| constant-depth decoder-only transformers essentially
| requires iteratively applying a sequence of functions,
| where each function takes as input the output of the
| previous function application. In terms
| of practical everyday reasoning tasks, this could
| potentially capture things like: 1. Multi-
| step arithmetic: A task like "What is 5 plus 3, minus 2,
| times 4, divided by 8" requires sequentially applying the
| arithmetic operations. Each operation takes as input the
| result of the previous one. 2. Chained
| logical inference: Deriving a conclusion from a sequence of
| logical premises, where each inference step builds on the
| previous conclusions. For example: "If A then B. If B then
| C. If C then D. A is true. Therefore D is true." Reaching
| the final conclusion requires iteratively applying the
| logical rules. 3. Complex query
| decomposition: Answering a complex query that requires
| multiple steps of information retrieval and computation.
| Like "How many countries have a population larger than
| Russia's GDP per capita?" This might involve retrieving
| Russia's GDP, dividing by population to get per capita,
| then comparing each country's population to this value and
| counting matches. 4. Multi-hop reasoning
| over knowledge: Drawing inferences that require combining
| multiple facts from a knowledge base. E.g. "The tallest
| building in the city with the largest population in John's
| home state" would require finding John's state, its largest
| city, and the tallest building there, using separate
| knowledge lookups. 5. Sequential decision
| making: Planning tasks that require a sequence of
| decisions, where each decision depends on the state
| resulting from previous decisions. The
| paper suggests that constant-depth transformers, the
| current architecture of choice for LLMs, may struggle with
| tasks like these that require longer chains of computation,
| despite their strong performance on many benchmarks.
| Exploring alternative architectures like deeper models or
| encoder-decoder structures may help address these
| limitations. Of course, the actual
| difficulty of any given task for an LLM depends on many
| factors beyond just the abstract computational structure.
| But this result provides a valuable lens for understanding
| the types of everyday reasoning that may be challenging for
| the current paradigm and motivates further research into
| more compositional and reasoning-oriented architectures.
| svachalek wrote:
| I'm not a math person either, but I'm familiar with some of
| the terminology. The two things I got out of this paper were:
|
| 1. Depth is more important than breadth for making
| transformers smarter. That is, for a given size of model it
| will be more powerful with more, smaller layers than it would
| be with less, bigger ones. Interestingly, Mistral just
| updated their small model yesterday with a big reduction in
| layers in order to improve performance. Among the many ways
| they say it more technically they do say directly "depth
| plays a more critical role than width in reasoning and
| composition tasks".
|
| 2. As I understand it, they are claiming to be able to prove
| mathematically that Chain of Thought such as seen in the new
| DeepSeek R1 and GPT o1/o3 models creates results that
| wouldn't be possible without it. The thought chains
| effectively act as additional layers, and per the previous
| point, the more layers the better. "From a theoretical view,
| CoT provides Transformer with extra computation space, and
| previous work ... proved that log-precision Transformer with
| CoT could simulate any polynomial-time algorithm. Therefore,
| by further assuming certain complexity conjecture ... their
| results imply that constant depth Transformer with CoT could
| simulate poly-time algorithm, while constant depth Transform
| ... itself can not solve P-complete task."
| nighthawk454 wrote:
| A nice trick for most papers is to skip the middle (at
| first). Just don't read most of the lines of math. Focus on
| the inputs and outputs.
|
| If the premise and conclusion don't make sense on
| fundamentals the math isn't likely to fix it. Most lines are
| literally equals signs - just walking you through some
| equivalencies as proof. A large statement saying "If ABC,
| then ... (and then ... and then ... and then ...) and finally
| XYZ"
|
| The middle 'and then's aren't really that important if the
| conclusion XYZ isn't interesting. Or much more commonly, the
| ABC premise is false anyway so who cares.
|
| Most readers I'd wager are not sitting here deciphering
| opaque gradient derivations every single paper. Just skip it
| unless it proves worthy
| robotresearcher wrote:
| Good advice. I ran an academic lab for a long time, read
| and reviewed a lot of papers, and trained students to read
| them. My process is as follows.
|
| In order, and quickly, I look at the abstract, the picture
| at the top of page 2 (unless it was on page 1 like vision
| and graphics papers tend to do), the references, the
| conclusion, the lit review, if I'm still interested I try
| to scan the whole thing to decide what the main point of
| the paper is. Then if I care to, I start again and read it
| linearly start to finish.
|
| If I'd agreed to review a paper, I don't get to abort the
| process. Otherwise I can bail at any step.
| throw_pm23 wrote:
| It's sad that papers in this area are so standardized in
| format though. Some of the classic papers break every
| convention and are written in very idiosyncratic ways. I
| want to read papers that can be enjoyed slowly and that
| change my life, not papers where I get trained to read
| them quickly in a certain way and that have prescribed
| sections and predictable figures. Life is too short for
| this rote work.
| peppery wrote:
| Agreed, idiosyncratic voice is so life- and mind-
| affirming in papers. (Do you mind sharing examples of
| three papers that you _did_ enjoy slowly and change your
| conceptual life?)
| idiotsecant wrote:
| The paper is a tool for conveying understanding.
| Standardized tools are a good thing.
| nighthawk454 wrote:
| Nice! Yeah that workflow feels pretty relatable. My other
| rule of thumb is if it's an architecture paper (system or
| neural net) without a diagram, I immediately bail haha
| DoctorOetker wrote:
| It's never too late to study physics or mathematics. I would
| however recommend studying physics since it will force you to
| study the slices of mathematics that was motivated by
| necessity. A lot of mathematics is genuinely interesting in
| and of itself, or results in mysteriously rich structures,
| but has not necessarily found many applications. A lot of
| mathematics was in fact motivated and (often sloppily)
| invented by physicists, before being put on more rigorous
| footing by mathematicians...
|
| Studying physics without learning math is impossible, but
| physics results in a curricular selection of math which is
| highly relevant to machine learning.
| gessha wrote:
| Most of the math in ML papers is smoke and mirrors. Some good
| theoretical papers exist but too many others cover up
| accidental discovery with vague mathy explanations.
|
| If you really want to go deep on a topic in ML, take a
| subtopic paper and read related works until you can connect
| the basics to the paper. This consists of not only writing
| down the math but also coding the concepts. ML students go
| through this process when they choose a topic and start
| working on it. Focus on one topic and learn it. Present it to
| yourself or others. Find holes in the literature and try
| exploring them. Science!
| cubefox wrote:
| Loosely related thought: A year ago, there was a lot of talk
| about the Mamba SSM architecture replacing transformers.
| Apparently that didn't happen so far.
___________________________________________________________________
(page generated 2025-02-01 08:01 UTC)