[HN Gopher] New attention mechanisms that outperform standard mu...
___________________________________________________________________
New attention mechanisms that outperform standard multi-head
attention
Author : snats
Score : 97 points
Date : 2024-05-29 19:33 UTC (3 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| 317070 wrote:
| > we evaluate the presented attention mechanisms on MNIST,
| CIFAR100, IMDB Movie Reviews, and Amazon Reviews datasets.
|
| It sounds amazing, but I'm not holding my breath this one will
| scale.
| r2_pilot wrote:
| You could provide the quote in full("In addition to providing
| rigorous mathematical comparisons,") so that the author's work
| in proving their point is not hidden by your effortless snark.
| 317070 wrote:
| I am not sure how much experience you have in this area of
| research, but maybe I can shed some light on the background
| here. The "Attention is all you need" paper is now almost 7
| years old. Those 7 years have seen a flood of proposals on
| improving transformers, only very few have been retained.
|
| There is very little theoretic about transformer-style
| architectures. Fundamentally, the proof is in the pudding,
| not in "mathematical comparisons". A proposed change needs to
| scale better, it is all that matters. And the datasets
| mentioned are simply unsuitable for showing any scaling. I
| think the biggest dataset in this list, is 160MB compressed.
|
| I am not sure why this article was posted here on hackernews.
| I would estimate even just today, there have probably been
| about 3 papers posted on arXiv with proposed transformer
| architecture changes, tested on larger datasets than the ones
| mentioned here.
| 317070 wrote:
| I checked, and on the 28th of May, arXiv has seen 14
| submissions with "transformer" in the title, and I found 3
| of them with proposals tested on larger datasets (I did not
| check all of them, there might have been more than these
| three).
|
| https://arxiv.org/pdf/2405.18240
| https://arxiv.org/abs/2405.17951
| https://arxiv.org/pdf/2405.17821
| janalsncm wrote:
| Sometimes it doesn't need to. You might have a problem that
| isn't web scale and where transfer learning is hard. We also
| need techniques for small datasets even if they are slower to
| train or are outperformed after 5 billion tokens.
| marcinzm wrote:
| These seems very tiny models and as I understand it LLMs behave
| fairly differently at different scales.
|
| The speed performance gain seems to only be on an M2 chip and I
| wonder if there's already much better non-GPU optimized attention
| approaches out there for those use cases.
| toxik wrote:
| I feel like FlashAttention is the relevant baseline here.
| lalaland1125 wrote:
| FlashAttention is completely orthogonal to this. This work is
| about speeding up the computation of Q, K and V vectors while
| FlashAttention is about speeding up the attention algorithm
| itself.
|
| You could combine the two.
| lalaland1125 wrote:
| > The Transformer models, used in this experiment, all have a
| single attention layer with model dimension and context length
| 32.
|
| I think we are going to need to see more experiments here,
| especially because the theoretical motivations here are weak
| behnamoh wrote:
| Pressing [X] to doubt.
|
| There are many alternatives to the good old transformers: RWKV,
| Mamba, etc.
|
| Yet here we are, still using transformers (actually, just the
| decoder part). Is it because the industry has so much inertia to
| pick up new methods? I doubt it because there's $BILLIONS in this
| market and everyone wants a piece of the AI cake, so it doesn't
| make sense to ignore promising methods.
|
| Why, then, we barely see any non-transformer production-ready LLM
| these days?
| solidasparagus wrote:
| It's going to take time. I can't speak to the actual quality of
| mamba other than to say the authors are extraordinary and
| should be taken seriously.
|
| But training a large model requires a huge amount of capital so
| the biggest runs are designed around risk minimization. And
| remember, many of the decision makers of these runs made are in
| their positions by doing transformer-centric work. The true
| value of mamba is still unclear to me with very long context
| techniques being effective for transformers.
| Buttons840 wrote:
| I believe the attention mechanism we use now was introduced in
| 2014 by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio in
| their paper titled "Neural Machine Translation by Jointly
| Learning to Align and Translate."
|
| 2014. It took almost a decade for the potential of this
| technique to be realized and come to the attention (heh) of
| most developers. I don't know what researchers are doing with
| Mamba and RWKV, but we should let them cook.
| hansvm wrote:
| > I doubt it because there's $BILLIONS in this market and
| everyone wants a piece of the AI cake, so it doesn't make sense
| to ignore promising methods.
|
| I also doubt this result. The "why have $BILLIONS not already
| invested" question is interesting in its own right though.
| Generally, the literature on the theoretical bounds of swarm
| optimization is pertinent. Those $BILLIONS aren't being
| invested by a single omniscient entity, so they're subject to
| interesting constraints.
|
| As one of many explanations, fragmentation is common. If
| $BILLIONS are split between greedy, mostly non-interacting
| entities (e.g., competing companies each trying to replace the
| transformer in a bounded number of hours and dollars while
| securing their market dominance), you expect,
| probabilistically, for each of them to converge on the same
| strateg(y/ies), especially if the "best" alternatives are
| obvious or globally known for some reason (e.g., some solutions
| intuitively feel "natural" or your researchers publish early
| results or you have employee movement between companies or
| whatever). Riskier strategies won't be touched, and you'll have
| $BILLIONS spent duplicating the same most likely alternatives
| when $MILLIONS would have sufficed.
|
| The normal counterpoint is that a few big players dominate the
| spending, and they would have higher internal coordination.
| Interestingly though, they don't usually, except when that
| coordination would tend to enforce the same strategies smaller
| competition are pursuing. How often do you hear about stories
| like the misaligned Google+ integrations resulting in employee
| bonuses for poor customer experiences vs a forward-thinking
| executive actively devoting funds to a meaningful number of
| competing solutions? Approximately never. It's career suicide
| if you fail and depend on other people for your position, you
| _are_ actually more likely to outdo the competition with your
| increased resources if you just lean into the "best"
| alternatives, and for a whole host of reasons very few
| executives (except for people with real power) will coordinate
| a more comprehensive strategy, certainly not one orthogonal to
| the competition's just for the sake of allocating the global
| $BILLIONS more efficiently.
|
| Separately (going back to the merits of the preprint), I'll
| probably read the full thing later, but a few points stuck out
| as suspicious on an initial skim. Notably, they seem to mix
| linear transformations in different domains. E.g., `xa` is
| linear in both `x` and `a`, and `vx` is linear in both `v` and
| `x`, but `xax` is _not_ linear in `x`, even if you try to
| "prove" that idea with `v = xa`. Linearity in `v` isn't enough
| to make the composition linear in `x`. A lot of their results
| seem to rely on eliminating those "redundant" computations,
| even though the things they're replacing with linear
| computations are actually higher order polynomials. On an
| initial skim, the other "novel" ideas also don't seem well
| grounded.
|
| Their experimental results are decent. That could mean a lot of
| things (normally that the authors made more errors in their
| competitors' than in their own work), but it's probably worth
| looking into for a few hours despite my other complaints.
| GaggiX wrote:
| The models tested are extremely small, a few thousand parameters
| and the performance is of course not great, I don't think we can
| extrapolate much from this. I don't understand why they chose
| such small models when you can train much larger ones for free on
| Colab or Kaggle if you really need it.
___________________________________________________________________
(page generated 2024-05-29 23:00 UTC)