[HN Gopher] New attention mechanisms that outperform standard mu...
       ___________________________________________________________________
        
       New attention mechanisms that outperform standard multi-head
       attention
        
       Author : snats
       Score  : 97 points
       Date   : 2024-05-29 19:33 UTC (3 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | 317070 wrote:
       | > we evaluate the presented attention mechanisms on MNIST,
       | CIFAR100, IMDB Movie Reviews, and Amazon Reviews datasets.
       | 
       | It sounds amazing, but I'm not holding my breath this one will
       | scale.
        
         | r2_pilot wrote:
         | You could provide the quote in full("In addition to providing
         | rigorous mathematical comparisons,") so that the author's work
         | in proving their point is not hidden by your effortless snark.
        
           | 317070 wrote:
           | I am not sure how much experience you have in this area of
           | research, but maybe I can shed some light on the background
           | here. The "Attention is all you need" paper is now almost 7
           | years old. Those 7 years have seen a flood of proposals on
           | improving transformers, only very few have been retained.
           | 
           | There is very little theoretic about transformer-style
           | architectures. Fundamentally, the proof is in the pudding,
           | not in "mathematical comparisons". A proposed change needs to
           | scale better, it is all that matters. And the datasets
           | mentioned are simply unsuitable for showing any scaling. I
           | think the biggest dataset in this list, is 160MB compressed.
           | 
           | I am not sure why this article was posted here on hackernews.
           | I would estimate even just today, there have probably been
           | about 3 papers posted on arXiv with proposed transformer
           | architecture changes, tested on larger datasets than the ones
           | mentioned here.
        
             | 317070 wrote:
             | I checked, and on the 28th of May, arXiv has seen 14
             | submissions with "transformer" in the title, and I found 3
             | of them with proposals tested on larger datasets (I did not
             | check all of them, there might have been more than these
             | three).
             | 
             | https://arxiv.org/pdf/2405.18240
             | https://arxiv.org/abs/2405.17951
             | https://arxiv.org/pdf/2405.17821
        
         | janalsncm wrote:
         | Sometimes it doesn't need to. You might have a problem that
         | isn't web scale and where transfer learning is hard. We also
         | need techniques for small datasets even if they are slower to
         | train or are outperformed after 5 billion tokens.
        
       | marcinzm wrote:
       | These seems very tiny models and as I understand it LLMs behave
       | fairly differently at different scales.
       | 
       | The speed performance gain seems to only be on an M2 chip and I
       | wonder if there's already much better non-GPU optimized attention
       | approaches out there for those use cases.
        
       | toxik wrote:
       | I feel like FlashAttention is the relevant baseline here.
        
         | lalaland1125 wrote:
         | FlashAttention is completely orthogonal to this. This work is
         | about speeding up the computation of Q, K and V vectors while
         | FlashAttention is about speeding up the attention algorithm
         | itself.
         | 
         | You could combine the two.
        
       | lalaland1125 wrote:
       | > The Transformer models, used in this experiment, all have a
       | single attention layer with model dimension and context length
       | 32.
       | 
       | I think we are going to need to see more experiments here,
       | especially because the theoretical motivations here are weak
        
       | behnamoh wrote:
       | Pressing [X] to doubt.
       | 
       | There are many alternatives to the good old transformers: RWKV,
       | Mamba, etc.
       | 
       | Yet here we are, still using transformers (actually, just the
       | decoder part). Is it because the industry has so much inertia to
       | pick up new methods? I doubt it because there's $BILLIONS in this
       | market and everyone wants a piece of the AI cake, so it doesn't
       | make sense to ignore promising methods.
       | 
       | Why, then, we barely see any non-transformer production-ready LLM
       | these days?
        
         | solidasparagus wrote:
         | It's going to take time. I can't speak to the actual quality of
         | mamba other than to say the authors are extraordinary and
         | should be taken seriously.
         | 
         | But training a large model requires a huge amount of capital so
         | the biggest runs are designed around risk minimization. And
         | remember, many of the decision makers of these runs made are in
         | their positions by doing transformer-centric work. The true
         | value of mamba is still unclear to me with very long context
         | techniques being effective for transformers.
        
         | Buttons840 wrote:
         | I believe the attention mechanism we use now was introduced in
         | 2014 by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio in
         | their paper titled "Neural Machine Translation by Jointly
         | Learning to Align and Translate."
         | 
         | 2014. It took almost a decade for the potential of this
         | technique to be realized and come to the attention (heh) of
         | most developers. I don't know what researchers are doing with
         | Mamba and RWKV, but we should let them cook.
        
         | hansvm wrote:
         | > I doubt it because there's $BILLIONS in this market and
         | everyone wants a piece of the AI cake, so it doesn't make sense
         | to ignore promising methods.
         | 
         | I also doubt this result. The "why have $BILLIONS not already
         | invested" question is interesting in its own right though.
         | Generally, the literature on the theoretical bounds of swarm
         | optimization is pertinent. Those $BILLIONS aren't being
         | invested by a single omniscient entity, so they're subject to
         | interesting constraints.
         | 
         | As one of many explanations, fragmentation is common. If
         | $BILLIONS are split between greedy, mostly non-interacting
         | entities (e.g., competing companies each trying to replace the
         | transformer in a bounded number of hours and dollars while
         | securing their market dominance), you expect,
         | probabilistically, for each of them to converge on the same
         | strateg(y/ies), especially if the "best" alternatives are
         | obvious or globally known for some reason (e.g., some solutions
         | intuitively feel "natural" or your researchers publish early
         | results or you have employee movement between companies or
         | whatever). Riskier strategies won't be touched, and you'll have
         | $BILLIONS spent duplicating the same most likely alternatives
         | when $MILLIONS would have sufficed.
         | 
         | The normal counterpoint is that a few big players dominate the
         | spending, and they would have higher internal coordination.
         | Interestingly though, they don't usually, except when that
         | coordination would tend to enforce the same strategies smaller
         | competition are pursuing. How often do you hear about stories
         | like the misaligned Google+ integrations resulting in employee
         | bonuses for poor customer experiences vs a forward-thinking
         | executive actively devoting funds to a meaningful number of
         | competing solutions? Approximately never. It's career suicide
         | if you fail and depend on other people for your position, you
         | _are_ actually more likely to outdo the competition with your
         | increased resources if you just lean into the "best"
         | alternatives, and for a whole host of reasons very few
         | executives (except for people with real power) will coordinate
         | a more comprehensive strategy, certainly not one orthogonal to
         | the competition's just for the sake of allocating the global
         | $BILLIONS more efficiently.
         | 
         | Separately (going back to the merits of the preprint), I'll
         | probably read the full thing later, but a few points stuck out
         | as suspicious on an initial skim. Notably, they seem to mix
         | linear transformations in different domains. E.g., `xa` is
         | linear in both `x` and `a`, and `vx` is linear in both `v` and
         | `x`, but `xax` is _not_ linear in `x`, even if you try to
         | "prove" that idea with `v = xa`. Linearity in `v` isn't enough
         | to make the composition linear in `x`. A lot of their results
         | seem to rely on eliminating those "redundant" computations,
         | even though the things they're replacing with linear
         | computations are actually higher order polynomials. On an
         | initial skim, the other "novel" ideas also don't seem well
         | grounded.
         | 
         | Their experimental results are decent. That could mean a lot of
         | things (normally that the authors made more errors in their
         | competitors' than in their own work), but it's probably worth
         | looking into for a few hours despite my other complaints.
        
       | GaggiX wrote:
       | The models tested are extremely small, a few thousand parameters
       | and the performance is of course not great, I don't think we can
       | extrapolate much from this. I don't understand why they chose
       | such small models when you can train much larger ones for free on
       | Colab or Kaggle if you really need it.
        
       ___________________________________________________________________
       (page generated 2024-05-29 23:00 UTC)