[HN Gopher] XLSTM: Extended Long Short-Term Memory
       ___________________________________________________________________
        
       XLSTM: Extended Long Short-Term Memory
        
       Author : mauricesvp
       Score  : 175 points
       Date   : 2024-05-08 05:28 UTC (17 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | KhoomeiK wrote:
       | For those who don't know, the senior author on this paper (Sepp
       | Hochreiter) was the first author on the original paper with
       | Schmidhuber introducing LSTMs in 1997.
        
         | ramraj07 wrote:
         | At least in biology, the first author of a paper is more often
         | than not just a pair of gifted hands who did the experiments
         | and plotted the graphs. Doesn't always translate that they
         | become good PIs later (though they get their chances from these
         | papers).
        
           | cdavid wrote:
           | In ML, it generally is ordered from most contributor to least
           | contributor, w/ heads of the lab last.
        
           | querez wrote:
           | In this specific case, it's fairly well known that Hochreiter
           | was the major brain behind the original LSTM.
        
       | albertzeyer wrote:
       | It seems Sepp Hochreiter has talked already about this model
       | since Oct 2023:
       | https://github.com/huggingface/transformers/issues/27011
       | 
       | In the scaling law comparison, I wonder if it is reasonable to
       | compare number of parameters between Llama, Mamba, RWKV, xLSTM?
       | Isn't compute time more relevant? E.g. in the figure about
       | scaling laws, replace num of params by compute time.
       | 
       | Specifically, the sLSTM has still recurrence (memory mixing) in
       | it, i.e. you cannot fully parallelize the computation. So scaling
       | up Transformer could still look better when you look at compute
       | time.
       | 
       | It seems neither the code nor the model params are released. I
       | wonder if that will follow.
        
         | YetAnotherNick wrote:
         | Recurrence is less of issue with really large models training
         | than it is with medium sized models. Medium sized transformer
         | models are generally not trained with sequence parallelism, but
         | sequence parallelism is getting more common with transformer
         | training. And sequence parallelism is same for transformer or
         | recurrent model.
         | 
         | For really large models, it is in fact easier to achieve peak
         | flops because computation required scales faster than memory
         | bandwidth required(square vs cube).
        
           | albertzeyer wrote:
           | With sequence parallelism, you mean to increase the batch
           | size, i.e. number of sequences in a batch?
           | 
           | > Medium sized transformer models are generally not trained
           | with sequence parallelism, but sequence parallelism is
           | getting more common with transformer training
           | 
           | Is there some word missing? You mean it's more common for
           | large-sized Transformers?
           | 
           | > computation required scales faster than memory bandwidth
           | required (square vs cube)
           | 
           | That is an interesting thought. I'm trying to understand what
           | exactly you mean. You mean, computation time is in O(N^2)
           | where N is the sequence length, while required memory
           | bandwidth is in O(N^3)? Why is that?
        
             | YetAnotherNick wrote:
             | No, it means dividing the sequence into multiple chunks and
             | processing them one by one, very similar to recurrence. See
             | [1]. Sequence parallelism is needed when the sequence can't
             | fit in a single GPU. Sequence parallelism is the hardest
             | parallelism, but it is required for longer sequence. Many
             | models just trains for smaller sequence length for majority
             | of the training and switch to sequence parallelism for last
             | few percentage of training.
             | 
             | [1]: https://arxiv.org/pdf/2105.13120
        
               | logicchains wrote:
               | >Sequence parallelism is the hardest parallelism, but it
               | is required for longer sequence
               | 
               | In terms of difficulty of implementation it's arguably
               | much easier than pipeline parallelism, which I'd argue is
               | the hardest kind (at least to implement it efficiently
               | without bubbles), and takes the most lines of code to
               | implement (especially in Jax, where sequence parallelism
               | is almost trivial).
        
         | korbip wrote:
         | Disclaimer: I'm shared first author of this paper.
         | 
         | As a clarification: The speed for training will be on par with
         | FlashAttention-2, when fully optimized and only including the
         | mLSTM. For decoding/inference both are very close to Mamba as
         | xLSTM is a recurrent architecture. The sLSTM has memory mixing,
         | that is state tracking capabilities, for problems Transformers
         | and State Space Models (and any other sequence-parallelizable
         | architecture) cannot solve fundamentally.
        
           | albertzeyer wrote:
           | Congratulations on the paper. That's some very interesting
           | work!
           | 
           | But you would want to include sLSTM as well to get the best
           | performance, right? How does the speed compares in that case?
           | Specifically when scaling up.
        
             | korbip wrote:
             | Thank you! I can say that it is not really a diminishing
             | factor at the scales reported in the paper. So, xLSTM[7:1]
             | is pretty much on par with xLSTM[1:0] in speed. We show
             | that it is helpful on toy tasks, and it shows even better
             | sequence extrapolation performance, so yes.
        
           | WithinReason wrote:
           | Can you expand on the "cannot solve fundamentally" part?
        
             | lucidrains wrote:
             | https://arxiv.org/abs/2404.08819
        
           | logicchains wrote:
           | To clarify, is the sLSTM strictly necessary (to achieve
           | better accuracy than those other architectures), or is the
           | mLSTM good enough? The [1/0] model in the paper seemed to do
           | quite well.
        
             | korbip wrote:
             | For language in general it seems fine. But there might be
             | specific tasks where it is necessary indeed.
        
           | deepnet wrote:
           | Fascinating work, very promising.
           | 
           | Can you summarise how the model in your paper differs from
           | this implementation of xLSTM ?
           | 
           | https://github.com/huggingface/transformers/issues/27011
        
           | brookst wrote:
           | Congrats on the paper, very interesting.
           | 
           | Can you opine on how the model will fare on hardware that is
           | optimized for transformers? There is so much investment in
           | accelerating the transformer arch[1][2], will xLSTM / sLSTM
           | benefit as well, or will the hardware optimizations give
           | transformers enough of an advantage that it's hard to compete
           | on general purpose hardware?
           | 
           | 1. https://www.etched.com/
           | 
           | 2. https://www.embedded.com/ai-chip-features-hardware-
           | support-f...
        
           | SpaceManNabs wrote:
           | > For decoding/inference both are very close to Mamba as
           | xLSTM is a recurrent architecture
           | 
           | Can you explain this statement more if you have time? Are you
           | saying the recurrent architecture of xLSTM enables fast
           | inference on par with Mamba? Or the xLSTM architecture slows
           | it down so that its inference is as slow as mamba?
        
           | goldemerald wrote:
           | Great work! I'd love to start using the language model
           | variant of your work. Do you know when/if it will be open
           | sourced? I'd start using it today if it were that soon.
        
         | zozbot234 wrote:
         | > Specifically, the sLSTM has still recurrence (memory mixing)
         | in it, i.e. you cannot fully parallelize the computation.
         | 
         | If you mean that you cannot fully parallelize inference, this
         | might be true but also not quite relevant since the
         | computational demands of inference are low. And you can always
         | "parallelize" training to some extent, just by training larger
         | batches.
        
           | korbip wrote:
           | This was formulated a bit unclear. It is not possible to
           | parallelize in the sequence dimension for training as it is
           | possible for Transformers. In the batch dimension you can
           | always do it.
        
       | sigmoid10 wrote:
       | Another week, another paper that thinks they can revive recurrent
       | networks. Although this time the father of LSTM is a co-author,
       | so this paper should not come as a surprise. Sadly, the results
       | seem to indicate that even by employing literally all tricks of
       | the trade, their architecture can't beat the throughput of flash-
       | attention (not by a long shot, but that is not surprising for
       | recurrent designs) and, on top of that, it is even slower than
       | Mamba, which offers similar accuracy at lower cost. So my money
       | is on this being another DOA architecture, like all the others
       | we've seen this year already.
        
         | l33tman wrote:
         | To put another perspective on this, lots of modern advancements
         | in both ML/AI and especially computer graphics has come from
         | ideas already from the 70-80s that were published, forgotten,
         | and revived. Because underlying dependencies change, like the
         | profile of the HW of the day. So just let the ideas flow, not
         | every paper has to have an immediate payoff.
        
           | KeplerBoy wrote:
           | To be fair, Hochreiter seems pretty confident that this will
           | be a success.
           | 
           | He stated in interviews "Wir werden das blode GPT einfach
           | wegkicken" (roughly: We will simply kick silly GPT off the
           | pitch) and he just founded a company to secure funding.
           | Interesting times.
           | 
           | Someone gathered most of the available information here:
           | https://github.com/AI-Guru/xlstm-resources
        
             | imjonse wrote:
             | With all due respect for his academic accomplishments,
             | confidence in this domain in the current climate is usually
             | a signal towards potential investors; it can be backed by
             | anything between solid work (as I hope this turns out to
             | be) and a flashy slide deck combined with a questionable
             | character.
        
               | KeplerBoy wrote:
               | Which is a legitimate stance.
               | 
               | Being a researcher at a public university in a country
               | that doesn't exactly splurge on this kind of research he
               | has to get creative to get any meaningful amount of
               | funding.
        
               | l33tman wrote:
               | To say the least. It's a bit unfortunate that there is
               | about 0 culture in the EU regarding moonshot projects
               | compared to silicon valley. I've tried to get money a
               | couple of times from government grants for (yet
               | another..) foundational AI model, neuroscience inspired,
               | but the grants instead seem to almost exclusively go to
               | well developed industrial companies that now wants some
               | free money to "leverage" ChatGPT in their existing
               | internal processes.. and being still in the research
               | phase, the more risk-averse VCs here are not touching
               | stuff like this either.
               | 
               | So I guess what's left is doing these grand proclamations
               | that you are going to "knock the crown off OpenAI" etc.
               | Though, some sort of vision is good to have for sure :)
        
         | karalala wrote:
         | Already seeing major flaws in the paper.
         | 
         | The benchmarking done in the table 1 is extremely questionable.
         | Their table basically contradicts the results from multiple
         | peer reviewed papers, especially for RNNs which report results
         | much closer to baseline transformers (and conducted much larger
         | experiments btw).
         | 
         | Page 40 they mention that all models are trained with the same
         | lr for comparability.
         | 
         | > Contradicts their own scaling laws table which uses different
         | lr for different models
         | 
         | > And no it is not a fair comparison to use the same lr to test
         | all these different models. Benchmarking results just looks
         | like they are using tuned hyperparameters for their model which
         | happens to not work for other models.
        
           | bingbingbing777 wrote:
           | You should publish a response paper and get them to retract
           | their paper if it has major flaws.
        
             | karalala wrote:
             | Its xlstm contradicting existing peer reviewed papers lmao.
             | Either xlstm should fix their benchmarks or existing peer
             | reviewed papers should retract.
             | 
             | RWKV-v6 > RWKV-v5 > RWKV-v4, not the other way round
             | obviously. HGRN 8 ppl worse than baseline transformers?
             | NIPS 2023 spotlight paper btw.
        
               | logicchains wrote:
               | I thought it was common knowledge that architecture
               | comparisons in papers aren't worth the paper they're
               | printed on; there are so many ways to deliberately or
               | accidentally structure things to favour one architecture
               | over the others. Ultimately the lmsys chatpot arena will
               | be the final judge.
        
               | karalala wrote:
               | True, but they normally arent this far off. HGRN claims
               | that they outperform transformer for 1B parameter model
               | trained on the pile. HGRN performing 8ppl worse suggests
               | that its useless.
        
           | rrr_oh_man wrote:
           | Could you explain for a dum-dum?
        
             | karalala wrote:
             | Results of xlstm are promising but will need larger scale
             | experiments.
             | 
             | However they completely messed up benchmarking experiments
             | for various RNN models which in their papers claim
             | comparable and even better performance than base
             | transformer.
        
               | AIsore wrote:
               | These experiments seem pretty large already though, no?
               | How are you so sure they messed up benchmarking? Is the
               | code out already?
        
       | WithinReason wrote:
       | I like the color coded equations, I wish they would become a
       | thing. We have syntax highlighting for programming languages,
       | it's time we have it for math too.
        
         | imjonse wrote:
         | Math notation has different fonts, with similar goals as syntax
         | highlighting. It also works well in black and white :)
        
       | elygre wrote:
       | I have no idea about what this is, so going off topic:
       | 
       | The name XLSTM reminds me of the time in the late eighties when
       | my university professor got accepted to hold a presentation on
       | WOM: write-only memory.
        
         | woadwarrior01 wrote:
         | I think it's a fine name. The prefix ensures that people don't
         | confuse it with vanilla LSTMs. Also, I'm fairly certain that
         | they must've considered LSTM++ and LSTM-XL.
        
         | pquki4 wrote:
         | I mean, if you look it another way, XSLT is a real thing that
         | gets used a lot, so I don't mind appending an M there.
        
       | beAbU wrote:
       | I thought this was some extension or enhancement to XSLT.
        
         | cylemons wrote:
         | Same
        
       | GistNoesis wrote:
       | Can someone explain the economics behind this ?
       | 
       | The claim is something than will replace the transformer, a
       | technology powering a good chunk of AI companies.
       | 
       | The paper's authors seems to be either from a public university,
       | or Sepp Hochreiter's private company or labs nx-ai.com
       | https://www.nx-ai.com/en/xlstm
       | 
       | Where is the code ? What is the license ? How are they earning
       | money ? Why publish their secret recipe ? Will they not be
       | replicated ? How will the rewards be commensurate with the value
       | their algorithm bring ? Who will get money from this new
       | technology ?
        
         | imjonse wrote:
         | Should all arxiv papers be backed by economic considerations or
         | business plans?
        
           | jampekka wrote:
           | Or any?
        
           | AIsore wrote:
           | Nope, they should not. It is academia after all. How would
           | you even do that in, say, pure mathematics? Concretely, I
           | would love to know what the business plan/economic
           | consideration of Gower's 1998 proof of Szemeredi's theorem
           | using higher order Fourier analysis would even look like.
        
             | queuebert wrote:
             | Coming soon to an HFT firm near you ...
        
         | refulgentis wrote:
         | Are you asking how academics make money from giving away
         | knowledge in papers? It's complicated
        
           | GistNoesis wrote:
           | I don't understand at all what the monetary value of this
           | algorithm should be.
           | 
           | The authors are positioning themselves as a company and not
           | merely academics :
           | 
           | A Sepp Hochreiter's video from 6 months ago hyping xLSTM :
           | 
           | https://youtu.be/hwIt7ezy6t8?feature=shared&t=561 in which he
           | state his intent to raise EUR300M to make a european
           | alternative to openai's GPT for niche domains thanks to this
           | new method that will allow to train for cheaper and better.
           | 
           | He recently received (2023) EUR 35,000 in prize money at the
           | 5th annual German AI Award.
           | 
           | https://www.jku.at/en/festival-
           | university/media/detail/news/...
           | 
           | Or is it just an academic tactic to get more funding ? To
           | extract more work from PhD students by making them think they
           | are going to strike it big ?
           | 
           | How are they intending to build a moat if they publish their
           | papers ? Will this technology be encumbered by
           | patents/license ?
        
         | brookst wrote:
         | I think you're making a satirical point about how commercial
         | R&D has far outstripped academia, but it's not 100% clear.
        
       | smusamashah wrote:
       | Can someone ELI5 this? Reading comments it sounds like it's going
       | to replace transformers which LLMs are based on? Is it something
       | exponentially better than current tech on scale?
        
         | probably_wrong wrote:
         | LSTMs are a recurrent architecture for neural networks, meaning
         | that your output depends both on your current input and your
         | previous output. This is similar to how language works, as the
         | next word in your sentence must fit both the idea you're trying
         | to convey (your input) and the words you've said up until now
         | (your previous output).
         | 
         | LSTMs where very popular for a while (I think the first good
         | version of Google Translate used them) but they had two
         | critical downsides: their performance went down with longer
         | outputs, and they where a bit annoying to parallelize because
         | computing the output for the 10th word required first computing
         | the output of the previous 9 words - no way to use 10 parallel
         | computers. The first problem was solved with Attention, a
         | scaffolding method that prevented degradation over longer
         | sequences. Eventually someone realized that Attention was doing
         | most of the heavy lifting, built an attention-only network that
         | could be easily parallelized (the Transformer), and LSTMs lost
         | the top place.
         | 
         | Are xLSTMs better? On paper I'd say they could be - they seem
         | to have a solid theory and good results. Will they dethrone
         | Transformers? My guess is no, as it wouldn't be the first time
         | that the "better" technology ends up losing against whatever is
         | popular. Having said that, it is entirely possible that some
         | inherently recurrent tasks like stock price prediction could
         | get a boost from this technology and they may find their place.
        
       | jasonjmcghee wrote:
       | They reference "a GPT-3 model with 356M parameters"
       | 
       | So GPT-3 Medium (from the GPT-3 paper) - feels pretty
       | disingenuous to list that as no one is referencing that model
       | when they say "GPT-3", but the 175B model.
       | 
       | I wasn't aware that size of the model (356M) was released- what
       | am I missing here?
       | 
       | I also think it's relatively well understood that (with our
       | current methods) transformers have a tipping point with parameter
       | count, and I don't know of any models less than ~3B that are
       | useful- arguably 7B.
       | 
       | Compare these benchmarks to, say, the RWKV 5/6 paper
       | https://arxiv.org/abs/2404.05892
        
         | CuriouslyC wrote:
         | phi3 mini is surprisingly capable given its size. You can teach
         | small transformers to do stuff well, you just can't have good
         | general purpose small models.
        
           | jasonjmcghee wrote:
           | Totally. But they aren't fine tuning these afaict- but
           | comparing general purpose capabilities.
        
       ___________________________________________________________________
       (page generated 2024-05-08 23:01 UTC)