[HN Gopher] XLSTM: Extended Long Short-Term Memory
___________________________________________________________________
XLSTM: Extended Long Short-Term Memory
Author : mauricesvp
Score : 175 points
Date : 2024-05-08 05:28 UTC (17 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| KhoomeiK wrote:
| For those who don't know, the senior author on this paper (Sepp
| Hochreiter) was the first author on the original paper with
| Schmidhuber introducing LSTMs in 1997.
| ramraj07 wrote:
| At least in biology, the first author of a paper is more often
| than not just a pair of gifted hands who did the experiments
| and plotted the graphs. Doesn't always translate that they
| become good PIs later (though they get their chances from these
| papers).
| cdavid wrote:
| In ML, it generally is ordered from most contributor to least
| contributor, w/ heads of the lab last.
| querez wrote:
| In this specific case, it's fairly well known that Hochreiter
| was the major brain behind the original LSTM.
| albertzeyer wrote:
| It seems Sepp Hochreiter has talked already about this model
| since Oct 2023:
| https://github.com/huggingface/transformers/issues/27011
|
| In the scaling law comparison, I wonder if it is reasonable to
| compare number of parameters between Llama, Mamba, RWKV, xLSTM?
| Isn't compute time more relevant? E.g. in the figure about
| scaling laws, replace num of params by compute time.
|
| Specifically, the sLSTM has still recurrence (memory mixing) in
| it, i.e. you cannot fully parallelize the computation. So scaling
| up Transformer could still look better when you look at compute
| time.
|
| It seems neither the code nor the model params are released. I
| wonder if that will follow.
| YetAnotherNick wrote:
| Recurrence is less of issue with really large models training
| than it is with medium sized models. Medium sized transformer
| models are generally not trained with sequence parallelism, but
| sequence parallelism is getting more common with transformer
| training. And sequence parallelism is same for transformer or
| recurrent model.
|
| For really large models, it is in fact easier to achieve peak
| flops because computation required scales faster than memory
| bandwidth required(square vs cube).
| albertzeyer wrote:
| With sequence parallelism, you mean to increase the batch
| size, i.e. number of sequences in a batch?
|
| > Medium sized transformer models are generally not trained
| with sequence parallelism, but sequence parallelism is
| getting more common with transformer training
|
| Is there some word missing? You mean it's more common for
| large-sized Transformers?
|
| > computation required scales faster than memory bandwidth
| required (square vs cube)
|
| That is an interesting thought. I'm trying to understand what
| exactly you mean. You mean, computation time is in O(N^2)
| where N is the sequence length, while required memory
| bandwidth is in O(N^3)? Why is that?
| YetAnotherNick wrote:
| No, it means dividing the sequence into multiple chunks and
| processing them one by one, very similar to recurrence. See
| [1]. Sequence parallelism is needed when the sequence can't
| fit in a single GPU. Sequence parallelism is the hardest
| parallelism, but it is required for longer sequence. Many
| models just trains for smaller sequence length for majority
| of the training and switch to sequence parallelism for last
| few percentage of training.
|
| [1]: https://arxiv.org/pdf/2105.13120
| logicchains wrote:
| >Sequence parallelism is the hardest parallelism, but it
| is required for longer sequence
|
| In terms of difficulty of implementation it's arguably
| much easier than pipeline parallelism, which I'd argue is
| the hardest kind (at least to implement it efficiently
| without bubbles), and takes the most lines of code to
| implement (especially in Jax, where sequence parallelism
| is almost trivial).
| korbip wrote:
| Disclaimer: I'm shared first author of this paper.
|
| As a clarification: The speed for training will be on par with
| FlashAttention-2, when fully optimized and only including the
| mLSTM. For decoding/inference both are very close to Mamba as
| xLSTM is a recurrent architecture. The sLSTM has memory mixing,
| that is state tracking capabilities, for problems Transformers
| and State Space Models (and any other sequence-parallelizable
| architecture) cannot solve fundamentally.
| albertzeyer wrote:
| Congratulations on the paper. That's some very interesting
| work!
|
| But you would want to include sLSTM as well to get the best
| performance, right? How does the speed compares in that case?
| Specifically when scaling up.
| korbip wrote:
| Thank you! I can say that it is not really a diminishing
| factor at the scales reported in the paper. So, xLSTM[7:1]
| is pretty much on par with xLSTM[1:0] in speed. We show
| that it is helpful on toy tasks, and it shows even better
| sequence extrapolation performance, so yes.
| WithinReason wrote:
| Can you expand on the "cannot solve fundamentally" part?
| lucidrains wrote:
| https://arxiv.org/abs/2404.08819
| logicchains wrote:
| To clarify, is the sLSTM strictly necessary (to achieve
| better accuracy than those other architectures), or is the
| mLSTM good enough? The [1/0] model in the paper seemed to do
| quite well.
| korbip wrote:
| For language in general it seems fine. But there might be
| specific tasks where it is necessary indeed.
| deepnet wrote:
| Fascinating work, very promising.
|
| Can you summarise how the model in your paper differs from
| this implementation of xLSTM ?
|
| https://github.com/huggingface/transformers/issues/27011
| brookst wrote:
| Congrats on the paper, very interesting.
|
| Can you opine on how the model will fare on hardware that is
| optimized for transformers? There is so much investment in
| accelerating the transformer arch[1][2], will xLSTM / sLSTM
| benefit as well, or will the hardware optimizations give
| transformers enough of an advantage that it's hard to compete
| on general purpose hardware?
|
| 1. https://www.etched.com/
|
| 2. https://www.embedded.com/ai-chip-features-hardware-
| support-f...
| SpaceManNabs wrote:
| > For decoding/inference both are very close to Mamba as
| xLSTM is a recurrent architecture
|
| Can you explain this statement more if you have time? Are you
| saying the recurrent architecture of xLSTM enables fast
| inference on par with Mamba? Or the xLSTM architecture slows
| it down so that its inference is as slow as mamba?
| goldemerald wrote:
| Great work! I'd love to start using the language model
| variant of your work. Do you know when/if it will be open
| sourced? I'd start using it today if it were that soon.
| zozbot234 wrote:
| > Specifically, the sLSTM has still recurrence (memory mixing)
| in it, i.e. you cannot fully parallelize the computation.
|
| If you mean that you cannot fully parallelize inference, this
| might be true but also not quite relevant since the
| computational demands of inference are low. And you can always
| "parallelize" training to some extent, just by training larger
| batches.
| korbip wrote:
| This was formulated a bit unclear. It is not possible to
| parallelize in the sequence dimension for training as it is
| possible for Transformers. In the batch dimension you can
| always do it.
| sigmoid10 wrote:
| Another week, another paper that thinks they can revive recurrent
| networks. Although this time the father of LSTM is a co-author,
| so this paper should not come as a surprise. Sadly, the results
| seem to indicate that even by employing literally all tricks of
| the trade, their architecture can't beat the throughput of flash-
| attention (not by a long shot, but that is not surprising for
| recurrent designs) and, on top of that, it is even slower than
| Mamba, which offers similar accuracy at lower cost. So my money
| is on this being another DOA architecture, like all the others
| we've seen this year already.
| l33tman wrote:
| To put another perspective on this, lots of modern advancements
| in both ML/AI and especially computer graphics has come from
| ideas already from the 70-80s that were published, forgotten,
| and revived. Because underlying dependencies change, like the
| profile of the HW of the day. So just let the ideas flow, not
| every paper has to have an immediate payoff.
| KeplerBoy wrote:
| To be fair, Hochreiter seems pretty confident that this will
| be a success.
|
| He stated in interviews "Wir werden das blode GPT einfach
| wegkicken" (roughly: We will simply kick silly GPT off the
| pitch) and he just founded a company to secure funding.
| Interesting times.
|
| Someone gathered most of the available information here:
| https://github.com/AI-Guru/xlstm-resources
| imjonse wrote:
| With all due respect for his academic accomplishments,
| confidence in this domain in the current climate is usually
| a signal towards potential investors; it can be backed by
| anything between solid work (as I hope this turns out to
| be) and a flashy slide deck combined with a questionable
| character.
| KeplerBoy wrote:
| Which is a legitimate stance.
|
| Being a researcher at a public university in a country
| that doesn't exactly splurge on this kind of research he
| has to get creative to get any meaningful amount of
| funding.
| l33tman wrote:
| To say the least. It's a bit unfortunate that there is
| about 0 culture in the EU regarding moonshot projects
| compared to silicon valley. I've tried to get money a
| couple of times from government grants for (yet
| another..) foundational AI model, neuroscience inspired,
| but the grants instead seem to almost exclusively go to
| well developed industrial companies that now wants some
| free money to "leverage" ChatGPT in their existing
| internal processes.. and being still in the research
| phase, the more risk-averse VCs here are not touching
| stuff like this either.
|
| So I guess what's left is doing these grand proclamations
| that you are going to "knock the crown off OpenAI" etc.
| Though, some sort of vision is good to have for sure :)
| karalala wrote:
| Already seeing major flaws in the paper.
|
| The benchmarking done in the table 1 is extremely questionable.
| Their table basically contradicts the results from multiple
| peer reviewed papers, especially for RNNs which report results
| much closer to baseline transformers (and conducted much larger
| experiments btw).
|
| Page 40 they mention that all models are trained with the same
| lr for comparability.
|
| > Contradicts their own scaling laws table which uses different
| lr for different models
|
| > And no it is not a fair comparison to use the same lr to test
| all these different models. Benchmarking results just looks
| like they are using tuned hyperparameters for their model which
| happens to not work for other models.
| bingbingbing777 wrote:
| You should publish a response paper and get them to retract
| their paper if it has major flaws.
| karalala wrote:
| Its xlstm contradicting existing peer reviewed papers lmao.
| Either xlstm should fix their benchmarks or existing peer
| reviewed papers should retract.
|
| RWKV-v6 > RWKV-v5 > RWKV-v4, not the other way round
| obviously. HGRN 8 ppl worse than baseline transformers?
| NIPS 2023 spotlight paper btw.
| logicchains wrote:
| I thought it was common knowledge that architecture
| comparisons in papers aren't worth the paper they're
| printed on; there are so many ways to deliberately or
| accidentally structure things to favour one architecture
| over the others. Ultimately the lmsys chatpot arena will
| be the final judge.
| karalala wrote:
| True, but they normally arent this far off. HGRN claims
| that they outperform transformer for 1B parameter model
| trained on the pile. HGRN performing 8ppl worse suggests
| that its useless.
| rrr_oh_man wrote:
| Could you explain for a dum-dum?
| karalala wrote:
| Results of xlstm are promising but will need larger scale
| experiments.
|
| However they completely messed up benchmarking experiments
| for various RNN models which in their papers claim
| comparable and even better performance than base
| transformer.
| AIsore wrote:
| These experiments seem pretty large already though, no?
| How are you so sure they messed up benchmarking? Is the
| code out already?
| WithinReason wrote:
| I like the color coded equations, I wish they would become a
| thing. We have syntax highlighting for programming languages,
| it's time we have it for math too.
| imjonse wrote:
| Math notation has different fonts, with similar goals as syntax
| highlighting. It also works well in black and white :)
| elygre wrote:
| I have no idea about what this is, so going off topic:
|
| The name XLSTM reminds me of the time in the late eighties when
| my university professor got accepted to hold a presentation on
| WOM: write-only memory.
| woadwarrior01 wrote:
| I think it's a fine name. The prefix ensures that people don't
| confuse it with vanilla LSTMs. Also, I'm fairly certain that
| they must've considered LSTM++ and LSTM-XL.
| pquki4 wrote:
| I mean, if you look it another way, XSLT is a real thing that
| gets used a lot, so I don't mind appending an M there.
| beAbU wrote:
| I thought this was some extension or enhancement to XSLT.
| cylemons wrote:
| Same
| GistNoesis wrote:
| Can someone explain the economics behind this ?
|
| The claim is something than will replace the transformer, a
| technology powering a good chunk of AI companies.
|
| The paper's authors seems to be either from a public university,
| or Sepp Hochreiter's private company or labs nx-ai.com
| https://www.nx-ai.com/en/xlstm
|
| Where is the code ? What is the license ? How are they earning
| money ? Why publish their secret recipe ? Will they not be
| replicated ? How will the rewards be commensurate with the value
| their algorithm bring ? Who will get money from this new
| technology ?
| imjonse wrote:
| Should all arxiv papers be backed by economic considerations or
| business plans?
| jampekka wrote:
| Or any?
| AIsore wrote:
| Nope, they should not. It is academia after all. How would
| you even do that in, say, pure mathematics? Concretely, I
| would love to know what the business plan/economic
| consideration of Gower's 1998 proof of Szemeredi's theorem
| using higher order Fourier analysis would even look like.
| queuebert wrote:
| Coming soon to an HFT firm near you ...
| refulgentis wrote:
| Are you asking how academics make money from giving away
| knowledge in papers? It's complicated
| GistNoesis wrote:
| I don't understand at all what the monetary value of this
| algorithm should be.
|
| The authors are positioning themselves as a company and not
| merely academics :
|
| A Sepp Hochreiter's video from 6 months ago hyping xLSTM :
|
| https://youtu.be/hwIt7ezy6t8?feature=shared&t=561 in which he
| state his intent to raise EUR300M to make a european
| alternative to openai's GPT for niche domains thanks to this
| new method that will allow to train for cheaper and better.
|
| He recently received (2023) EUR 35,000 in prize money at the
| 5th annual German AI Award.
|
| https://www.jku.at/en/festival-
| university/media/detail/news/...
|
| Or is it just an academic tactic to get more funding ? To
| extract more work from PhD students by making them think they
| are going to strike it big ?
|
| How are they intending to build a moat if they publish their
| papers ? Will this technology be encumbered by
| patents/license ?
| brookst wrote:
| I think you're making a satirical point about how commercial
| R&D has far outstripped academia, but it's not 100% clear.
| smusamashah wrote:
| Can someone ELI5 this? Reading comments it sounds like it's going
| to replace transformers which LLMs are based on? Is it something
| exponentially better than current tech on scale?
| probably_wrong wrote:
| LSTMs are a recurrent architecture for neural networks, meaning
| that your output depends both on your current input and your
| previous output. This is similar to how language works, as the
| next word in your sentence must fit both the idea you're trying
| to convey (your input) and the words you've said up until now
| (your previous output).
|
| LSTMs where very popular for a while (I think the first good
| version of Google Translate used them) but they had two
| critical downsides: their performance went down with longer
| outputs, and they where a bit annoying to parallelize because
| computing the output for the 10th word required first computing
| the output of the previous 9 words - no way to use 10 parallel
| computers. The first problem was solved with Attention, a
| scaffolding method that prevented degradation over longer
| sequences. Eventually someone realized that Attention was doing
| most of the heavy lifting, built an attention-only network that
| could be easily parallelized (the Transformer), and LSTMs lost
| the top place.
|
| Are xLSTMs better? On paper I'd say they could be - they seem
| to have a solid theory and good results. Will they dethrone
| Transformers? My guess is no, as it wouldn't be the first time
| that the "better" technology ends up losing against whatever is
| popular. Having said that, it is entirely possible that some
| inherently recurrent tasks like stock price prediction could
| get a boost from this technology and they may find their place.
| jasonjmcghee wrote:
| They reference "a GPT-3 model with 356M parameters"
|
| So GPT-3 Medium (from the GPT-3 paper) - feels pretty
| disingenuous to list that as no one is referencing that model
| when they say "GPT-3", but the 175B model.
|
| I wasn't aware that size of the model (356M) was released- what
| am I missing here?
|
| I also think it's relatively well understood that (with our
| current methods) transformers have a tipping point with parameter
| count, and I don't know of any models less than ~3B that are
| useful- arguably 7B.
|
| Compare these benchmarks to, say, the RWKV 5/6 paper
| https://arxiv.org/abs/2404.05892
| CuriouslyC wrote:
| phi3 mini is surprisingly capable given its size. You can teach
| small transformers to do stuff well, you just can't have good
| general purpose small models.
| jasonjmcghee wrote:
| Totally. But they aren't fine tuning these afaict- but
| comparing general purpose capabilities.
___________________________________________________________________
(page generated 2024-05-08 23:01 UTC)