[HN Gopher] xLSTM code release by NX-AI
___________________________________________________________________
xLSTM code release by NX-AI
Author : badlogic
Score : 95 points
Date : 2024-06-04 09:09 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| pietz wrote:
| Could someone provide a quick summary where they stand compared
| to transformer architectures? Do they have real world scale
| results that are competitive?
| barrell wrote:
| - They outperform transformers at lower parameter counts. Time
| will tell if that hold up for more parameters
|
| - They scale linearly in complexity, which means with a longer
| context window they will be faster and cheaper than
| transformers
|
| - It's been mostly academic as far as I know, only just
| recently being published. I don't think there's been an
| opportunity to use them at 'real world scale' yet, although tbh
| I'm a little uncertain what you mean by it.
| oersted wrote:
| It does seem to match transformers, but I wouldn't say it
| meaningfully outperforms them in terms of quality vs
| parameters.
|
| Model: #Params (M), SlimPajama (15B) ppl |
|
| - GPT-3: 356M, 14.26
|
| - Llama: 407M, 14.25
|
| - H3: 420M, 18.23
|
| - Mamba: 423M, 13.70
|
| - Hyena: 435M, 17.59
|
| - RWKV-4: 430M, 15.62
|
| - RWKV-5: 456M, 16.53
|
| - RWKV-6: 442M, 17.40
|
| - RetNet: 431M, 16.23
|
| - HGRN: 411M, 21.83
|
| - GLA: 412M, 19.56
|
| - HGRN2: 411M, 16.77
|
| - xLSTM[1:0]: 409M, 13.43
|
| - xLSTM[7:1]: 408M, 13.48
|
| There are more detailed perplexity and task benchmarks in the
| paper. Overall, all the architectures perform very similarly
| on every benchmark, sometimes xLSTM is slightly ahead but not
| always, and the difference is not really meaningful.
|
| This is great news though, it means we are not losing
| anything by switching to xLSTM and we get important
| advantages like the scalable context window.
|
| I'm quite excited about this because we can potentially have
| the LLM remember what you say and do few-shot persistent
| learning from user interaction (updating "itself", the state
| vector). It would be very interesting if LLMs were no longer
| static. Although I'm sure it will be a challenge to train the
| model to keep such learnings in its memory long-term.
|
| The paper: https://arxiv.org/abs/2405.04517
| piecerough wrote:
| > It would be very interesting if LLMs were no longer
| static.
|
| Little bit of a nightmare too. Instructions keep piling up
| for you that you no longer openly can access and remove
| 3abiton wrote:
| Linear scaling for context is also a bit deal. Flash
| attention partially solved this for TF, but xLSTM seems
| promising!
| wantsanagent wrote:
| Deeper dive by Yannic if you want it:
| https://www.youtube.com/watch?v=0OaEv1a5jUM
| htrp wrote:
| This is exciting because it is an architecture that had so much
| promise, but we could never solve the gradient/parallelization
| problems better than transformers.
|
| This code will allow people yo experiment and see if it is a
| viable architecture at foundation/frontier model scale.
| ein0p wrote:
| Note: GNU AGPLv3. Industry labs won't touch this with a hundred
| foot pole. Given that they're the only ones with access to
| serious resources, it could be a while before we see a large
| model of this architecture
| striking wrote:
| Reimplementation from paper is pretty common, though, no?
| ein0p wrote:
| Yes, that's why it'll take time. There's so much stuff
| competing for researchers' attention, and experimentation
| with this takes so much time and $$$, that if it wasn't for
| Sepp Hochreiter on the list of authors this could get ignored
| entirely. IOW it's not the seller's market for novel
| architectures right now.
| htrp wrote:
| You can't outspend the industry labs given the compute
| inflation in transformer architectures (unless you are
| ridiculously well connected in the venture/sovereign
| funding communities).
|
| And realistically, do we need another GPT4 evaluation
| paper?
| ein0p wrote:
| That is by far not the only thing industry labs are
| working on currently. I work in one. My group might be
| unusual, but I can't name a single currently active
| project here that is not a departure from Transformers
| one way or another. I expect a ton of such efficiency
| oriented work in the next 4-5 years. We can't be burning
| money as inefficiently as we do it right now.
| nurple wrote:
| How does AGPLv3 impact a lab's ability to do research on an
| implementation?
| ein0p wrote:
| IANAL, and this is not legal advice, I don't think it really
| impacts anything for academic research, but Legal usually has
| a major fit when AGPL is even peripherally mentioned.
| mindcrime wrote:
| Interestingly, the AGPL has been something of a "boogey-
| man" to some commercial entities, going back 20+ years now.
| The GPL too, albeit to a lesser extent. Anyway, this may
| well be a great opportunity for a firm who bothers to look
| a bit deeper and say "OK, maybe the AGPL isn't something to
| be scared of after all". Just comply with the terms and "no
| harm, no foul".
| dheera wrote:
| There's a really easy way around this, and that is to offer
| high salaries to the authors to join the company and
| reimplement their work.
|
| For researchers, in all honestly that is a very, very good
| reason to go GPL. If someone wants to profit off of it, it's
| not that they can't use the code commercially, they are just
| forced to hire you or pay you to dual-license it.
|
| There's no reason why a company whose stock goes up $10B due to
| your model can't cut you a few million of that.
| trextrex wrote:
| I'm not clear on what advantage this architecture has over
| mamba/Griffin. They also have the linear scaling, better sequence
| parallelism and are competitive in performance with transformers.
| wave_1 wrote:
| state tracking...
| lalaland1125 wrote:
| The whole field seems to be having issues with comparisons
| right now.
|
| We really don't even know how Mamba vs Griffin compare.
| ganzuul wrote:
| Are there any studies on predicting neural architecture scaling?
| E.g. a small training dataset which indicates performance on a
| large training dataset?
| dang wrote:
| Recent and related:
|
| _xLSTM: Extended Long Short-Term Memory_ -
| https://news.ycombinator.com/item?id=40294650 - May 2024 (73
| comments)
| brcmthrowaway wrote:
| Congrats to the x.AI team!
___________________________________________________________________
(page generated 2024-06-05 23:01 UTC)