[HN Gopher] xLSTM code release by NX-AI
       ___________________________________________________________________
        
       xLSTM code release by NX-AI
        
       Author : badlogic
       Score  : 95 points
       Date   : 2024-06-04 09:09 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | pietz wrote:
       | Could someone provide a quick summary where they stand compared
       | to transformer architectures? Do they have real world scale
       | results that are competitive?
        
         | barrell wrote:
         | - They outperform transformers at lower parameter counts. Time
         | will tell if that hold up for more parameters
         | 
         | - They scale linearly in complexity, which means with a longer
         | context window they will be faster and cheaper than
         | transformers
         | 
         | - It's been mostly academic as far as I know, only just
         | recently being published. I don't think there's been an
         | opportunity to use them at 'real world scale' yet, although tbh
         | I'm a little uncertain what you mean by it.
        
           | oersted wrote:
           | It does seem to match transformers, but I wouldn't say it
           | meaningfully outperforms them in terms of quality vs
           | parameters.
           | 
           | Model: #Params (M), SlimPajama (15B) ppl |
           | 
           | - GPT-3: 356M, 14.26
           | 
           | - Llama: 407M, 14.25
           | 
           | - H3: 420M, 18.23
           | 
           | - Mamba: 423M, 13.70
           | 
           | - Hyena: 435M, 17.59
           | 
           | - RWKV-4: 430M, 15.62
           | 
           | - RWKV-5: 456M, 16.53
           | 
           | - RWKV-6: 442M, 17.40
           | 
           | - RetNet: 431M, 16.23
           | 
           | - HGRN: 411M, 21.83
           | 
           | - GLA: 412M, 19.56
           | 
           | - HGRN2: 411M, 16.77
           | 
           | - xLSTM[1:0]: 409M, 13.43
           | 
           | - xLSTM[7:1]: 408M, 13.48
           | 
           | There are more detailed perplexity and task benchmarks in the
           | paper. Overall, all the architectures perform very similarly
           | on every benchmark, sometimes xLSTM is slightly ahead but not
           | always, and the difference is not really meaningful.
           | 
           | This is great news though, it means we are not losing
           | anything by switching to xLSTM and we get important
           | advantages like the scalable context window.
           | 
           | I'm quite excited about this because we can potentially have
           | the LLM remember what you say and do few-shot persistent
           | learning from user interaction (updating "itself", the state
           | vector). It would be very interesting if LLMs were no longer
           | static. Although I'm sure it will be a challenge to train the
           | model to keep such learnings in its memory long-term.
           | 
           | The paper: https://arxiv.org/abs/2405.04517
        
             | piecerough wrote:
             | > It would be very interesting if LLMs were no longer
             | static.
             | 
             | Little bit of a nightmare too. Instructions keep piling up
             | for you that you no longer openly can access and remove
        
             | 3abiton wrote:
             | Linear scaling for context is also a bit deal. Flash
             | attention partially solved this for TF, but xLSTM seems
             | promising!
        
         | wantsanagent wrote:
         | Deeper dive by Yannic if you want it:
         | https://www.youtube.com/watch?v=0OaEv1a5jUM
        
       | htrp wrote:
       | This is exciting because it is an architecture that had so much
       | promise, but we could never solve the gradient/parallelization
       | problems better than transformers.
       | 
       | This code will allow people yo experiment and see if it is a
       | viable architecture at foundation/frontier model scale.
        
       | ein0p wrote:
       | Note: GNU AGPLv3. Industry labs won't touch this with a hundred
       | foot pole. Given that they're the only ones with access to
       | serious resources, it could be a while before we see a large
       | model of this architecture
        
         | striking wrote:
         | Reimplementation from paper is pretty common, though, no?
        
           | ein0p wrote:
           | Yes, that's why it'll take time. There's so much stuff
           | competing for researchers' attention, and experimentation
           | with this takes so much time and $$$, that if it wasn't for
           | Sepp Hochreiter on the list of authors this could get ignored
           | entirely. IOW it's not the seller's market for novel
           | architectures right now.
        
             | htrp wrote:
             | You can't outspend the industry labs given the compute
             | inflation in transformer architectures (unless you are
             | ridiculously well connected in the venture/sovereign
             | funding communities).
             | 
             | And realistically, do we need another GPT4 evaluation
             | paper?
        
               | ein0p wrote:
               | That is by far not the only thing industry labs are
               | working on currently. I work in one. My group might be
               | unusual, but I can't name a single currently active
               | project here that is not a departure from Transformers
               | one way or another. I expect a ton of such efficiency
               | oriented work in the next 4-5 years. We can't be burning
               | money as inefficiently as we do it right now.
        
         | nurple wrote:
         | How does AGPLv3 impact a lab's ability to do research on an
         | implementation?
        
           | ein0p wrote:
           | IANAL, and this is not legal advice, I don't think it really
           | impacts anything for academic research, but Legal usually has
           | a major fit when AGPL is even peripherally mentioned.
        
             | mindcrime wrote:
             | Interestingly, the AGPL has been something of a "boogey-
             | man" to some commercial entities, going back 20+ years now.
             | The GPL too, albeit to a lesser extent. Anyway, this may
             | well be a great opportunity for a firm who bothers to look
             | a bit deeper and say "OK, maybe the AGPL isn't something to
             | be scared of after all". Just comply with the terms and "no
             | harm, no foul".
        
         | dheera wrote:
         | There's a really easy way around this, and that is to offer
         | high salaries to the authors to join the company and
         | reimplement their work.
         | 
         | For researchers, in all honestly that is a very, very good
         | reason to go GPL. If someone wants to profit off of it, it's
         | not that they can't use the code commercially, they are just
         | forced to hire you or pay you to dual-license it.
         | 
         | There's no reason why a company whose stock goes up $10B due to
         | your model can't cut you a few million of that.
        
       | trextrex wrote:
       | I'm not clear on what advantage this architecture has over
       | mamba/Griffin. They also have the linear scaling, better sequence
       | parallelism and are competitive in performance with transformers.
        
         | wave_1 wrote:
         | state tracking...
        
         | lalaland1125 wrote:
         | The whole field seems to be having issues with comparisons
         | right now.
         | 
         | We really don't even know how Mamba vs Griffin compare.
        
       | ganzuul wrote:
       | Are there any studies on predicting neural architecture scaling?
       | E.g. a small training dataset which indicates performance on a
       | large training dataset?
        
       | dang wrote:
       | Recent and related:
       | 
       |  _xLSTM: Extended Long Short-Term Memory_ -
       | https://news.ycombinator.com/item?id=40294650 - May 2024 (73
       | comments)
        
       | brcmthrowaway wrote:
       | Congrats to the x.AI team!
        
       ___________________________________________________________________
       (page generated 2024-06-05 23:01 UTC)