[HN Gopher] Mamba-2 - State Space Duality
       ___________________________________________________________________
        
       Mamba-2 - State Space Duality
        
       Author : bratao
       Score  : 107 points
       Date   : 2024-06-03 16:07 UTC (6 hours ago)
        
 (HTM) web link (tridao.me)
 (TXT) w3m dump (tridao.me)
        
       | eranation wrote:
       | I'll bite: can anyone please eli5 to the non PhDs among us?
        
         | cs702 wrote:
         | TL;DR: The authors show that if you simplify Mamba so its
         | state-space layer uses a diagonal matrix A that is a scalar
         | times the identity matrix, the state-space transformation can
         | be expressed as a form of causal linear attention.[a] That's
         | the duality the authors refer to in the title. The key
         | practical benefit is that it enables more efficient (faster)
         | training on GPUs.
         | 
         | ---
         | 
         | [a] https://arxiv.org/abs/2006.16236
        
         | brrrrrm wrote:
         | restricts freedom in one of the parameters (A) to make training
         | substantially more efficient (easier for a GPU to churn
         | through).
         | 
         | the actual flops involved are similar to the original SSM-based
         | version, but that's harder to formulate as strictly matrix
         | multiplications
        
         | xcodevn wrote:
         | tldr: mamba is not as good as transformer.
        
       | cs702 wrote:
       | Dao and Gu show that if you simplify Mamba so its state-space
       | layer uses a diagonal matrix A_t that is a scalar times the
       | identity matrix, i.e., A_t = a_t I, the state-space
       | transformation can be expressed as a form of causal linear
       | attention[a] by compounding coefficients a_1 ... a_t at each time
       | step t. The equivalence of the simplified state-space layer and
       | causal linear attention constitute the duality the authors refer
       | to in the title. By taking advantage of this duality, Mamba-2 can
       | be trained more efficiently, i.e., faster than original Mamba on
       | GPUs.
       | 
       | Theoretical stuff aside, Mamba-2's performance seems to scale
       | slightly better than original Mamba:
       | https://tridao.me/assets/img/2024-05-31-mamba-2/pile_8k_mamb...
       | 
       | Here's the code implementing Mamba-2: https://github.com/state-
       | spaces/mamba/blob/main/mamba_ssm/mo...
       | 
       | Great work by Tri Dao (of FlashAttention fame) and Albert Gu, as
       | usual.
       | 
       | The key question, for me and many others, is whether Mamba,
       | Mamba-2, RWKV, and other linear RNN / linear attention models
       | will ever match the performance of standard Softmax attention. My
       | understanding and experience is that all the linear attention
       | models out there [b] still underperform Softmax attention on
       | things like recall tasks.[c]
       | 
       | ---
       | 
       | [a] https://arxiv.org/abs/2006.16236
       | 
       | [b] https://github.com/topics/linear-attention /
       | https://github.com/topics/linear-attention-model -- this list is
       | by no means complete!
       | 
       | [c] https://arxiv.org/abs/2402.01032
        
         | ein0p wrote:
         | Notably, humans also underperform Transformers on recall tasks.
         | And yet we do ok on many others, even with our imperfect
         | recall. So I hope we can identify a set of high value tasks on
         | which these new architectures outperform Transformers and start
         | benchmarking them on Transformers, too. Recall isn't really
         | "all you need" in this space, although it certainly impresses
         | and helps to plug the capability gaps.
        
         | logicchains wrote:
         | Quadratic transformers outperform weaker forms of attention on
         | recall tasks, but recurrent models are strictly more powerful
         | than transformers (when they don't use chain of thought) at
         | state tracking problems. Mamba 2 likely has the same
         | limitations with state tracking as a full transformer due to
         | being parallelizable: https://arxiv.org/abs/2404.08819 .
        
           | cs702 wrote:
           | Thank you for sharing this. I've added it to my reading list!
        
       | sroussey wrote:
       | TLDR for non-NLP people: Mamba-2 is _much_ faster to train than
       | Mamba-1.
        
       | adt wrote:
       | https://lifearchitect.ai/models-table/
        
         | webappguy wrote:
         | love this thanks!! Wish we could classify and rank tables and
         | rows (like all OpenAI models).
        
       | imjonse wrote:
       | "From one perspective, Mamba-2 isn't strictly better than
       | Mamba-1: while it's a dramatic improvement from a training
       | perspective, Mamba-1 might be better from a pure inference
       | perspective. Since inference speed of SSMs is entirely governed
       | by the state dimension, if one wants to maximize performance for
       | a target inference efficiency (i.e. for a particular state size
       | N), then the increased expressivity of Mamba-1 might be better."
        
         | radarsat1 wrote:
         | Seems like since Mamba 2 is a constrained version of Mamba 1,
         | maybe Mamba 1 could be used in a fine-tuning stage.
        
       | evnc wrote:
       | I'm a bit of a noob here, but if
       | 
       | a) a linear SSM (a form of RNN?) is equivalent to Attention
       | without the scaling and softmax; and
       | 
       | b) Attention is "all you need" and the thing that made
       | Transformers radically outperform all the previous architectures
       | like LSTMs that used to dominate NLP;
       | 
       | does that imply c) the scaling and softmax parts of the attention
       | equation, in particular, is the magic touch that makes
       | Transformers work so well?
        
         | visarga wrote:
         | The major difference is that transformer state grows as the
         | sequence gets longer, while recurrent models use a fixed size
         | state. So presumably at sequence length (T) > size of state
         | space (N), the transformer will be better on some very specific
         | tasks. Not all, especially those that require the model to
         | select information from the beginning of the sequence
         | conditional on something at the end of the sequence.
         | Transformers can refocus any time, while SSNs need to guess
         | right from the start what to keep and what to drop. SSNs could
         | use the old trick of repeating the input twice to allow the end
         | to condition on the beginning as well.
         | 
         | An important role is held by the softmax function which
         | normalizes the attention scores, allowing the model to weigh
         | different parts of the input sequence dynamically. This means
         | that, unlike RNNs which sequentially process inputs and update
         | states, Transformers can directly access and prioritize
         | information from any part of the sequence, and they are not
         | slower for T < N.
        
       | pama wrote:
       | Has anyone tried training yet and are there any obvious pitfalls
       | for multi GPU training like there were in mamba-1?
        
       | tomrod wrote:
       | This appears to be huge. Win-win-win for fast LU factorization!
        
       ___________________________________________________________________
       (page generated 2024-06-03 23:00 UTC)