[HN Gopher] Mamba-2 - State Space Duality
___________________________________________________________________
Mamba-2 - State Space Duality
Author : bratao
Score : 107 points
Date : 2024-06-03 16:07 UTC (6 hours ago)
(HTM) web link (tridao.me)
(TXT) w3m dump (tridao.me)
| eranation wrote:
| I'll bite: can anyone please eli5 to the non PhDs among us?
| cs702 wrote:
| TL;DR: The authors show that if you simplify Mamba so its
| state-space layer uses a diagonal matrix A that is a scalar
| times the identity matrix, the state-space transformation can
| be expressed as a form of causal linear attention.[a] That's
| the duality the authors refer to in the title. The key
| practical benefit is that it enables more efficient (faster)
| training on GPUs.
|
| ---
|
| [a] https://arxiv.org/abs/2006.16236
| brrrrrm wrote:
| restricts freedom in one of the parameters (A) to make training
| substantially more efficient (easier for a GPU to churn
| through).
|
| the actual flops involved are similar to the original SSM-based
| version, but that's harder to formulate as strictly matrix
| multiplications
| xcodevn wrote:
| tldr: mamba is not as good as transformer.
| cs702 wrote:
| Dao and Gu show that if you simplify Mamba so its state-space
| layer uses a diagonal matrix A_t that is a scalar times the
| identity matrix, i.e., A_t = a_t I, the state-space
| transformation can be expressed as a form of causal linear
| attention[a] by compounding coefficients a_1 ... a_t at each time
| step t. The equivalence of the simplified state-space layer and
| causal linear attention constitute the duality the authors refer
| to in the title. By taking advantage of this duality, Mamba-2 can
| be trained more efficiently, i.e., faster than original Mamba on
| GPUs.
|
| Theoretical stuff aside, Mamba-2's performance seems to scale
| slightly better than original Mamba:
| https://tridao.me/assets/img/2024-05-31-mamba-2/pile_8k_mamb...
|
| Here's the code implementing Mamba-2: https://github.com/state-
| spaces/mamba/blob/main/mamba_ssm/mo...
|
| Great work by Tri Dao (of FlashAttention fame) and Albert Gu, as
| usual.
|
| The key question, for me and many others, is whether Mamba,
| Mamba-2, RWKV, and other linear RNN / linear attention models
| will ever match the performance of standard Softmax attention. My
| understanding and experience is that all the linear attention
| models out there [b] still underperform Softmax attention on
| things like recall tasks.[c]
|
| ---
|
| [a] https://arxiv.org/abs/2006.16236
|
| [b] https://github.com/topics/linear-attention /
| https://github.com/topics/linear-attention-model -- this list is
| by no means complete!
|
| [c] https://arxiv.org/abs/2402.01032
| ein0p wrote:
| Notably, humans also underperform Transformers on recall tasks.
| And yet we do ok on many others, even with our imperfect
| recall. So I hope we can identify a set of high value tasks on
| which these new architectures outperform Transformers and start
| benchmarking them on Transformers, too. Recall isn't really
| "all you need" in this space, although it certainly impresses
| and helps to plug the capability gaps.
| logicchains wrote:
| Quadratic transformers outperform weaker forms of attention on
| recall tasks, but recurrent models are strictly more powerful
| than transformers (when they don't use chain of thought) at
| state tracking problems. Mamba 2 likely has the same
| limitations with state tracking as a full transformer due to
| being parallelizable: https://arxiv.org/abs/2404.08819 .
| cs702 wrote:
| Thank you for sharing this. I've added it to my reading list!
| sroussey wrote:
| TLDR for non-NLP people: Mamba-2 is _much_ faster to train than
| Mamba-1.
| adt wrote:
| https://lifearchitect.ai/models-table/
| webappguy wrote:
| love this thanks!! Wish we could classify and rank tables and
| rows (like all OpenAI models).
| imjonse wrote:
| "From one perspective, Mamba-2 isn't strictly better than
| Mamba-1: while it's a dramatic improvement from a training
| perspective, Mamba-1 might be better from a pure inference
| perspective. Since inference speed of SSMs is entirely governed
| by the state dimension, if one wants to maximize performance for
| a target inference efficiency (i.e. for a particular state size
| N), then the increased expressivity of Mamba-1 might be better."
| radarsat1 wrote:
| Seems like since Mamba 2 is a constrained version of Mamba 1,
| maybe Mamba 1 could be used in a fine-tuning stage.
| evnc wrote:
| I'm a bit of a noob here, but if
|
| a) a linear SSM (a form of RNN?) is equivalent to Attention
| without the scaling and softmax; and
|
| b) Attention is "all you need" and the thing that made
| Transformers radically outperform all the previous architectures
| like LSTMs that used to dominate NLP;
|
| does that imply c) the scaling and softmax parts of the attention
| equation, in particular, is the magic touch that makes
| Transformers work so well?
| visarga wrote:
| The major difference is that transformer state grows as the
| sequence gets longer, while recurrent models use a fixed size
| state. So presumably at sequence length (T) > size of state
| space (N), the transformer will be better on some very specific
| tasks. Not all, especially those that require the model to
| select information from the beginning of the sequence
| conditional on something at the end of the sequence.
| Transformers can refocus any time, while SSNs need to guess
| right from the start what to keep and what to drop. SSNs could
| use the old trick of repeating the input twice to allow the end
| to condition on the beginning as well.
|
| An important role is held by the softmax function which
| normalizes the attention scores, allowing the model to weigh
| different parts of the input sequence dynamically. This means
| that, unlike RNNs which sequentially process inputs and update
| states, Transformers can directly access and prioritize
| information from any part of the sequence, and they are not
| slower for T < N.
| pama wrote:
| Has anyone tried training yet and are there any obvious pitfalls
| for multi GPU training like there were in mamba-1?
| tomrod wrote:
| This appears to be huge. Win-win-win for fast LU factorization!
___________________________________________________________________
(page generated 2024-06-03 23:00 UTC)