[HN Gopher] Mamba: The Easy Way
___________________________________________________________________
Mamba: The Easy Way
Author : jackcook
Score : 188 points
Date : 2024-02-23 16:11 UTC (6 hours ago)
(HTM) web link (jackcook.com)
(TXT) w3m dump (jackcook.com)
| paxys wrote:
| From what I can tell all the large players in the space are
| continuing developing on transformers right? Is it just that
| Mamba is too new, or is the architecture fundamentally not usable
| for some reason?
| thatguysaguy wrote:
| Too new is definitely one thing. Someone is going to have to
| make a gamble to actually paying for a serious pretraining run
| with this architecture before we know how it really stacks up
| against transformers.
|
| There are some papers suggesting that transformers are better
| than SSMs in fundamental ways (e.g. They cannot do arbitrary
| key-based recall from their context:
| https://arxiv.org/abs/2402.01032). This means it's not just a
| no-brainer to switch over.
| gaogao wrote:
| It's a reasonably easy bet that Together is doing or will do
| a serious pretraining run with Mamba, where if that's a
| success other players might start considering it more.
| whimsicalism wrote:
| > There are some papers suggesting that transformers are
| better than SSMs in fundamental ways
|
| I mean the vanilla transformers are also shown failing at the
| tasks they present.
| espadrine wrote:
| Another element is that Mamba required a very custom
| implementation down to custom fused kernels which I expect
| would need to be implemented in deepspeed or the equivalent
| library for a larger training run spanning thousands of GPUs.
| cs702 wrote:
| Not necessarily:
|
| https://www.reddit.com/r/MachineLearning/comments/1amb3xu/d
| _...
| whimsicalism wrote:
| we have no idea what the large players in the space are doing
| danielmarkbruce wrote:
| Exactly this. Except, there is zero chance they just looked
| at mamba and went "meh, too new for us". People are
| definitely trying stuff. It takes a lot of fiddling around
| with a brand new model architecture to get something working
| well. OpenAI aren't going to give a running commentary on the
| state of all the things they are looking into.
| magnio wrote:
| Fantastic blog post, thank you for this. I am not even familiar
| with transformers, yet the explanation is stellar clear to me,
| and the included references and context are a trasure trove. The
| explanation of FlashAttention is the best I have seen, and that
| is not even the focus of the article.
|
| One question I have on selectivity: footnote 4 says "the
| continuous A is constant, while our discretization parameter [?]
| is input-dependent." What is the effect of varying the
| discretization instead of the (main, as I understand it) state A?
| My gut says it simplifies training and provides stability, but I
| feel A carries most of the behavior of the model, so it should
| have more wiggle room throughout training.
| jackcook wrote:
| Thank you for the kind words! I think it's mostly to reduce
| complexity during training. Here's an excerpt from page 9 of
| the Mamba paper:
|
| "We remark that while the A parameter could also be selective,
| it ultimately affects the model only through its interaction
| with [?] via A = exp([?]A) (the discretization (4)). Thus
| selectivity in [?] is enough to ensure selectivity in (A, B),
| and is the main source of improvement. We hypothesize that
| making A selective in addition to (or instead of) [?] would
| have similar performance, and leave it out for simplicity."
| whimsicalism wrote:
| How are you not familiar with transformers yet have seen
| multiple explanations of FlashAttention?
| avarun wrote:
| Literally the exact question I had reading that comment haha
| samus wrote:
| The issue with Attention essentially is that it is used to
| relate all token of the input sequence with each other. The
| need to do that somehow makes sense no matter how much one
| understands about the internals of a transformer. The naive
| way to do that boils down to matrix multiplications, and a
| lot more people understand the performance issues implied by
| them.
| whimsicalism wrote:
| your comment makes no sense to me, sorry. if you understand
| attention you understand transformers, period.
| samus wrote:
| That's good to know :)
| lxe wrote:
| I'm very positive I can actually understand the terminology used
| in discussing machine learning models if it was presented in a
| way that describes the first principles a little bit better,
| instead of diving directly into high level abstract equations and
| symbols.
|
| I'd like a way to learn this stuff as a computer engineer, in the
| same spirit as "big scary math symbols are just for-loops"
| esafak wrote:
| Ask an LLM to translate it into terms you understand. This is
| something they excel at.
| paulluuk wrote:
| Ironically, you can probably just ask a Transformer model to
| explain it to you.
|
| I'm the same as you: I have no problem grasping complex
| concepts, I just always struggled with the mathematical
| notation. I did pass linear algebra in university, but was glad
| I could go back to programming after that. Even then, I mostly
| passed linear algebra because I wrote functions that solve
| linear algebra equations until I fully grasped the concept.
|
| I've found that GPT-4 is very good at taking a math-notation-
| rich document and just describing it in terms a math-notation-
| averse engineer would understand.
|
| I was a data engineer for about 6-7 years at various companies,
| always working together with data scientists who insist that
| `x_` or `_phi` are proper variable names. Man am I glad to be
| working with engineers now.
| danielmarkbruce wrote:
| This is very effective.
|
| Also, just try really hard. Repeat. It's new language to
| explain concepts you likely already know. You don't remember
| spanish by looking at the translations once.
| yorwba wrote:
| It is unclear to me whether you're praising the article as
| particularly easy to understand or complaining that it contains
| equations like h_t = A h_{t-1} + B x_t
| y_t = C h_t
|
| (which the author attempts to illustrate in the "My name is
| Jack" figure below)
| whimsicalism wrote:
| If you want to learn this stuff as a computer engineer, you can
| read the code here [0]. I find the math quite helpful.
|
| [0]: https://github.com/state-spaces/mamba
| jxmorris12 wrote:
| In case people are wondering why Mamba is exciting:
|
| There's this idea in AI right now that "scaling" models to be
| bigger and train on more data always makes them better. This has
| led to a science of "scaling laws" which study just how much
| bigger models need to be and how much data we need to train them
| on to make them a certain amount better. The relationship between
| model size, training data size, and performance turns out to be
| quite predictable.
|
| Transformers are great because they can continue scaling and
| giving us better performance - unlike, we think, RNNs. Probably
| the most exciting thing about Mamba is the claim that it can be a
| bit smaller, and train on a bit less data, and still provide
| better performance than the equivalent Transformer, especially at
| longer sequence lengths.
|
| For more info, see the scaling laws plot in Figure 4 of the Mamba
| paper: https://arxiv.org/abs/2312.00752
| 5kg wrote:
| I'd love to see someone who has the resources train a model
| bigger than 2.8b and show the scaling law still holds.
| nickpsecurity wrote:
| Some prior comments said those architectures lack the memory
| or something of a transformer. That there's a weakness that's
| keeping people using transformers. If true, I'd like to also
| see tests of various domains with equivalent transformer and
| Mamba designs to see if that difference impacted anything.
| From there, we'd have a better idea about whether Mamba-176B
| is worth the money.
| hansonw wrote:
| "RNN-mode inference" is also extremely exciting because you can
| precompute the hidden state of any prompt prefix (i.e. a long
| system prompt, or statically retrieved context) and continued
| generations pay the same cost irrespective of the prefix
| length.
| KuriousCat wrote:
| People have shown even CNNs can match up the peformance of the
| transformers.
|
| https://openreview.net/forum?id=TKIFuQHHECj#
|
| I believe there is a lot of herding going on due to the
| influence of people who had compute to play around with than
| deeply insightful or principled exploration of networks.
| moffkalast wrote:
| If I'm not mistaken the largest mamba model right now is 2.8B and
| undertrained with low quality data (the Pile only). The main
| problem is that it's new and unproven.
|
| Should become very interesting once someone with both data and
| significant financial backing takes the plunge and trains
| something of notable size. Perhaps Llama-3 might already end up
| being that attempt, as we seem to be heavily into diminishing
| returns for transformers.
| SekstiNi wrote:
| There is one trained on 600B tokens from SlimPajama [1], but
| that's fairly tiny compared to other recent releases (ex.
| stablelm-3b [2] trained on 4T tokens).
|
| > low quality data (the Pile only)
|
| The Pile is pretty good quality wise. It's mostly the size
| (300B tokens) that's limiting.
|
| [1]: https://huggingface.co/state-spaces/mamba-2.8b-slimpj [2]:
| https://huggingface.co/stabilityai/stablelm-3b-4e1t
| moffkalast wrote:
| Eh quality is subjective. There are good parts, like Books3
| and arxiv, but a large part of it is common crawl which has
| just about anything people put up on the internet, random IRC
| chat logs, HN and Reddit shitposts, Youtube subtitles which
| are in broken English half the time, and of course the Enron
| corporate email dump to make every model sound like an HR
| middle manager.
| mistrial9 wrote:
| namespace collision detected
|
| https://anaconda.org/conda-forge/mamba
| jsenn wrote:
| This was really helpful, but only discusses linear operations,
| which obviously can't be the whole story. From the paper it seems
| like the discretization is the only nonlinear step--in particular
| the selection mechanism is just a linear transformation. Is that
| right? How important is the particular form of the nonlinearity?
|
| EDIT: from looking at the paper, it seems like even though the
| core state space model/selection mechanism is linear (except for
| discretization?), they incorporate a nonlinearity in the full
| "mamba block", which is stacked up with residual connections and
| layer norm just like in a transformer. They describe this as
| combining a linear attention and an MLP into a single step,
| rather than alternating attention and MLP as in a transformer.
| jackcook wrote:
| Yes you're spot on, the nonlinearities come from the full Mamba
| blocks, which I left out of this post for simplicity/to focus
| on the bigger ideas the paper introduced. You can see it marked
| by the "X" on the right-most part of Figure 3 in the Mamba
| paper: https://arxiv.org/abs/2312.00752
| denial wrote:
| Something minor I always wonder about when I read Mamba is the
| discretization.
|
| All of the sources I see referred to as derivations of it have a
| discretization of the form
|
| h_t =Ah_{t-1} + Bx_{t-1} for the first line instead of the given
| one of the form h_t =Ah_{t-1} + Bx_t.
|
| Does anyone know why this is?
| pama wrote:
| Not sure how much detail you need but generally there exist
| implicit and explicit integrators for numerically solving
| (integrating) ODE. The implicit ones, like the one used here,
| tend to be more stable. The ideas behind SSM come from control
| theory ideas that used integrators with stability guarantees so
| that the rest of the neural network can focus on other aspects
| of the problem.
| denial wrote:
| That's a helpful pointer. Thank you.
| intalentive wrote:
| Nice post. A couple things to add:
|
| 1. The Mamba co-author was also the FlashAttention lead author.
|
| 2. The secret ingredient that makes SSMs viable for deep learning
| is HiPPO theory. If you start with random initialization you're
| not going to get results. What you need is "optimal online
| function approximation" using Legendre polynomials, a Fourier
| basis, etc., in matrix form. The Mamba story starts with Legendre
| Memory Units.
|
| Invariably someone comments, "How do we know that it scales?" We
| don't. But the lead author has backing and a new startup at
| cartesia.ai. Could be the next Mistral.
| sigmoid10 wrote:
| The architecture is completely public. I would be surprised if
| certain other players (including but not limited to Mistral AI)
| are not training models yet. We'll hear soon enough if this is
| viable. Maybe not for official release candidates, but at least
| for internal testing.
| israrkhan wrote:
| MoE (Mixture of Experts) is an effective way to scale
| transformers. Gemini 1.5 is already doing upto 1 million tokens.
| I have not seen any large scale mamba model, so not aware of its
| shortcomings, but I am sure there are tradeoffs.
|
| It should be possible to combine Mamba with MoE, I wonder how
| that would look like... a billion token context?
| intalentive wrote:
| https://arxiv.org/abs/2401.04081
|
| https://github.com/jzhang38/LongMamba
| israrkhan wrote:
| interesting. This is exactly what I was thinking about.
| Thanks for sharing
| whimsicalism wrote:
| nope :) MoE does not scale transformers along sequence length
___________________________________________________________________
(page generated 2024-02-23 23:00 UTC)