[HN Gopher] Mamba: The Easy Way
       ___________________________________________________________________
        
       Mamba: The Easy Way
        
       Author : jackcook
       Score  : 188 points
       Date   : 2024-02-23 16:11 UTC (6 hours ago)
        
 (HTM) web link (jackcook.com)
 (TXT) w3m dump (jackcook.com)
        
       | paxys wrote:
       | From what I can tell all the large players in the space are
       | continuing developing on transformers right? Is it just that
       | Mamba is too new, or is the architecture fundamentally not usable
       | for some reason?
        
         | thatguysaguy wrote:
         | Too new is definitely one thing. Someone is going to have to
         | make a gamble to actually paying for a serious pretraining run
         | with this architecture before we know how it really stacks up
         | against transformers.
         | 
         | There are some papers suggesting that transformers are better
         | than SSMs in fundamental ways (e.g. They cannot do arbitrary
         | key-based recall from their context:
         | https://arxiv.org/abs/2402.01032). This means it's not just a
         | no-brainer to switch over.
        
           | gaogao wrote:
           | It's a reasonably easy bet that Together is doing or will do
           | a serious pretraining run with Mamba, where if that's a
           | success other players might start considering it more.
        
           | whimsicalism wrote:
           | > There are some papers suggesting that transformers are
           | better than SSMs in fundamental ways
           | 
           | I mean the vanilla transformers are also shown failing at the
           | tasks they present.
        
           | espadrine wrote:
           | Another element is that Mamba required a very custom
           | implementation down to custom fused kernels which I expect
           | would need to be implemented in deepspeed or the equivalent
           | library for a larger training run spanning thousands of GPUs.
        
             | cs702 wrote:
             | Not necessarily:
             | 
             | https://www.reddit.com/r/MachineLearning/comments/1amb3xu/d
             | _...
        
         | whimsicalism wrote:
         | we have no idea what the large players in the space are doing
        
           | danielmarkbruce wrote:
           | Exactly this. Except, there is zero chance they just looked
           | at mamba and went "meh, too new for us". People are
           | definitely trying stuff. It takes a lot of fiddling around
           | with a brand new model architecture to get something working
           | well. OpenAI aren't going to give a running commentary on the
           | state of all the things they are looking into.
        
       | magnio wrote:
       | Fantastic blog post, thank you for this. I am not even familiar
       | with transformers, yet the explanation is stellar clear to me,
       | and the included references and context are a trasure trove. The
       | explanation of FlashAttention is the best I have seen, and that
       | is not even the focus of the article.
       | 
       | One question I have on selectivity: footnote 4 says "the
       | continuous A is constant, while our discretization parameter [?]
       | is input-dependent." What is the effect of varying the
       | discretization instead of the (main, as I understand it) state A?
       | My gut says it simplifies training and provides stability, but I
       | feel A carries most of the behavior of the model, so it should
       | have more wiggle room throughout training.
        
         | jackcook wrote:
         | Thank you for the kind words! I think it's mostly to reduce
         | complexity during training. Here's an excerpt from page 9 of
         | the Mamba paper:
         | 
         | "We remark that while the A parameter could also be selective,
         | it ultimately affects the model only through its interaction
         | with [?] via A = exp([?]A) (the discretization (4)). Thus
         | selectivity in [?] is enough to ensure selectivity in (A, B),
         | and is the main source of improvement. We hypothesize that
         | making A selective in addition to (or instead of) [?] would
         | have similar performance, and leave it out for simplicity."
        
         | whimsicalism wrote:
         | How are you not familiar with transformers yet have seen
         | multiple explanations of FlashAttention?
        
           | avarun wrote:
           | Literally the exact question I had reading that comment haha
        
           | samus wrote:
           | The issue with Attention essentially is that it is used to
           | relate all token of the input sequence with each other. The
           | need to do that somehow makes sense no matter how much one
           | understands about the internals of a transformer. The naive
           | way to do that boils down to matrix multiplications, and a
           | lot more people understand the performance issues implied by
           | them.
        
             | whimsicalism wrote:
             | your comment makes no sense to me, sorry. if you understand
             | attention you understand transformers, period.
        
               | samus wrote:
               | That's good to know :)
        
       | lxe wrote:
       | I'm very positive I can actually understand the terminology used
       | in discussing machine learning models if it was presented in a
       | way that describes the first principles a little bit better,
       | instead of diving directly into high level abstract equations and
       | symbols.
       | 
       | I'd like a way to learn this stuff as a computer engineer, in the
       | same spirit as "big scary math symbols are just for-loops"
        
         | esafak wrote:
         | Ask an LLM to translate it into terms you understand. This is
         | something they excel at.
        
         | paulluuk wrote:
         | Ironically, you can probably just ask a Transformer model to
         | explain it to you.
         | 
         | I'm the same as you: I have no problem grasping complex
         | concepts, I just always struggled with the mathematical
         | notation. I did pass linear algebra in university, but was glad
         | I could go back to programming after that. Even then, I mostly
         | passed linear algebra because I wrote functions that solve
         | linear algebra equations until I fully grasped the concept.
         | 
         | I've found that GPT-4 is very good at taking a math-notation-
         | rich document and just describing it in terms a math-notation-
         | averse engineer would understand.
         | 
         | I was a data engineer for about 6-7 years at various companies,
         | always working together with data scientists who insist that
         | `x_` or `_phi` are proper variable names. Man am I glad to be
         | working with engineers now.
        
           | danielmarkbruce wrote:
           | This is very effective.
           | 
           | Also, just try really hard. Repeat. It's new language to
           | explain concepts you likely already know. You don't remember
           | spanish by looking at the translations once.
        
         | yorwba wrote:
         | It is unclear to me whether you're praising the article as
         | particularly easy to understand or complaining that it contains
         | equations like                 h_t = A h_{t-1} + B x_t
         | y_t = C h_t
         | 
         | (which the author attempts to illustrate in the "My name is
         | Jack" figure below)
        
         | whimsicalism wrote:
         | If you want to learn this stuff as a computer engineer, you can
         | read the code here [0]. I find the math quite helpful.
         | 
         | [0]: https://github.com/state-spaces/mamba
        
       | jxmorris12 wrote:
       | In case people are wondering why Mamba is exciting:
       | 
       | There's this idea in AI right now that "scaling" models to be
       | bigger and train on more data always makes them better. This has
       | led to a science of "scaling laws" which study just how much
       | bigger models need to be and how much data we need to train them
       | on to make them a certain amount better. The relationship between
       | model size, training data size, and performance turns out to be
       | quite predictable.
       | 
       | Transformers are great because they can continue scaling and
       | giving us better performance - unlike, we think, RNNs. Probably
       | the most exciting thing about Mamba is the claim that it can be a
       | bit smaller, and train on a bit less data, and still provide
       | better performance than the equivalent Transformer, especially at
       | longer sequence lengths.
       | 
       | For more info, see the scaling laws plot in Figure 4 of the Mamba
       | paper: https://arxiv.org/abs/2312.00752
        
         | 5kg wrote:
         | I'd love to see someone who has the resources train a model
         | bigger than 2.8b and show the scaling law still holds.
        
           | nickpsecurity wrote:
           | Some prior comments said those architectures lack the memory
           | or something of a transformer. That there's a weakness that's
           | keeping people using transformers. If true, I'd like to also
           | see tests of various domains with equivalent transformer and
           | Mamba designs to see if that difference impacted anything.
           | From there, we'd have a better idea about whether Mamba-176B
           | is worth the money.
        
         | hansonw wrote:
         | "RNN-mode inference" is also extremely exciting because you can
         | precompute the hidden state of any prompt prefix (i.e. a long
         | system prompt, or statically retrieved context) and continued
         | generations pay the same cost irrespective of the prefix
         | length.
        
         | KuriousCat wrote:
         | People have shown even CNNs can match up the peformance of the
         | transformers.
         | 
         | https://openreview.net/forum?id=TKIFuQHHECj#
         | 
         | I believe there is a lot of herding going on due to the
         | influence of people who had compute to play around with than
         | deeply insightful or principled exploration of networks.
        
       | moffkalast wrote:
       | If I'm not mistaken the largest mamba model right now is 2.8B and
       | undertrained with low quality data (the Pile only). The main
       | problem is that it's new and unproven.
       | 
       | Should become very interesting once someone with both data and
       | significant financial backing takes the plunge and trains
       | something of notable size. Perhaps Llama-3 might already end up
       | being that attempt, as we seem to be heavily into diminishing
       | returns for transformers.
        
         | SekstiNi wrote:
         | There is one trained on 600B tokens from SlimPajama [1], but
         | that's fairly tiny compared to other recent releases (ex.
         | stablelm-3b [2] trained on 4T tokens).
         | 
         | > low quality data (the Pile only)
         | 
         | The Pile is pretty good quality wise. It's mostly the size
         | (300B tokens) that's limiting.
         | 
         | [1]: https://huggingface.co/state-spaces/mamba-2.8b-slimpj [2]:
         | https://huggingface.co/stabilityai/stablelm-3b-4e1t
        
           | moffkalast wrote:
           | Eh quality is subjective. There are good parts, like Books3
           | and arxiv, but a large part of it is common crawl which has
           | just about anything people put up on the internet, random IRC
           | chat logs, HN and Reddit shitposts, Youtube subtitles which
           | are in broken English half the time, and of course the Enron
           | corporate email dump to make every model sound like an HR
           | middle manager.
        
       | mistrial9 wrote:
       | namespace collision detected
       | 
       | https://anaconda.org/conda-forge/mamba
        
       | jsenn wrote:
       | This was really helpful, but only discusses linear operations,
       | which obviously can't be the whole story. From the paper it seems
       | like the discretization is the only nonlinear step--in particular
       | the selection mechanism is just a linear transformation. Is that
       | right? How important is the particular form of the nonlinearity?
       | 
       | EDIT: from looking at the paper, it seems like even though the
       | core state space model/selection mechanism is linear (except for
       | discretization?), they incorporate a nonlinearity in the full
       | "mamba block", which is stacked up with residual connections and
       | layer norm just like in a transformer. They describe this as
       | combining a linear attention and an MLP into a single step,
       | rather than alternating attention and MLP as in a transformer.
        
         | jackcook wrote:
         | Yes you're spot on, the nonlinearities come from the full Mamba
         | blocks, which I left out of this post for simplicity/to focus
         | on the bigger ideas the paper introduced. You can see it marked
         | by the "X" on the right-most part of Figure 3 in the Mamba
         | paper: https://arxiv.org/abs/2312.00752
        
       | denial wrote:
       | Something minor I always wonder about when I read Mamba is the
       | discretization.
       | 
       | All of the sources I see referred to as derivations of it have a
       | discretization of the form
       | 
       | h_t =Ah_{t-1} + Bx_{t-1} for the first line instead of the given
       | one of the form h_t =Ah_{t-1} + Bx_t.
       | 
       | Does anyone know why this is?
        
         | pama wrote:
         | Not sure how much detail you need but generally there exist
         | implicit and explicit integrators for numerically solving
         | (integrating) ODE. The implicit ones, like the one used here,
         | tend to be more stable. The ideas behind SSM come from control
         | theory ideas that used integrators with stability guarantees so
         | that the rest of the neural network can focus on other aspects
         | of the problem.
        
           | denial wrote:
           | That's a helpful pointer. Thank you.
        
       | intalentive wrote:
       | Nice post. A couple things to add:
       | 
       | 1. The Mamba co-author was also the FlashAttention lead author.
       | 
       | 2. The secret ingredient that makes SSMs viable for deep learning
       | is HiPPO theory. If you start with random initialization you're
       | not going to get results. What you need is "optimal online
       | function approximation" using Legendre polynomials, a Fourier
       | basis, etc., in matrix form. The Mamba story starts with Legendre
       | Memory Units.
       | 
       | Invariably someone comments, "How do we know that it scales?" We
       | don't. But the lead author has backing and a new startup at
       | cartesia.ai. Could be the next Mistral.
        
         | sigmoid10 wrote:
         | The architecture is completely public. I would be surprised if
         | certain other players (including but not limited to Mistral AI)
         | are not training models yet. We'll hear soon enough if this is
         | viable. Maybe not for official release candidates, but at least
         | for internal testing.
        
       | israrkhan wrote:
       | MoE (Mixture of Experts) is an effective way to scale
       | transformers. Gemini 1.5 is already doing upto 1 million tokens.
       | I have not seen any large scale mamba model, so not aware of its
       | shortcomings, but I am sure there are tradeoffs.
       | 
       | It should be possible to combine Mamba with MoE, I wonder how
       | that would look like... a billion token context?
        
         | intalentive wrote:
         | https://arxiv.org/abs/2401.04081
         | 
         | https://github.com/jzhang38/LongMamba
        
           | israrkhan wrote:
           | interesting. This is exactly what I was thinking about.
           | Thanks for sharing
        
         | whimsicalism wrote:
         | nope :) MoE does not scale transformers along sequence length
        
       ___________________________________________________________________
       (page generated 2024-02-23 23:00 UTC)