[HN Gopher] Mamba Explained: The State Space Model Taking On Tra...
       ___________________________________________________________________
        
       Mamba Explained: The State Space Model Taking On Transformers
        
       Author : koayon
       Score  : 150 points
       Date   : 2024-02-25 16:16 UTC (6 hours ago)
        
 (HTM) web link (www.kolaayonrinde.com)
 (TXT) w3m dump (www.kolaayonrinde.com)
        
       | fancyfredbot wrote:
       | See also: https://jackcook.com/2024/02/23/mamba.html
        
       | Der_Einzige wrote:
       | First it was longformer, and linear attention models. Then it was
       | RWKV and now it's Mamba. So many bombastic claims of improved
       | architectural performance - and no open source models that beat
       | the thing they purport to beat. The proof is always in the
       | pudding, and these models will remain a curiosity for most until
       | their weights are being benchmarked favorably on LLM
       | leaderboards.
        
         | digdugdirk wrote:
         | Yes, that's technically accurate. But I prefer to think of the
         | entire LLM space as a new scientific field that started when
         | OpenAI released ChatGPT.
         | 
         | In that context, all new research directions are valuable
         | simply for the fact that they're expanding the foundation of
         | the field. 5 years from now, who knows what the most effective
         | models will use under the hood, but the more we can learn about
         | them in general, the better.
        
           | CityOfThrowaway wrote:
           | The field of research here is far older than ChatGPT's
           | release. Neural network research has been going on for at
           | least 50 years.
           | 
           | Most of the research that enabled ChatGPT was also already
           | known. "Attention is all you need" was a 2017 paper.
           | 
           | It still is a fast evolving field, but not one that just
           | kicked off.
        
           | lettergram wrote:
           | lol I think in general, LLM research traces its origins back
           | to all the standard deep learning techniques: NNs, CNNs,
           | LSTMs, RNNs, etc.
           | 
           | In 2018, with the release of transformers (via google) it
           | enabled much more rapid training of models and more
           | generalization with less data. 100% of the LLMs (as you'd
           | probably thing of them)trace their origins to BERT.
           | 
           | That said, my team was working with hundred million to low
           | billions of parameter LSTMs & CNNs back in 2016-2017 that
           | were comparable to some lighter weight LLMs today.
           | 
           | In my opinion, the greatest strides in the space has less to
           | do with the underlying architecture, and more to do with
           | improved data formatting, accessibility and compute
           | improvements.
        
         | sigmoid10 wrote:
         | True, but bear in mind the Mamba preprint is less than three
         | months old. A lot of people are probably experimenting with
         | these ideas right now and training a completely new, large
         | foundation model with a different architecture will take a
         | significant amount of time.
        
         | imjonse wrote:
         | Most (all?) open-ish 7B+ models today are finetunes of
         | proprietary/semi-closed/bigbudget LLMs. There is no such
         | foundation model for Mamba yet.
        
       | thecolorgreen wrote:
       | Why doesn't Equation 1b use the h' defined in Equation 1a?
        
         | atlacatl_sv wrote:
         | I believe h' is for the next state. y(t) is to predict the next
         | word so it uses the current hidden state h(t).
        
         | koayon wrote:
         | Hey! OP here Great question - h' in Equation 1a refers to the
         | derivative of h with respect to time (t). This is a
         | differential equation which we can solve mathematically when we
         | have x in order to get a closed-form solution for h. We would
         | then plug in that h (the hidden state) into equation 1b.
         | 
         | In our case, we don't actually wait for a closed-form solution
         | but instead compute the discrete representation (Equation 2)
         | 
         | Hope that helps!
        
       | CrypticShift wrote:
       | > In other words, you can drag and drop downloaded states into
       | your model, like literal plug-in cartridges
       | 
       | The same could be said of "control vectors" [1]. Both ideas are
       | still experimental, but is seems to me IINM that they could
       | replace "system prompts" and "RAG" respectively.
       | 
       | [1] https://news.ycombinator.com/item?id=39414532
        
         | refulgentis wrote:
         | Can control vectors replace RAG?
         | 
         | i.e. if I want the model to give me a summary of the news
         | today, and the model was trained before today, can control
         | vectors help?
        
           | p1esk wrote:
           | No technique can get you the news other than actually
           | searching for and then parsing the published news.
        
             | refulgentis wrote:
             | Can a control vector replace system prompts?
             | 
             | i.e. can it do in-context learning without the context?
        
               | jncfhnb wrote:
               | It more or less is the same as a system prompt
        
               | refulgentis wrote:
               | So, no
        
         | Der_Einzige wrote:
         | Whoever is downvoting this post needs to stop.
         | 
         | The concepts behind control vectors, i.e. "representation
         | engineering" are not especially new and have been highly
         | effective in the diffusion space. I always find it entertaining
         | when LLM folks act like they're discovering stuff that waifu
         | stable diffusion folks knew for 6 months + about - like
         | "concept slider loras".
        
           | refulgentis wrote:
           | I don't know what you mean, can you help me?
           | 
           | I'm familiar with our intrepid stable diffusion sailors.
           | 
           | I don't know why you think the post is being downvoted.
           | 
           | I don't know why it would be verboten to downvote it, or
           | indicative of the downvoter being an LLM fanatic who thinks
           | they discovered everything.
           | 
           | I am puzzled by the post because it claims RAG can be
           | replaced by control vectors.
           | 
           | I'm also puzzled because it claims prompts can be replaced by
           | control vectors.
           | 
           | I get that if system prompts were only to shift output tone,
           | control vectors could replace that case, but that seems
           | narrow compared to the full set of things prompt input
           | enables (inter alia, the in-context learning)
        
           | CuriouslyC wrote:
           | You are right that playing with AI image generation models is
           | really good for building intuition about AI models in
           | general, even if they seem superficially different. It's kind
           | of like surveying a battlefield from the air.
        
           | jncfhnb wrote:
           | Most of these things aren't much better than a single
           | weighted token though
        
       | behnamoh wrote:
       | Can the low adoption of Mamba be attributed to what is being
       | discussed today on HN
       | (https://news.ycombinator.com/item?id=39491863)?
       | 
       | Basically, Nvidia et al. don't want the AI research to move in a
       | direction that requires less GPU compute, less training data, and
       | less inference compute.
       | 
       | Someone on HN (I don't remember the name) mentioned that the idea
       | of deep learning is backed by big tech because it benefits them
       | the most as they are the only players in town with huge amounts
       | of data. If the AI community would find entirely different
       | approaches to AGI (maybe not even learning), who do you think
       | would suffer the most from the implications?
        
         | p1esk wrote:
         | This doesn't make sense - there are literally thousands of
         | academic AI research labs who are severely limited by compute
         | resources. If anything could work better than transformers and
         | require less compute they would be all over that.
        
           | behnamoh wrote:
           | I guess the argument is that most AI research is supported by
           | the big tech, and they have heavily invested in the deep
           | learning approach.
           | 
           | If the fundings were funneled to research groups working on
           | alternative approaches, maybe we'd see the same amount of
           | progress in AI only using another approach.
        
             | kettleballroll wrote:
             | As a member of the research community: that's nonsense.
             | Like already pointed out: academic groups (who by no means
             | are dependent on big tech) would jump all over that. Mamba
             | has been out long enough that you'd already see tons of
             | papers at arxiv showing mamba dominating transformers in
             | all sorts of applications. But that's not happening,
             | despite the ton of hype. That doesn't mean that mamba is
             | nonsense. Just that it isn't the immediate transformer
             | killer. It remains to be seen if something comes from it,
             | eventually.
        
               | godelski wrote:
               | As a member of the research community: that's nonsense.
               | Publishing is an extremely noisy process in ML and is
               | getting increasingly difficult for smaller non big tech
               | collaborating labs. Reviewers' go to are: more datasets,
               | scale, not novel. The easiest way to approach this is to
               | work off of pretrained models. This is probably more
               | obvious in the NLP world.
               | 
               | I agree that Mamba doesn't solve everything and it still
               | needs work. But I disagree with the logic that there
               | isn't an issue of railroading.
        
               | p1esk wrote:
               | What's the main difference between an ape's brain and a
               | human brain? Scale. So that's the train we're riding at
               | the moment. No roadblocks yet, aside from cost.
        
               | godelski wrote:
               | > What's the main difference between an ape's brain and a
               | human brain? Scale.
               | 
               | This is incredibly naive with absolutely no scientific
               | basis. There is no evidence that this is in scale of data
               | nor scale of architecture.
               | 
               | There are a number of animals with larger brains in terms
               | of both mass and total number of neurons. An African
               | Elephant has roughly 3x the number of neurons humans
               | have. Dolphins beat humans in total surface area.
               | Neanderthals are estimated to have had larger brains too!
               | It isn't mass, neurons, neuron density, surface area. We
               | aren't just scaled up chimps.
        
               | p1esk wrote:
               | Other animals with larger brains might have other
               | bottlenecks preventing them from reaching full potential
               | of their intelligence. Neanderthals might have been
               | smarter than us, but went extinct for reasons not related
               | to intelligence.
               | 
               | But my point stands - our brains have evolved directly
               | from apes brains and the main difference between them and
               | us is brain size.
        
               | godelski wrote:
               | > Other animals with larger brains might have other
               | bottlenecks
               | 
               | >>> What's the main difference between an ape's brain and
               | a human brain? Scale.
               | 
               | Your argument is inconsistent. Very clearly everything
               | isn't scale or we'd use other things besides
               | transformers. Different architectures scale in different
               | ways and everything has different inductive biases. No
               | one doubts scale is important, but there's a lot more.
        
               | p1esk wrote:
               | Scale is all we need for transformers (so far). It might
               | also be all we need for ape brains. It's not all we need
               | for whatever elephant or dolphin brains evolved from.
               | 
               | When this stops being the case for transformers, we will
               | need something else. I'm just pointing out it's not the
               | case yet.
        
               | godelski wrote:
               | I see no evidence of this in biology nor in ML. I've read
               | those scale papers. I've worked on scale myself. I'll bet
               | the farm that scale isn't all you need. But I won't be
               | surprised if people say that it is all scale.
               | 
               | If you really think it is all scale, train a 7T ResNet
               | MLP based model for NLP. If scale is all you need, make a
               | LLM without DPO or RLHF. If scale is all you need, make
               | SD3 with a GAN. Or what about a VAE, Normalizing Flow,
               | HMM? Do it with different optimizers. Do it with gradient
               | free methods. Do it with different loss functions.
               | 
               | The bitter lesson wasn't "scale is all you need." That's
               | just a misinterpretation.
               | 
               | Edit: It's fine to disagree. We can compete on the ideas
               | and methods. That's good for our community. So continue
               | down yours, and I'll continue down mine. All I ask is
               | that since your camp is more popular, you don't block
               | those on my side. If you're right, we'll get to AGI soon.
               | If we're right, we still might. But if we're right and
               | you all block us, we'll get another AI winter in between.
               | If we're right and you all don't block us, we can
               | progress without skipping a beat. Just don't put all your
               | eggs in one basket. It isn't good for anyone.
        
               | p1esk wrote:
               | I said "scale is all you need _for transformers_ ". That
               | has been true since GPT1. The best way to improve our
               | best model today still seems to be "make it larger and
               | train it on more data".
               | 
               | If you disagree please suggest a better way, or at least
               | provide evidence that scaling up no longer works for
               | transformers.
        
               | algo_trader wrote:
               | > Just that it isn't the immediate transformer killer.
               | 
               | What is the best/stable-ish linear alternative for
               | transformer right now? Especially for text generation and
               | summarization.
               | 
               | We have domain specific ways of over sampling and search,
               | so we much prefer less expensive models.
        
             | landryraccoon wrote:
             | Why would Meta, Microsoft, Amazon and Google want Nvidia to
             | remain dominant in hardware? Are you treating "big tech"
             | like they all have one hive mind?
        
               | behnamoh wrote:
               | For MSFT, AMZN, GOOG, the competitive advantage comes
               | from having huge datasets (that Nvidia doesn't have).
               | It's a symbiosis that benefits the data-rich and GPU-rich
               | players.
        
               | pixl97 wrote:
               | This still makes little sense as that scale will always
               | matter. If you can drop the compute cost of a model by
               | 10x it means you can increase model
               | integrity/intelligence/speed etc beyond what your compute
               | bound competitors have.
               | 
               | Simply put, for the time being huge datasets are going to
               | be needed and those with bigger (cleaner?) datasets will
               | have a better behaving model.
        
               | fauigerzigerk wrote:
               | Where is the symbiosis? If data is the differentiator,
               | how do the data owners benefit from Nvidia eating into
               | their margins?
        
         | jahewson wrote:
         | This early in the game? No. If LLMs become vastly cheaper and
         | faster then adoption (and model size) will increase in line
         | with that.
        
         | lamson wrote:
         | I don't really think big tech have that much control. They are
         | aiming for optimal profit solution, thing likes AI monopoly
         | with huge self-supervised learning just popped up recently when
         | ChatGPT performs really well, a couple years ago, people still
         | believed modular and supervised learning is the key to AI
         | application. So simply current scaling deep learning/llm is
         | most promising and it works while tradional methods don't. If
         | there is something that works as good as current solution and
         | requires less resource, they will go for it very fast, see the
         | implementation of Flash attention as an example.
        
         | imjonse wrote:
         | Low adoption is primarily caused by it being relatively recent
         | and there are no 7B or larger public Mamba-based models to
         | start a comparison in earnest with widely used transformer
         | based LLMs.
        
         | fbdab103 wrote:
         | It is a really recent development. Even if this architecture is
         | technically superior, it could take time before a model using
         | it becomes competitive.
         | 
         | Or maybe it does not pan out at all. We are still at the stage
         | where people are throwing everything at the wall to see what
         | sticks. Some promising ideas which work at small scale do not
         | work at bigger.
        
           | CuriouslyC wrote:
           | This. Hyperparameter tuning and training include a lot of
           | model specific black magic. Transformers have had time to
           | mature, it'll take a while for other stuff to catch up even
           | if they have a higher potential ceiling.
        
             | koayon wrote:
             | Definitely agree that a lot of work going into
             | hyperparameter tuning and maturing the ecosystem will be
             | key here!
             | 
             | I'm seeing the Mamba paper as the `Attention Is All You
             | Need` of Mamba - it might take a little while before we get
             | everything optimised to the point of a GPT-4 (it took 6
             | years for transformers but should be faster than that now
             | with all the attention on ML)
        
             | koayon wrote:
             | Another interesting one is that the hardware isn't really
             | optimised for Mamba yet either - ideally we'd want more of
             | the fast SRAM so that we can store more larger hidden
             | states efficiently
        
         | nbardy wrote:
         | No these things just take time.
         | 
         | There is no conspiracy again efficient training. Companies
         | aren't going to lower compute budgets with more efficiency.
         | 
         | All the top labs are increasing efficiency, but they are using
         | that to get more out of their large runs not spend less. Most
         | companies have a relatively fixed training budget for their
         | large runs and are trying to get the most out of it, bot save
         | money,
         | 
         | Mamba is actually being scaled up and tested across other
         | fields(bio) at a rapid pace compared to other architectures
        
           | godelski wrote:
           | > There is no conspiracy
           | 
           | Fwiw, the OP isn't suggesting conspiracy. The notion is more
           | about convergent thinking.
        
         | godelski wrote:
         | Yes and no.
         | 
         | The thing is that Mamba is not perfect. There's no neural
         | architecture to rule them all, if you will. I think the bigger
         | issue is that we more act like there is and get on bandwagons.
         | Let me give a clearer example from the past so we can see. The
         | predecessor to DDPM (the work that kicked off the diffusion
         | model era) was published in 2015[0], only a year after GANs[1].
         | Diffusion then showed promise but why did DDPM come out in
         | 2020[2]? Because everyone was working on GANs. GANs produced
         | far better images and diffusion (still is) was a lot more
         | computationally intensive. Plus, all the people working on
         | these diffusion models were in the same camp as those working
         | on Normalizing Flows and other density based models, and fewer
         | people are interested in understanding density estimation.
         | 
         | So the entire problem is that the community hopped on a
         | singular railroad for research direction. There was still
         | working going on in that direction but it wasn't nearly getting
         | the attention that GANs got. It's hard to know if things were
         | blocked from publication because they weren't as good as GANs.
         | I can say from personal experience I had a Flow publication
         | blocked because reviewers were concerned with its quality
         | compared to GANs (this was 2019/2020, this paper will never be
         | published because now it is even hard to publish a GAN work).
         | 
         | So yes and no because there is certainly railroading happening
         | but there are also real critiques to Mamba. But what people
         | often forget is that it is incredibly naive to compare new
         | methods to existing methods on a direct one-to-one comparison.
         | You're comparing something that has hundreds of hours to
         | thousands of hours from a handful to a few dozen eyes against
         | works with millions of hours and millions of eyes. Evaluation
         | is just a really fucking hard thing to do but it is easy to
         | just look at some benchmarks, even if they don't mean much.
         | This is a fairly generalization notion though, so take the
         | lesson to heart. But Mamba seems a bit different than our
         | diffusion/GAN story, in that it is getting more attention than
         | diffusion did in the 2016-2019.
         | 
         | [0] https://arxiv.org/abs/1503.03585
         | 
         | [1] https://arxiv.org/abs/1406.2661
         | 
         | [2] https://arxiv.org/abs/2006.11239
        
         | nyrikki wrote:
         | The fact that 'removing the "quadratic bottleneck"' involves
         | either reduced expressability compared to self attention or
         | disproving SETH is another reason.
         | 
         | The quadratic bottleneck is due to the lower bounds of
         | exhaustive search.
         | 
         | The papers on this only ever seem to reference perplexity.
         | 
         | The fact it can append a word to "I'm going to the beach" that
         | sounds good doesn't mean it is useful.
         | 
         | There is no free lunch, and this project hasn't shown that the
         | costs are acceptable.
         | 
         | "I'm going to the beach" + house
         | 
         | Doesn't help if what you needed was
         | 
         | "I'm going to the beach" + tomorrow
         | 
         | I do hope that there is more information on the costs, or that
         | they have disproven SETH soon.
        
           | soVeryTired wrote:
           | What's SETH in this context? I googled to no avail.
        
             | nyrikki wrote:
             | Strong Exponential Time Hypothesis
             | 
             | Here is how it relates to attention.
             | 
             | https://arxiv.org/abs/2209.04881
        
         | ssivark wrote:
         | It's less compute _for the same model sizes_. Rest assured that
         | there will still be a race to scale model sizes (and data) to
         | achieve better performance.
        
       | imjonse wrote:
       | Explaining Mamba is a rite of passage, like the monad tutorials
       | of yore.
        
         | SkyMarshal wrote:
         | Mamba is like a burrito...
        
           | kekebo wrote:
           | It gets soggy and disintegrates when not consumed swiftly?
        
         | hyperbovine wrote:
         | Similar market share too.
        
         | sja wrote:
         | Or Balks[0]:
         | 
         | BALK RULES! IMPORTANT! 1. You can't just be up there and just
         | doin' a balk like that.
         | 
         | 1a. A balk is when you
         | 
         | 1b. Okay well listen. A balk is when you balk the
         | 
         | 1c. Let me start over
         | 
         | 1c-a. The pitcher is not allowed to do a motion to the, uh,
         | batter, that prohibits the batter from doing, you know, just
         | trying to hit the ball. You can't do that.
         | 
         | 1c-b. Once the pitcher is in the stretch, he can't be over here
         | and say to the runner, like, "I'm gonna get ya! I'm gonna tag
         | you out! You better watch your butt!" and then just be like he
         | didn't even do that.
         | 
         | 1c-b(1). Like, if you're about to pitch and then don't pitch,
         | you have to still pitch. You cannot not pitch. Does that make
         | any sense?
         | 
         | 1c-b(2). You gotta be, throwing motion of the ball, and then,
         | until you just throw it.
         | 
         | 1c-b(2)-a. Okay, well, you can have the ball up here, like
         | this, but then there's the balk you gotta think about.
         | 
         | 1c-b(2)-b. Fairuza Balk hasn't been in any movies in forever. I
         | hope she wasn't typecast as that racist lady in American
         | History X.
         | 
         | 1c-b(2)-b(i). Oh wait, she was in The Waterboy too! That would
         | be even worse.
         | 
         | 1c-b(2)-b(ii). "get in mah bellah" - Adam Water, "The
         | Waterboy." Haha, classic...
         | 
         | 1c-b(3). Okay seriously though. A balk is when the pitcher
         | makes a movement that, as determined by, when you do a move
         | involving the baseball and field of
         | 
         | 2. Do not do a balk please.
         | 
         | [0]: https://justinbee.tumblr.com/post/15309101943/best-
         | explanati...
        
       | AndrewKemendo wrote:
       | Someone is going to re-invent Bellman's equations and call it
       | Learnformer
        
       | Straw wrote:
       | The SSMs papers and blogs always have unnecessarily complicated
       | explanations. At this point I almost wonder if its to hide how
       | simple the underlying algorithms are, or to make them seem fancy.
       | 
       | SSMs are doing exponentially weighted moving averages (EMA).
       | That's it- to summarize the past, you take an EMA of a variable
       | output at each time step. Mamba changes one key thing- instead of
       | decaying the past by a fixed amount each step as in a constant-
       | time EMA, we have another output which decides how much to
       | forget, or equivalently, how much 'time' has passed since the
       | last observation in our EMA.
       | 
       | All of the matrix equations, continuous time, discretization,
       | etc, will end up with a dynamic-forgetting EMA as I describe
       | above. This also makes the benefits and limitations clear- finite
       | state size, has to decide at a given layer what to forget before
       | it sees the past at that layer.
        
         | logicchains wrote:
         | Are there any fundamental differences between Mamba, Retnet and
         | RWKV, or are they all variants of this same architecture?
        
         | binarymax wrote:
         | I hadn't heard of Mamba before reading this article, and I was
         | wondering if anyone has tried setting importance of a token as
         | a TF-IDF or BM25 lookup. Requires a first pass to construct the
         | token index but otherwise it seems like it would address the
         | big issue that all these architectures have - they don't know
         | how "important" a token is. Interestingly this seems to be the
         | crux of Mamba - deciding what tokens to forget! EMA other
         | treats all tokens equally at sequence time. What if the tokens
         | were weighted beforehand and the weights were passed as an
         | attention mechanism? I wonder if anyone has tried something
         | like this.
        
           | halflings wrote:
           | The importance (e.g. attention) needs to be dynamic, e.g. one
           | token will be important to some other tokens but not others.
           | 
           | tf-idf and similar heuristics are what we were using before
           | attention came along, e.g. tf-idf weighted bag-of-words
           | representation of word2vec embeddings. That approaches fails
           | in so many cases.
        
             | binarymax wrote:
             | Attention in transformers works because over time the model
             | learns token importance based on frequency and context.
             | 
             | If you don't have attention and need a fast substitute for
             | "forgetting" non important tokens, then BM25 is an
             | intuitive hypothesis.
        
         | ogogmad wrote:
         | That might explain the motivation for why the D variable is
         | used and varied; but not the "Selectivity", which the article
         | says is expressed by how the matrices B and C vary while
         | consuming input.
         | 
         | Something I've noticed is that B, C and D depend only on the
         | current token. See this:
         | https://www.kolaayonrinde.com/blog/images/mamba/ssm_algorith...
         | -- Another thing is that I've noticed that the definition of
         | "SSM" in the image I've linked to is apparently recursive. This
         | is also in the Arxiv paper. Strange.
         | 
         | +1 though for making me go back to the article and read it more
         | carefully! +1 also to the article.
        
           | ogogmad wrote:
           | OK, I've noticed that the pseudo-code above is vectorised,
           | and so there's no recursion. The SSM function is actually
           | described at the start of the paper, and an efficient
           | hardware-aware implementation is suggested in section 3:
           | https://arxiv.org/ftp/arxiv/papers/2312/2312.00752.pdf
        
       | givemeethekeys wrote:
       | Transformers, Rise of the Mambas, coming to a theater near you!
        
       | kken wrote:
       | There is also this: https://jackcook.com/2024/02/23/mamba.html
        
       ___________________________________________________________________
       (page generated 2024-02-25 23:00 UTC)