[HN Gopher] Mamba Explained: The State Space Model Taking On Tra...
___________________________________________________________________
Mamba Explained: The State Space Model Taking On Transformers
Author : koayon
Score : 150 points
Date : 2024-02-25 16:16 UTC (6 hours ago)
(HTM) web link (www.kolaayonrinde.com)
(TXT) w3m dump (www.kolaayonrinde.com)
| fancyfredbot wrote:
| See also: https://jackcook.com/2024/02/23/mamba.html
| Der_Einzige wrote:
| First it was longformer, and linear attention models. Then it was
| RWKV and now it's Mamba. So many bombastic claims of improved
| architectural performance - and no open source models that beat
| the thing they purport to beat. The proof is always in the
| pudding, and these models will remain a curiosity for most until
| their weights are being benchmarked favorably on LLM
| leaderboards.
| digdugdirk wrote:
| Yes, that's technically accurate. But I prefer to think of the
| entire LLM space as a new scientific field that started when
| OpenAI released ChatGPT.
|
| In that context, all new research directions are valuable
| simply for the fact that they're expanding the foundation of
| the field. 5 years from now, who knows what the most effective
| models will use under the hood, but the more we can learn about
| them in general, the better.
| CityOfThrowaway wrote:
| The field of research here is far older than ChatGPT's
| release. Neural network research has been going on for at
| least 50 years.
|
| Most of the research that enabled ChatGPT was also already
| known. "Attention is all you need" was a 2017 paper.
|
| It still is a fast evolving field, but not one that just
| kicked off.
| lettergram wrote:
| lol I think in general, LLM research traces its origins back
| to all the standard deep learning techniques: NNs, CNNs,
| LSTMs, RNNs, etc.
|
| In 2018, with the release of transformers (via google) it
| enabled much more rapid training of models and more
| generalization with less data. 100% of the LLMs (as you'd
| probably thing of them)trace their origins to BERT.
|
| That said, my team was working with hundred million to low
| billions of parameter LSTMs & CNNs back in 2016-2017 that
| were comparable to some lighter weight LLMs today.
|
| In my opinion, the greatest strides in the space has less to
| do with the underlying architecture, and more to do with
| improved data formatting, accessibility and compute
| improvements.
| sigmoid10 wrote:
| True, but bear in mind the Mamba preprint is less than three
| months old. A lot of people are probably experimenting with
| these ideas right now and training a completely new, large
| foundation model with a different architecture will take a
| significant amount of time.
| imjonse wrote:
| Most (all?) open-ish 7B+ models today are finetunes of
| proprietary/semi-closed/bigbudget LLMs. There is no such
| foundation model for Mamba yet.
| thecolorgreen wrote:
| Why doesn't Equation 1b use the h' defined in Equation 1a?
| atlacatl_sv wrote:
| I believe h' is for the next state. y(t) is to predict the next
| word so it uses the current hidden state h(t).
| koayon wrote:
| Hey! OP here Great question - h' in Equation 1a refers to the
| derivative of h with respect to time (t). This is a
| differential equation which we can solve mathematically when we
| have x in order to get a closed-form solution for h. We would
| then plug in that h (the hidden state) into equation 1b.
|
| In our case, we don't actually wait for a closed-form solution
| but instead compute the discrete representation (Equation 2)
|
| Hope that helps!
| CrypticShift wrote:
| > In other words, you can drag and drop downloaded states into
| your model, like literal plug-in cartridges
|
| The same could be said of "control vectors" [1]. Both ideas are
| still experimental, but is seems to me IINM that they could
| replace "system prompts" and "RAG" respectively.
|
| [1] https://news.ycombinator.com/item?id=39414532
| refulgentis wrote:
| Can control vectors replace RAG?
|
| i.e. if I want the model to give me a summary of the news
| today, and the model was trained before today, can control
| vectors help?
| p1esk wrote:
| No technique can get you the news other than actually
| searching for and then parsing the published news.
| refulgentis wrote:
| Can a control vector replace system prompts?
|
| i.e. can it do in-context learning without the context?
| jncfhnb wrote:
| It more or less is the same as a system prompt
| refulgentis wrote:
| So, no
| Der_Einzige wrote:
| Whoever is downvoting this post needs to stop.
|
| The concepts behind control vectors, i.e. "representation
| engineering" are not especially new and have been highly
| effective in the diffusion space. I always find it entertaining
| when LLM folks act like they're discovering stuff that waifu
| stable diffusion folks knew for 6 months + about - like
| "concept slider loras".
| refulgentis wrote:
| I don't know what you mean, can you help me?
|
| I'm familiar with our intrepid stable diffusion sailors.
|
| I don't know why you think the post is being downvoted.
|
| I don't know why it would be verboten to downvote it, or
| indicative of the downvoter being an LLM fanatic who thinks
| they discovered everything.
|
| I am puzzled by the post because it claims RAG can be
| replaced by control vectors.
|
| I'm also puzzled because it claims prompts can be replaced by
| control vectors.
|
| I get that if system prompts were only to shift output tone,
| control vectors could replace that case, but that seems
| narrow compared to the full set of things prompt input
| enables (inter alia, the in-context learning)
| CuriouslyC wrote:
| You are right that playing with AI image generation models is
| really good for building intuition about AI models in
| general, even if they seem superficially different. It's kind
| of like surveying a battlefield from the air.
| jncfhnb wrote:
| Most of these things aren't much better than a single
| weighted token though
| behnamoh wrote:
| Can the low adoption of Mamba be attributed to what is being
| discussed today on HN
| (https://news.ycombinator.com/item?id=39491863)?
|
| Basically, Nvidia et al. don't want the AI research to move in a
| direction that requires less GPU compute, less training data, and
| less inference compute.
|
| Someone on HN (I don't remember the name) mentioned that the idea
| of deep learning is backed by big tech because it benefits them
| the most as they are the only players in town with huge amounts
| of data. If the AI community would find entirely different
| approaches to AGI (maybe not even learning), who do you think
| would suffer the most from the implications?
| p1esk wrote:
| This doesn't make sense - there are literally thousands of
| academic AI research labs who are severely limited by compute
| resources. If anything could work better than transformers and
| require less compute they would be all over that.
| behnamoh wrote:
| I guess the argument is that most AI research is supported by
| the big tech, and they have heavily invested in the deep
| learning approach.
|
| If the fundings were funneled to research groups working on
| alternative approaches, maybe we'd see the same amount of
| progress in AI only using another approach.
| kettleballroll wrote:
| As a member of the research community: that's nonsense.
| Like already pointed out: academic groups (who by no means
| are dependent on big tech) would jump all over that. Mamba
| has been out long enough that you'd already see tons of
| papers at arxiv showing mamba dominating transformers in
| all sorts of applications. But that's not happening,
| despite the ton of hype. That doesn't mean that mamba is
| nonsense. Just that it isn't the immediate transformer
| killer. It remains to be seen if something comes from it,
| eventually.
| godelski wrote:
| As a member of the research community: that's nonsense.
| Publishing is an extremely noisy process in ML and is
| getting increasingly difficult for smaller non big tech
| collaborating labs. Reviewers' go to are: more datasets,
| scale, not novel. The easiest way to approach this is to
| work off of pretrained models. This is probably more
| obvious in the NLP world.
|
| I agree that Mamba doesn't solve everything and it still
| needs work. But I disagree with the logic that there
| isn't an issue of railroading.
| p1esk wrote:
| What's the main difference between an ape's brain and a
| human brain? Scale. So that's the train we're riding at
| the moment. No roadblocks yet, aside from cost.
| godelski wrote:
| > What's the main difference between an ape's brain and a
| human brain? Scale.
|
| This is incredibly naive with absolutely no scientific
| basis. There is no evidence that this is in scale of data
| nor scale of architecture.
|
| There are a number of animals with larger brains in terms
| of both mass and total number of neurons. An African
| Elephant has roughly 3x the number of neurons humans
| have. Dolphins beat humans in total surface area.
| Neanderthals are estimated to have had larger brains too!
| It isn't mass, neurons, neuron density, surface area. We
| aren't just scaled up chimps.
| p1esk wrote:
| Other animals with larger brains might have other
| bottlenecks preventing them from reaching full potential
| of their intelligence. Neanderthals might have been
| smarter than us, but went extinct for reasons not related
| to intelligence.
|
| But my point stands - our brains have evolved directly
| from apes brains and the main difference between them and
| us is brain size.
| godelski wrote:
| > Other animals with larger brains might have other
| bottlenecks
|
| >>> What's the main difference between an ape's brain and
| a human brain? Scale.
|
| Your argument is inconsistent. Very clearly everything
| isn't scale or we'd use other things besides
| transformers. Different architectures scale in different
| ways and everything has different inductive biases. No
| one doubts scale is important, but there's a lot more.
| p1esk wrote:
| Scale is all we need for transformers (so far). It might
| also be all we need for ape brains. It's not all we need
| for whatever elephant or dolphin brains evolved from.
|
| When this stops being the case for transformers, we will
| need something else. I'm just pointing out it's not the
| case yet.
| godelski wrote:
| I see no evidence of this in biology nor in ML. I've read
| those scale papers. I've worked on scale myself. I'll bet
| the farm that scale isn't all you need. But I won't be
| surprised if people say that it is all scale.
|
| If you really think it is all scale, train a 7T ResNet
| MLP based model for NLP. If scale is all you need, make a
| LLM without DPO or RLHF. If scale is all you need, make
| SD3 with a GAN. Or what about a VAE, Normalizing Flow,
| HMM? Do it with different optimizers. Do it with gradient
| free methods. Do it with different loss functions.
|
| The bitter lesson wasn't "scale is all you need." That's
| just a misinterpretation.
|
| Edit: It's fine to disagree. We can compete on the ideas
| and methods. That's good for our community. So continue
| down yours, and I'll continue down mine. All I ask is
| that since your camp is more popular, you don't block
| those on my side. If you're right, we'll get to AGI soon.
| If we're right, we still might. But if we're right and
| you all block us, we'll get another AI winter in between.
| If we're right and you all don't block us, we can
| progress without skipping a beat. Just don't put all your
| eggs in one basket. It isn't good for anyone.
| p1esk wrote:
| I said "scale is all you need _for transformers_ ". That
| has been true since GPT1. The best way to improve our
| best model today still seems to be "make it larger and
| train it on more data".
|
| If you disagree please suggest a better way, or at least
| provide evidence that scaling up no longer works for
| transformers.
| algo_trader wrote:
| > Just that it isn't the immediate transformer killer.
|
| What is the best/stable-ish linear alternative for
| transformer right now? Especially for text generation and
| summarization.
|
| We have domain specific ways of over sampling and search,
| so we much prefer less expensive models.
| landryraccoon wrote:
| Why would Meta, Microsoft, Amazon and Google want Nvidia to
| remain dominant in hardware? Are you treating "big tech"
| like they all have one hive mind?
| behnamoh wrote:
| For MSFT, AMZN, GOOG, the competitive advantage comes
| from having huge datasets (that Nvidia doesn't have).
| It's a symbiosis that benefits the data-rich and GPU-rich
| players.
| pixl97 wrote:
| This still makes little sense as that scale will always
| matter. If you can drop the compute cost of a model by
| 10x it means you can increase model
| integrity/intelligence/speed etc beyond what your compute
| bound competitors have.
|
| Simply put, for the time being huge datasets are going to
| be needed and those with bigger (cleaner?) datasets will
| have a better behaving model.
| fauigerzigerk wrote:
| Where is the symbiosis? If data is the differentiator,
| how do the data owners benefit from Nvidia eating into
| their margins?
| jahewson wrote:
| This early in the game? No. If LLMs become vastly cheaper and
| faster then adoption (and model size) will increase in line
| with that.
| lamson wrote:
| I don't really think big tech have that much control. They are
| aiming for optimal profit solution, thing likes AI monopoly
| with huge self-supervised learning just popped up recently when
| ChatGPT performs really well, a couple years ago, people still
| believed modular and supervised learning is the key to AI
| application. So simply current scaling deep learning/llm is
| most promising and it works while tradional methods don't. If
| there is something that works as good as current solution and
| requires less resource, they will go for it very fast, see the
| implementation of Flash attention as an example.
| imjonse wrote:
| Low adoption is primarily caused by it being relatively recent
| and there are no 7B or larger public Mamba-based models to
| start a comparison in earnest with widely used transformer
| based LLMs.
| fbdab103 wrote:
| It is a really recent development. Even if this architecture is
| technically superior, it could take time before a model using
| it becomes competitive.
|
| Or maybe it does not pan out at all. We are still at the stage
| where people are throwing everything at the wall to see what
| sticks. Some promising ideas which work at small scale do not
| work at bigger.
| CuriouslyC wrote:
| This. Hyperparameter tuning and training include a lot of
| model specific black magic. Transformers have had time to
| mature, it'll take a while for other stuff to catch up even
| if they have a higher potential ceiling.
| koayon wrote:
| Definitely agree that a lot of work going into
| hyperparameter tuning and maturing the ecosystem will be
| key here!
|
| I'm seeing the Mamba paper as the `Attention Is All You
| Need` of Mamba - it might take a little while before we get
| everything optimised to the point of a GPT-4 (it took 6
| years for transformers but should be faster than that now
| with all the attention on ML)
| koayon wrote:
| Another interesting one is that the hardware isn't really
| optimised for Mamba yet either - ideally we'd want more of
| the fast SRAM so that we can store more larger hidden
| states efficiently
| nbardy wrote:
| No these things just take time.
|
| There is no conspiracy again efficient training. Companies
| aren't going to lower compute budgets with more efficiency.
|
| All the top labs are increasing efficiency, but they are using
| that to get more out of their large runs not spend less. Most
| companies have a relatively fixed training budget for their
| large runs and are trying to get the most out of it, bot save
| money,
|
| Mamba is actually being scaled up and tested across other
| fields(bio) at a rapid pace compared to other architectures
| godelski wrote:
| > There is no conspiracy
|
| Fwiw, the OP isn't suggesting conspiracy. The notion is more
| about convergent thinking.
| godelski wrote:
| Yes and no.
|
| The thing is that Mamba is not perfect. There's no neural
| architecture to rule them all, if you will. I think the bigger
| issue is that we more act like there is and get on bandwagons.
| Let me give a clearer example from the past so we can see. The
| predecessor to DDPM (the work that kicked off the diffusion
| model era) was published in 2015[0], only a year after GANs[1].
| Diffusion then showed promise but why did DDPM come out in
| 2020[2]? Because everyone was working on GANs. GANs produced
| far better images and diffusion (still is) was a lot more
| computationally intensive. Plus, all the people working on
| these diffusion models were in the same camp as those working
| on Normalizing Flows and other density based models, and fewer
| people are interested in understanding density estimation.
|
| So the entire problem is that the community hopped on a
| singular railroad for research direction. There was still
| working going on in that direction but it wasn't nearly getting
| the attention that GANs got. It's hard to know if things were
| blocked from publication because they weren't as good as GANs.
| I can say from personal experience I had a Flow publication
| blocked because reviewers were concerned with its quality
| compared to GANs (this was 2019/2020, this paper will never be
| published because now it is even hard to publish a GAN work).
|
| So yes and no because there is certainly railroading happening
| but there are also real critiques to Mamba. But what people
| often forget is that it is incredibly naive to compare new
| methods to existing methods on a direct one-to-one comparison.
| You're comparing something that has hundreds of hours to
| thousands of hours from a handful to a few dozen eyes against
| works with millions of hours and millions of eyes. Evaluation
| is just a really fucking hard thing to do but it is easy to
| just look at some benchmarks, even if they don't mean much.
| This is a fairly generalization notion though, so take the
| lesson to heart. But Mamba seems a bit different than our
| diffusion/GAN story, in that it is getting more attention than
| diffusion did in the 2016-2019.
|
| [0] https://arxiv.org/abs/1503.03585
|
| [1] https://arxiv.org/abs/1406.2661
|
| [2] https://arxiv.org/abs/2006.11239
| nyrikki wrote:
| The fact that 'removing the "quadratic bottleneck"' involves
| either reduced expressability compared to self attention or
| disproving SETH is another reason.
|
| The quadratic bottleneck is due to the lower bounds of
| exhaustive search.
|
| The papers on this only ever seem to reference perplexity.
|
| The fact it can append a word to "I'm going to the beach" that
| sounds good doesn't mean it is useful.
|
| There is no free lunch, and this project hasn't shown that the
| costs are acceptable.
|
| "I'm going to the beach" + house
|
| Doesn't help if what you needed was
|
| "I'm going to the beach" + tomorrow
|
| I do hope that there is more information on the costs, or that
| they have disproven SETH soon.
| soVeryTired wrote:
| What's SETH in this context? I googled to no avail.
| nyrikki wrote:
| Strong Exponential Time Hypothesis
|
| Here is how it relates to attention.
|
| https://arxiv.org/abs/2209.04881
| ssivark wrote:
| It's less compute _for the same model sizes_. Rest assured that
| there will still be a race to scale model sizes (and data) to
| achieve better performance.
| imjonse wrote:
| Explaining Mamba is a rite of passage, like the monad tutorials
| of yore.
| SkyMarshal wrote:
| Mamba is like a burrito...
| kekebo wrote:
| It gets soggy and disintegrates when not consumed swiftly?
| hyperbovine wrote:
| Similar market share too.
| sja wrote:
| Or Balks[0]:
|
| BALK RULES! IMPORTANT! 1. You can't just be up there and just
| doin' a balk like that.
|
| 1a. A balk is when you
|
| 1b. Okay well listen. A balk is when you balk the
|
| 1c. Let me start over
|
| 1c-a. The pitcher is not allowed to do a motion to the, uh,
| batter, that prohibits the batter from doing, you know, just
| trying to hit the ball. You can't do that.
|
| 1c-b. Once the pitcher is in the stretch, he can't be over here
| and say to the runner, like, "I'm gonna get ya! I'm gonna tag
| you out! You better watch your butt!" and then just be like he
| didn't even do that.
|
| 1c-b(1). Like, if you're about to pitch and then don't pitch,
| you have to still pitch. You cannot not pitch. Does that make
| any sense?
|
| 1c-b(2). You gotta be, throwing motion of the ball, and then,
| until you just throw it.
|
| 1c-b(2)-a. Okay, well, you can have the ball up here, like
| this, but then there's the balk you gotta think about.
|
| 1c-b(2)-b. Fairuza Balk hasn't been in any movies in forever. I
| hope she wasn't typecast as that racist lady in American
| History X.
|
| 1c-b(2)-b(i). Oh wait, she was in The Waterboy too! That would
| be even worse.
|
| 1c-b(2)-b(ii). "get in mah bellah" - Adam Water, "The
| Waterboy." Haha, classic...
|
| 1c-b(3). Okay seriously though. A balk is when the pitcher
| makes a movement that, as determined by, when you do a move
| involving the baseball and field of
|
| 2. Do not do a balk please.
|
| [0]: https://justinbee.tumblr.com/post/15309101943/best-
| explanati...
| AndrewKemendo wrote:
| Someone is going to re-invent Bellman's equations and call it
| Learnformer
| Straw wrote:
| The SSMs papers and blogs always have unnecessarily complicated
| explanations. At this point I almost wonder if its to hide how
| simple the underlying algorithms are, or to make them seem fancy.
|
| SSMs are doing exponentially weighted moving averages (EMA).
| That's it- to summarize the past, you take an EMA of a variable
| output at each time step. Mamba changes one key thing- instead of
| decaying the past by a fixed amount each step as in a constant-
| time EMA, we have another output which decides how much to
| forget, or equivalently, how much 'time' has passed since the
| last observation in our EMA.
|
| All of the matrix equations, continuous time, discretization,
| etc, will end up with a dynamic-forgetting EMA as I describe
| above. This also makes the benefits and limitations clear- finite
| state size, has to decide at a given layer what to forget before
| it sees the past at that layer.
| logicchains wrote:
| Are there any fundamental differences between Mamba, Retnet and
| RWKV, or are they all variants of this same architecture?
| binarymax wrote:
| I hadn't heard of Mamba before reading this article, and I was
| wondering if anyone has tried setting importance of a token as
| a TF-IDF or BM25 lookup. Requires a first pass to construct the
| token index but otherwise it seems like it would address the
| big issue that all these architectures have - they don't know
| how "important" a token is. Interestingly this seems to be the
| crux of Mamba - deciding what tokens to forget! EMA other
| treats all tokens equally at sequence time. What if the tokens
| were weighted beforehand and the weights were passed as an
| attention mechanism? I wonder if anyone has tried something
| like this.
| halflings wrote:
| The importance (e.g. attention) needs to be dynamic, e.g. one
| token will be important to some other tokens but not others.
|
| tf-idf and similar heuristics are what we were using before
| attention came along, e.g. tf-idf weighted bag-of-words
| representation of word2vec embeddings. That approaches fails
| in so many cases.
| binarymax wrote:
| Attention in transformers works because over time the model
| learns token importance based on frequency and context.
|
| If you don't have attention and need a fast substitute for
| "forgetting" non important tokens, then BM25 is an
| intuitive hypothesis.
| ogogmad wrote:
| That might explain the motivation for why the D variable is
| used and varied; but not the "Selectivity", which the article
| says is expressed by how the matrices B and C vary while
| consuming input.
|
| Something I've noticed is that B, C and D depend only on the
| current token. See this:
| https://www.kolaayonrinde.com/blog/images/mamba/ssm_algorith...
| -- Another thing is that I've noticed that the definition of
| "SSM" in the image I've linked to is apparently recursive. This
| is also in the Arxiv paper. Strange.
|
| +1 though for making me go back to the article and read it more
| carefully! +1 also to the article.
| ogogmad wrote:
| OK, I've noticed that the pseudo-code above is vectorised,
| and so there's no recursion. The SSM function is actually
| described at the start of the paper, and an efficient
| hardware-aware implementation is suggested in section 3:
| https://arxiv.org/ftp/arxiv/papers/2312/2312.00752.pdf
| givemeethekeys wrote:
| Transformers, Rise of the Mambas, coming to a theater near you!
| kken wrote:
| There is also this: https://jackcook.com/2024/02/23/mamba.html
___________________________________________________________________
(page generated 2024-02-25 23:00 UTC)