[HN Gopher] Mamba Explained
       ___________________________________________________________________
        
       Mamba Explained
        
       Author : andreyk
       Score  : 174 points
       Date   : 2024-03-30 16:04 UTC (1 days ago)
        
 (HTM) web link (thegradient.pub)
 (TXT) w3m dump (thegradient.pub)
        
       | xz18r wrote:
       | I just have to say it: that image shows gunpla, i.e. Mobile Suit
       | Gundam, not Transformers!
        
         | throwup238 wrote:
         | An official request has been made to ICANN to rescind the OP's
         | nerd card.
        
       | andy_xor_andrew wrote:
       | > But Transformers have one core problem. In a transformer, every
       | token can look back at every previous token when making
       | predictions.
       | 
       | Lately I've been wondering... is this a problem, or a strength?
       | 
       | It might be a fallacy to compare how LLMs "think" with how humans
       | think. But humor me for a second. When you are speaking, each
       | time you emit a word, you are not attending to every previous
       | word in your sentence (like transformers), rather you have a
       | state in your mind that represents the grammar and concepts,
       | which is continuously updated as you speak (more similar to
       | SSMs).
       | 
       | Similarly, when you read a book, every time you read a word, you
       | are not attending to every previous word in the book. Your model
       | of "the book" is rather a fuzzy/approximate state that is updated
       | with new information every time a new word appears. Right? (I'm
       | sorry I know this is very handwavy and psuedoscientific but bear
       | with me).
       | 
       | Ok, so if (big if) you feel like the above is true, then to match
       | human-type language modelling, SSMs seem more human-like than
       | transformers.
       | 
       | BUT... then aren't transformers _strictly better_ in terms of
       | accuracy? Because a transformer never  "forgets" information, as
       | long as it is within the context window, because it revisits that
       | information every time it emits a new token.
       | 
       | So let's say we can remove the "quadratic attention" problem of
       | transformers with SSMs. That's a nice training/inference
       | performance boost. But... look at where we got with "naive"
       | attention. GPT 4, Claude 3. It's not like we're hitting a wall
       | with quadratic attention. It's absurdly more expensive than SSMs,
       | but GPUs certainly aren't getting slower. If all AI work stops
       | now, and only hardware improves, it wouldn't be long until GPT4
       | could run on local hardware, right, provided Moore's law?
       | 
       | /end rant, not really sure what my point was, I'm not against
       | SSMs (they're cool) but rather I'm wondering if the SOTA will
       | ever be SSM when attention is so damn good
        
         | maccam912 wrote:
         | It depends on the task I imagine. Like writing a novel was
         | mentioned, keeping important story lines in your memory for a
         | long time will be necessary, or at least certainly more
         | important than remembering what the characters were eating for
         | lunch on page 10. But if you need to find that one loophole in
         | a contact you probably will benefit from the perfect recall.
        
         | spxneo wrote:
         | very good point and the sooner we can accept this difference
         | (we access hyperdimensional entities we discover through
         | language and math via fast and slow access and vocalize it
         | through the alphabets we learned to read) the more
         | "intelligence" we can unlock from AI.
        
         | aCoreyJ wrote:
         | We're running out of the ability to make transistors smaller
         | and closer together so beyond some major breakthrough I wouldnt
         | expect Moore's law to continue nearly long enough to get to the
         | point of running GPT4 on consumer hardware in the short term
        
           | timschmidt wrote:
           | Ah, but we've just begun stacking transistors in the third
           | dimension.
        
             | ctrw wrote:
             | That doesn't solve the problem, it just pushes is down the
             | road a bit. The exponential growth is merely offset by a
             | constant factor once. Unless we figure out how to push
             | transistors in the 5th, 6th etc dimension with every new
             | generation.
        
             | jazzyjackson wrote:
             | It was never a solution, Moore's law has more than one
             | dimension as well, not just density but heat dissipation.
             | Can't cool down a transistor that's surrounded by
             | transistors on all sides.
        
           | moffkalast wrote:
           | Well consumer hardware can run something in the order of ~50B
           | quantized at a "reasonable" price today, we'd need about 5 or
           | 6 doublings to run something that would be GPT 4 tier at 1T+.
           | So, it would need to continue for roughly a decade at least?
           | 
           | Current models are horrendously inefficient though, so with
           | architectural improvements we'll have something of that
           | capability far sooner on weaker hardware.
        
         | y42 wrote:
         | >> Is this a problem or a strength?
         | 
         | I was wondering the same thing. I understand, why the initial
         | developers of this method declared it as a strength. Still I
         | think it's a problem, too:
         | 
         | If the Tranformer reads this sentence:
         | 
         | A equals B
         | 
         | It understands, that B comes after A and therefore A equals B.
         | But how does it learn that after A comes B and therefore B
         | equals A.
         | 
         | I am referring to the logical problems, that most (all?) modern
         | language models suffer of.
        
           | sigmoid10 wrote:
           | I see many people get confused by this due to the widely
           | spread (and false) "stochastic parrot" theme. But these
           | models are much more than mere senzence-repeaters. In a way,
           | the model is not learning that after A comes B. I mean, it
           | could. With a lack of additional training data it probably
           | would, too. But with enough data, this kind of sentence
           | completion based purely on existing sentences no longer works
           | because it would saturate parameters. So to retain and
           | improve accuracy during training, it will have to come up
           | with a compression that essentially forms a model of the real
           | world. Or at least the world that the training corpus
           | describes [1]. In that sense, it no longer "knows" that B
           | comes after A (except for the input context), but it would
           | have learned that there is a special relation between A and
           | B. In can then also apply this kind of learned logic to new
           | concepts that appear first in the context during inference.
           | With all that happening internally, it only has to morph this
           | state back into a natural language output. But with billions
           | of parameters and countless layers, there is more than enough
           | computational room for this to happen. In fact, recent models
           | have shown that even small models can get pretty good at
           | logic if you only get the training data right.
           | 
           | [1] https://arxiv.org/abs/2210.13382
        
         | incrudible wrote:
         | > It's not like we're hitting a wall with quadratic attention.
         | It's absurdly more expensive than SSMs, but GPUs certainly
         | aren't getting slower.
         | 
         | We are not hitting a wall, but a slope. Hardware improvements
         | will not make up for it indefinitely. Software will have to
         | make up for it, but the problem is that it costs millions of
         | dollars to hit compile.
        
         | tippytippytango wrote:
         | It's a tradeoff to be managed depending on the application
         | rather than a problem.
        
         | thomasahle wrote:
         | >> But Transformers have one core problem. In a transformer,
         | every token can look back at every previous token when making
         | predictions.
         | 
         | > Lately I've been wondering... is this a problem, or a
         | strength?
         | 
         | Exactly. There are lot of use cases where perfect recall is
         | important. And earlier data may be more or less incompressible,
         | such as if an LLM is working on a large table of data.
         | 
         | Maybe we'll end up with different architectures being used for
         | different applications. E.g. simple chat may be OK with an RNN
         | type architecture.
         | 
         | I've also seen people combine Mamba and Transformer layers.
         | Maybe that's a good tradeoff for some other applications.
        
         | koayon wrote:
         | This is a very fair point! If we had infinite compute then it's
         | undeniable that transformers (i.e. full attention) would be
         | better (exactly as you characterise it)
         | 
         | But that's the efficiency-effectiveness tradeoff that we have
         | to make: given that compute is limited, would we prefer
         | attention over shorter sequences or SSMs over longer sequences?
         | The answer is probably "well, it depends on your use case" - I
         | can definitely see reasons for both!
         | 
         | A fairly compelling thought for me is hybrid architectures
         | (Jamba is a recent one). Here you can imagine having perfect
         | recall over recent tokens and lossy recall over distant tokens.
         | E.g. if the AI is generating a feature-length film, you "could
         | imagine having Attention look at the most recent frames for
         | short-term fluidity and an SSM for long-term narrative
         | consistency" (quote from the OP)
        
           | koayon wrote:
           | And given that the compute is O(n^2) with context window,
           | it's a very real tradeoff, at least in the short term
        
           | rdedev wrote:
           | If I remember it right, the llm big bird had something like
           | this. For a particular word it would attend strongly with its
           | closer neighbours but weakly to words far from it. Look for
           | sparse attention. I think that's the relevant terminology.
           | Not sure if it matches exactly what you described
        
         | nlrk wrote:
         | >> When you are speaking, each time you emit a word, you are
         | not attending to every previous word in your sentence
         | 
         | I was exactly doing this until late in my youth. until I learnt
         | people do it sequentially. But it is doable to create
         | connections and pick the sensible case. Not the most relaxing
         | thing.
        
         | anon291 wrote:
         | Yes transformers are obviously more capable than humans in my
         | opinion. Claude can ingest dozens of pages in seconds and -- in
         | a single shot -- write a summary bringing in relevant passages.
         | 
         | The innovation is not the speed, but the lack of recursion or
         | iteration. Humans, even accomplished ones, have to reread
         | sections and really 'internalize' ideas before being able to
         | summarize and very few humans can -- in a single attempt --
         | generate perfect speech. Most of us speak and unknowingly
         | revise our own speech as we go along. Unlike transformers, that
         | speak confidently, we start making a sentence and then decide
         | halfway through its not going where we like. Then we start it
         | over again, and by the powers of human attention, no one seems
         | to really notice.
         | 
         | Transformers Are just insanely complicated and expensive to
         | train.
        
           | jazzyjackson wrote:
           | > we start making a sentence and then decide halfway through
           | its not going where we like
           | 
           | I'll just add the observation that when we do this it's
           | largely based on feedback receive from the recipient (well,
           | so long as you're talking-with as opposed to talking-at) -
           | we're paying attention to how the audience is paying
           | attention or not, any small facial tics that might betray
           | skepticism or agreement and so on. I'm looking forward to
           | interacting with an LLM that pairs an emotion-vector along
           | with each token it has previously produced.
           | 
           | hume.ai goes a long way analyzing audio, just a matter of
           | time before they're ingesting realtime facial cues to also
           | incorporate their audience's reaction in their choice of what
           | to say next
        
           | rdedev wrote:
           | I view transformers as like the language center of the brain.
           | When we write or speak, especially when it's critical to get
           | things right, we have this ability to think "that doesn't
           | make sense" and start over. I view this recursion as more of
           | a strength than weakness. You can get an LLM to generate an
           | answer and when asked about the validity of the answer it
           | would acknowledge that it got it wrong. This begs the
           | question that if it had perfect recall and understanding why
           | did it give the wrong answer in the first place?
           | 
           | I don't know how the reasoning part comes to us but if we
           | could implant that capability to a transformer model then it
           | would end up pretty good.
        
             | mannykannot wrote:
             | I agree, and also, when I'm writing, I am working towards a
             | hierarchy of goals at the level of sentence, paragraph and
             | beyond, and I'm also wondering if what I have written and
             | plan to write could be confusing or misunderstood.
             | 
             | I think it's fair to ask whether these are essential
             | techniques for improving precision and clarity, or just a
             | way to compensate for not being able to see the whole
             | picture all at once - but if the latter is the case,
             | there's still room for improvement in LLMs (and me, for
             | that matter.) I notice that experts on a topic are often
             | able to pick out what matters most without any apparent
             | hesitation.
        
         | logicchains wrote:
         | >Lately I've been wondering... is this a problem, or a
         | strength?
         | 
         | It's a strength; fundamentally it's impossible to achieve the
         | same degree of accuracy with a sub-quadratic attention
         | mechanism: https://arxiv.org/abs/2209.04881 (unless the Strong
         | Exponential Time Hypothesis is false, which is very unlikely,
         | like P=NP).
        
         | metadat wrote:
         | What's an SSM?
         | 
         | For the uninitiated (like me), apparently it stands for State
         | Space Models.
        
         | Semionilo wrote:
         | I don't think it's weird or broken to think and compare on what
         | LLM do vs what our brain do.
         | 
         | It shows more than not that we are also parrots
        
         | scarmig wrote:
         | > Lately I've been wondering... is this a problem, or a
         | strength?
         | 
         | It probably depends. But an idea I've been playing with:
         | because transformers have such a strong ability for recall
         | during inference, they might be introducing a strong inductive
         | bias for memorization as opposed to generalization. Why bother
         | to build a complete world model when you can just attend to the
         | answer? The global minimum in loss (at least for the training
         | dataset) would use those memorizing and interpolating circuits
         | over those that generalize well. This seems consistent with
         | LLMs as they exist today: superhuman at recall, very mediocre
         | at reasoning. Though, for what it's worth, existing SSSMs
         | haven't yet shown they can outperform (or even match)
         | transformers when it comes to reasoning.
         | 
         | If this hypothesis were true, you might expect to see grokking
         | in state space models more quickly than in transformer models.
         | 
         | (Even if it's hard to train transformers to generalize,
         | superhuman recall is still incredibly valuable, and likely a
         | hybrid system would offer the best of both worlds.)
        
       | password4321 wrote:
       | Links to more about Mamba (selective state space models) on HN
       | yesterday:
       | 
       | https://news.ycombinator.com/item?id=39853958#39855430
        
         | fisian wrote:
         | This submission has the same content as the link here
         | (submitted to HN about a month ago):
         | 
         | https://news.ycombinator.com/item?id=39501982
         | https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html
        
           | jimmySixDOF wrote:
           | Yes and its the same author this time published on the
           | Gradient (the link before was to the personal blog). The
           | Gradient by the way are amazing curators of AI news in
           | general and have one of the better podcasts I am aware of
           | interviewing developers in the trenches.
           | 
           | Adding: this resurgence in Mamba in general is also due to
           | some actual sota progress with SSM like the new AI21 lab
           | released this week [1] and likely to see others merging
           | different architecture layers (this is a 52B MoE with 12B
           | params active during inference blending both Mamba and
           | transformers)
           | 
           | >As the first production-grade model based on Mamba
           | architecture, Jamba achieves an unprecedented 3X throughput
           | and fits 140K context on a single GPU.
           | 
           | [1] https://www.ai21.com/jamba
        
       | programjames wrote:
       | This is the best explanation I have seen for Mamba.
        
         | spxneo wrote:
         | TLDR: Friendship ended with transformers. Now Mamba is my best
         | friend.
        
       | etbebl wrote:
       | Anyone else keep seeing articles about Mamba and thinking it's
       | about Python/Conda? It's annoying when the new cool thing picks
       | the same name as something else you like that deserves attention.
        
         | ragebol wrote:
         | > attention
         | 
         | I see what you did there
        
         | tempaccount420 wrote:
         | Sounds like you need a language model to help you categorize
         | Mamba articles into Python and non-Python articles?
        
       | sp332 wrote:
       | So in an effective Mamba query the question goes at the end,
       | after input data? I thought that the question should go at the
       | beginning, so it can decide which information in the data is
       | relevant.
        
         | eropple wrote:
         | I could be wrong, as I haven't used Mamba, but it seems to
         | remain similar to transformers in that it doesn't "decide"
         | anything and streams tokens to follow the existing ones;
         | attention isn't a thing in the same way, but recency does still
         | have impact. To that end, putting context after the question
         | makes it more likely to follow the context, not the question.
        
       | jongjong wrote:
       | I find it difficult to understand certain math and science
       | papers/articles due to ambiguous use of language.
       | 
       | For example "all previous tokens can be passed to the current
       | token." That seems like a poorly constructed sentence. A token is
       | not a function and it's not an algorithm either... How can you
       | pass tokens to a token? This type of ambiguous language in
       | academic papers makes it hard to read... Maybe the phrase 'every
       | token has an association with every other previously encountered
       | token' would be better? Or every token is used to compute the
       | token vector for each token... I don't know, all I can do is
       | guess the meaning of the word 'passed'. They want us to infer and
       | fill in the gaps with our own assumptions. It assumes that we are
       | primed to think in a certain highly constrained way...
       | 
       | For some reason a lot of academia around AI is littered with such
       | imprecise language. They choose to use niche concepts and
       | repurposed wording that their own small community invented rather
       | using words and ideas that are more widely understood but which
       | would convey the same information.
       | 
       | Rational people who aren't directly involved in those fields who
       | generally resist jumping to conclusions will struggle to
       | understand what is meant because a lot of those words and ideas
       | have different interpretations in their own fields.
       | 
       | I studied machine learning at university and wrote ANNs from
       | scratch and trained them and even I find the language and
       | concepts around LLMs too ambiguous. I'd rather just ask ChatGPT.
       | 
       | One thing that bothers me is that the community has moved away
       | from relating concepts to neurons, interconnections, input
       | layers, hidden layers and output layers. Instead, they jump
       | straight into vectors and matrices... Pretending as though there
       | is only one way to map those calculations to neurons and weights.
       | But in fact, this abstraction has many possible interpretations.
       | You could have fully connected layers or partially connected
       | layers... Maybe you need a transformer only in front of the input
       | layer or between every layer... So many possibilities.
       | 
       | The entire article means little if considered in isolation
       | outside of the context of current configurations of various
       | popular frameworks and tools.
        
         | king_magic wrote:
         | that's not what it says in the article. it actually says
         | "information from all previous tokens can be passed to the
         | current token".
         | 
         | that statement is meaningfully different from "all previous
         | tokens can be passed to the current token". and both really
         | makes sense if you understand attention mechanisms.
        
           | jongjong wrote:
           | Sorry for the misquote but it's a distraction from my issue
           | which was with the usage of the word 'passed'.
           | 
           | Do you pass information from other tokens to a token in the
           | sense that each token processes information from other
           | tokens? A token isn't a processing unit AFAIK, it's just a
           | word part. The processing is not the responsibility of the
           | token itself. My understanding is that tokens may be
           | associated with each other via an external structure but not
           | passed to each other. Or maybe they meant a token vector? And
           | the token vector contains information from related tokens?
           | It's unclear.
           | 
           | To me, 'passed' means data passed to a function or algorithm
           | for processing. It's confusing unless a token is a function
           | or algorithm.
           | 
           | My point is that this language only makes sense if you are
           | already up to date in that field.
        
             | anon291 wrote:
             | Well they gave the equations so follow closely where the
             | token representations end up and how they're acted upon.
        
         | derbOac wrote:
         | I agree although I've always interpreted it as a combination of
         | difficulty explaining complex architecture, and also not really
         | understanding why things work the way they do. A lot of modern
         | AI sits in this kind of quasi-empirical realm just above (in an
         | emergent properties sense) analytic math and statistics, and it
         | seems like there's not a very good integrative account or
         | understanding of what's going on, or a way of deriving what
         | direction to go in. So you end up with poor explanations in
         | part because the authors of the structures themselves don't
         | quite understand why things are working as they are.
        
       ___________________________________________________________________
       (page generated 2024-03-31 23:02 UTC)