[HN Gopher] Mamba Explained
       ___________________________________________________________________
        
       Mamba Explained
        
       Author : andreyk
       Score  : 79 points
       Date   : 2024-03-30 16:04 UTC (6 hours ago)
        
 (HTM) web link (thegradient.pub)
 (TXT) w3m dump (thegradient.pub)
        
       | xz18r wrote:
       | I just have to say it: that image shows gunpla, i.e. Mobile Suit
       | Gundam, not Transformers!
        
         | throwup238 wrote:
         | An official request has been made to ICANN to rescind the OP's
         | nerd card.
        
       | andy_xor_andrew wrote:
       | > But Transformers have one core problem. In a transformer, every
       | token can look back at every previous token when making
       | predictions.
       | 
       | Lately I've been wondering... is this a problem, or a strength?
       | 
       | It might be a fallacy to compare how LLMs "think" with how humans
       | think. But humor me for a second. When you are speaking, each
       | time you emit a word, you are not attending to every previous
       | word in your sentence (like transformers), rather you have a
       | state in your mind that represents the grammar and concepts,
       | which is continuously updated as you speak (more similar to
       | SSMs).
       | 
       | Similarly, when you read a book, every time you read a word, you
       | are not attending to every previous word in the book. Your model
       | of "the book" is rather a fuzzy/approximate state that is updated
       | with new information every time a new word appears. Right? (I'm
       | sorry I know this is very handwavy and psuedoscientific but bear
       | with me).
       | 
       | Ok, so if (big if) you feel like the above is true, then to match
       | human-type language modelling, SSMs seem more human-like than
       | transformers.
       | 
       | BUT... then aren't transformers _strictly better_ in terms of
       | accuracy? Because a transformer never  "forgets" information, as
       | long as it is within the context window, because it revisits that
       | information every time it emits a new token.
       | 
       | So let's say we can remove the "quadratic attention" problem of
       | transformers with SSMs. That's a nice training/inference
       | performance boost. But... look at where we got with "naive"
       | attention. GPT 4, Claude 3. It's not like we're hitting a wall
       | with quadratic attention. It's absurdly more expensive than SSMs,
       | but GPUs certainly aren't getting slower. If all AI work stops
       | now, and only hardware improves, it wouldn't be long until GPT4
       | could run on local hardware, right, provided Moore's law?
       | 
       | /end rant, not really sure what my point was, I'm not against
       | SSMs (they're cool) but rather I'm wondering if the SOTA will
       | ever be SSM when attention is so damn good
        
         | maccam912 wrote:
         | It depends on the task I imagine. Like writing a novel was
         | mentioned, keeping important story lines in your memory for a
         | long time will be necessary, or at least certainly more
         | important than remembering what the characters were eating for
         | lunch on page 10. But if you need to find that one loophole in
         | a contact you probably will benefit from the perfect recall.
        
         | spxneo wrote:
         | very good point and the sooner we can accept this difference
         | (we access hyperdimensional entities we discover through
         | language and math via fast and slow access and vocalize it
         | through the alphabets we learned to read) the more
         | "intelligence" we can unlock from AI.
        
         | aCoreyJ wrote:
         | We're running out of the ability to make transistors smaller
         | and closer together so beyond some major breakthrough I wouldnt
         | expect Moore's law to continue nearly long enough to get to the
         | point of running GPT4 on consumer hardware in the short term
        
           | timschmidt wrote:
           | Ah, but we've just begun stacking transistors in the third
           | dimension.
        
         | y42 wrote:
         | >> Is this a problem or a strength?
         | 
         | I was wondering the same thing. I understand, why the initial
         | developers of this method declared it as a strength. Still I
         | think it's a problem, too:
         | 
         | If the Tranformer reads this sentence:
         | 
         | A equals B
         | 
         | It understands, that B comes after A and therefore A equals B.
         | But how does it learn that after A comes B and therefore B
         | equals A.
         | 
         | I am referring to the logical problems, that most (all?) modern
         | language models suffer of.
        
           | sigmoid10 wrote:
           | I see many people get confused by this due to the widely
           | spread (and false) "stochastic parrot" theme. But these
           | models are much more than mere senzence-repeaters. In a way,
           | the model is not learning that after A comes B. I mean, it
           | could. With a lack of additional training data it probably
           | would, too. But with enough data, this kind of sentence
           | completion based purely on existing sentences no longer works
           | because it would saturate parameters. So to retain and
           | improve accuracy during training, it will have to come up
           | with a compression that essentially forms a model of the real
           | world. Or at least the world that the training corpus
           | describes [1]. In that sense, it no longer "knows" that B
           | comes after A (except for the input context), but it would
           | have learned that there is a special relation between A and
           | B. In can then also apply this kind of learned logic to new
           | concepts that appear first in the context during inference.
           | With all that happening internally, it only has to morph this
           | state back into a natural language output. But with billions
           | of parameters and countless layers, there is more than enough
           | computational room for this to happen. In fact, recent models
           | have shown that even small models can get pretty good at
           | logic if you only get the training data right.
           | 
           | [1] https://arxiv.org/abs/2210.13382
        
         | incrudible wrote:
         | > It's not like we're hitting a wall with quadratic attention.
         | It's absurdly more expensive than SSMs, but GPUs certainly
         | aren't getting slower.
         | 
         | We are not hitting a wall, but a slope. Hardware improvements
         | will not make up for it indefinitely. Software will have to
         | make up for it, but the problem is that it costs millions of
         | dollars to hit compile.
        
         | tippytippytango wrote:
         | It's a tradeoff to be managed depending on the application
         | rather than a problem.
        
       | password4321 wrote:
       | Links to more about Mamba (selective state space models) on HN
       | yesterday:
       | 
       | https://news.ycombinator.com/item?id=39853958#39855430
        
       | programjames wrote:
       | This is the best explanation I have seen for Mamba.
        
         | spxneo wrote:
         | TLDR: Friendship ended with transformers. Now Mamba is my best
         | friend.
        
       | etbebl wrote:
       | Anyone else keep seeing articles about Mamba and thinking it's
       | about Python/Conda? It's annoying when the new cool thing picks
       | the same name as something else you like that deserves attention.
        
         | ragebol wrote:
         | > attention
         | 
         | I see what you did there
        
       | sp332 wrote:
       | So in an effective Mamba query the question goes at the end,
       | after input data? I thought that the question should go at the
       | beginning, so it can decide which information in the data is
       | relevant.
        
       ___________________________________________________________________
       (page generated 2024-03-30 23:00 UTC)