[HN Gopher] Mamba Explained
___________________________________________________________________
Mamba Explained
Author : andreyk
Score : 79 points
Date : 2024-03-30 16:04 UTC (6 hours ago)
(HTM) web link (thegradient.pub)
(TXT) w3m dump (thegradient.pub)
| xz18r wrote:
| I just have to say it: that image shows gunpla, i.e. Mobile Suit
| Gundam, not Transformers!
| throwup238 wrote:
| An official request has been made to ICANN to rescind the OP's
| nerd card.
| andy_xor_andrew wrote:
| > But Transformers have one core problem. In a transformer, every
| token can look back at every previous token when making
| predictions.
|
| Lately I've been wondering... is this a problem, or a strength?
|
| It might be a fallacy to compare how LLMs "think" with how humans
| think. But humor me for a second. When you are speaking, each
| time you emit a word, you are not attending to every previous
| word in your sentence (like transformers), rather you have a
| state in your mind that represents the grammar and concepts,
| which is continuously updated as you speak (more similar to
| SSMs).
|
| Similarly, when you read a book, every time you read a word, you
| are not attending to every previous word in the book. Your model
| of "the book" is rather a fuzzy/approximate state that is updated
| with new information every time a new word appears. Right? (I'm
| sorry I know this is very handwavy and psuedoscientific but bear
| with me).
|
| Ok, so if (big if) you feel like the above is true, then to match
| human-type language modelling, SSMs seem more human-like than
| transformers.
|
| BUT... then aren't transformers _strictly better_ in terms of
| accuracy? Because a transformer never "forgets" information, as
| long as it is within the context window, because it revisits that
| information every time it emits a new token.
|
| So let's say we can remove the "quadratic attention" problem of
| transformers with SSMs. That's a nice training/inference
| performance boost. But... look at where we got with "naive"
| attention. GPT 4, Claude 3. It's not like we're hitting a wall
| with quadratic attention. It's absurdly more expensive than SSMs,
| but GPUs certainly aren't getting slower. If all AI work stops
| now, and only hardware improves, it wouldn't be long until GPT4
| could run on local hardware, right, provided Moore's law?
|
| /end rant, not really sure what my point was, I'm not against
| SSMs (they're cool) but rather I'm wondering if the SOTA will
| ever be SSM when attention is so damn good
| maccam912 wrote:
| It depends on the task I imagine. Like writing a novel was
| mentioned, keeping important story lines in your memory for a
| long time will be necessary, or at least certainly more
| important than remembering what the characters were eating for
| lunch on page 10. But if you need to find that one loophole in
| a contact you probably will benefit from the perfect recall.
| spxneo wrote:
| very good point and the sooner we can accept this difference
| (we access hyperdimensional entities we discover through
| language and math via fast and slow access and vocalize it
| through the alphabets we learned to read) the more
| "intelligence" we can unlock from AI.
| aCoreyJ wrote:
| We're running out of the ability to make transistors smaller
| and closer together so beyond some major breakthrough I wouldnt
| expect Moore's law to continue nearly long enough to get to the
| point of running GPT4 on consumer hardware in the short term
| timschmidt wrote:
| Ah, but we've just begun stacking transistors in the third
| dimension.
| y42 wrote:
| >> Is this a problem or a strength?
|
| I was wondering the same thing. I understand, why the initial
| developers of this method declared it as a strength. Still I
| think it's a problem, too:
|
| If the Tranformer reads this sentence:
|
| A equals B
|
| It understands, that B comes after A and therefore A equals B.
| But how does it learn that after A comes B and therefore B
| equals A.
|
| I am referring to the logical problems, that most (all?) modern
| language models suffer of.
| sigmoid10 wrote:
| I see many people get confused by this due to the widely
| spread (and false) "stochastic parrot" theme. But these
| models are much more than mere senzence-repeaters. In a way,
| the model is not learning that after A comes B. I mean, it
| could. With a lack of additional training data it probably
| would, too. But with enough data, this kind of sentence
| completion based purely on existing sentences no longer works
| because it would saturate parameters. So to retain and
| improve accuracy during training, it will have to come up
| with a compression that essentially forms a model of the real
| world. Or at least the world that the training corpus
| describes [1]. In that sense, it no longer "knows" that B
| comes after A (except for the input context), but it would
| have learned that there is a special relation between A and
| B. In can then also apply this kind of learned logic to new
| concepts that appear first in the context during inference.
| With all that happening internally, it only has to morph this
| state back into a natural language output. But with billions
| of parameters and countless layers, there is more than enough
| computational room for this to happen. In fact, recent models
| have shown that even small models can get pretty good at
| logic if you only get the training data right.
|
| [1] https://arxiv.org/abs/2210.13382
| incrudible wrote:
| > It's not like we're hitting a wall with quadratic attention.
| It's absurdly more expensive than SSMs, but GPUs certainly
| aren't getting slower.
|
| We are not hitting a wall, but a slope. Hardware improvements
| will not make up for it indefinitely. Software will have to
| make up for it, but the problem is that it costs millions of
| dollars to hit compile.
| tippytippytango wrote:
| It's a tradeoff to be managed depending on the application
| rather than a problem.
| password4321 wrote:
| Links to more about Mamba (selective state space models) on HN
| yesterday:
|
| https://news.ycombinator.com/item?id=39853958#39855430
| programjames wrote:
| This is the best explanation I have seen for Mamba.
| spxneo wrote:
| TLDR: Friendship ended with transformers. Now Mamba is my best
| friend.
| etbebl wrote:
| Anyone else keep seeing articles about Mamba and thinking it's
| about Python/Conda? It's annoying when the new cool thing picks
| the same name as something else you like that deserves attention.
| ragebol wrote:
| > attention
|
| I see what you did there
| sp332 wrote:
| So in an effective Mamba query the question goes at the end,
| after input data? I thought that the question should go at the
| beginning, so it can decide which information in the data is
| relevant.
___________________________________________________________________
(page generated 2024-03-30 23:00 UTC)