[HN Gopher] Mamba Explained
___________________________________________________________________
Mamba Explained
Author : andreyk
Score : 174 points
Date : 2024-03-30 16:04 UTC (1 days ago)
(HTM) web link (thegradient.pub)
(TXT) w3m dump (thegradient.pub)
| xz18r wrote:
| I just have to say it: that image shows gunpla, i.e. Mobile Suit
| Gundam, not Transformers!
| throwup238 wrote:
| An official request has been made to ICANN to rescind the OP's
| nerd card.
| andy_xor_andrew wrote:
| > But Transformers have one core problem. In a transformer, every
| token can look back at every previous token when making
| predictions.
|
| Lately I've been wondering... is this a problem, or a strength?
|
| It might be a fallacy to compare how LLMs "think" with how humans
| think. But humor me for a second. When you are speaking, each
| time you emit a word, you are not attending to every previous
| word in your sentence (like transformers), rather you have a
| state in your mind that represents the grammar and concepts,
| which is continuously updated as you speak (more similar to
| SSMs).
|
| Similarly, when you read a book, every time you read a word, you
| are not attending to every previous word in the book. Your model
| of "the book" is rather a fuzzy/approximate state that is updated
| with new information every time a new word appears. Right? (I'm
| sorry I know this is very handwavy and psuedoscientific but bear
| with me).
|
| Ok, so if (big if) you feel like the above is true, then to match
| human-type language modelling, SSMs seem more human-like than
| transformers.
|
| BUT... then aren't transformers _strictly better_ in terms of
| accuracy? Because a transformer never "forgets" information, as
| long as it is within the context window, because it revisits that
| information every time it emits a new token.
|
| So let's say we can remove the "quadratic attention" problem of
| transformers with SSMs. That's a nice training/inference
| performance boost. But... look at where we got with "naive"
| attention. GPT 4, Claude 3. It's not like we're hitting a wall
| with quadratic attention. It's absurdly more expensive than SSMs,
| but GPUs certainly aren't getting slower. If all AI work stops
| now, and only hardware improves, it wouldn't be long until GPT4
| could run on local hardware, right, provided Moore's law?
|
| /end rant, not really sure what my point was, I'm not against
| SSMs (they're cool) but rather I'm wondering if the SOTA will
| ever be SSM when attention is so damn good
| maccam912 wrote:
| It depends on the task I imagine. Like writing a novel was
| mentioned, keeping important story lines in your memory for a
| long time will be necessary, or at least certainly more
| important than remembering what the characters were eating for
| lunch on page 10. But if you need to find that one loophole in
| a contact you probably will benefit from the perfect recall.
| spxneo wrote:
| very good point and the sooner we can accept this difference
| (we access hyperdimensional entities we discover through
| language and math via fast and slow access and vocalize it
| through the alphabets we learned to read) the more
| "intelligence" we can unlock from AI.
| aCoreyJ wrote:
| We're running out of the ability to make transistors smaller
| and closer together so beyond some major breakthrough I wouldnt
| expect Moore's law to continue nearly long enough to get to the
| point of running GPT4 on consumer hardware in the short term
| timschmidt wrote:
| Ah, but we've just begun stacking transistors in the third
| dimension.
| ctrw wrote:
| That doesn't solve the problem, it just pushes is down the
| road a bit. The exponential growth is merely offset by a
| constant factor once. Unless we figure out how to push
| transistors in the 5th, 6th etc dimension with every new
| generation.
| jazzyjackson wrote:
| It was never a solution, Moore's law has more than one
| dimension as well, not just density but heat dissipation.
| Can't cool down a transistor that's surrounded by
| transistors on all sides.
| moffkalast wrote:
| Well consumer hardware can run something in the order of ~50B
| quantized at a "reasonable" price today, we'd need about 5 or
| 6 doublings to run something that would be GPT 4 tier at 1T+.
| So, it would need to continue for roughly a decade at least?
|
| Current models are horrendously inefficient though, so with
| architectural improvements we'll have something of that
| capability far sooner on weaker hardware.
| y42 wrote:
| >> Is this a problem or a strength?
|
| I was wondering the same thing. I understand, why the initial
| developers of this method declared it as a strength. Still I
| think it's a problem, too:
|
| If the Tranformer reads this sentence:
|
| A equals B
|
| It understands, that B comes after A and therefore A equals B.
| But how does it learn that after A comes B and therefore B
| equals A.
|
| I am referring to the logical problems, that most (all?) modern
| language models suffer of.
| sigmoid10 wrote:
| I see many people get confused by this due to the widely
| spread (and false) "stochastic parrot" theme. But these
| models are much more than mere senzence-repeaters. In a way,
| the model is not learning that after A comes B. I mean, it
| could. With a lack of additional training data it probably
| would, too. But with enough data, this kind of sentence
| completion based purely on existing sentences no longer works
| because it would saturate parameters. So to retain and
| improve accuracy during training, it will have to come up
| with a compression that essentially forms a model of the real
| world. Or at least the world that the training corpus
| describes [1]. In that sense, it no longer "knows" that B
| comes after A (except for the input context), but it would
| have learned that there is a special relation between A and
| B. In can then also apply this kind of learned logic to new
| concepts that appear first in the context during inference.
| With all that happening internally, it only has to morph this
| state back into a natural language output. But with billions
| of parameters and countless layers, there is more than enough
| computational room for this to happen. In fact, recent models
| have shown that even small models can get pretty good at
| logic if you only get the training data right.
|
| [1] https://arxiv.org/abs/2210.13382
| incrudible wrote:
| > It's not like we're hitting a wall with quadratic attention.
| It's absurdly more expensive than SSMs, but GPUs certainly
| aren't getting slower.
|
| We are not hitting a wall, but a slope. Hardware improvements
| will not make up for it indefinitely. Software will have to
| make up for it, but the problem is that it costs millions of
| dollars to hit compile.
| tippytippytango wrote:
| It's a tradeoff to be managed depending on the application
| rather than a problem.
| thomasahle wrote:
| >> But Transformers have one core problem. In a transformer,
| every token can look back at every previous token when making
| predictions.
|
| > Lately I've been wondering... is this a problem, or a
| strength?
|
| Exactly. There are lot of use cases where perfect recall is
| important. And earlier data may be more or less incompressible,
| such as if an LLM is working on a large table of data.
|
| Maybe we'll end up with different architectures being used for
| different applications. E.g. simple chat may be OK with an RNN
| type architecture.
|
| I've also seen people combine Mamba and Transformer layers.
| Maybe that's a good tradeoff for some other applications.
| koayon wrote:
| This is a very fair point! If we had infinite compute then it's
| undeniable that transformers (i.e. full attention) would be
| better (exactly as you characterise it)
|
| But that's the efficiency-effectiveness tradeoff that we have
| to make: given that compute is limited, would we prefer
| attention over shorter sequences or SSMs over longer sequences?
| The answer is probably "well, it depends on your use case" - I
| can definitely see reasons for both!
|
| A fairly compelling thought for me is hybrid architectures
| (Jamba is a recent one). Here you can imagine having perfect
| recall over recent tokens and lossy recall over distant tokens.
| E.g. if the AI is generating a feature-length film, you "could
| imagine having Attention look at the most recent frames for
| short-term fluidity and an SSM for long-term narrative
| consistency" (quote from the OP)
| koayon wrote:
| And given that the compute is O(n^2) with context window,
| it's a very real tradeoff, at least in the short term
| rdedev wrote:
| If I remember it right, the llm big bird had something like
| this. For a particular word it would attend strongly with its
| closer neighbours but weakly to words far from it. Look for
| sparse attention. I think that's the relevant terminology.
| Not sure if it matches exactly what you described
| nlrk wrote:
| >> When you are speaking, each time you emit a word, you are
| not attending to every previous word in your sentence
|
| I was exactly doing this until late in my youth. until I learnt
| people do it sequentially. But it is doable to create
| connections and pick the sensible case. Not the most relaxing
| thing.
| anon291 wrote:
| Yes transformers are obviously more capable than humans in my
| opinion. Claude can ingest dozens of pages in seconds and -- in
| a single shot -- write a summary bringing in relevant passages.
|
| The innovation is not the speed, but the lack of recursion or
| iteration. Humans, even accomplished ones, have to reread
| sections and really 'internalize' ideas before being able to
| summarize and very few humans can -- in a single attempt --
| generate perfect speech. Most of us speak and unknowingly
| revise our own speech as we go along. Unlike transformers, that
| speak confidently, we start making a sentence and then decide
| halfway through its not going where we like. Then we start it
| over again, and by the powers of human attention, no one seems
| to really notice.
|
| Transformers Are just insanely complicated and expensive to
| train.
| jazzyjackson wrote:
| > we start making a sentence and then decide halfway through
| its not going where we like
|
| I'll just add the observation that when we do this it's
| largely based on feedback receive from the recipient (well,
| so long as you're talking-with as opposed to talking-at) -
| we're paying attention to how the audience is paying
| attention or not, any small facial tics that might betray
| skepticism or agreement and so on. I'm looking forward to
| interacting with an LLM that pairs an emotion-vector along
| with each token it has previously produced.
|
| hume.ai goes a long way analyzing audio, just a matter of
| time before they're ingesting realtime facial cues to also
| incorporate their audience's reaction in their choice of what
| to say next
| rdedev wrote:
| I view transformers as like the language center of the brain.
| When we write or speak, especially when it's critical to get
| things right, we have this ability to think "that doesn't
| make sense" and start over. I view this recursion as more of
| a strength than weakness. You can get an LLM to generate an
| answer and when asked about the validity of the answer it
| would acknowledge that it got it wrong. This begs the
| question that if it had perfect recall and understanding why
| did it give the wrong answer in the first place?
|
| I don't know how the reasoning part comes to us but if we
| could implant that capability to a transformer model then it
| would end up pretty good.
| mannykannot wrote:
| I agree, and also, when I'm writing, I am working towards a
| hierarchy of goals at the level of sentence, paragraph and
| beyond, and I'm also wondering if what I have written and
| plan to write could be confusing or misunderstood.
|
| I think it's fair to ask whether these are essential
| techniques for improving precision and clarity, or just a
| way to compensate for not being able to see the whole
| picture all at once - but if the latter is the case,
| there's still room for improvement in LLMs (and me, for
| that matter.) I notice that experts on a topic are often
| able to pick out what matters most without any apparent
| hesitation.
| logicchains wrote:
| >Lately I've been wondering... is this a problem, or a
| strength?
|
| It's a strength; fundamentally it's impossible to achieve the
| same degree of accuracy with a sub-quadratic attention
| mechanism: https://arxiv.org/abs/2209.04881 (unless the Strong
| Exponential Time Hypothesis is false, which is very unlikely,
| like P=NP).
| metadat wrote:
| What's an SSM?
|
| For the uninitiated (like me), apparently it stands for State
| Space Models.
| Semionilo wrote:
| I don't think it's weird or broken to think and compare on what
| LLM do vs what our brain do.
|
| It shows more than not that we are also parrots
| scarmig wrote:
| > Lately I've been wondering... is this a problem, or a
| strength?
|
| It probably depends. But an idea I've been playing with:
| because transformers have such a strong ability for recall
| during inference, they might be introducing a strong inductive
| bias for memorization as opposed to generalization. Why bother
| to build a complete world model when you can just attend to the
| answer? The global minimum in loss (at least for the training
| dataset) would use those memorizing and interpolating circuits
| over those that generalize well. This seems consistent with
| LLMs as they exist today: superhuman at recall, very mediocre
| at reasoning. Though, for what it's worth, existing SSSMs
| haven't yet shown they can outperform (or even match)
| transformers when it comes to reasoning.
|
| If this hypothesis were true, you might expect to see grokking
| in state space models more quickly than in transformer models.
|
| (Even if it's hard to train transformers to generalize,
| superhuman recall is still incredibly valuable, and likely a
| hybrid system would offer the best of both worlds.)
| password4321 wrote:
| Links to more about Mamba (selective state space models) on HN
| yesterday:
|
| https://news.ycombinator.com/item?id=39853958#39855430
| fisian wrote:
| This submission has the same content as the link here
| (submitted to HN about a month ago):
|
| https://news.ycombinator.com/item?id=39501982
| https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html
| jimmySixDOF wrote:
| Yes and its the same author this time published on the
| Gradient (the link before was to the personal blog). The
| Gradient by the way are amazing curators of AI news in
| general and have one of the better podcasts I am aware of
| interviewing developers in the trenches.
|
| Adding: this resurgence in Mamba in general is also due to
| some actual sota progress with SSM like the new AI21 lab
| released this week [1] and likely to see others merging
| different architecture layers (this is a 52B MoE with 12B
| params active during inference blending both Mamba and
| transformers)
|
| >As the first production-grade model based on Mamba
| architecture, Jamba achieves an unprecedented 3X throughput
| and fits 140K context on a single GPU.
|
| [1] https://www.ai21.com/jamba
| programjames wrote:
| This is the best explanation I have seen for Mamba.
| spxneo wrote:
| TLDR: Friendship ended with transformers. Now Mamba is my best
| friend.
| etbebl wrote:
| Anyone else keep seeing articles about Mamba and thinking it's
| about Python/Conda? It's annoying when the new cool thing picks
| the same name as something else you like that deserves attention.
| ragebol wrote:
| > attention
|
| I see what you did there
| tempaccount420 wrote:
| Sounds like you need a language model to help you categorize
| Mamba articles into Python and non-Python articles?
| sp332 wrote:
| So in an effective Mamba query the question goes at the end,
| after input data? I thought that the question should go at the
| beginning, so it can decide which information in the data is
| relevant.
| eropple wrote:
| I could be wrong, as I haven't used Mamba, but it seems to
| remain similar to transformers in that it doesn't "decide"
| anything and streams tokens to follow the existing ones;
| attention isn't a thing in the same way, but recency does still
| have impact. To that end, putting context after the question
| makes it more likely to follow the context, not the question.
| jongjong wrote:
| I find it difficult to understand certain math and science
| papers/articles due to ambiguous use of language.
|
| For example "all previous tokens can be passed to the current
| token." That seems like a poorly constructed sentence. A token is
| not a function and it's not an algorithm either... How can you
| pass tokens to a token? This type of ambiguous language in
| academic papers makes it hard to read... Maybe the phrase 'every
| token has an association with every other previously encountered
| token' would be better? Or every token is used to compute the
| token vector for each token... I don't know, all I can do is
| guess the meaning of the word 'passed'. They want us to infer and
| fill in the gaps with our own assumptions. It assumes that we are
| primed to think in a certain highly constrained way...
|
| For some reason a lot of academia around AI is littered with such
| imprecise language. They choose to use niche concepts and
| repurposed wording that their own small community invented rather
| using words and ideas that are more widely understood but which
| would convey the same information.
|
| Rational people who aren't directly involved in those fields who
| generally resist jumping to conclusions will struggle to
| understand what is meant because a lot of those words and ideas
| have different interpretations in their own fields.
|
| I studied machine learning at university and wrote ANNs from
| scratch and trained them and even I find the language and
| concepts around LLMs too ambiguous. I'd rather just ask ChatGPT.
|
| One thing that bothers me is that the community has moved away
| from relating concepts to neurons, interconnections, input
| layers, hidden layers and output layers. Instead, they jump
| straight into vectors and matrices... Pretending as though there
| is only one way to map those calculations to neurons and weights.
| But in fact, this abstraction has many possible interpretations.
| You could have fully connected layers or partially connected
| layers... Maybe you need a transformer only in front of the input
| layer or between every layer... So many possibilities.
|
| The entire article means little if considered in isolation
| outside of the context of current configurations of various
| popular frameworks and tools.
| king_magic wrote:
| that's not what it says in the article. it actually says
| "information from all previous tokens can be passed to the
| current token".
|
| that statement is meaningfully different from "all previous
| tokens can be passed to the current token". and both really
| makes sense if you understand attention mechanisms.
| jongjong wrote:
| Sorry for the misquote but it's a distraction from my issue
| which was with the usage of the word 'passed'.
|
| Do you pass information from other tokens to a token in the
| sense that each token processes information from other
| tokens? A token isn't a processing unit AFAIK, it's just a
| word part. The processing is not the responsibility of the
| token itself. My understanding is that tokens may be
| associated with each other via an external structure but not
| passed to each other. Or maybe they meant a token vector? And
| the token vector contains information from related tokens?
| It's unclear.
|
| To me, 'passed' means data passed to a function or algorithm
| for processing. It's confusing unless a token is a function
| or algorithm.
|
| My point is that this language only makes sense if you are
| already up to date in that field.
| anon291 wrote:
| Well they gave the equations so follow closely where the
| token representations end up and how they're acted upon.
| derbOac wrote:
| I agree although I've always interpreted it as a combination of
| difficulty explaining complex architecture, and also not really
| understanding why things work the way they do. A lot of modern
| AI sits in this kind of quasi-empirical realm just above (in an
| emergent properties sense) analytic math and statistics, and it
| seems like there's not a very good integrative account or
| understanding of what's going on, or a way of deriving what
| direction to go in. So you end up with poor explanations in
| part because the authors of the structures themselves don't
| quite understand why things are working as they are.
___________________________________________________________________
(page generated 2024-03-31 23:02 UTC)