[HN Gopher] Jamba: Production-grade Mamba-based AI model
       ___________________________________________________________________
        
       Jamba: Production-grade Mamba-based AI model
        
       Author : bubblehack3r
       Score  : 328 points
       Date   : 2024-03-28 16:36 UTC (1 days ago)
        
 (HTM) web link (www.maginative.com)
 (TXT) w3m dump (www.maginative.com)
        
       | kelseyfrog wrote:
       | I'm glad we're seeing exploration into scaling post-transformer
       | LLM architectures, but I'm disappointed that it _has_ a context
       | window. That was kind of the selling point of Mamba(and SSM
       | models in general), right linear scaling because
       | state+input=next_state+output?
        
         | refulgentis wrote:
         | I'm not sure I follow fully, it is also the case for
         | (handwaves) "traditional" LLMs that state + input = next state
         | + output. Its just that output increases, so as output becomes
         | input, eventually state + input / next state + output is
         | greater than the context size.
         | 
         | Re: linear scaling, that means the runtime cost is O(n) to
         | context size, rather than traditional transformer O(n^2)
        
           | maccam912 wrote:
           | I think kelseyfrog meant that the state for a mamba model is
           | supposed to "remember" stuff even if it doesn't have the
           | actual tokens to reference any more. It might not be
           | guaranteed to hang on to some information about tokens from a
           | long time ago, but at least in theory it's possible, whereas
           | tokens from before a context window in a tradional llms may
           | as well never have existed.
        
             | kelseyfrog wrote:
             | Yes, you said it better than I did :)
        
           | visarga wrote:
           | That is valid for Mamba, this model (Jamba) is a mix of
           | transformer and mamba layers, so it still has a quadratic
           | memory cost, but divided by 8.
        
         | a_wild_dandan wrote:
         | state = context
         | 
         | The difference between SSMs and GPTs here is _how_ that state
         | /context scales. Per usual in engineering, there are big trade
         | offs!
        
           | kelseyfrog wrote:
           | I'm not following. State is a multi-dimensional vector and
           | context is a list of tokens. State is perturbed by A and
           | Bx(t), while context is appended to by sampling the predicted
           | token distribution.
        
         | spxneo wrote:
         | 256k is huge dude. that is like 1/2 of the average non fiction
         | novel
         | 
         | i think at least 200~300 pages of PDF
         | 
         | im not complaining here and it also fits in GPU
        
       | htrp wrote:
       | compute still has cost?
        
         | samus wrote:
         | In not sure I understood your question.
         | 
         | This model should have much lower computational cost since only
         | one out of eight layers is a traditional transformer layer with
         | masked self-attention. Additionally, half of the Mamba layers
         | are MoEs.
        
       | krasin wrote:
       | The license is a proper open-source one: Apache 2.0. Thanks, AI21
       | Labs.
        
         | popalchemist wrote:
         | In addition to the architectural and performance benefits, this
         | is the big deal here, IMO.
        
         | spxneo wrote:
         | im so used to seeing AGPLv3
         | 
         | apache 2 is a more generous license
        
           | krasin wrote:
           | AGPLv3 is a fine license too. But most of the models nowadays
           | come with bullshit licenses, like Llama 2 with its
           | "acceptable use policy" enforced by the license:
           | https://ai.meta.com/llama/use-policy/
        
       | Reubend wrote:
       | It's great to see a full production level model using Mamba. But
       | when it comes to long context window benchmarks, I'd love to see
       | performance as well as throughput. I was under the impressions
       | that Mamba has huge increases in throughput at the cost of modest
       | losses in accuracy when using long contexts.
        
         | refulgentis wrote:
         | I would too -- long context has been such a red herring across
         | providers, Claude 3 is the first I've seen that seems to
         | genuinely have some sort of qualitative leap in noticing
         | things.
         | 
         | It is worth noting I'm fairly sure there's no inherent
         | theoratical decrease to accuracy in long contexts, the claimed
         | theoratical change is an _increase_ in long-term accuracy in
         | long contexts.
        
           | Arthur_ODC wrote:
           | Long Context is great and all, but it sucks that all of these
           | LLM's have really poor output length. If I feed something an
           | entire book and ask for a comprehensive summary then I'm
           | expecting at least a full 3-page summary. I get that they try
           | to force these things to be "concise" to save on compute, but
           | good lord it's so annoying.
        
             | CuriouslyC wrote:
             | That's a chat gpt problem, if you hit the API it's not
             | nearly so hard to get good output.
        
               | refulgentis wrote:
               | I wouldn't say that, my latest big user story for making
               | sure I'm handling huge inputs was "translate Moby dick to
               | zoomer". Cant give any service chunks larger than ~5K
               | tokens, over API, without it failing.
               | 
               | (Miserably, like, I'd be fine if it gave a paragraph
               | back. But at least on this "map" task, there's a critical
               | point where there's so much input that the reward
               | function ends up imitating the input more instead of
               | chatting)
        
             | pedrovhb wrote:
             | Have you tried asking it for a specific concrete length,
             | like a number of words? I was also frustrated with concise
             | answers when asking for long ones, but I found that the
             | outputs improved significantly if I asked for e.g. 4000
             | words specifically. Further than that, have it break it
             | down into sections and write X words per section.
        
               | Arthur_ODC wrote:
               | Yes, all the possible length extending custom
               | instructions you can think of. I can get some reasonable
               | length responses out of it, but I've never seen them go
               | over 1 page worth, and multi-shot example prompts using
               | multiple USER and GPT exchanges to define the format.
               | Seems like GPT4 has a hard limit as to how much it will
               | output when you click "continue", and Claude Opus never
               | goes over a page either. Another user pointed out using
               | the API, which I have done in the past, but it's been a
               | long while, and I can't really justify the cost of using
               | the advanced models via API for my general use.
        
               | refulgentis wrote:
               | Everyone's coalescing at a max of 4096 tokens/12 "pages"
               | via API (page is 250 words, which is 1 8.5"x11" double
               | spaced)
               | 
               | To your point, doesn't matter anyway, it's nigh
               | impossible to get over 2K of output with every trick and
               | bit of guidance you can think of (I got desperate when
               | 16K/48 pages came out to "make it work", even completely
               | deforming tricks like making it number each line and
               | write a reminder on each line that it should write 1000
               | lines don't work)
        
           | binalpatel wrote:
           | Gemini 1.5 Pro is really good at long context in my
           | experience.
        
           | tempusalaria wrote:
           | Every long context sucks right now. All the model providers
           | benchmark on fact recall which is very limited. Actual
           | ability to do anything complicated beyond 16k tokens is not
           | present in any current model I have seen.
        
             | ukuina wrote:
             | This is not current. GPT-4-Turbo (128k) has lossless recall
             | to the first 64k input tokens and produces output
             | indistinguishable from GPT-4 (32k), though both are limited
             | to 4k output tokens.
             | 
             | Several downsides: Recall accuracy past the first 64k
             | tokens suffers badly; Cost is astronomical; Response
             | latency is too high for most interactive use-cases.
             | 
             | I would point out the astounding leap in input context in
             | just one year. Should we assume effectively-infinite (RAG-
             | free) context in the near-future?
        
               | anoncareer0212 wrote:
               | This is grossly untrue in a way that denotes surface-
               | level familiarity on several fronts
               | 
               | You're referring to the needle-in-a-haystack retrieval
               | problem.
               | 
               | Which the person you're replying to explicitly mentioned
               | is the only benchmark providers are using, for good
               | reason.
               | 
               | Consider the "translate Moby Dick to comedic zoomer"
               | problem. This does not even come remotely close to
               | working unless I do it in maximum chunks of 5,000 tokens.
               | 
               | Consider the API output limit of 4096 tokens, across all
               | providers.
               | 
               | And no, you shouldn't assume effectively infinite (RAG
               | free) context in the near future. This time last year,
               | Anthropic was demonstrating 120,000 token context. It
               | released 200K a few weeks ago. And runtime cost scales
               | with N^2.
        
         | samus wrote:
         | This one should have you covered :-) one out of every eight
         | layers is a traditional Transformer layer, which should ensure
         | precision, at least over short distances.
        
           | swyx wrote:
           | > which should ensure precision, at least over short
           | distances.
           | 
           | why? i dont follow. transformers should provide some
           | attention over -all- distances, no? why does layering
           | truncate this to "short distances"?
        
             | samus wrote:
             | I mean "short" in comparison to the unlimited, but lossy
             | recall that the Mamba blocks provide. Transformers are
             | limited to the context length, while Mamba can carry along
             | state. While it can remember things from a lot farther
             | back, it is limited and must thus eventually drop things
             | and/or lose precision.
        
       | gautamcgoel wrote:
       | Why include self-attention layers at all? In other words, why not
       | just alternate SSM and MLP layers?
        
         | NLPaep wrote:
         | Mamba is bad with long context. It doesn't remember phone
         | numbers
         | 
         | https://www.harvard.edu/kempner-institute/2024/02/05/repeat-...
        
           | a_wild_dandan wrote:
           | Good! DNNs unlock _semantics_ (parsing, transforming,
           | producing). That 's the basis of general intelligence, not
           | encyclopedic random string recall. Models shouldn't burn
           | ungodly quantities of compute emulating DDR5 with their
           | working memory. We need machines that _think better_ , not
           | _memorize_ well. We already have plenty of those.
           | 
           | Massive context windows, and their needle tests, are
           | misguided. We won't reach human-level AGI by basically
           | inventing a natural language RDBMS. Our resources should
           | primarily target better reasoning systems for our models,
           | reinforcement learning, etc.
           | 
           | If we can build a GPT4-level problem solving system that
           | coincidentally also can't remember telephone numbers, I'll
           | consider it major progress.
        
             | 6gvONxR4sf7o wrote:
             | Memorization usually refers to training data. It's often
             | useful to have something that can utilize instructions
             | losslessly, which is the distinction between these models.
        
           | Rodeoclash wrote:
           | I can't remember phone numbers either but I can use a device
           | suited to remembering them to look them up
        
             | orra wrote:
             | Hell, it looks like you forgot you already said that (-:
        
               | Rodeoclash wrote:
               | Haha, I blame the Harmonic app :/
        
             | imtringued wrote:
             | What if your field of vision was infinite and you are
             | looking at a unrolled telephone book?
             | 
             | Would you need a device to remember the phone number? You
             | wouldn't. You would need a method or algorithm to find the
             | number, but there is no reason why that algorithm couldn't
             | be part of the attention mechanism. The attention mechanism
             | is akin to reading the entire phone book for every word you
             | are about to say. It would be unreasonable to expect you to
             | not find the right phone number eventually.
        
           | Rodeoclash wrote:
           | I can't remember phone numbers either but I can use a device
           | suited to remembering them to look them up.
        
       | skybrian wrote:
       | > Jamba boasts an extensive context window of 256K tokens,
       | equivalent to around 210 pages of text, while fitting up to 140K
       | tokens on a single 80GB GPU.
       | 
       | I realize this is a big improvement, but it's striking how
       | inefficient LLM's are, that you need 80GB of GPU memory to
       | analyze less than 1 megabyte of data. That's a lot of bloat!
       | Hopefully there's a lot of room for algorithmic improvements.
        
         | electric_mayhem wrote:
         | It's literally simulating a neural network.
         | 
         | How much of your 5-sense experiential memories and decades of
         | academic book learning are you bringing to understand my reply
         | to your post?
         | 
         | How many gigabytes do you think that's equivalent to?
        
           | _false wrote:
           | I love both parent post perspectives on this.
        
           | skybrian wrote:
           | Jamba seems to be distributed as 21 5-gigabyte files [1] so I
           | guess that's another way of looking at it.
           | 
           | [1] https://huggingface.co/ai21labs/Jamba-v0.1/tree/main
        
             | imtringued wrote:
             | So what? I have seen models distributed as 26x 10GB files.
        
           | richardw wrote:
           | It's kinda simulating our brains but not really. When I
           | attempted to dig more into how neurons work I realised that
           | it's a massive chasm of difference. Very much worth doing if
           | you haven't (you might know far better then me, this is for
           | people who don't yet.)
           | 
           | In terms of results: Our brains are working with 20w of power
           | and can be trained to compete with LLM's using a tiny
           | fraction of the world's data. They also have to keep you
           | breathing and your blood pumping and manage all the dangers
           | of catching a ball near traffic. Or skiing, or poetry, or
           | sunsets. And they remember stuff five minutes later and don't
           | need a training run that takes months.
           | 
           | We have SO many opportunities to improve the AI architecture
           | it's ridiculous. This is a good thing.
        
             | reissbaker wrote:
             | To be fair most of the brain is more like a pretrained
             | model -- it isn't being trained at any point after
             | conception to keep your blood pumping or your lungs
             | working, it does that out of the box roughly as soon as you
             | sprout those organs (or the minute you're born, in the case
             | of lungs). The training process was billions of years of
             | evolution. And, well, given fairly persistent cross-
             | cultural cognitive biases, I expect the conscious thought
             | parts are starting from a pretrained model, too, and all
             | we're doing in school is finetuning ;)
        
             | imtringued wrote:
             | People don't understand that to simulate a single neuron,
             | you need an entire neural network. So 70 billion parameters
             | might at best be equivalent to a million neurons but that
             | is assuming that your neural network architecture is akin
             | to the connections between neurons. Considering the
             | physical sparsity, you might need even more parameters to
             | model the connections of a biological neural network. So
             | less than a million neurons in practice.
        
         | nostrowski wrote:
         | Two things I'm curious to know:
         | 
         | 1. How many tokens can 'traditional' models (e.g. Mistral's
         | 8x7B) fit on a single 80GB GPU? 2. How does quantization affect
         | the single transformer layer in the stack? What are the
         | performance/accuracy trade-offs that happen when so little of
         | the stack depends on this bottleneck?
        
           | patrakov wrote:
           | Mixtral 8x7b runs well (i.e., produces the correct output
           | faster than I can read it) on a modern AMD or Intel laptop
           | without any use of a GPU - provided that you have enough RAM
           | and CPU cores. 32 GB of RAM and 16 hyperthreads are enough
           | with 4-bit quantization if you don't ask too much in terms of
           | context.
           | 
           | P.S. Dell Inspiron 7415 upgraded to 64 GB of RAM here.
        
         | riku_iki wrote:
         | > that you need 80GB of GPU memory to analyze less than 1
         | megabyte of data
         | 
         | 80GB is compressed all human knowledge applied on that 1mb..
        
         | pama wrote:
         | The big (huge?) memory requirement is during training. These
         | LLMs work with high dimensional vectors and they calculate
         | gradients with respect to high dimensional vectors and they do
         | updates that require state of the optimizer. If you have 3
         | particles in 3 dimensions and you need their forces that
         | creates 3 new 3D vectors and once you update their position
         | along the forces then they also carry momenta. Now generalize
         | these simple 3-body physics to the typical 60-layer creatures
         | inside the LLM with vectors of several thousand dimensions,
         | interactions/weights that are scaling like the squares of these
         | vectors, to a total parameter count that adds up to the 10s to
         | 100s of billions of parameters, and then take derivatives and
         | start to keep track of momenta. It is a feat of modern
         | engineering that some groups can train such models efficiently.
         | I hope we will see more of the training stories becoming public
         | in the near future.
        
           | nl wrote:
           | This is wrong. You need big memory during inference too.
           | 
           | The difference there is you can use tricks like quantisation
           | and offloading to CPU to reduce it somewhat at the cost of
           | accuracy and/or speed.
        
             | brrrrrm wrote:
             | Training is 3x the memory used by inference, and usually
             | run at a much larger batch size
        
         | imtringued wrote:
         | Compared to the human brain they are shockingly efficient. It's
         | the hardware that isn't, but that is just a matter of time.
        
         | nl wrote:
         | That's all the world's knowledge compressed into 80GB. It's not
         | analysing 1MB data, it's analysing all of that knowledge plus
         | and additional 1MB.
        
       | smusamashah wrote:
       | There was a recent thread on explaining Mamba
       | https://news.ycombinator.com/item?id=39501982
       | (https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html)
       | 
       | There was another one on the same thing, probably better
       | https://news.ycombinator.com/item?id=39482428
       | (https://jackcook.com/2024/02/23/mamba.html)
        
         | dang wrote:
         | Thanks! Macroexpanded:
         | 
         |  _Mamba Explained: The State Space Model Taking On
         | Transformers_ - https://news.ycombinator.com/item?id=39501982 -
         | Feb 2024 (93 comments)
         | 
         |  _Mamba: The Easy Way_ -
         | https://news.ycombinator.com/item?id=39482428 - Feb 2024 (60
         | comments)
         | 
         |  _Is Mamba Capable of In-Context Learning?_ -
         | https://news.ycombinator.com/item?id=39286410 - Feb 2024 (1
         | comment)
         | 
         |  _Vision Mamba: Efficient Visual Representation Learning with
         | Bidirectional SSM_ -
         | https://news.ycombinator.com/item?id=39214939 - Feb 2024 (16
         | comments)
         | 
         |  _MoE-Mamba: Efficient Selective State Space Models with
         | Mixture of Experts_ -
         | https://news.ycombinator.com/item?id=38932350 - Jan 2024 (39
         | comments)
         | 
         |  _Implementation of Mamba in one file of PyTorch_ -
         | https://news.ycombinator.com/item?id=38708730 - Dec 2023 (109
         | comments)
         | 
         |  _Show HN: Fortran inference code for the Mamba state space
         | language model_ - https://news.ycombinator.com/item?id=38687342
         | - Dec 2023 (1 comment)
         | 
         |  _Guide to the Mamba architecture that claims to be a
         | replacement for Transformers_ -
         | https://news.ycombinator.com/item?id=38659238 - Dec 2023 (2
         | comments)
         | 
         |  _Mamba outperforms transformers "everywhere we tried"_ -
         | https://news.ycombinator.com/item?id=38606590 - Dec 2023 (25
         | comments)
         | 
         |  _Mamba: Linear-Time Sequence Modeling with Selective State
         | Spaces_ - https://news.ycombinator.com/item?id=38522428 - Dec
         | 2023 (37 comments)
         | 
         |  _Mamba: New SSM arch with linear-time scaling that outperforms
         | Transformers_ - https://news.ycombinator.com/item?id=38520992 -
         | Dec 2023 (2 comments)
        
           | garyiskidding wrote:
           | thank you, these are very helpful.
        
       | a_wild_dandan wrote:
       | To those curious about the tradeoffs between transformer and
       | state space model layers, I highly recommend Sasha Rush's video
       | on it: https://www.youtube.com/watch?v=dKJEpOtVgXc
        
         | az226 wrote:
         | They use less memory for inference but remember the details
         | less well. For instance if you're implementing code and want
         | edits, it will forget various functions to be part of the
         | script. Even transformers aren't perfect at this and SSMs are
         | even worse. For many use cases, that ability isn't needed as
         | much so the memory savings is a bigger lever.
        
       | haddr wrote:
       | Will it be possible to run such model family in ollama?
        
         | andy99 wrote:
         | Mamba is supported in llama.cpp so should be (edit - apparently
         | it's not strictly the mamba architecture, it's a mix of mamba
         | and transformers, so it looks like it would have to be ported
         | to llama.cpp)
        
       | google234123 wrote:
       | I'm pretty sure computational chemists were combining NNs with
       | Kalman Filters for a while now... I recall the issue it was slow
       | due to the N^2 size of the covariance matrix
        
         | uoaei wrote:
         | Surprised they hadn't found ways to advance their techniques
         | with e.g. low-rank approximations, etc.
        
           | theGnuMe wrote:
           | That's one strategy. Also flash attention.
        
       | ipsum2 wrote:
       | @dang this is blogspam for the official post:
       | https://www.ai21.com/blog/announcing-jamba
        
       | ninjahatori wrote:
       | On a side note: working over longer contexts also reminds me of
       | MemGPT(https://github.com/cpacker/MemGPT) I think a similar
       | concept can be applied to Mamba architecture models too.
        
       | eigenvalue wrote:
       | Has anyone gotten this to work in linux using 1 or 2 4090s? I get
       | stuck on "Loading checkpoint shards: 71%" and then it bails. But
       | weirdly nvidia-smi shows plenty of VRAM available. My machine has
       | 256gb of RAM so I don't think that's the problem either. Really
       | excited to try this one.
        
       | cs702 wrote:
       | Please link to the original post:
       | 
       | https://www.ai21.com/blog/announcing-jamba
       | 
       | Jamba looks _fabulous_. Good performance for its size _and_ much
       | more efficient than the available open alternatives.
       | 
       | The key idea: One of out of every eight transformer blocks in
       | Jamba applies dot-product attention with quadratic cost, but the
       | other seven out of eight apply a Mamba layer with linear cost.
       | And the entire model is a mixture of experts(MoE) so only ~12B
       | parameters are used at once for inference.
       | 
       | Thank you to the folks at AI21 for making Jamba available!
        
         | swyx wrote:
         | i havent seen anyone mention this yet so i'll be the first -
         | what is the comparison vs StripedHyena?
         | https://www.together.ai/blog/stripedhyena-7b
        
           | cs702 wrote:
           | Mamba came out of the same research group, Hazy Research, led
           | by Chris Re. This new "Jamba" model incorporating Mamba and
           | dot-product attention layers has ~8x more parameters than the
           | largest open Striped Hyena, and appears to work much better.
        
       | sleepingreset wrote:
       | god damn
        
       | unraveller wrote:
       | Jamba-v0.1-hybrid-MoE (16x6B?) is like giving a big NOS boost to
       | a mixtral 8x7B tier LLM. If true 256k context, 3x longer, faster
       | & cheaper than anything else, it should mean an end to the One
       | Model To Rule Them All mindset for now. The big boys will have to
       | offer some version of it as separate but close side-kick
       | integration to their hero offering.
        
       | moneycantbuy wrote:
       | would a 192GB RAM mac studio or even a 7950x with 192GB RAM be
       | practical for running this model for inference and possibly fine
       | tuning? Especially if I don't need very low latency e.g. 1 token
       | per second is fine for inference. i also have two 3090s.
        
       | zelphirkalt wrote:
       | Is there a Sparabo too?
       | 
       | It is always funny to see old names associated with totally
       | different new things!
        
       | toddmorey wrote:
       | Released with open weights!
        
       | CGamesPlay wrote:
       | Does this mean that I can continue a chat without needing to send
       | a full transcript? This feels like it could make inference a lot
       | cheaper for multi-step dialogs.
        
       | zzzzzzzzzz10 wrote:
       | Where can I download and use it?
        
       | kjkjadksj wrote:
       | People need to pick better names. Mamba is already a popular
       | python package and internet search tools are on their knees
       | already.
        
       ___________________________________________________________________
       (page generated 2024-03-29 23:02 UTC)