[HN Gopher] Jamba: Production-grade Mamba-based AI model
       ___________________________________________________________________
        
       Jamba: Production-grade Mamba-based AI model
        
       Author : bubblehack3r
       Score  : 219 points
       Date   : 2024-03-28 16:36 UTC (6 hours ago)
        
 (HTM) web link (www.maginative.com)
 (TXT) w3m dump (www.maginative.com)
        
       | kelseyfrog wrote:
       | I'm glad we're seeing exploration into scaling post-transformer
       | LLM architectures, but I'm disappointed that it _has_ a context
       | window. That was kind of the selling point of Mamba(and SSM
       | models in general), right linear scaling because
       | state+input=next_state+output?
        
         | refulgentis wrote:
         | I'm not sure I follow fully, it is also the case for
         | (handwaves) "traditional" LLMs that state + input = next state
         | + output. Its just that output increases, so as output becomes
         | input, eventually state + input / next state + output is
         | greater than the context size.
         | 
         | Re: linear scaling, that means the runtime cost is O(n) to
         | context size, rather than traditional transformer O(n^2)
        
           | maccam912 wrote:
           | I think kelseyfrog meant that the state for a mamba model is
           | supposed to "remember" stuff even if it doesn't have the
           | actual tokens to reference any more. It might not be
           | guaranteed to hang on to some information about tokens from a
           | long time ago, but at least in theory it's possible, whereas
           | tokens from before a context window in a tradional llms may
           | as well never have existed.
        
             | kelseyfrog wrote:
             | Yes, you said it better than I did :)
        
           | visarga wrote:
           | That is valid for Mamba, this model (Jamba) is a mix of
           | transformer and mamba layers, so it still has a quadratic
           | memory cost, but divided by 8.
        
         | a_wild_dandan wrote:
         | state = context
         | 
         | The difference between SSMs and GPTs here is _how_ that state
         | /context scales. Per usual in engineering, there are big trade
         | offs!
        
           | kelseyfrog wrote:
           | I'm not following. State is a multi-dimensional vector and
           | context is a list of tokens. State is perturbed by A and
           | Bx(t), while context is appended to by sampling the predicted
           | token distribution.
        
         | spxneo wrote:
         | 256k is huge dude. that is like 1/2 of the average non fiction
         | novel
         | 
         | i think at least 200~300 pages of PDF
         | 
         | im not complaining here and it also fits in GPU
        
       | htrp wrote:
       | compute still has cost?
        
         | samus wrote:
         | In not sure I understood your question.
         | 
         | This model should have much lower computational cost since only
         | one out of eight layers is a traditional transformer layer with
         | masked self-attention. Additionally, half of the Mamba layers
         | are MoEs.
        
       | krasin wrote:
       | The license is a proper open-source one: Apache 2.0. Thanks, AI21
       | Labs.
        
         | popalchemist wrote:
         | In addition to the architectural and performance benefits, this
         | is the big deal here, IMO.
        
         | spxneo wrote:
         | im so used to seeing AGPLv3
         | 
         | apache 2 is a more generous license
        
           | krasin wrote:
           | AGPLv3 is a fine license too. But most of the models nowadays
           | come with bullshit licenses, like Llama 2 with its
           | "acceptable use policy" enforced by the license:
           | https://ai.meta.com/llama/use-policy/
        
       | Reubend wrote:
       | It's great to see a full production level model using Mamba. But
       | when it comes to long context window benchmarks, I'd love to see
       | performance as well as throughput. I was under the impressions
       | that Mamba has huge increases in throughput at the cost of modest
       | losses in accuracy when using long contexts.
        
         | refulgentis wrote:
         | I would too -- long context has been such a red herring across
         | providers, Claude 3 is the first I've seen that seems to
         | genuinely have some sort of qualitative leap in noticing
         | things.
         | 
         | It is worth noting I'm fairly sure there's no inherent
         | theoratical decrease to accuracy in long contexts, the claimed
         | theoratical change is an _increase_ in long-term accuracy in
         | long contexts.
        
           | Arthur_ODC wrote:
           | Long Context is great and all, but it sucks that all of these
           | LLM's have really poor output length. If I feed something an
           | entire book and ask for a comprehensive summary then I'm
           | expecting at least a full 3-page summary. I get that they try
           | to force these things to be "concise" to save on compute, but
           | good lord it's so annoying.
        
             | CuriouslyC wrote:
             | That's a chat gpt problem, if you hit the API it's not
             | nearly so hard to get good output.
        
               | refulgentis wrote:
               | I wouldn't say that, my latest big user story for making
               | sure I'm handling huge inputs was "translate Moby dick to
               | zoomer". Cant give any service chunks larger than ~5K
               | tokens, over API, without it failing.
               | 
               | (Miserably, like, I'd be fine if it gave a paragraph
               | back. But at least on this "map" task, there's a critical
               | point where there's so much input that the reward
               | function ends up imitating the input more instead of
               | chatting)
        
             | pedrovhb wrote:
             | Have you tried asking it for a specific concrete length,
             | like a number of words? I was also frustrated with concise
             | answers when asking for long ones, but I found that the
             | outputs improved significantly if I asked for e.g. 4000
             | words specifically. Further than that, have it break it
             | down into sections and write X words per section.
        
               | Arthur_ODC wrote:
               | Yes, all the possible length extending custom
               | instructions you can think of. I can get some reasonable
               | length responses out of it, but I've never seen them go
               | over 1 page worth, and multi-shot example prompts using
               | multiple USER and GPT exchanges to define the format.
               | Seems like GPT4 has a hard limit as to how much it will
               | output when you click "continue", and Claude Opus never
               | goes over a page either. Another user pointed out using
               | the API, which I have done in the past, but it's been a
               | long while, and I can't really justify the cost of using
               | the advanced models via API for my general use.
        
           | binalpatel wrote:
           | Gemini 1.5 Pro is really good at long context in my
           | experience.
        
           | tempusalaria wrote:
           | Every long context sucks right now. All the model providers
           | benchmark on fact recall which is very limited. Actual
           | ability to do anything complicated beyond 16k tokens is not
           | present in any current model I have seen.
        
         | samus wrote:
         | This one should have you covered :-) one out of every eight
         | layers is a traditional Transformer layer, which should ensure
         | precision, at least over short distances.
        
       | gautamcgoel wrote:
       | Why include self-attention layers at all? In other words, why not
       | just alternate SSM and MLP layers?
        
         | NLPaep wrote:
         | Mamba is bad with long context. It doesn't remember phone
         | numbers
         | 
         | https://www.harvard.edu/kempner-institute/2024/02/05/repeat-...
        
           | a_wild_dandan wrote:
           | Good! DNNs unlock _semantics_ (parsing, transforming,
           | producing). That 's the basis of general intelligence, not
           | encyclopedic random string recall. Models shouldn't burn
           | ungodly quantities of compute emulating DDR5 with their
           | working memory. We need machines that _think better_ , not
           | _memorize_ well. We already have plenty of those.
           | 
           | Massive context windows, and their needle tests, are
           | misguided. We won't reach human-level AGI by basically
           | inventing a natural language RDBMS. Our resources should
           | primarily target better reasoning systems for our models,
           | reinforcement learning, etc.
           | 
           | If we can build a GPT4-level problem solving system that
           | coincidentally also can't remember telephone numbers, I'll
           | consider it major progress.
        
           | Rodeoclash wrote:
           | I can't remember phone numbers either but I can use a device
           | suited to remembering them to look them up
        
             | orra wrote:
             | Hell, it looks like you forgot you already said that (-:
        
           | Rodeoclash wrote:
           | I can't remember phone numbers either but I can use a device
           | suited to remembering them to look them up.
        
       | skybrian wrote:
       | > Jamba boasts an extensive context window of 256K tokens,
       | equivalent to around 210 pages of text, while fitting up to 140K
       | tokens on a single 80GB GPU.
       | 
       | I realize this is a big improvement, but it's striking how
       | inefficient LLM's are, that you need 80GB of GPU memory to
       | analyze less than 1 megabyte of data. That's a lot of bloat!
       | Hopefully there's a lot of room for algorithmic improvements.
        
         | electric_mayhem wrote:
         | It's literally simulating a neural network.
         | 
         | How much of your 5-sense experiential memories and decades of
         | academic book learning are you bringing to understand my reply
         | to your post?
         | 
         | How many gigabytes do you think that's equivalent to?
        
           | _false wrote:
           | I love both parent post perspectives on this.
        
           | skybrian wrote:
           | Jamba seems to be distributed as 21 5-gigabyte files [1] so I
           | guess that's another way of looking at it.
           | 
           | [1] https://huggingface.co/ai21labs/Jamba-v0.1/tree/main
        
         | nostrowski wrote:
         | Two things I'm curious to know:
         | 
         | 1. How many tokens can 'traditional' models (e.g. Mistral's
         | 8x7B) fit on a single 80GB GPU? 2. How does quantization affect
         | the single transformer layer in the stack? What are the
         | performance/accuracy trade-offs that happen when so little of
         | the stack depends on this bottleneck?
        
           | patrakov wrote:
           | Mixtral 8x7b runs well (i.e., produces the correct output
           | faster than I can read it) on a modern AMD or Intel laptop
           | without any use of a GPU - provided that you have enough RAM
           | and CPU cores. 32 GB of RAM and 16 hyperthreads are enough
           | with 4-bit quantization if you don't ask too much in terms of
           | context.
           | 
           | P.S. Dell Inspiron 7415 upgraded to 64 GB of RAM here.
        
         | riku_iki wrote:
         | > that you need 80GB of GPU memory to analyze less than 1
         | megabyte of data
         | 
         | 80GB is compressed all human knowledge applied on that 1mb..
        
       | smusamashah wrote:
       | There was a recent thread on explaining Mamba
       | https://news.ycombinator.com/item?id=39501982
       | (https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html)
       | 
       | There was another one on the same thing, probably better
       | https://news.ycombinator.com/item?id=39482428
       | (https://jackcook.com/2024/02/23/mamba.html)
        
         | dang wrote:
         | Thanks! Macroexpanded:
         | 
         |  _Mamba Explained: The State Space Model Taking On
         | Transformers_ - https://news.ycombinator.com/item?id=39501982 -
         | Feb 2024 (93 comments)
         | 
         |  _Mamba: The Easy Way_ -
         | https://news.ycombinator.com/item?id=39482428 - Feb 2024 (60
         | comments)
         | 
         |  _Is Mamba Capable of In-Context Learning?_ -
         | https://news.ycombinator.com/item?id=39286410 - Feb 2024 (1
         | comment)
         | 
         |  _Vision Mamba: Efficient Visual Representation Learning with
         | Bidirectional SSM_ -
         | https://news.ycombinator.com/item?id=39214939 - Feb 2024 (16
         | comments)
         | 
         |  _MoE-Mamba: Efficient Selective State Space Models with
         | Mixture of Experts_ -
         | https://news.ycombinator.com/item?id=38932350 - Jan 2024 (39
         | comments)
         | 
         |  _Implementation of Mamba in one file of PyTorch_ -
         | https://news.ycombinator.com/item?id=38708730 - Dec 2023 (109
         | comments)
         | 
         |  _Show HN: Fortran inference code for the Mamba state space
         | language model_ - https://news.ycombinator.com/item?id=38687342
         | - Dec 2023 (1 comment)
         | 
         |  _Guide to the Mamba architecture that claims to be a
         | replacement for Transformers_ -
         | https://news.ycombinator.com/item?id=38659238 - Dec 2023 (2
         | comments)
         | 
         |  _Mamba outperforms transformers "everywhere we tried"_ -
         | https://news.ycombinator.com/item?id=38606590 - Dec 2023 (25
         | comments)
         | 
         |  _Mamba: Linear-Time Sequence Modeling with Selective State
         | Spaces_ - https://news.ycombinator.com/item?id=38522428 - Dec
         | 2023 (37 comments)
         | 
         |  _Mamba: New SSM arch with linear-time scaling that outperforms
         | Transformers_ - https://news.ycombinator.com/item?id=38520992 -
         | Dec 2023 (2 comments)
        
       | a_wild_dandan wrote:
       | To those curious about the tradeoffs between transformer and
       | state space model layers, I highly recommend Sasha Rush's video
       | on it: https://www.youtube.com/watch?v=dKJEpOtVgXc
        
       | haddr wrote:
       | Will it be possible to run such model family in ollama?
        
         | andy99 wrote:
         | Mamba is supported in llama.cpp so should be (edit - apparently
         | it's not strictly the mamba architecture, it's a mix of mamba
         | and transformers, so it looks like it would have to be ported
         | to llama.cpp)
        
       | google234123 wrote:
       | I'm pretty sure computational chemists were combining NNs with
       | Kalman Filters for a while now... I recall the issue it was slow
       | due to the N^2 size of the covariance matrix
        
         | uoaei wrote:
         | Surprised they hadn't found ways to advance their techniques
         | with e.g. low-rank approximations, etc.
        
       | ipsum2 wrote:
       | @dang this is blogspam for the official post:
       | https://www.ai21.com/blog/announcing-jamba
        
       | ninjahatori wrote:
       | On a side note: working over longer contexts also reminds me of
       | MemGPT(https://github.com/cpacker/MemGPT) I think a similar
       | concept can be applied to Mamba architecture models too.
        
       | eigenvalue wrote:
       | Has anyone gotten this to work in linux using 1 or 2 4090s? I get
       | stuck on "Loading checkpoint shards: 71%" and then it bails. But
       | weirdly nvidia-smi shows plenty of VRAM available. My machine has
       | 256gb of RAM so I don't think that's the problem either. Really
       | excited to try this one.
        
       | cs702 wrote:
       | Please link to the original post:
       | 
       | https://www.ai21.com/blog/announcing-jamba
       | 
       | Jamba looks _fabulous_. Good performance for its size _and_ much
       | more efficient than the available open alternatives.
       | 
       | The key idea: One of out of every eight transformer blocks in
       | Jamba applies dot-product attention with quadratic cost, but the
       | other seven out of eight apply a Mamba layer with linear cost.
       | And the entire model is a mixture of experts(MoE) so only ~12B
       | parameters are used at once for inference.
       | 
       | Thank you to the folks at AI21 for making Jamba available!
        
         | swyx wrote:
         | i havent seen anyone mention this yet so i'll be the first -
         | what is the comparison vs StripedHyena?
         | https://www.together.ai/blog/stripedhyena-7b
        
       | sleepingreset wrote:
       | god damn
        
       | unraveller wrote:
       | Jamba-v0.1-hybrid-MoE (16x6B?) is like giving a big NOS boost to
       | a mixtral 8x7B tier LLM. If true 256k context, 3x longer, faster
       | & cheaper than anything else, it should mean an end to the One
       | Model To Rule Them All mindset for now. The big boys will have to
       | offer some version of it as separate but close side-kick
       | integration to their hero offering.
        
       ___________________________________________________________________
       (page generated 2024-03-28 23:00 UTC)