[HN Gopher] Jamba: Production-grade Mamba-based AI model
___________________________________________________________________
Jamba: Production-grade Mamba-based AI model
Author : bubblehack3r
Score : 219 points
Date : 2024-03-28 16:36 UTC (6 hours ago)
(HTM) web link (www.maginative.com)
(TXT) w3m dump (www.maginative.com)
| kelseyfrog wrote:
| I'm glad we're seeing exploration into scaling post-transformer
| LLM architectures, but I'm disappointed that it _has_ a context
| window. That was kind of the selling point of Mamba(and SSM
| models in general), right linear scaling because
| state+input=next_state+output?
| refulgentis wrote:
| I'm not sure I follow fully, it is also the case for
| (handwaves) "traditional" LLMs that state + input = next state
| + output. Its just that output increases, so as output becomes
| input, eventually state + input / next state + output is
| greater than the context size.
|
| Re: linear scaling, that means the runtime cost is O(n) to
| context size, rather than traditional transformer O(n^2)
| maccam912 wrote:
| I think kelseyfrog meant that the state for a mamba model is
| supposed to "remember" stuff even if it doesn't have the
| actual tokens to reference any more. It might not be
| guaranteed to hang on to some information about tokens from a
| long time ago, but at least in theory it's possible, whereas
| tokens from before a context window in a tradional llms may
| as well never have existed.
| kelseyfrog wrote:
| Yes, you said it better than I did :)
| visarga wrote:
| That is valid for Mamba, this model (Jamba) is a mix of
| transformer and mamba layers, so it still has a quadratic
| memory cost, but divided by 8.
| a_wild_dandan wrote:
| state = context
|
| The difference between SSMs and GPTs here is _how_ that state
| /context scales. Per usual in engineering, there are big trade
| offs!
| kelseyfrog wrote:
| I'm not following. State is a multi-dimensional vector and
| context is a list of tokens. State is perturbed by A and
| Bx(t), while context is appended to by sampling the predicted
| token distribution.
| spxneo wrote:
| 256k is huge dude. that is like 1/2 of the average non fiction
| novel
|
| i think at least 200~300 pages of PDF
|
| im not complaining here and it also fits in GPU
| htrp wrote:
| compute still has cost?
| samus wrote:
| In not sure I understood your question.
|
| This model should have much lower computational cost since only
| one out of eight layers is a traditional transformer layer with
| masked self-attention. Additionally, half of the Mamba layers
| are MoEs.
| krasin wrote:
| The license is a proper open-source one: Apache 2.0. Thanks, AI21
| Labs.
| popalchemist wrote:
| In addition to the architectural and performance benefits, this
| is the big deal here, IMO.
| spxneo wrote:
| im so used to seeing AGPLv3
|
| apache 2 is a more generous license
| krasin wrote:
| AGPLv3 is a fine license too. But most of the models nowadays
| come with bullshit licenses, like Llama 2 with its
| "acceptable use policy" enforced by the license:
| https://ai.meta.com/llama/use-policy/
| Reubend wrote:
| It's great to see a full production level model using Mamba. But
| when it comes to long context window benchmarks, I'd love to see
| performance as well as throughput. I was under the impressions
| that Mamba has huge increases in throughput at the cost of modest
| losses in accuracy when using long contexts.
| refulgentis wrote:
| I would too -- long context has been such a red herring across
| providers, Claude 3 is the first I've seen that seems to
| genuinely have some sort of qualitative leap in noticing
| things.
|
| It is worth noting I'm fairly sure there's no inherent
| theoratical decrease to accuracy in long contexts, the claimed
| theoratical change is an _increase_ in long-term accuracy in
| long contexts.
| Arthur_ODC wrote:
| Long Context is great and all, but it sucks that all of these
| LLM's have really poor output length. If I feed something an
| entire book and ask for a comprehensive summary then I'm
| expecting at least a full 3-page summary. I get that they try
| to force these things to be "concise" to save on compute, but
| good lord it's so annoying.
| CuriouslyC wrote:
| That's a chat gpt problem, if you hit the API it's not
| nearly so hard to get good output.
| refulgentis wrote:
| I wouldn't say that, my latest big user story for making
| sure I'm handling huge inputs was "translate Moby dick to
| zoomer". Cant give any service chunks larger than ~5K
| tokens, over API, without it failing.
|
| (Miserably, like, I'd be fine if it gave a paragraph
| back. But at least on this "map" task, there's a critical
| point where there's so much input that the reward
| function ends up imitating the input more instead of
| chatting)
| pedrovhb wrote:
| Have you tried asking it for a specific concrete length,
| like a number of words? I was also frustrated with concise
| answers when asking for long ones, but I found that the
| outputs improved significantly if I asked for e.g. 4000
| words specifically. Further than that, have it break it
| down into sections and write X words per section.
| Arthur_ODC wrote:
| Yes, all the possible length extending custom
| instructions you can think of. I can get some reasonable
| length responses out of it, but I've never seen them go
| over 1 page worth, and multi-shot example prompts using
| multiple USER and GPT exchanges to define the format.
| Seems like GPT4 has a hard limit as to how much it will
| output when you click "continue", and Claude Opus never
| goes over a page either. Another user pointed out using
| the API, which I have done in the past, but it's been a
| long while, and I can't really justify the cost of using
| the advanced models via API for my general use.
| binalpatel wrote:
| Gemini 1.5 Pro is really good at long context in my
| experience.
| tempusalaria wrote:
| Every long context sucks right now. All the model providers
| benchmark on fact recall which is very limited. Actual
| ability to do anything complicated beyond 16k tokens is not
| present in any current model I have seen.
| samus wrote:
| This one should have you covered :-) one out of every eight
| layers is a traditional Transformer layer, which should ensure
| precision, at least over short distances.
| gautamcgoel wrote:
| Why include self-attention layers at all? In other words, why not
| just alternate SSM and MLP layers?
| NLPaep wrote:
| Mamba is bad with long context. It doesn't remember phone
| numbers
|
| https://www.harvard.edu/kempner-institute/2024/02/05/repeat-...
| a_wild_dandan wrote:
| Good! DNNs unlock _semantics_ (parsing, transforming,
| producing). That 's the basis of general intelligence, not
| encyclopedic random string recall. Models shouldn't burn
| ungodly quantities of compute emulating DDR5 with their
| working memory. We need machines that _think better_ , not
| _memorize_ well. We already have plenty of those.
|
| Massive context windows, and their needle tests, are
| misguided. We won't reach human-level AGI by basically
| inventing a natural language RDBMS. Our resources should
| primarily target better reasoning systems for our models,
| reinforcement learning, etc.
|
| If we can build a GPT4-level problem solving system that
| coincidentally also can't remember telephone numbers, I'll
| consider it major progress.
| Rodeoclash wrote:
| I can't remember phone numbers either but I can use a device
| suited to remembering them to look them up
| orra wrote:
| Hell, it looks like you forgot you already said that (-:
| Rodeoclash wrote:
| I can't remember phone numbers either but I can use a device
| suited to remembering them to look them up.
| skybrian wrote:
| > Jamba boasts an extensive context window of 256K tokens,
| equivalent to around 210 pages of text, while fitting up to 140K
| tokens on a single 80GB GPU.
|
| I realize this is a big improvement, but it's striking how
| inefficient LLM's are, that you need 80GB of GPU memory to
| analyze less than 1 megabyte of data. That's a lot of bloat!
| Hopefully there's a lot of room for algorithmic improvements.
| electric_mayhem wrote:
| It's literally simulating a neural network.
|
| How much of your 5-sense experiential memories and decades of
| academic book learning are you bringing to understand my reply
| to your post?
|
| How many gigabytes do you think that's equivalent to?
| _false wrote:
| I love both parent post perspectives on this.
| skybrian wrote:
| Jamba seems to be distributed as 21 5-gigabyte files [1] so I
| guess that's another way of looking at it.
|
| [1] https://huggingface.co/ai21labs/Jamba-v0.1/tree/main
| nostrowski wrote:
| Two things I'm curious to know:
|
| 1. How many tokens can 'traditional' models (e.g. Mistral's
| 8x7B) fit on a single 80GB GPU? 2. How does quantization affect
| the single transformer layer in the stack? What are the
| performance/accuracy trade-offs that happen when so little of
| the stack depends on this bottleneck?
| patrakov wrote:
| Mixtral 8x7b runs well (i.e., produces the correct output
| faster than I can read it) on a modern AMD or Intel laptop
| without any use of a GPU - provided that you have enough RAM
| and CPU cores. 32 GB of RAM and 16 hyperthreads are enough
| with 4-bit quantization if you don't ask too much in terms of
| context.
|
| P.S. Dell Inspiron 7415 upgraded to 64 GB of RAM here.
| riku_iki wrote:
| > that you need 80GB of GPU memory to analyze less than 1
| megabyte of data
|
| 80GB is compressed all human knowledge applied on that 1mb..
| smusamashah wrote:
| There was a recent thread on explaining Mamba
| https://news.ycombinator.com/item?id=39501982
| (https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html)
|
| There was another one on the same thing, probably better
| https://news.ycombinator.com/item?id=39482428
| (https://jackcook.com/2024/02/23/mamba.html)
| dang wrote:
| Thanks! Macroexpanded:
|
| _Mamba Explained: The State Space Model Taking On
| Transformers_ - https://news.ycombinator.com/item?id=39501982 -
| Feb 2024 (93 comments)
|
| _Mamba: The Easy Way_ -
| https://news.ycombinator.com/item?id=39482428 - Feb 2024 (60
| comments)
|
| _Is Mamba Capable of In-Context Learning?_ -
| https://news.ycombinator.com/item?id=39286410 - Feb 2024 (1
| comment)
|
| _Vision Mamba: Efficient Visual Representation Learning with
| Bidirectional SSM_ -
| https://news.ycombinator.com/item?id=39214939 - Feb 2024 (16
| comments)
|
| _MoE-Mamba: Efficient Selective State Space Models with
| Mixture of Experts_ -
| https://news.ycombinator.com/item?id=38932350 - Jan 2024 (39
| comments)
|
| _Implementation of Mamba in one file of PyTorch_ -
| https://news.ycombinator.com/item?id=38708730 - Dec 2023 (109
| comments)
|
| _Show HN: Fortran inference code for the Mamba state space
| language model_ - https://news.ycombinator.com/item?id=38687342
| - Dec 2023 (1 comment)
|
| _Guide to the Mamba architecture that claims to be a
| replacement for Transformers_ -
| https://news.ycombinator.com/item?id=38659238 - Dec 2023 (2
| comments)
|
| _Mamba outperforms transformers "everywhere we tried"_ -
| https://news.ycombinator.com/item?id=38606590 - Dec 2023 (25
| comments)
|
| _Mamba: Linear-Time Sequence Modeling with Selective State
| Spaces_ - https://news.ycombinator.com/item?id=38522428 - Dec
| 2023 (37 comments)
|
| _Mamba: New SSM arch with linear-time scaling that outperforms
| Transformers_ - https://news.ycombinator.com/item?id=38520992 -
| Dec 2023 (2 comments)
| a_wild_dandan wrote:
| To those curious about the tradeoffs between transformer and
| state space model layers, I highly recommend Sasha Rush's video
| on it: https://www.youtube.com/watch?v=dKJEpOtVgXc
| haddr wrote:
| Will it be possible to run such model family in ollama?
| andy99 wrote:
| Mamba is supported in llama.cpp so should be (edit - apparently
| it's not strictly the mamba architecture, it's a mix of mamba
| and transformers, so it looks like it would have to be ported
| to llama.cpp)
| google234123 wrote:
| I'm pretty sure computational chemists were combining NNs with
| Kalman Filters for a while now... I recall the issue it was slow
| due to the N^2 size of the covariance matrix
| uoaei wrote:
| Surprised they hadn't found ways to advance their techniques
| with e.g. low-rank approximations, etc.
| ipsum2 wrote:
| @dang this is blogspam for the official post:
| https://www.ai21.com/blog/announcing-jamba
| ninjahatori wrote:
| On a side note: working over longer contexts also reminds me of
| MemGPT(https://github.com/cpacker/MemGPT) I think a similar
| concept can be applied to Mamba architecture models too.
| eigenvalue wrote:
| Has anyone gotten this to work in linux using 1 or 2 4090s? I get
| stuck on "Loading checkpoint shards: 71%" and then it bails. But
| weirdly nvidia-smi shows plenty of VRAM available. My machine has
| 256gb of RAM so I don't think that's the problem either. Really
| excited to try this one.
| cs702 wrote:
| Please link to the original post:
|
| https://www.ai21.com/blog/announcing-jamba
|
| Jamba looks _fabulous_. Good performance for its size _and_ much
| more efficient than the available open alternatives.
|
| The key idea: One of out of every eight transformer blocks in
| Jamba applies dot-product attention with quadratic cost, but the
| other seven out of eight apply a Mamba layer with linear cost.
| And the entire model is a mixture of experts(MoE) so only ~12B
| parameters are used at once for inference.
|
| Thank you to the folks at AI21 for making Jamba available!
| swyx wrote:
| i havent seen anyone mention this yet so i'll be the first -
| what is the comparison vs StripedHyena?
| https://www.together.ai/blog/stripedhyena-7b
| sleepingreset wrote:
| god damn
| unraveller wrote:
| Jamba-v0.1-hybrid-MoE (16x6B?) is like giving a big NOS boost to
| a mixtral 8x7B tier LLM. If true 256k context, 3x longer, faster
| & cheaper than anything else, it should mean an end to the One
| Model To Rule Them All mindset for now. The big boys will have to
| offer some version of it as separate but close side-kick
| integration to their hero offering.
___________________________________________________________________
(page generated 2024-03-28 23:00 UTC)