[HN Gopher] Jamba: Production-grade Mamba-based AI model
___________________________________________________________________
Jamba: Production-grade Mamba-based AI model
Author : bubblehack3r
Score : 328 points
Date : 2024-03-28 16:36 UTC (1 days ago)
(HTM) web link (www.maginative.com)
(TXT) w3m dump (www.maginative.com)
| kelseyfrog wrote:
| I'm glad we're seeing exploration into scaling post-transformer
| LLM architectures, but I'm disappointed that it _has_ a context
| window. That was kind of the selling point of Mamba(and SSM
| models in general), right linear scaling because
| state+input=next_state+output?
| refulgentis wrote:
| I'm not sure I follow fully, it is also the case for
| (handwaves) "traditional" LLMs that state + input = next state
| + output. Its just that output increases, so as output becomes
| input, eventually state + input / next state + output is
| greater than the context size.
|
| Re: linear scaling, that means the runtime cost is O(n) to
| context size, rather than traditional transformer O(n^2)
| maccam912 wrote:
| I think kelseyfrog meant that the state for a mamba model is
| supposed to "remember" stuff even if it doesn't have the
| actual tokens to reference any more. It might not be
| guaranteed to hang on to some information about tokens from a
| long time ago, but at least in theory it's possible, whereas
| tokens from before a context window in a tradional llms may
| as well never have existed.
| kelseyfrog wrote:
| Yes, you said it better than I did :)
| visarga wrote:
| That is valid for Mamba, this model (Jamba) is a mix of
| transformer and mamba layers, so it still has a quadratic
| memory cost, but divided by 8.
| a_wild_dandan wrote:
| state = context
|
| The difference between SSMs and GPTs here is _how_ that state
| /context scales. Per usual in engineering, there are big trade
| offs!
| kelseyfrog wrote:
| I'm not following. State is a multi-dimensional vector and
| context is a list of tokens. State is perturbed by A and
| Bx(t), while context is appended to by sampling the predicted
| token distribution.
| spxneo wrote:
| 256k is huge dude. that is like 1/2 of the average non fiction
| novel
|
| i think at least 200~300 pages of PDF
|
| im not complaining here and it also fits in GPU
| htrp wrote:
| compute still has cost?
| samus wrote:
| In not sure I understood your question.
|
| This model should have much lower computational cost since only
| one out of eight layers is a traditional transformer layer with
| masked self-attention. Additionally, half of the Mamba layers
| are MoEs.
| krasin wrote:
| The license is a proper open-source one: Apache 2.0. Thanks, AI21
| Labs.
| popalchemist wrote:
| In addition to the architectural and performance benefits, this
| is the big deal here, IMO.
| spxneo wrote:
| im so used to seeing AGPLv3
|
| apache 2 is a more generous license
| krasin wrote:
| AGPLv3 is a fine license too. But most of the models nowadays
| come with bullshit licenses, like Llama 2 with its
| "acceptable use policy" enforced by the license:
| https://ai.meta.com/llama/use-policy/
| Reubend wrote:
| It's great to see a full production level model using Mamba. But
| when it comes to long context window benchmarks, I'd love to see
| performance as well as throughput. I was under the impressions
| that Mamba has huge increases in throughput at the cost of modest
| losses in accuracy when using long contexts.
| refulgentis wrote:
| I would too -- long context has been such a red herring across
| providers, Claude 3 is the first I've seen that seems to
| genuinely have some sort of qualitative leap in noticing
| things.
|
| It is worth noting I'm fairly sure there's no inherent
| theoratical decrease to accuracy in long contexts, the claimed
| theoratical change is an _increase_ in long-term accuracy in
| long contexts.
| Arthur_ODC wrote:
| Long Context is great and all, but it sucks that all of these
| LLM's have really poor output length. If I feed something an
| entire book and ask for a comprehensive summary then I'm
| expecting at least a full 3-page summary. I get that they try
| to force these things to be "concise" to save on compute, but
| good lord it's so annoying.
| CuriouslyC wrote:
| That's a chat gpt problem, if you hit the API it's not
| nearly so hard to get good output.
| refulgentis wrote:
| I wouldn't say that, my latest big user story for making
| sure I'm handling huge inputs was "translate Moby dick to
| zoomer". Cant give any service chunks larger than ~5K
| tokens, over API, without it failing.
|
| (Miserably, like, I'd be fine if it gave a paragraph
| back. But at least on this "map" task, there's a critical
| point where there's so much input that the reward
| function ends up imitating the input more instead of
| chatting)
| pedrovhb wrote:
| Have you tried asking it for a specific concrete length,
| like a number of words? I was also frustrated with concise
| answers when asking for long ones, but I found that the
| outputs improved significantly if I asked for e.g. 4000
| words specifically. Further than that, have it break it
| down into sections and write X words per section.
| Arthur_ODC wrote:
| Yes, all the possible length extending custom
| instructions you can think of. I can get some reasonable
| length responses out of it, but I've never seen them go
| over 1 page worth, and multi-shot example prompts using
| multiple USER and GPT exchanges to define the format.
| Seems like GPT4 has a hard limit as to how much it will
| output when you click "continue", and Claude Opus never
| goes over a page either. Another user pointed out using
| the API, which I have done in the past, but it's been a
| long while, and I can't really justify the cost of using
| the advanced models via API for my general use.
| refulgentis wrote:
| Everyone's coalescing at a max of 4096 tokens/12 "pages"
| via API (page is 250 words, which is 1 8.5"x11" double
| spaced)
|
| To your point, doesn't matter anyway, it's nigh
| impossible to get over 2K of output with every trick and
| bit of guidance you can think of (I got desperate when
| 16K/48 pages came out to "make it work", even completely
| deforming tricks like making it number each line and
| write a reminder on each line that it should write 1000
| lines don't work)
| binalpatel wrote:
| Gemini 1.5 Pro is really good at long context in my
| experience.
| tempusalaria wrote:
| Every long context sucks right now. All the model providers
| benchmark on fact recall which is very limited. Actual
| ability to do anything complicated beyond 16k tokens is not
| present in any current model I have seen.
| ukuina wrote:
| This is not current. GPT-4-Turbo (128k) has lossless recall
| to the first 64k input tokens and produces output
| indistinguishable from GPT-4 (32k), though both are limited
| to 4k output tokens.
|
| Several downsides: Recall accuracy past the first 64k
| tokens suffers badly; Cost is astronomical; Response
| latency is too high for most interactive use-cases.
|
| I would point out the astounding leap in input context in
| just one year. Should we assume effectively-infinite (RAG-
| free) context in the near-future?
| anoncareer0212 wrote:
| This is grossly untrue in a way that denotes surface-
| level familiarity on several fronts
|
| You're referring to the needle-in-a-haystack retrieval
| problem.
|
| Which the person you're replying to explicitly mentioned
| is the only benchmark providers are using, for good
| reason.
|
| Consider the "translate Moby Dick to comedic zoomer"
| problem. This does not even come remotely close to
| working unless I do it in maximum chunks of 5,000 tokens.
|
| Consider the API output limit of 4096 tokens, across all
| providers.
|
| And no, you shouldn't assume effectively infinite (RAG
| free) context in the near future. This time last year,
| Anthropic was demonstrating 120,000 token context. It
| released 200K a few weeks ago. And runtime cost scales
| with N^2.
| samus wrote:
| This one should have you covered :-) one out of every eight
| layers is a traditional Transformer layer, which should ensure
| precision, at least over short distances.
| swyx wrote:
| > which should ensure precision, at least over short
| distances.
|
| why? i dont follow. transformers should provide some
| attention over -all- distances, no? why does layering
| truncate this to "short distances"?
| samus wrote:
| I mean "short" in comparison to the unlimited, but lossy
| recall that the Mamba blocks provide. Transformers are
| limited to the context length, while Mamba can carry along
| state. While it can remember things from a lot farther
| back, it is limited and must thus eventually drop things
| and/or lose precision.
| gautamcgoel wrote:
| Why include self-attention layers at all? In other words, why not
| just alternate SSM and MLP layers?
| NLPaep wrote:
| Mamba is bad with long context. It doesn't remember phone
| numbers
|
| https://www.harvard.edu/kempner-institute/2024/02/05/repeat-...
| a_wild_dandan wrote:
| Good! DNNs unlock _semantics_ (parsing, transforming,
| producing). That 's the basis of general intelligence, not
| encyclopedic random string recall. Models shouldn't burn
| ungodly quantities of compute emulating DDR5 with their
| working memory. We need machines that _think better_ , not
| _memorize_ well. We already have plenty of those.
|
| Massive context windows, and their needle tests, are
| misguided. We won't reach human-level AGI by basically
| inventing a natural language RDBMS. Our resources should
| primarily target better reasoning systems for our models,
| reinforcement learning, etc.
|
| If we can build a GPT4-level problem solving system that
| coincidentally also can't remember telephone numbers, I'll
| consider it major progress.
| 6gvONxR4sf7o wrote:
| Memorization usually refers to training data. It's often
| useful to have something that can utilize instructions
| losslessly, which is the distinction between these models.
| Rodeoclash wrote:
| I can't remember phone numbers either but I can use a device
| suited to remembering them to look them up
| orra wrote:
| Hell, it looks like you forgot you already said that (-:
| Rodeoclash wrote:
| Haha, I blame the Harmonic app :/
| imtringued wrote:
| What if your field of vision was infinite and you are
| looking at a unrolled telephone book?
|
| Would you need a device to remember the phone number? You
| wouldn't. You would need a method or algorithm to find the
| number, but there is no reason why that algorithm couldn't
| be part of the attention mechanism. The attention mechanism
| is akin to reading the entire phone book for every word you
| are about to say. It would be unreasonable to expect you to
| not find the right phone number eventually.
| Rodeoclash wrote:
| I can't remember phone numbers either but I can use a device
| suited to remembering them to look them up.
| skybrian wrote:
| > Jamba boasts an extensive context window of 256K tokens,
| equivalent to around 210 pages of text, while fitting up to 140K
| tokens on a single 80GB GPU.
|
| I realize this is a big improvement, but it's striking how
| inefficient LLM's are, that you need 80GB of GPU memory to
| analyze less than 1 megabyte of data. That's a lot of bloat!
| Hopefully there's a lot of room for algorithmic improvements.
| electric_mayhem wrote:
| It's literally simulating a neural network.
|
| How much of your 5-sense experiential memories and decades of
| academic book learning are you bringing to understand my reply
| to your post?
|
| How many gigabytes do you think that's equivalent to?
| _false wrote:
| I love both parent post perspectives on this.
| skybrian wrote:
| Jamba seems to be distributed as 21 5-gigabyte files [1] so I
| guess that's another way of looking at it.
|
| [1] https://huggingface.co/ai21labs/Jamba-v0.1/tree/main
| imtringued wrote:
| So what? I have seen models distributed as 26x 10GB files.
| richardw wrote:
| It's kinda simulating our brains but not really. When I
| attempted to dig more into how neurons work I realised that
| it's a massive chasm of difference. Very much worth doing if
| you haven't (you might know far better then me, this is for
| people who don't yet.)
|
| In terms of results: Our brains are working with 20w of power
| and can be trained to compete with LLM's using a tiny
| fraction of the world's data. They also have to keep you
| breathing and your blood pumping and manage all the dangers
| of catching a ball near traffic. Or skiing, or poetry, or
| sunsets. And they remember stuff five minutes later and don't
| need a training run that takes months.
|
| We have SO many opportunities to improve the AI architecture
| it's ridiculous. This is a good thing.
| reissbaker wrote:
| To be fair most of the brain is more like a pretrained
| model -- it isn't being trained at any point after
| conception to keep your blood pumping or your lungs
| working, it does that out of the box roughly as soon as you
| sprout those organs (or the minute you're born, in the case
| of lungs). The training process was billions of years of
| evolution. And, well, given fairly persistent cross-
| cultural cognitive biases, I expect the conscious thought
| parts are starting from a pretrained model, too, and all
| we're doing in school is finetuning ;)
| imtringued wrote:
| People don't understand that to simulate a single neuron,
| you need an entire neural network. So 70 billion parameters
| might at best be equivalent to a million neurons but that
| is assuming that your neural network architecture is akin
| to the connections between neurons. Considering the
| physical sparsity, you might need even more parameters to
| model the connections of a biological neural network. So
| less than a million neurons in practice.
| nostrowski wrote:
| Two things I'm curious to know:
|
| 1. How many tokens can 'traditional' models (e.g. Mistral's
| 8x7B) fit on a single 80GB GPU? 2. How does quantization affect
| the single transformer layer in the stack? What are the
| performance/accuracy trade-offs that happen when so little of
| the stack depends on this bottleneck?
| patrakov wrote:
| Mixtral 8x7b runs well (i.e., produces the correct output
| faster than I can read it) on a modern AMD or Intel laptop
| without any use of a GPU - provided that you have enough RAM
| and CPU cores. 32 GB of RAM and 16 hyperthreads are enough
| with 4-bit quantization if you don't ask too much in terms of
| context.
|
| P.S. Dell Inspiron 7415 upgraded to 64 GB of RAM here.
| riku_iki wrote:
| > that you need 80GB of GPU memory to analyze less than 1
| megabyte of data
|
| 80GB is compressed all human knowledge applied on that 1mb..
| pama wrote:
| The big (huge?) memory requirement is during training. These
| LLMs work with high dimensional vectors and they calculate
| gradients with respect to high dimensional vectors and they do
| updates that require state of the optimizer. If you have 3
| particles in 3 dimensions and you need their forces that
| creates 3 new 3D vectors and once you update their position
| along the forces then they also carry momenta. Now generalize
| these simple 3-body physics to the typical 60-layer creatures
| inside the LLM with vectors of several thousand dimensions,
| interactions/weights that are scaling like the squares of these
| vectors, to a total parameter count that adds up to the 10s to
| 100s of billions of parameters, and then take derivatives and
| start to keep track of momenta. It is a feat of modern
| engineering that some groups can train such models efficiently.
| I hope we will see more of the training stories becoming public
| in the near future.
| nl wrote:
| This is wrong. You need big memory during inference too.
|
| The difference there is you can use tricks like quantisation
| and offloading to CPU to reduce it somewhat at the cost of
| accuracy and/or speed.
| brrrrrm wrote:
| Training is 3x the memory used by inference, and usually
| run at a much larger batch size
| imtringued wrote:
| Compared to the human brain they are shockingly efficient. It's
| the hardware that isn't, but that is just a matter of time.
| nl wrote:
| That's all the world's knowledge compressed into 80GB. It's not
| analysing 1MB data, it's analysing all of that knowledge plus
| and additional 1MB.
| smusamashah wrote:
| There was a recent thread on explaining Mamba
| https://news.ycombinator.com/item?id=39501982
| (https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html)
|
| There was another one on the same thing, probably better
| https://news.ycombinator.com/item?id=39482428
| (https://jackcook.com/2024/02/23/mamba.html)
| dang wrote:
| Thanks! Macroexpanded:
|
| _Mamba Explained: The State Space Model Taking On
| Transformers_ - https://news.ycombinator.com/item?id=39501982 -
| Feb 2024 (93 comments)
|
| _Mamba: The Easy Way_ -
| https://news.ycombinator.com/item?id=39482428 - Feb 2024 (60
| comments)
|
| _Is Mamba Capable of In-Context Learning?_ -
| https://news.ycombinator.com/item?id=39286410 - Feb 2024 (1
| comment)
|
| _Vision Mamba: Efficient Visual Representation Learning with
| Bidirectional SSM_ -
| https://news.ycombinator.com/item?id=39214939 - Feb 2024 (16
| comments)
|
| _MoE-Mamba: Efficient Selective State Space Models with
| Mixture of Experts_ -
| https://news.ycombinator.com/item?id=38932350 - Jan 2024 (39
| comments)
|
| _Implementation of Mamba in one file of PyTorch_ -
| https://news.ycombinator.com/item?id=38708730 - Dec 2023 (109
| comments)
|
| _Show HN: Fortran inference code for the Mamba state space
| language model_ - https://news.ycombinator.com/item?id=38687342
| - Dec 2023 (1 comment)
|
| _Guide to the Mamba architecture that claims to be a
| replacement for Transformers_ -
| https://news.ycombinator.com/item?id=38659238 - Dec 2023 (2
| comments)
|
| _Mamba outperforms transformers "everywhere we tried"_ -
| https://news.ycombinator.com/item?id=38606590 - Dec 2023 (25
| comments)
|
| _Mamba: Linear-Time Sequence Modeling with Selective State
| Spaces_ - https://news.ycombinator.com/item?id=38522428 - Dec
| 2023 (37 comments)
|
| _Mamba: New SSM arch with linear-time scaling that outperforms
| Transformers_ - https://news.ycombinator.com/item?id=38520992 -
| Dec 2023 (2 comments)
| garyiskidding wrote:
| thank you, these are very helpful.
| a_wild_dandan wrote:
| To those curious about the tradeoffs between transformer and
| state space model layers, I highly recommend Sasha Rush's video
| on it: https://www.youtube.com/watch?v=dKJEpOtVgXc
| az226 wrote:
| They use less memory for inference but remember the details
| less well. For instance if you're implementing code and want
| edits, it will forget various functions to be part of the
| script. Even transformers aren't perfect at this and SSMs are
| even worse. For many use cases, that ability isn't needed as
| much so the memory savings is a bigger lever.
| haddr wrote:
| Will it be possible to run such model family in ollama?
| andy99 wrote:
| Mamba is supported in llama.cpp so should be (edit - apparently
| it's not strictly the mamba architecture, it's a mix of mamba
| and transformers, so it looks like it would have to be ported
| to llama.cpp)
| google234123 wrote:
| I'm pretty sure computational chemists were combining NNs with
| Kalman Filters for a while now... I recall the issue it was slow
| due to the N^2 size of the covariance matrix
| uoaei wrote:
| Surprised they hadn't found ways to advance their techniques
| with e.g. low-rank approximations, etc.
| theGnuMe wrote:
| That's one strategy. Also flash attention.
| ipsum2 wrote:
| @dang this is blogspam for the official post:
| https://www.ai21.com/blog/announcing-jamba
| ninjahatori wrote:
| On a side note: working over longer contexts also reminds me of
| MemGPT(https://github.com/cpacker/MemGPT) I think a similar
| concept can be applied to Mamba architecture models too.
| eigenvalue wrote:
| Has anyone gotten this to work in linux using 1 or 2 4090s? I get
| stuck on "Loading checkpoint shards: 71%" and then it bails. But
| weirdly nvidia-smi shows plenty of VRAM available. My machine has
| 256gb of RAM so I don't think that's the problem either. Really
| excited to try this one.
| cs702 wrote:
| Please link to the original post:
|
| https://www.ai21.com/blog/announcing-jamba
|
| Jamba looks _fabulous_. Good performance for its size _and_ much
| more efficient than the available open alternatives.
|
| The key idea: One of out of every eight transformer blocks in
| Jamba applies dot-product attention with quadratic cost, but the
| other seven out of eight apply a Mamba layer with linear cost.
| And the entire model is a mixture of experts(MoE) so only ~12B
| parameters are used at once for inference.
|
| Thank you to the folks at AI21 for making Jamba available!
| swyx wrote:
| i havent seen anyone mention this yet so i'll be the first -
| what is the comparison vs StripedHyena?
| https://www.together.ai/blog/stripedhyena-7b
| cs702 wrote:
| Mamba came out of the same research group, Hazy Research, led
| by Chris Re. This new "Jamba" model incorporating Mamba and
| dot-product attention layers has ~8x more parameters than the
| largest open Striped Hyena, and appears to work much better.
| sleepingreset wrote:
| god damn
| unraveller wrote:
| Jamba-v0.1-hybrid-MoE (16x6B?) is like giving a big NOS boost to
| a mixtral 8x7B tier LLM. If true 256k context, 3x longer, faster
| & cheaper than anything else, it should mean an end to the One
| Model To Rule Them All mindset for now. The big boys will have to
| offer some version of it as separate but close side-kick
| integration to their hero offering.
| moneycantbuy wrote:
| would a 192GB RAM mac studio or even a 7950x with 192GB RAM be
| practical for running this model for inference and possibly fine
| tuning? Especially if I don't need very low latency e.g. 1 token
| per second is fine for inference. i also have two 3090s.
| zelphirkalt wrote:
| Is there a Sparabo too?
|
| It is always funny to see old names associated with totally
| different new things!
| toddmorey wrote:
| Released with open weights!
| CGamesPlay wrote:
| Does this mean that I can continue a chat without needing to send
| a full transcript? This feels like it could make inference a lot
| cheaper for multi-step dialogs.
| zzzzzzzzzz10 wrote:
| Where can I download and use it?
| kjkjadksj wrote:
| People need to pick better names. Mamba is already a popular
| python package and internet search tools are on their knees
| already.
___________________________________________________________________
(page generated 2024-03-29 23:02 UTC)