[HN Gopher] Implementation of Mamba in one file of PyTorch
___________________________________________________________________
Implementation of Mamba in one file of PyTorch
Author : johnma2006
Score : 311 points
Date : 2023-12-20 13:58 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| andy99 wrote:
| The original mamba code has a lot of speed optimizations and
| other stuff that make it difficult to immediately get so this
| will help with learning.
|
| I can't help but also plug my own Mamba inference implementation.
| https://github.com/rbitr/llm.f90/tree/master/ssm
|
| For inference one token at a time everything simplifies
| considerably.
| cs702 wrote:
| Fortran! If you don't mind me asking, why Fortran?
|
| I know it underpins a _lot_ of time-tested scientific code,
| often wrapped by libraries like PyTorch and Numpy, but Fortran
| isn 't exactly a popular language nowadays. What's your
| rationale for using it?
| andy99 wrote:
| Tdlr, Fortran is low level-ish, compiled, but otherwise
| almost identical to numpy syntax wise.
|
| It supports all the common array and matrix operations and it
| doesn't need memory and pointer management the way C does.
| But it still compiles down to something very fast, you can
| link in BLAS and GPU libraries, supports easy parallelism...
|
| When I compare with e.g. Karpathy's llama2.c, I think Fortran
| is easy to work with implementing basic transformer inference
| because of how it handles arrays.
|
| The downside is that while there are efforts to modernize it,
| I find it more cumbersome for non-numerical stuff,
| particularly strings. But I think for the actual linear
| algebra implementation, it can't be beat.
|
| I should add, I know it's a bit of an uphill battle, I expect
| fewer people will use code that I write in Fortran vs
| basically anything else. But I'm hoping to pull some people
| in and get a critical mass of interest because I think it has
| a lot of promise. That's actually one of the reasons I wanted
| to get a Mamba implementation quickly (though now that
| there's a basic python one I think I'll have lost some
| potential users to it :)
| cs702 wrote:
| Thanks for the thoughtful response.
|
| Unfortunately, I too think it will be a bit of an uphill
| battle for you.
|
| If you haven't already, take a look at Mojo and Julia. Both
| offer many of the benefits of Fortran, but unlike it, they
| are seeing growing adoption.
| andy99 wrote:
| An uphill battle is fine
| cs702 wrote:
| This looks _really nice_. Thank you for sharing it on HN!
|
| In case you didn't know, you can parallelize the slow Python loop
| in _selective_scan_ that computes all the x 's: x
| = torch.zeros((b, d_in, n)) for i in range(l): x
| = deltaA[:, :, i] * x + deltaB_u[:, :, i] [?]
|
| with only two calls to the PyTorch API. See the examples here:
| https://github.com/glassroom/heinsen_sequence/blob/main/READ...
| .[a]
|
| You can then compute all the y's with one einsum, instead of l
| sequential einsums.
|
| ---
|
| [a] Previous discussion on HN:
| https://news.ycombinator.com/item?id=38556669
| make3 wrote:
| OP's code is much easier to understand, though, which is the
| main (only) purpose of their code
| cs702 wrote:
| Can't argue with that! :-)
|
| For what it's worth, you can keep both, and make parallel vs
| sequential execution an option, with a boolean flag.
|
| You can also leave the sequential code as a comment
| explaining what the parallel code does.
|
| Or, if slow execution doesn't bother you, leave it as is.
| bradfitz wrote:
| You're replying to somebody who was arguing for readability
| being its virtue and you're proposing ... adding options
| and alternate code paths? :)
| anytime5704 wrote:
| Via a boolean parameter, no less.
| cs702 wrote:
| _Touche._ I just updated my comment :-)
| boredumb wrote:
| "Mamba is the world's longest venomous snake with an estimated
| length of over 150 m"
|
| Had a laugh at that. Really great stuff though, it was nice to
| have referencing to the arxiv paper so someone like me who
| generally consumes these things instead of translating them from
| papers could sort of peak behind the curtains.
| visarga wrote:
| Mamba has a great name ... [S]elective [S]tructured [S]tate
| [S]pace [S]equence models.. makes sSSSS, like a snake
| behnamoh wrote:
| If only the "mamba" name were not ugly.
| rdedev wrote:
| Wait I thought that was the king cobra? The longest venomous
| snake ? At least that was what a simple Google search showed
| me.
|
| Would be funny if they had to issue a correction for that
| sentence later on
| fwip wrote:
| It's also not 150 meters long (nearly 500 feet), which I
| think is also part of why it was funny to include the
| sentence in the README.
| y42 wrote:
| slightly OT:
|
| I really struggle with dozens and dozens of vocabulary that is
| being used in the field of machine learning and especially AI.
| I'm not a beginner at all, but I wonder if there is a
| comprehensive guide for all those terms that not necessarily
| explains the technology behind them in detail, but shows their
| position and relation to each other. like some kind of landscape.
|
| "everyone" seems to know Mamba. I never heard of Mamba. There are
| constantly new kind of llm popping up, talking about stuff that
| seems to be obvious.
|
| So, is there some kind of resource like that, not aiming at
| beginners, but experienced users, coming from other fields of IT?
| sevagh wrote:
| >"everyone" seems to know Mamba. I never heard of Mamba
|
| Only the "everybody who knows what mamba is" are the ones
| upvoting and commenting. Think of all the people who ignore it.
| For me, Mamba is the faster version of Conda [1], and that's
| why I clicked on the article.
|
| https://github.com/mamba-org/mamba
| 3-cheese-sundae wrote:
| Ah yes, Conda, definitely something else I've heard of.
| NavinF wrote:
| Conda has been around for a decade and it used to be the
| primary package manager for everything related to
| numpy/scipy. Most ML and data science people have heard of
| it even if they haven't used it.
| sevagh wrote:
| Conda is the latest LLM cli frontend that's a MOE of
| Mistral 7B, LLama 17B, Falcon 32C, and the Yamaha YZ50 quad
| bike.
| gpderetta wrote:
| > and the Yamaha YZ50 quad bike.
|
| Well played.
| supermatt wrote:
| Its extremely common to manage python environments with
| conda (although it can do much more). If you are unaware of
| conda, it is unlikely you work with python, and therefore
| unlikely to be doing much with ML (and LLMs) anyway - its
| even part of the "getting started" documentation for
| pytorch.
| IshKebab wrote:
| That is not "a new LLVM architecture"... It's talking about a
| different Mamba.
| CaptainOfCoit wrote:
| I'm not aware of such a glossary.
|
| But I did notice the "References" section in the bottom of the
| README, which does explain what Mamba is by linking to the
| original paper: "Mamba: Linear-Time Sequence Modeling with
| Selective State Spaces" https://arxiv.org/abs/2312.00752
| bananaflag wrote:
| I knew about Mamba from r/singularity and following AI
| researchers on Twitter.
|
| I don't work in AI at all (and don't plan to), but it's fun to
| know about stuff a little before they become mainstream.
| swyx wrote:
| the field just moves fast. I have curated a list of non-hypey
| writers and youtubers who explain these things for a typical
| SWE audience if you are interested.
| https://github.com/swyxio/ai-notes/blob/main/Resources/Good%...
| y42 wrote:
| Will check it, thank you!
| orbifold wrote:
| It is a very fad driven field. Everyone brands everything. It
| isn't enough to give things boring titles like, stacked open
| linear dynamical system with selective observations and learned
| timestep.
| Ar-Curunir wrote:
| I mean, Mamba is much easier to remember than what you said.
| It's good to have short names for techniques.
| cyanydeez wrote:
| that's half of it, the other half is pure social linguistics.
|
| try talking about stacked open linear dynamical system for
| more than three times and you're bound to figure out a token
| that conveys the same but is quicker to produce
|
| it's turtles all the way down with LLM And your comment.
| people are just trying to maximize their token conversations
| yuppiepuppie wrote:
| Heavily agree. Ive been following this space quite closely,
| like most people, only for the past year. But it seems to be
| still in its experimental phase which in turn brings academics
| and researchers who tend toward this type of language.
| sva_ wrote:
| I didn't know Mamba but the bottom of the page lists
| comprehensive references.
|
| If you mean the "branding" that is common in ML, which is often
| criticized, I much prefer it over the jargon used in other
| fields, e.g. Mathematics. It is nice to have distinguished
| words to talk about different concepts.
| visarga wrote:
| > I never heard of Mamba.
|
| Just came out a few days ago. It's new for everyone.
| amelius wrote:
| Mamba is also the name of a package management system,
| similar to Conda.
|
| Just to make it a little extra confusing :)
|
| https://github.com/mamba-org/mamba
| Tao3300 wrote:
| Should have picked a different snake, like... I dunno, Asp?
| Wait, no, not that one...
| sevagh wrote:
| Python!
| wenc wrote:
| In fast evolving fields it's always all about sociology, not
| canon or pedagogy. Meaning in new fields is created in
| community (constructionism).
|
| You need to plug into the community and overhear what people
| are talking about (HN is such a community). You'll also get a
| sense of the linguistic subculture (acronyms, lingo etc) much
| like you learn to talk hip hop if you're into the hip hop
| subculture. Much of it will be noise but overall you'll get a
| sense of what the community cares about, which helps you narrow
| what you need to focus on. The subreddit r/localllama is the
| watering hole for hobbyists right now.
|
| If you need a primer, this is a good guide.
|
| https://flyte.org/blog/getting-started-with-large-language-m...
|
| In this particular case, I find it helpful to do syntopical
| reading (per Mortimer Adler) around LLMs not AI in general.
| Mamba is interesting to me because I have a background in
| optimal control and state space models are my bread an butter
| and it's fascinating to see them applied in this way.
|
| Side: I'm in my 40s and this isn't my first rodeo. There will
| always be new fields and trends emerging -- I've been through
| several waves of this (cloud, big data, ML, data science etc)
| where posts like yours are commonplace. But there is no need to
| be frustrated. Overhearing conversations is one way to make
| sense of them instead of feeling lost and waiting for someone
| to summarize and explain everything to you.
|
| The same applies to academic fields.
|
| Ps also consider you might not need to be on the cutting edge.
| If you're not trying to build leading edge stuff, it's good to
| wait for the dust to settle -- you'll waste less time following
| dead ends while the community is figuring out what's good.
| pshc wrote:
| Perhaps the community at r/localllama could train an LLM that
| knows about the latest developments and explains jargon and
| papers, updated weekly. Free idea for karma.
| wenc wrote:
| Not a bad idea.
|
| I actually read papers with the help of ChatGPT-4 and
| Claude. It helps me quickly understand papers that I don't
| have a background in.
|
| For instance when I see something I don't understand I ask
| it "can you break that down for me?" Or "is this similar to
| (concept I know)?"
|
| It's the new way of doing syntopical reading -- but faster
| and more efficient.
|
| (For the uninitiated, it's a technique from Mortimer
| Adler's How to read a book)
| ttul wrote:
| This is a great way to consume papers. If there's one
| thing LLMs know, it's machine learning literature!
| jsight wrote:
| How do you feed a recent arxiv paper directly to ChatGPT?
| Min0taur wrote:
| If you have the + subscription you can upload pdfs
| directly/ask it to ingest.
| WhitneyLand wrote:
| A few options are:
|
| 1. Select abstract or select all text then copy/paste.
|
| 2. Save the PDF and upload with ChatGPT's document
| feature.
|
| 3. Ask for it, "what's that well known LLM paper about
| context and getting lost in the middle?". It will web
| search as needed.
|
| You can also do more than summarize. Ask about equations,
| ask it to make analogies, challenge the key findings as
| devil's advocate to learn from different angles. Propose
| your own ideas.
|
| Use voice to digest topics during your commute and ask
| tons of questions until you understand.
| y42 wrote:
| Good point, thanks for the link! (one of the links there
| leads to this wonderful post: Highly recommended:
| http://jalammar.github.io/illustrated-transformer/)
| countWSS wrote:
| Its a new LLM type: instead of transformers it use state-space
| machines, which are orders of magnitude faster. Its currently
| very new and less coherent than GPT-2.
| senseiV wrote:
| ? its better than GPT 2 for sure...
| falcor84 wrote:
| > in the field of machine learning and especially AI
|
| Sorry for getting semantical here, but isn't ML a subfield of
| AI? In other words, I would have expected "... in the field of
| machine learning and AI in general"
| dragonwriter wrote:
| AI is often being used recently for specifically _generative_
| AI, which is a subfield of machine learning, which is a
| subfield of AI in the broader sense.
| quickthrower2 wrote:
| You are now in the loop! Your colleagues will think the same
| "this person how does he/she keep up with all the LLM stuff?".
| TrackerFF wrote:
| The people that are constantly up to date on this stuff tend to
| be AI/ML researchers and engineers. In academia, industry
| research groups, or startups.
|
| They literally get paid to read papers, and implement models on
| a day-to-day basis.
|
| I wouldn't worry too much not being up to date or things
| sounding a bit foreign. The names themselves are just that,
| names, the models themselves tend to be incremental versions of
| some previous model.
| toasted-subs wrote:
| Most of the startups I've chatted with seem to prioritize
| finding people who build products. The complaint/regret I've
| heard from 3-5 organizations was hiring researchers.
|
| Researcher is more for highly funded organizations. Starrups
| can get by with off the shelf models.
| jiggawatts wrote:
| Don't feel bad, Mamba is _very new_ technology. I only just
| heard about it for the first time last week!
| esafak wrote:
| Everybody doesn't know Mamba. You can't stay on top of
| everything in ML so stop trying. Since you asked, Mamba is a
| neural architecture based on structured state space models
| (SSMs) that aims to replace Transformers. For me right now just
| know that counts as staying on top of things. If I need to know
| more than that I can have the computer summarize it for me.
| swyx wrote:
| things I'd like a non-ML-researcher explanation of about Mamba:
|
| 1. what is the overall insight of state space models beyond
| transformers? (i know this is somewhat covered in the paper but
| still a bit inaccessible)
|
| 2. what was the incremental innovation/result that is making
| Mamba more successful/interesting than its predecessors? (S4, H3,
| Monarch etc)
|
| 3. what are the implications beyond subquadratic scaling of
| context? say if i don't really care about context length > 100k
| tokens. what other benefits are there - for example, is Mamba
| potentially more compute-efficient to train for a similar size of
| model/dataset?
|
| just offering 3 prompts for knowledgeable people to drop some
| alpha
| logicchains wrote:
| For 2, Mamba makes some A B C weights that in S4 are time
| invariant become functions of the input, which makes it more
| powerful.
| pk-protect-ai wrote:
| > is Mamba potentially more compute-efficient to train for a
| similar size of model/dataset?
|
| I would like to understand it too as well ...
|
| Here is the citation from original paper:
|
| "Computation. After the parameters have been transformed from
| ([?], A, B, C) - (A, B, C), the model can be computed in two
| ways, either as a linear recurrence (2) or a global convolution
| (3). Commonly, the model uses the convolutional mode (3) for
| efficient parallelizable training (where the whole input
| sequence is seen ahead of time), and switched into recurrent
| mode (2) for efficient autoregressive inference (where the
| inputs are seen one timestep at a time)."
|
| So the training is parallelizable, like in RetNet with parallel
| forward mode. By default inference is done in the recurrent
| mode, to have a longest possible context. No chunking
| available, so it is difficult for me to say how much RAM and
| VRAM it will consume during the inference ...
| ttul wrote:
| My IQ is orders of magnitude lower than the authors of the
| paper, but I did my best to work through it anyway. I studied
| CE and have the basic control theory background and undergrad
| level discrete time systems intuition. It would take much
| additional studying to understand state space models enough to
| really parse this paper. But I tried anyway. Take my comment
| here with a big grain of salt.
|
| The overall insight of Mamba is to solve a longstanding problem
| with state space models. They are good at compressing the input
| context, but the compression of input into a hidden state
| erases information needed to make use of the context
| effectively as Transformers do.
|
| Their solution to this problem is to create what they call a
| selection mechanism. The mechanism is input-dependent, allowing
| the model to adjust its output at each step as the input
| changes. How they do this is by making a few of the state space
| variables input-dependent instead of input-invariant. They
| choose a few of the state space variables and attach linear
| layers and such to project the input onto the state space
| variable at each time step. The linear layers (etc) are
| obviously trained so that they know how to transform the input
| appropriately so that the model spits out useful output.
|
| But making the state space variables input dependent creates a
| problem in terms of computation overhead. They fix the
| computation problem by designing a machine architecture-aware
| algorithm that makes the most of modern GPU memory
| architecture, avoiding moving things in and out of HBM as much
| as possible.
|
| Tri Dao came up with Flash Attention, which is basically a way
| to use hardware more efficiently in a Transformer. So this is
| his jam 100%.
|
| I know this doesn't add much to understanding the paper, but
| hopefully it's better than nothing.
| SpaceManNabs wrote:
| Is this similar to subset selection with the concrete
| distribution?
| sjkoelle wrote:
| my loose understanding
|
| 1) transformers create an input x input size attention matrix
| that is unnecessarily large. state space models somehow
| compress this.
|
| 2) "The main difference is simply making several parameters [in
| the state space model] functions of the input"
|
| 3) i think it might be more sample efficient (requires less
| data)
| WhitneyLand wrote:
| I think this video is exactly what you're looking for.
|
| He explains the paper but also gives a lot of context, how it
| fits into the big picture, etc.
|
| It's actual kind of exciting hearing the plot unfold.
|
| https://youtu.be/ouF-H35atOY?si=y2Ckp9MCFd7ulLL3
| mcemilg wrote:
| Looks wonderful. But I would like to add this, I hate einops, it
| doesn't make it simple to read unfortunately.
| andy99 wrote:
| I re-implemented Mamba myself and this was the first time I had
| ever worked with einops/einsum. I'm 50/50 on them after this. I
| found them relatively easy to look at and understand the intent
| (possibly more so than other representations), but talking
| extra time to transforms into other primitives (loops,
| multiplication, etc). I belive using torch.einsum is generally
| well optimized as well compared to naively looping. All said, I
| don't know if I'd use it myself working from scratch but it's
| interesting to know and if I was working in python I might try
| comparing the speed of einops/sum vs other ways.
| sjkoelle wrote:
| disagree
| danieldk wrote:
| Nice! For what it is worth, a colleague and I made a library a
| while ago that factors out most shared model code, with which
| many models can be implemented in about 100 lines (excluding
| Python import ceremony and comments). E.g.:
|
| BERT:
|
| https://github.com/explosion/curated-transformers/blob/main/...
|
| Llama 1/2:
|
| https://github.com/explosion/curated-transformers/blob/main/...
|
| MPT:
|
| https://github.com/explosion/curated-transformers/blob/main/...
|
| With various stuff enabled, including support for TorchScript
| JIT, PyTorch flash attention, etc.
| rdedev wrote:
| Nice. I will definitely be taking a look at this. Have you
| looked at the xformers library ? They are looking at the same
| problem as you but their focus is more on providing performant
| transformer modules using triton. Using specific components
| from the library though is not as simple. I kept running into
| runtime errors so I've kept it aside for now. I am building
| something based on the Bert architecture so I will give this a
| look. Thanks for all the work!
| danieldk wrote:
| I would've loved to look at xFormers, but I avoided looking
| at other implementations to make sure that ours is a clean
| room implementation.
|
| Curated Transformers started as a very small library just for
| spaCy (spaCy 3.7 transformer pipelines use Curated
| Transformers) with just the older encoder models (BERT,
| RoBERTa, etc.). spaCy used Hugging Face Transformers prior
| for the provided transformer models, but we wanted something
| where we could easily hook into different parts of the model
| (e.g. for distillation).
|
| After the functionality needed for spaCy was done, Matt @
| Explosion encouraged us to extend it into a more general
| PyTorch library that would also support decoder
| architectures, generation, etc.
| pk-protect-ai wrote:
| Is there an original paper discussion? I seem to have missed it.
| It's quite interesting. I didn't catch on to this part:
|
| "We note that full results on context length 8k are missing for
| the RWKV and RetNet baselines, prior strong recurrent models that
| can also be interpreted as SSMs, due to a lack of efficient
| implementation leading to out-of-memory or unrealistic
| computation requirements."
|
| RetNet doesn't really consume much memory, and with the chunkwise
| forward implementation, it restricts the VRAM usage to the chunk
| size. This is the part to test the context length.
|
| Has anyone done some tests on the original Mamba model? How fast
| is the training on this one in comparison with RetNet in parallel
| forward mode?
| error9348 wrote:
| https://news.ycombinator.com/item?id=38522428
|
| https://openreview.net/forum?id=AL1fq05o7H
| allanrbo wrote:
| Love it when complex things are distilled down to just the
| essentials!
| dheera wrote:
| I love one file implementations. I hate all these implementations
| with preprocess_utils.py that imports stuff from model.py that
| imports stuff again from preprocess_utils.py that imports stuff
| from ...
| fassssst wrote:
| Feels like a useful preprocessor script: turn this repo into a
| single file
| squigz wrote:
| Is number of files in a project a meaningful metric...?
| DarmokJalad1701 wrote:
| Thanks for this. I took a stab at unraveling the official CUDA
| version and never really got around to it after my initial
| attempt failed. This seems a lot nicer.
| tysam_and wrote:
| Oh my gosh, another one-file PyTorch implementation. This is
| fantastic. I'd like to hope that some of my previous work (hlb-
| CIFAR10 and related projects, along with other influences before
| it like minGPT, DawnBench, etc.) has been able to help push the
| 'simple, single-file, reduced-complexity' format forward a bit. I
| personally think that this kind of work is critical to efficient
| ML research, and that is possibly one of the most important
| things that we can do for the field today.
|
| Research progresses at the speed of innovation, which progresses
| with the inverse of experiment runtime, which is definitely and
| absolutely related to the underlying Kolmogorov Complexity of the
| code w.r.t. a research/simple-hackery-focused objective.
|
| I really cannot stress enough how important to research tools
| like this are and how much they've sped up the knowledge
| discovery process for me personally. Being able to quickly sketch
| out ideas, often in minutes, and get immediate, high-snr results
| back has become an indispensable part of my research progress.
| While we seem to really good at some of the specifics of some of
| the detailsresearch, and somehow have extremely information-
| efficient training processes, we have not applied the same logic
| seemingly on the whole to the entire research field!
|
| Knowledge distillation and/or the MDL
| (https://en.wikipedia.org/wiki/Minimum_description_length) are
| excessively important I think to reversing a lot of the constant
| fluff, cruft, and overly dense thrash-and-hope-you-don't-get-
| scooped-by-other-researchers-on-marginal-value-topics trend that
| I think has largely been encouraged by the current paper
| submission/review/etc process.
|
| I've been wanting to try to get around this and move a bit more
| towards a slightly better scaling solution recently. One of these
| things is that I've started distributing my code in 1-file, self-
| contained, short rough gists as 'code sketches', which shortens
| dev time and gets rough, unpolished, working code for a concept
| in people's hands. It seems to work pretty well so far, I hope to
| continue doing it! <3 :'))))
|
| In any case, this is extremely exciting stuff, and everyone --
| please! More code like this! We're researchers on learning data
| in a largely-scaled way, let's be data-efficient in how we
| disseminate information as well! It's a dream come true to see a
| lot more of this stuff coming down the pipeline, fantastic work
| and keep it coming! <3 :')))) Woop woop woop!!!!
|
| Excellent stuff. <3 :'))))
| tysam_and wrote:
| Minor potential performance benefit -- it looks like you might
| be able to fuse the x_proj and dt_proj weights here as x_proj
| has no bias. This is a thing that's possibly doable simply at
| runtime if there's any weight-fiddling reqs, I'm guessing the
| single kernel + bias will still run faster in the end (not sure
| though! <3 :')))) )
| jiggawatts wrote:
| It's been an exciting 2023 year in no small part because of
| watching AI research unfold at these crazy speeds. Like you've
| said, these enablers like ArXiV, PyTorch, GitHub, Huggingface,
| and terse Python code that's open source are dramatically
| accelerating the development of this new field.
|
| It's probably the fastest the human race has ever developed
| anything of substantial complexity!
|
| The only other place I see this king of velocity is SpaceX,
| which also launched two cutting edge rockets this year.
|
| I wonder what 2024 will bring...
| jdeaton wrote:
| Very cool ive read this line of paper originating from hippo, s4,
| hyena, mamba etc but can someone please explain how this isnt
| just an RNN/LSTM variant??
| rhaps0dy wrote:
| Its latent space transition is linear, instead of nonlinear, so
| there's a more parallelizable algorithm for advancing time in
| it. This makes it much more efficient to train and do inference
| with in GPUs.
|
| The way it keeps all the representation power of LSTMs is by
| having the transition vary with the input (but still be
| linear).
| jdeaton wrote:
| Thanks thats helpful. One place where the parallelizability
| of this method falls short of the transformer is not being
| able to pack multiple varying length examples into the same
| array during training with block diagonal attention pattern.
| If I understand correctly thats not possible with this
| architecture and its an important practical concern in large
| scale transformer training.
| epaulson wrote:
| This is a dumb question but how hard is it to train the mamba
| models that are on huggingface? It looks like the largest one is
| 2.8b - how many GPUs for how long do you need to train that up
| using a dataset like The Pile?
| uejfiweun wrote:
| How long does it generally take between model architectures like
| Mamba being proposed and the use of these architectures in SotA
| mega models like GPT or Gemini? IIUC Mamba basically eliminates
| restrictions on context length which would be awesome to see in
| the super-mega high performance models.
| brcmthrowaway wrote:
| GPT-5 would have this enhancement
| marmaduke wrote:
| Hm I'd take a stab at a Jax version based on this. Thanks
| iskander wrote:
| I expected the core of the algorithm to be a parallel prefix scan
| though (isn't that the point of Mamba?): for i
| in range(l): x = deltaA[:, :, i] \* x +
| deltaB_u[:, :, i] y = einsum(x, C[:, i, :], 'b
| d_in n , b n -> b d_in') ys.append(y)
| ekiauhce wrote:
| If a variable contains batch size, then name it accordingly --
| batch_size.
|
| And no glossary needed, KISS
|
| https://github.com/johnma2006/mamba-minimal/blob/82efa90919c...
___________________________________________________________________
(page generated 2023-12-20 23:00 UTC)