[HN Gopher] Implementation of Mamba in one file of PyTorch
       ___________________________________________________________________
        
       Implementation of Mamba in one file of PyTorch
        
       Author : johnma2006
       Score  : 311 points
       Date   : 2023-12-20 13:58 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | andy99 wrote:
       | The original mamba code has a lot of speed optimizations and
       | other stuff that make it difficult to immediately get so this
       | will help with learning.
       | 
       | I can't help but also plug my own Mamba inference implementation.
       | https://github.com/rbitr/llm.f90/tree/master/ssm
       | 
       | For inference one token at a time everything simplifies
       | considerably.
        
         | cs702 wrote:
         | Fortran! If you don't mind me asking, why Fortran?
         | 
         | I know it underpins a _lot_ of time-tested scientific code,
         | often wrapped by libraries like PyTorch and Numpy, but Fortran
         | isn 't exactly a popular language nowadays. What's your
         | rationale for using it?
        
           | andy99 wrote:
           | Tdlr, Fortran is low level-ish, compiled, but otherwise
           | almost identical to numpy syntax wise.
           | 
           | It supports all the common array and matrix operations and it
           | doesn't need memory and pointer management the way C does.
           | But it still compiles down to something very fast, you can
           | link in BLAS and GPU libraries, supports easy parallelism...
           | 
           | When I compare with e.g. Karpathy's llama2.c, I think Fortran
           | is easy to work with implementing basic transformer inference
           | because of how it handles arrays.
           | 
           | The downside is that while there are efforts to modernize it,
           | I find it more cumbersome for non-numerical stuff,
           | particularly strings. But I think for the actual linear
           | algebra implementation, it can't be beat.
           | 
           | I should add, I know it's a bit of an uphill battle, I expect
           | fewer people will use code that I write in Fortran vs
           | basically anything else. But I'm hoping to pull some people
           | in and get a critical mass of interest because I think it has
           | a lot of promise. That's actually one of the reasons I wanted
           | to get a Mamba implementation quickly (though now that
           | there's a basic python one I think I'll have lost some
           | potential users to it :)
        
             | cs702 wrote:
             | Thanks for the thoughtful response.
             | 
             | Unfortunately, I too think it will be a bit of an uphill
             | battle for you.
             | 
             | If you haven't already, take a look at Mojo and Julia. Both
             | offer many of the benefits of Fortran, but unlike it, they
             | are seeing growing adoption.
        
               | andy99 wrote:
               | An uphill battle is fine
        
       | cs702 wrote:
       | This looks _really nice_. Thank you for sharing it on HN!
       | 
       | In case you didn't know, you can parallelize the slow Python loop
       | in _selective_scan_ that computes all the x 's:                 x
       | = torch.zeros((b, d_in, n))       for i in range(l):           x
       | = deltaA[:, :, i] * x + deltaB_u[:, :, i]           [?]
       | 
       | with only two calls to the PyTorch API. See the examples here:
       | https://github.com/glassroom/heinsen_sequence/blob/main/READ...
       | .[a]
       | 
       | You can then compute all the y's with one einsum, instead of l
       | sequential einsums.
       | 
       | ---
       | 
       | [a] Previous discussion on HN:
       | https://news.ycombinator.com/item?id=38556669
        
         | make3 wrote:
         | OP's code is much easier to understand, though, which is the
         | main (only) purpose of their code
        
           | cs702 wrote:
           | Can't argue with that! :-)
           | 
           | For what it's worth, you can keep both, and make parallel vs
           | sequential execution an option, with a boolean flag.
           | 
           | You can also leave the sequential code as a comment
           | explaining what the parallel code does.
           | 
           | Or, if slow execution doesn't bother you, leave it as is.
        
             | bradfitz wrote:
             | You're replying to somebody who was arguing for readability
             | being its virtue and you're proposing ... adding options
             | and alternate code paths? :)
        
               | anytime5704 wrote:
               | Via a boolean parameter, no less.
        
               | cs702 wrote:
               | _Touche._ I just updated my comment :-)
        
       | boredumb wrote:
       | "Mamba is the world's longest venomous snake with an estimated
       | length of over 150 m"
       | 
       | Had a laugh at that. Really great stuff though, it was nice to
       | have referencing to the arxiv paper so someone like me who
       | generally consumes these things instead of translating them from
       | papers could sort of peak behind the curtains.
        
         | visarga wrote:
         | Mamba has a great name ... [S]elective [S]tructured [S]tate
         | [S]pace [S]equence models.. makes sSSSS, like a snake
        
           | behnamoh wrote:
           | If only the "mamba" name were not ugly.
        
         | rdedev wrote:
         | Wait I thought that was the king cobra? The longest venomous
         | snake ? At least that was what a simple Google search showed
         | me.
         | 
         | Would be funny if they had to issue a correction for that
         | sentence later on
        
           | fwip wrote:
           | It's also not 150 meters long (nearly 500 feet), which I
           | think is also part of why it was funny to include the
           | sentence in the README.
        
       | y42 wrote:
       | slightly OT:
       | 
       | I really struggle with dozens and dozens of vocabulary that is
       | being used in the field of machine learning and especially AI.
       | I'm not a beginner at all, but I wonder if there is a
       | comprehensive guide for all those terms that not necessarily
       | explains the technology behind them in detail, but shows their
       | position and relation to each other. like some kind of landscape.
       | 
       | "everyone" seems to know Mamba. I never heard of Mamba. There are
       | constantly new kind of llm popping up, talking about stuff that
       | seems to be obvious.
       | 
       | So, is there some kind of resource like that, not aiming at
       | beginners, but experienced users, coming from other fields of IT?
        
         | sevagh wrote:
         | >"everyone" seems to know Mamba. I never heard of Mamba
         | 
         | Only the "everybody who knows what mamba is" are the ones
         | upvoting and commenting. Think of all the people who ignore it.
         | For me, Mamba is the faster version of Conda [1], and that's
         | why I clicked on the article.
         | 
         | https://github.com/mamba-org/mamba
        
           | 3-cheese-sundae wrote:
           | Ah yes, Conda, definitely something else I've heard of.
        
             | NavinF wrote:
             | Conda has been around for a decade and it used to be the
             | primary package manager for everything related to
             | numpy/scipy. Most ML and data science people have heard of
             | it even if they haven't used it.
        
             | sevagh wrote:
             | Conda is the latest LLM cli frontend that's a MOE of
             | Mistral 7B, LLama 17B, Falcon 32C, and the Yamaha YZ50 quad
             | bike.
        
               | gpderetta wrote:
               | > and the Yamaha YZ50 quad bike.
               | 
               | Well played.
        
             | supermatt wrote:
             | Its extremely common to manage python environments with
             | conda (although it can do much more). If you are unaware of
             | conda, it is unlikely you work with python, and therefore
             | unlikely to be doing much with ML (and LLMs) anyway - its
             | even part of the "getting started" documentation for
             | pytorch.
        
           | IshKebab wrote:
           | That is not "a new LLVM architecture"... It's talking about a
           | different Mamba.
        
         | CaptainOfCoit wrote:
         | I'm not aware of such a glossary.
         | 
         | But I did notice the "References" section in the bottom of the
         | README, which does explain what Mamba is by linking to the
         | original paper: "Mamba: Linear-Time Sequence Modeling with
         | Selective State Spaces" https://arxiv.org/abs/2312.00752
        
         | bananaflag wrote:
         | I knew about Mamba from r/singularity and following AI
         | researchers on Twitter.
         | 
         | I don't work in AI at all (and don't plan to), but it's fun to
         | know about stuff a little before they become mainstream.
        
         | swyx wrote:
         | the field just moves fast. I have curated a list of non-hypey
         | writers and youtubers who explain these things for a typical
         | SWE audience if you are interested.
         | https://github.com/swyxio/ai-notes/blob/main/Resources/Good%...
        
           | y42 wrote:
           | Will check it, thank you!
        
         | orbifold wrote:
         | It is a very fad driven field. Everyone brands everything. It
         | isn't enough to give things boring titles like, stacked open
         | linear dynamical system with selective observations and learned
         | timestep.
        
           | Ar-Curunir wrote:
           | I mean, Mamba is much easier to remember than what you said.
           | It's good to have short names for techniques.
        
           | cyanydeez wrote:
           | that's half of it, the other half is pure social linguistics.
           | 
           | try talking about stacked open linear dynamical system for
           | more than three times and you're bound to figure out a token
           | that conveys the same but is quicker to produce
           | 
           | it's turtles all the way down with LLM And your comment.
           | people are just trying to maximize their token conversations
        
         | yuppiepuppie wrote:
         | Heavily agree. Ive been following this space quite closely,
         | like most people, only for the past year. But it seems to be
         | still in its experimental phase which in turn brings academics
         | and researchers who tend toward this type of language.
        
         | sva_ wrote:
         | I didn't know Mamba but the bottom of the page lists
         | comprehensive references.
         | 
         | If you mean the "branding" that is common in ML, which is often
         | criticized, I much prefer it over the jargon used in other
         | fields, e.g. Mathematics. It is nice to have distinguished
         | words to talk about different concepts.
        
         | visarga wrote:
         | > I never heard of Mamba.
         | 
         | Just came out a few days ago. It's new for everyone.
        
           | amelius wrote:
           | Mamba is also the name of a package management system,
           | similar to Conda.
           | 
           | Just to make it a little extra confusing :)
           | 
           | https://github.com/mamba-org/mamba
        
             | Tao3300 wrote:
             | Should have picked a different snake, like... I dunno, Asp?
             | Wait, no, not that one...
        
               | sevagh wrote:
               | Python!
        
         | wenc wrote:
         | In fast evolving fields it's always all about sociology, not
         | canon or pedagogy. Meaning in new fields is created in
         | community (constructionism).
         | 
         | You need to plug into the community and overhear what people
         | are talking about (HN is such a community). You'll also get a
         | sense of the linguistic subculture (acronyms, lingo etc) much
         | like you learn to talk hip hop if you're into the hip hop
         | subculture. Much of it will be noise but overall you'll get a
         | sense of what the community cares about, which helps you narrow
         | what you need to focus on. The subreddit r/localllama is the
         | watering hole for hobbyists right now.
         | 
         | If you need a primer, this is a good guide.
         | 
         | https://flyte.org/blog/getting-started-with-large-language-m...
         | 
         | In this particular case, I find it helpful to do syntopical
         | reading (per Mortimer Adler) around LLMs not AI in general.
         | Mamba is interesting to me because I have a background in
         | optimal control and state space models are my bread an butter
         | and it's fascinating to see them applied in this way.
         | 
         | Side: I'm in my 40s and this isn't my first rodeo. There will
         | always be new fields and trends emerging -- I've been through
         | several waves of this (cloud, big data, ML, data science etc)
         | where posts like yours are commonplace. But there is no need to
         | be frustrated. Overhearing conversations is one way to make
         | sense of them instead of feeling lost and waiting for someone
         | to summarize and explain everything to you.
         | 
         | The same applies to academic fields.
         | 
         | Ps also consider you might not need to be on the cutting edge.
         | If you're not trying to build leading edge stuff, it's good to
         | wait for the dust to settle -- you'll waste less time following
         | dead ends while the community is figuring out what's good.
        
           | pshc wrote:
           | Perhaps the community at r/localllama could train an LLM that
           | knows about the latest developments and explains jargon and
           | papers, updated weekly. Free idea for karma.
        
             | wenc wrote:
             | Not a bad idea.
             | 
             | I actually read papers with the help of ChatGPT-4 and
             | Claude. It helps me quickly understand papers that I don't
             | have a background in.
             | 
             | For instance when I see something I don't understand I ask
             | it "can you break that down for me?" Or "is this similar to
             | (concept I know)?"
             | 
             | It's the new way of doing syntopical reading -- but faster
             | and more efficient.
             | 
             | (For the uninitiated, it's a technique from Mortimer
             | Adler's How to read a book)
        
               | ttul wrote:
               | This is a great way to consume papers. If there's one
               | thing LLMs know, it's machine learning literature!
        
               | jsight wrote:
               | How do you feed a recent arxiv paper directly to ChatGPT?
        
               | Min0taur wrote:
               | If you have the + subscription you can upload pdfs
               | directly/ask it to ingest.
        
               | WhitneyLand wrote:
               | A few options are:
               | 
               | 1. Select abstract or select all text then copy/paste.
               | 
               | 2. Save the PDF and upload with ChatGPT's document
               | feature.
               | 
               | 3. Ask for it, "what's that well known LLM paper about
               | context and getting lost in the middle?". It will web
               | search as needed.
               | 
               | You can also do more than summarize. Ask about equations,
               | ask it to make analogies, challenge the key findings as
               | devil's advocate to learn from different angles. Propose
               | your own ideas.
               | 
               | Use voice to digest topics during your commute and ask
               | tons of questions until you understand.
        
           | y42 wrote:
           | Good point, thanks for the link! (one of the links there
           | leads to this wonderful post: Highly recommended:
           | http://jalammar.github.io/illustrated-transformer/)
        
         | countWSS wrote:
         | Its a new LLM type: instead of transformers it use state-space
         | machines, which are orders of magnitude faster. Its currently
         | very new and less coherent than GPT-2.
        
           | senseiV wrote:
           | ? its better than GPT 2 for sure...
        
         | falcor84 wrote:
         | > in the field of machine learning and especially AI
         | 
         | Sorry for getting semantical here, but isn't ML a subfield of
         | AI? In other words, I would have expected "... in the field of
         | machine learning and AI in general"
        
           | dragonwriter wrote:
           | AI is often being used recently for specifically _generative_
           | AI, which is a subfield of machine learning, which is a
           | subfield of AI in the broader sense.
        
         | quickthrower2 wrote:
         | You are now in the loop! Your colleagues will think the same
         | "this person how does he/she keep up with all the LLM stuff?".
        
         | TrackerFF wrote:
         | The people that are constantly up to date on this stuff tend to
         | be AI/ML researchers and engineers. In academia, industry
         | research groups, or startups.
         | 
         | They literally get paid to read papers, and implement models on
         | a day-to-day basis.
         | 
         | I wouldn't worry too much not being up to date or things
         | sounding a bit foreign. The names themselves are just that,
         | names, the models themselves tend to be incremental versions of
         | some previous model.
        
           | toasted-subs wrote:
           | Most of the startups I've chatted with seem to prioritize
           | finding people who build products. The complaint/regret I've
           | heard from 3-5 organizations was hiring researchers.
           | 
           | Researcher is more for highly funded organizations. Starrups
           | can get by with off the shelf models.
        
         | jiggawatts wrote:
         | Don't feel bad, Mamba is _very new_ technology. I only just
         | heard about it for the first time last week!
        
         | esafak wrote:
         | Everybody doesn't know Mamba. You can't stay on top of
         | everything in ML so stop trying. Since you asked, Mamba is a
         | neural architecture based on structured state space models
         | (SSMs) that aims to replace Transformers. For me right now just
         | know that counts as staying on top of things. If I need to know
         | more than that I can have the computer summarize it for me.
        
       | swyx wrote:
       | things I'd like a non-ML-researcher explanation of about Mamba:
       | 
       | 1. what is the overall insight of state space models beyond
       | transformers? (i know this is somewhat covered in the paper but
       | still a bit inaccessible)
       | 
       | 2. what was the incremental innovation/result that is making
       | Mamba more successful/interesting than its predecessors? (S4, H3,
       | Monarch etc)
       | 
       | 3. what are the implications beyond subquadratic scaling of
       | context? say if i don't really care about context length > 100k
       | tokens. what other benefits are there - for example, is Mamba
       | potentially more compute-efficient to train for a similar size of
       | model/dataset?
       | 
       | just offering 3 prompts for knowledgeable people to drop some
       | alpha
        
         | logicchains wrote:
         | For 2, Mamba makes some A B C weights that in S4 are time
         | invariant become functions of the input, which makes it more
         | powerful.
        
         | pk-protect-ai wrote:
         | > is Mamba potentially more compute-efficient to train for a
         | similar size of model/dataset?
         | 
         | I would like to understand it too as well ...
         | 
         | Here is the citation from original paper:
         | 
         | "Computation. After the parameters have been transformed from
         | ([?], A, B, C) - (A, B, C), the model can be computed in two
         | ways, either as a linear recurrence (2) or a global convolution
         | (3). Commonly, the model uses the convolutional mode (3) for
         | efficient parallelizable training (where the whole input
         | sequence is seen ahead of time), and switched into recurrent
         | mode (2) for efficient autoregressive inference (where the
         | inputs are seen one timestep at a time)."
         | 
         | So the training is parallelizable, like in RetNet with parallel
         | forward mode. By default inference is done in the recurrent
         | mode, to have a longest possible context. No chunking
         | available, so it is difficult for me to say how much RAM and
         | VRAM it will consume during the inference ...
        
         | ttul wrote:
         | My IQ is orders of magnitude lower than the authors of the
         | paper, but I did my best to work through it anyway. I studied
         | CE and have the basic control theory background and undergrad
         | level discrete time systems intuition. It would take much
         | additional studying to understand state space models enough to
         | really parse this paper. But I tried anyway. Take my comment
         | here with a big grain of salt.
         | 
         | The overall insight of Mamba is to solve a longstanding problem
         | with state space models. They are good at compressing the input
         | context, but the compression of input into a hidden state
         | erases information needed to make use of the context
         | effectively as Transformers do.
         | 
         | Their solution to this problem is to create what they call a
         | selection mechanism. The mechanism is input-dependent, allowing
         | the model to adjust its output at each step as the input
         | changes. How they do this is by making a few of the state space
         | variables input-dependent instead of input-invariant. They
         | choose a few of the state space variables and attach linear
         | layers and such to project the input onto the state space
         | variable at each time step. The linear layers (etc) are
         | obviously trained so that they know how to transform the input
         | appropriately so that the model spits out useful output.
         | 
         | But making the state space variables input dependent creates a
         | problem in terms of computation overhead. They fix the
         | computation problem by designing a machine architecture-aware
         | algorithm that makes the most of modern GPU memory
         | architecture, avoiding moving things in and out of HBM as much
         | as possible.
         | 
         | Tri Dao came up with Flash Attention, which is basically a way
         | to use hardware more efficiently in a Transformer. So this is
         | his jam 100%.
         | 
         | I know this doesn't add much to understanding the paper, but
         | hopefully it's better than nothing.
        
           | SpaceManNabs wrote:
           | Is this similar to subset selection with the concrete
           | distribution?
        
         | sjkoelle wrote:
         | my loose understanding
         | 
         | 1) transformers create an input x input size attention matrix
         | that is unnecessarily large. state space models somehow
         | compress this.
         | 
         | 2) "The main difference is simply making several parameters [in
         | the state space model] functions of the input"
         | 
         | 3) i think it might be more sample efficient (requires less
         | data)
        
         | WhitneyLand wrote:
         | I think this video is exactly what you're looking for.
         | 
         | He explains the paper but also gives a lot of context, how it
         | fits into the big picture, etc.
         | 
         | It's actual kind of exciting hearing the plot unfold.
         | 
         | https://youtu.be/ouF-H35atOY?si=y2Ckp9MCFd7ulLL3
        
       | mcemilg wrote:
       | Looks wonderful. But I would like to add this, I hate einops, it
       | doesn't make it simple to read unfortunately.
        
         | andy99 wrote:
         | I re-implemented Mamba myself and this was the first time I had
         | ever worked with einops/einsum. I'm 50/50 on them after this. I
         | found them relatively easy to look at and understand the intent
         | (possibly more so than other representations), but talking
         | extra time to transforms into other primitives (loops,
         | multiplication, etc). I belive using torch.einsum is generally
         | well optimized as well compared to naively looping. All said, I
         | don't know if I'd use it myself working from scratch but it's
         | interesting to know and if I was working in python I might try
         | comparing the speed of einops/sum vs other ways.
        
         | sjkoelle wrote:
         | disagree
        
       | danieldk wrote:
       | Nice! For what it is worth, a colleague and I made a library a
       | while ago that factors out most shared model code, with which
       | many models can be implemented in about 100 lines (excluding
       | Python import ceremony and comments). E.g.:
       | 
       | BERT:
       | 
       | https://github.com/explosion/curated-transformers/blob/main/...
       | 
       | Llama 1/2:
       | 
       | https://github.com/explosion/curated-transformers/blob/main/...
       | 
       | MPT:
       | 
       | https://github.com/explosion/curated-transformers/blob/main/...
       | 
       | With various stuff enabled, including support for TorchScript
       | JIT, PyTorch flash attention, etc.
        
         | rdedev wrote:
         | Nice. I will definitely be taking a look at this. Have you
         | looked at the xformers library ? They are looking at the same
         | problem as you but their focus is more on providing performant
         | transformer modules using triton. Using specific components
         | from the library though is not as simple. I kept running into
         | runtime errors so I've kept it aside for now. I am building
         | something based on the Bert architecture so I will give this a
         | look. Thanks for all the work!
        
           | danieldk wrote:
           | I would've loved to look at xFormers, but I avoided looking
           | at other implementations to make sure that ours is a clean
           | room implementation.
           | 
           | Curated Transformers started as a very small library just for
           | spaCy (spaCy 3.7 transformer pipelines use Curated
           | Transformers) with just the older encoder models (BERT,
           | RoBERTa, etc.). spaCy used Hugging Face Transformers prior
           | for the provided transformer models, but we wanted something
           | where we could easily hook into different parts of the model
           | (e.g. for distillation).
           | 
           | After the functionality needed for spaCy was done, Matt @
           | Explosion encouraged us to extend it into a more general
           | PyTorch library that would also support decoder
           | architectures, generation, etc.
        
       | pk-protect-ai wrote:
       | Is there an original paper discussion? I seem to have missed it.
       | It's quite interesting. I didn't catch on to this part:
       | 
       | "We note that full results on context length 8k are missing for
       | the RWKV and RetNet baselines, prior strong recurrent models that
       | can also be interpreted as SSMs, due to a lack of efficient
       | implementation leading to out-of-memory or unrealistic
       | computation requirements."
       | 
       | RetNet doesn't really consume much memory, and with the chunkwise
       | forward implementation, it restricts the VRAM usage to the chunk
       | size. This is the part to test the context length.
       | 
       | Has anyone done some tests on the original Mamba model? How fast
       | is the training on this one in comparison with RetNet in parallel
       | forward mode?
        
         | error9348 wrote:
         | https://news.ycombinator.com/item?id=38522428
         | 
         | https://openreview.net/forum?id=AL1fq05o7H
        
       | allanrbo wrote:
       | Love it when complex things are distilled down to just the
       | essentials!
        
       | dheera wrote:
       | I love one file implementations. I hate all these implementations
       | with preprocess_utils.py that imports stuff from model.py that
       | imports stuff again from preprocess_utils.py that imports stuff
       | from ...
        
         | fassssst wrote:
         | Feels like a useful preprocessor script: turn this repo into a
         | single file
        
       | squigz wrote:
       | Is number of files in a project a meaningful metric...?
        
       | DarmokJalad1701 wrote:
       | Thanks for this. I took a stab at unraveling the official CUDA
       | version and never really got around to it after my initial
       | attempt failed. This seems a lot nicer.
        
       | tysam_and wrote:
       | Oh my gosh, another one-file PyTorch implementation. This is
       | fantastic. I'd like to hope that some of my previous work (hlb-
       | CIFAR10 and related projects, along with other influences before
       | it like minGPT, DawnBench, etc.) has been able to help push the
       | 'simple, single-file, reduced-complexity' format forward a bit. I
       | personally think that this kind of work is critical to efficient
       | ML research, and that is possibly one of the most important
       | things that we can do for the field today.
       | 
       | Research progresses at the speed of innovation, which progresses
       | with the inverse of experiment runtime, which is definitely and
       | absolutely related to the underlying Kolmogorov Complexity of the
       | code w.r.t. a research/simple-hackery-focused objective.
       | 
       | I really cannot stress enough how important to research tools
       | like this are and how much they've sped up the knowledge
       | discovery process for me personally. Being able to quickly sketch
       | out ideas, often in minutes, and get immediate, high-snr results
       | back has become an indispensable part of my research progress.
       | While we seem to really good at some of the specifics of some of
       | the detailsresearch, and somehow have extremely information-
       | efficient training processes, we have not applied the same logic
       | seemingly on the whole to the entire research field!
       | 
       | Knowledge distillation and/or the MDL
       | (https://en.wikipedia.org/wiki/Minimum_description_length) are
       | excessively important I think to reversing a lot of the constant
       | fluff, cruft, and overly dense thrash-and-hope-you-don't-get-
       | scooped-by-other-researchers-on-marginal-value-topics trend that
       | I think has largely been encouraged by the current paper
       | submission/review/etc process.
       | 
       | I've been wanting to try to get around this and move a bit more
       | towards a slightly better scaling solution recently. One of these
       | things is that I've started distributing my code in 1-file, self-
       | contained, short rough gists as 'code sketches', which shortens
       | dev time and gets rough, unpolished, working code for a concept
       | in people's hands. It seems to work pretty well so far, I hope to
       | continue doing it! <3 :'))))
       | 
       | In any case, this is extremely exciting stuff, and everyone --
       | please! More code like this! We're researchers on learning data
       | in a largely-scaled way, let's be data-efficient in how we
       | disseminate information as well! It's a dream come true to see a
       | lot more of this stuff coming down the pipeline, fantastic work
       | and keep it coming! <3 :')))) Woop woop woop!!!!
       | 
       | Excellent stuff. <3 :'))))
        
         | tysam_and wrote:
         | Minor potential performance benefit -- it looks like you might
         | be able to fuse the x_proj and dt_proj weights here as x_proj
         | has no bias. This is a thing that's possibly doable simply at
         | runtime if there's any weight-fiddling reqs, I'm guessing the
         | single kernel + bias will still run faster in the end (not sure
         | though! <3 :')))) )
        
         | jiggawatts wrote:
         | It's been an exciting 2023 year in no small part because of
         | watching AI research unfold at these crazy speeds. Like you've
         | said, these enablers like ArXiV, PyTorch, GitHub, Huggingface,
         | and terse Python code that's open source are dramatically
         | accelerating the development of this new field.
         | 
         | It's probably the fastest the human race has ever developed
         | anything of substantial complexity!
         | 
         | The only other place I see this king of velocity is SpaceX,
         | which also launched two cutting edge rockets this year.
         | 
         | I wonder what 2024 will bring...
        
       | jdeaton wrote:
       | Very cool ive read this line of paper originating from hippo, s4,
       | hyena, mamba etc but can someone please explain how this isnt
       | just an RNN/LSTM variant??
        
         | rhaps0dy wrote:
         | Its latent space transition is linear, instead of nonlinear, so
         | there's a more parallelizable algorithm for advancing time in
         | it. This makes it much more efficient to train and do inference
         | with in GPUs.
         | 
         | The way it keeps all the representation power of LSTMs is by
         | having the transition vary with the input (but still be
         | linear).
        
           | jdeaton wrote:
           | Thanks thats helpful. One place where the parallelizability
           | of this method falls short of the transformer is not being
           | able to pack multiple varying length examples into the same
           | array during training with block diagonal attention pattern.
           | If I understand correctly thats not possible with this
           | architecture and its an important practical concern in large
           | scale transformer training.
        
       | epaulson wrote:
       | This is a dumb question but how hard is it to train the mamba
       | models that are on huggingface? It looks like the largest one is
       | 2.8b - how many GPUs for how long do you need to train that up
       | using a dataset like The Pile?
        
       | uejfiweun wrote:
       | How long does it generally take between model architectures like
       | Mamba being proposed and the use of these architectures in SotA
       | mega models like GPT or Gemini? IIUC Mamba basically eliminates
       | restrictions on context length which would be awesome to see in
       | the super-mega high performance models.
        
         | brcmthrowaway wrote:
         | GPT-5 would have this enhancement
        
       | marmaduke wrote:
       | Hm I'd take a stab at a Jax version based on this. Thanks
        
       | iskander wrote:
       | I expected the core of the algorithm to be a parallel prefix scan
       | though (isn't that the point of Mamba?):                   for i
       | in range(l):                 x = deltaA[:, :, i] \* x +
       | deltaB_u[:, :, i]                 y = einsum(x, C[:, i, :], 'b
       | d_in n , b n -> b d_in')                 ys.append(y)
        
       | ekiauhce wrote:
       | If a variable contains batch size, then name it accordingly --
       | batch_size.
       | 
       | And no glossary needed, KISS
       | 
       | https://github.com/johnma2006/mamba-minimal/blob/82efa90919c...
        
       ___________________________________________________________________
       (page generated 2023-12-20 23:00 UTC)