[HN Gopher] Llama 3 implemented in pure NumPy
       ___________________________________________________________________
        
       Llama 3 implemented in pure NumPy
        
       Author : orixilus
       Score  : 311 points
       Date   : 2024-05-16 13:53 UTC (9 hours ago)
        
 (HTM) web link (docs.likejazz.com)
 (TXT) w3m dump (docs.likejazz.com)
        
       | ulam2 wrote:
       | I'll consider superintelligence achieved if AI can do such work
       | faithfully.
        
         | sebzim4500 wrote:
         | What? Lots of people could produce this repo, it hardly counts
         | as superintelligence.
        
       | Scene_Cast2 wrote:
       | The rotary embeddings bit is neat. I wonder if a complex
       | representation would simplify vs complexify things (readability,
       | performance, expressive power).
        
         | johndough wrote:
         | Some implementations use a complex rotary encoding, but it
         | makes it a bit harder to port to platforms or frameworks which
         | do not support complex numbers natively.
        
         | 6gvONxR4sf7o wrote:
         | The tensor cores that do the bulk of the flops on the bulk of
         | the gpus people use are just various sizes of floats, i think.
         | We're in a funny position where progress in models and progress
         | in hardware are kind of linked.
         | 
         | As far as expressive power goes, it shouldn't make a difference
         | for the models in common use, but I could totally imagine
         | models where it improves readability.
        
       | johndough wrote:
       | What is the difference to the llama.np repository credited in the
       | README? https://github.com/hscspring/llama.np
        
         | aeyes wrote:
         | Well, it supports Llama3.
         | 
         | But the other question I have is about the license. The
         | tokenizer.py file is identical, and the rest is very similar -
         | just making minor adjustments here and there.
         | 
         | Can they just take this Apache 2 licensed code, change it a bit
         | and offer it as MIT? They are clearly not the original author.
        
           | Scaevolus wrote:
           | Unfortunately, licenses are only worth as much as your
           | lawyers.
        
             | yjftsjthsd-h wrote:
             | DMCA takedowns are free.
        
               | not2b wrote:
               | A less aggressive approach would be to file an issue and
               | let the maintainer correct the license issue.
        
       | kolinko wrote:
       | Obligatory Recmo's Llama1 implementation in numpy :)
       | 
       | https://github.com/recmo/cria
        
       | joennlae wrote:
       | Trainable Llama-like transformer (with backpropagation) in numpy
       | only (~600 lines)
       | 
       | https://github.com/joennlae/tensorli
        
       | lnyan wrote:
       | `import jax.numpy as np`, then we also get a jax implemention
       | after certain modifications: e.g. remove in-place index
       | assignment, replace unsupported functions, etc
        
         | cl3misch wrote:
         | ...which should be much faster also on CPU, I assume.
        
         | ffriend wrote:
         | JAX requires a bit more work to maintain fixed-size buffers as
         | required by XLA, especially in case of caching and rotary
         | embeddings. But yeah, overall the code can be pretty similar
         | [1].
         | 
         | [1]:
         | https://github.com/dfdx/fabrique/blob/main/fabrique/llama/mo...
        
       | xchip wrote:
       | Nice but the tricky part is the training data.
        
         | whereismyacc wrote:
         | there are a lot of tricky parts.
        
         | swader999 wrote:
         | The tricky part is getting big enough that no one can
         | successfully sue you for using "your" training data.
        
       | buildbot wrote:
       | Cool, instant cuda acceleration via cupy! `import cupy as np`
        
       | AI_hacker wrote:
       | How does the performance of llama3.np compare to other
       | implementations, especially considering it's a pure NumPy
       | implementation?
        
       | rhdunn wrote:
       | From the TinyStories dataset card [1] the dataset is generated by
       | GPT-3.5 and GPT-4. Reading the discussions in the community tab
       | [2] it looks like there are a lot of incomplete or misspelled
       | words, incorrect grammar, and even Chinese characters in the
       | dataset.
       | 
       | As such, I'd be weary of using that dataset to train or evaluate
       | models.
       | 
       | [1] https://huggingface.co/datasets/roneneldan/TinyStories
       | 
       | [2]
       | https://huggingface.co/datasets/roneneldan/TinyStories/discu...
        
         | nwoli wrote:
         | It's just used for checking that the implementation is correct.
         | The dataset is just a toy dataset it doesn't matter if it has
         | misspelled words
        
       | ffriend wrote:
       | It's also worth mentioning that the original implementation by
       | Meta is only 300 lines of very readable code [1].
       | 
       | [1]: https://github.com/meta-
       | llama/llama3/blob/main/llama/model.p...
        
         | blt wrote:
         | the simplicity of the transformer is quite refreshing.
         | especially in vision where the Vision Transformer with linear
         | patch encodings replaces complex intertwined decisions about
         | filter size, striding, pooling, #filters, depth, etc., with the
         | simpler decision of how to allocate your FLOPS between
         | dimensionality, #heads, and #layers.
        
         | blharr wrote:
         | So is this the case that the information is in the data set? Or
         | the code is very well defined to be so small? As an outsider
         | it's surprising that such a capable model can be so "simple".
        
           | jacobn wrote:
           | The training code is presumably quite a bit more complex than
           | what they've open sourced, but part of the beauty of the GPT-
           | based LLMs is their structural simplicity.
           | 
           | Now, that simplicity can be deceiving - there are a lot of
           | conceptual interconnectedness within these models. They've
           | been put together "just so" if you will.
           | 
           | If you look at the source code to nanoGPT and compare it to
           | Llama3, the most remarkable thing (when you look past the
           | superficial name changes) is just how similar they are.
           | 
           | If I recall correctly the primary differences are:
           | - The MLP: Llama3 uses SwiGLU vs the more "traditional" x = x
           | + proj(gelu(expand(x))) in GPT2       - The token encoders,
           | which is arguably external to the model       - Attention:
           | Llama3 uses Grouped Query Attention, vs full Multi-Head
           | Attention in GPT2       - Normalization: Llama3 uses RMSNorm,
           | vs LayerNorm for GPT2
           | 
           | They were published more than five years apart. On the one
           | hand progress has been breathtaking, truly astounding. On the
           | other hand, it's almost exactly the same model.
           | 
           | Goes to show just how much is in the training data.
        
           | moritzwarhier wrote:
           | I think with LLMs in general, the algorithms are very refined
           | and require lots of research, despite being "simple" in terms
           | of entropy, or an imagined Kolgomorov complexity for defining
           | algorithms.
           | 
           | So "simple" is a fuzzy term here, but yes, the entropic
           | complexity is in the data, not the algorithms.
           | 
           | Related to the so-called "Bitter lesson".
           | 
           | Edit: the sister comment pointed out what I failed to
           | express: RILHF and training are also algorithms, and their
           | applications and implementations are probably much more
           | complex than the code that evaluates a given prompt.
           | 
           | So basically, "models" (trained NNs) are also an example for
           | the equivalence of code and data.
           | 
           | Fixed data used by code (the trained model) is code in
           | itself, even when it is not directly written by humans or in
           | a human-readable language.
           | 
           | Edit edit: don't forget to count the imported maths code :)
           | but I assume this is not relevant to the "it's just matrix
           | multiplications" overall argument
        
           | SpaceManNabs wrote:
           | 300 lines of this code is a bit different than 300 lines of
           | typical code where you read files, set up a backend/frontend,
           | or parse data. In the latter case, there are a lot of tedious
           | operations. Sure, the former also has that with reshaping and
           | asserts or wtv.
           | 
           | But in a sense, the 300 lines of Llama code are essentially
           | just lines of math. And reading through any math proof will
           | show you that any particular line can hide large amounts of
           | complexity.
           | 
           | This can be true with code with more tedious operations, but
           | those lines are a smaller fraction of the overall code base
           | by definition.
           | 
           | Even the "tedious" parts of the llama code can hide large
           | complexity. Setting a learning rate with a schedule might
           | require reading a paper or two for your particular
           | architecture.
           | 
           | But yes, once you parse all the math and the theory, the
           | lines are kinda simple matmul and forward lol.
        
         | kureikain wrote:
         | Do you know why these are so short? What is the algorithm/magic
         | in all of these?
         | 
         | I tried to make sense of it but cannot
        
           | DavidSJ wrote:
           | The magic is in the billions of learned weights (~synapses).
           | This is just the scaffolding that runs them.
        
           | Hugsun wrote:
           | Architecturally, LLMs are very simple compared to many
           | software projects.
           | 
           | The crux of their behavior comes from their learned weights
           | which are gigabytes and can cost millions to obtain via
           | training.
        
         | ebb_earl_co wrote:
         | On line 59, there is a less-than-or-equals comparison between 0
         | and 1. Curious https://github.com/meta-
         | llama/llama3/blob/main/llama/model.p...
        
       | dang wrote:
       | We changed the URL from https://github.com/likejazz/llama3.np to
       | the article it points to, which gives more background.
        
       ___________________________________________________________________
       (page generated 2024-05-16 23:00 UTC)