[HN Gopher] Llama 3 implemented in pure NumPy
___________________________________________________________________
Llama 3 implemented in pure NumPy
Author : orixilus
Score : 311 points
Date : 2024-05-16 13:53 UTC (9 hours ago)
(HTM) web link (docs.likejazz.com)
(TXT) w3m dump (docs.likejazz.com)
| ulam2 wrote:
| I'll consider superintelligence achieved if AI can do such work
| faithfully.
| sebzim4500 wrote:
| What? Lots of people could produce this repo, it hardly counts
| as superintelligence.
| Scene_Cast2 wrote:
| The rotary embeddings bit is neat. I wonder if a complex
| representation would simplify vs complexify things (readability,
| performance, expressive power).
| johndough wrote:
| Some implementations use a complex rotary encoding, but it
| makes it a bit harder to port to platforms or frameworks which
| do not support complex numbers natively.
| 6gvONxR4sf7o wrote:
| The tensor cores that do the bulk of the flops on the bulk of
| the gpus people use are just various sizes of floats, i think.
| We're in a funny position where progress in models and progress
| in hardware are kind of linked.
|
| As far as expressive power goes, it shouldn't make a difference
| for the models in common use, but I could totally imagine
| models where it improves readability.
| johndough wrote:
| What is the difference to the llama.np repository credited in the
| README? https://github.com/hscspring/llama.np
| aeyes wrote:
| Well, it supports Llama3.
|
| But the other question I have is about the license. The
| tokenizer.py file is identical, and the rest is very similar -
| just making minor adjustments here and there.
|
| Can they just take this Apache 2 licensed code, change it a bit
| and offer it as MIT? They are clearly not the original author.
| Scaevolus wrote:
| Unfortunately, licenses are only worth as much as your
| lawyers.
| yjftsjthsd-h wrote:
| DMCA takedowns are free.
| not2b wrote:
| A less aggressive approach would be to file an issue and
| let the maintainer correct the license issue.
| kolinko wrote:
| Obligatory Recmo's Llama1 implementation in numpy :)
|
| https://github.com/recmo/cria
| joennlae wrote:
| Trainable Llama-like transformer (with backpropagation) in numpy
| only (~600 lines)
|
| https://github.com/joennlae/tensorli
| lnyan wrote:
| `import jax.numpy as np`, then we also get a jax implemention
| after certain modifications: e.g. remove in-place index
| assignment, replace unsupported functions, etc
| cl3misch wrote:
| ...which should be much faster also on CPU, I assume.
| ffriend wrote:
| JAX requires a bit more work to maintain fixed-size buffers as
| required by XLA, especially in case of caching and rotary
| embeddings. But yeah, overall the code can be pretty similar
| [1].
|
| [1]:
| https://github.com/dfdx/fabrique/blob/main/fabrique/llama/mo...
| xchip wrote:
| Nice but the tricky part is the training data.
| whereismyacc wrote:
| there are a lot of tricky parts.
| swader999 wrote:
| The tricky part is getting big enough that no one can
| successfully sue you for using "your" training data.
| buildbot wrote:
| Cool, instant cuda acceleration via cupy! `import cupy as np`
| AI_hacker wrote:
| How does the performance of llama3.np compare to other
| implementations, especially considering it's a pure NumPy
| implementation?
| rhdunn wrote:
| From the TinyStories dataset card [1] the dataset is generated by
| GPT-3.5 and GPT-4. Reading the discussions in the community tab
| [2] it looks like there are a lot of incomplete or misspelled
| words, incorrect grammar, and even Chinese characters in the
| dataset.
|
| As such, I'd be weary of using that dataset to train or evaluate
| models.
|
| [1] https://huggingface.co/datasets/roneneldan/TinyStories
|
| [2]
| https://huggingface.co/datasets/roneneldan/TinyStories/discu...
| nwoli wrote:
| It's just used for checking that the implementation is correct.
| The dataset is just a toy dataset it doesn't matter if it has
| misspelled words
| ffriend wrote:
| It's also worth mentioning that the original implementation by
| Meta is only 300 lines of very readable code [1].
|
| [1]: https://github.com/meta-
| llama/llama3/blob/main/llama/model.p...
| blt wrote:
| the simplicity of the transformer is quite refreshing.
| especially in vision where the Vision Transformer with linear
| patch encodings replaces complex intertwined decisions about
| filter size, striding, pooling, #filters, depth, etc., with the
| simpler decision of how to allocate your FLOPS between
| dimensionality, #heads, and #layers.
| blharr wrote:
| So is this the case that the information is in the data set? Or
| the code is very well defined to be so small? As an outsider
| it's surprising that such a capable model can be so "simple".
| jacobn wrote:
| The training code is presumably quite a bit more complex than
| what they've open sourced, but part of the beauty of the GPT-
| based LLMs is their structural simplicity.
|
| Now, that simplicity can be deceiving - there are a lot of
| conceptual interconnectedness within these models. They've
| been put together "just so" if you will.
|
| If you look at the source code to nanoGPT and compare it to
| Llama3, the most remarkable thing (when you look past the
| superficial name changes) is just how similar they are.
|
| If I recall correctly the primary differences are:
| - The MLP: Llama3 uses SwiGLU vs the more "traditional" x = x
| + proj(gelu(expand(x))) in GPT2 - The token encoders,
| which is arguably external to the model - Attention:
| Llama3 uses Grouped Query Attention, vs full Multi-Head
| Attention in GPT2 - Normalization: Llama3 uses RMSNorm,
| vs LayerNorm for GPT2
|
| They were published more than five years apart. On the one
| hand progress has been breathtaking, truly astounding. On the
| other hand, it's almost exactly the same model.
|
| Goes to show just how much is in the training data.
| moritzwarhier wrote:
| I think with LLMs in general, the algorithms are very refined
| and require lots of research, despite being "simple" in terms
| of entropy, or an imagined Kolgomorov complexity for defining
| algorithms.
|
| So "simple" is a fuzzy term here, but yes, the entropic
| complexity is in the data, not the algorithms.
|
| Related to the so-called "Bitter lesson".
|
| Edit: the sister comment pointed out what I failed to
| express: RILHF and training are also algorithms, and their
| applications and implementations are probably much more
| complex than the code that evaluates a given prompt.
|
| So basically, "models" (trained NNs) are also an example for
| the equivalence of code and data.
|
| Fixed data used by code (the trained model) is code in
| itself, even when it is not directly written by humans or in
| a human-readable language.
|
| Edit edit: don't forget to count the imported maths code :)
| but I assume this is not relevant to the "it's just matrix
| multiplications" overall argument
| SpaceManNabs wrote:
| 300 lines of this code is a bit different than 300 lines of
| typical code where you read files, set up a backend/frontend,
| or parse data. In the latter case, there are a lot of tedious
| operations. Sure, the former also has that with reshaping and
| asserts or wtv.
|
| But in a sense, the 300 lines of Llama code are essentially
| just lines of math. And reading through any math proof will
| show you that any particular line can hide large amounts of
| complexity.
|
| This can be true with code with more tedious operations, but
| those lines are a smaller fraction of the overall code base
| by definition.
|
| Even the "tedious" parts of the llama code can hide large
| complexity. Setting a learning rate with a schedule might
| require reading a paper or two for your particular
| architecture.
|
| But yes, once you parse all the math and the theory, the
| lines are kinda simple matmul and forward lol.
| kureikain wrote:
| Do you know why these are so short? What is the algorithm/magic
| in all of these?
|
| I tried to make sense of it but cannot
| DavidSJ wrote:
| The magic is in the billions of learned weights (~synapses).
| This is just the scaffolding that runs them.
| Hugsun wrote:
| Architecturally, LLMs are very simple compared to many
| software projects.
|
| The crux of their behavior comes from their learned weights
| which are gigabytes and can cost millions to obtain via
| training.
| ebb_earl_co wrote:
| On line 59, there is a less-than-or-equals comparison between 0
| and 1. Curious https://github.com/meta-
| llama/llama3/blob/main/llama/model.p...
| dang wrote:
| We changed the URL from https://github.com/likejazz/llama3.np to
| the article it points to, which gives more background.
___________________________________________________________________
(page generated 2024-05-16 23:00 UTC)