[HN Gopher] Llama3 implemented from scratch
       ___________________________________________________________________
        
       Llama3 implemented from scratch
        
       Author : Hadi7546
       Score  : 280 points
       Date   : 2024-05-19 18:42 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | lakshyaag wrote:
       | Awesome, gonna go through!
        
       | digitaltrees wrote:
       | Are you the repo author or reposting something cool? I am curious
       | because I want to talk to the repo author about a collaboration
       | project.
        
         | magoghm wrote:
         | You might be able to reach the repo author on X:
         | https://x.com/naklecha
        
       | brcmthrowaway wrote:
       | Wait, are you saying SoTA NN research hasnt evolved from
       | hardcoding a bunch of layer structures and sizes?
       | 
       | I'm kind of shocked. I thought there would be more dynamism by
       | now and I stopped dabbling in like 2018.
        
         | astrange wrote:
         | The innovation is that everything is just one standardized
         | structure now (transformer models) and you make it bigger if
         | you feel like you need that.
         | 
         | There's still some room for experimenting if you care about
         | memory/power efficiency, like MoE models, but they're not as
         | well understood yet.
        
           | aDyslecticCrow wrote:
           | There are too many papers throwing transformers on everything
           | without thinking. Transformers are amazing for language but
           | kinda mid on everything else. CS researchers tend to jump on
           | trends really hard, so it will probably go back to normal
           | again soon.
        
             | imtringued wrote:
             | I don't know what you mean by amazing for language. Almost
             | everything is built on transformers nowadays. Image
             | segmentation uses transformers. Text to speech uses
             | transformers. Voice recognition uses transformers. There
             | are robotics transformers that take image inputs and output
             | motion sequences. Transformers are inherently multi-modal.
             | They handle whatever you throw at them, it's just that
             | language tends to be a very common input or output.
        
         | pshc wrote:
         | My wild guess is that adjusting the shape before each step is
         | not worth the speed hit. Uniform structures make GPUs go brrrrr
        
           | astrange wrote:
           | It's also easier to train and in particular easier to
           | parallelize.
        
         | delusional wrote:
         | The innovation is the amount of resources people are willing to
         | spend right now. From looking at the research code it's clear
         | that the whole field is basically doing a (somewhat) guided
         | search in the entire space of possible layer permutations.
         | 
         | There seems to be no rhyme or reason, no scientific insight, no
         | analysis. They just try a million different permutations, and
         | whatever scores the highest on the benchmarks gets published.
        
           | moffkalast wrote:
           | Well it took evolution 4 billion years of testing out random
           | permutations that resulted in a pretty good local maximum, so
           | there is hope for us yet.
        
             | WanderPanda wrote:
             | ,,I'm a pretty good local maximum" that is what any local
             | maximum would tell you if asked how it likes itself
        
         | curious_cat_163 wrote:
         | There is a tick-tock between searching the dominant NN
         | architectures (tick) and optimizing for accuracy, compute and
         | inference latency and throughput (tock).
         | 
         | This particular (tock) is still playing out. The next (tick)
         | does not feel imminent and will likely depend on when we
         | discover the limits of the transformers when it comes to
         | solving for long tail of use-cases.
         | 
         | My $0.02.
        
           | rdedev wrote:
           | My wish is they would move on to the next phase. The whole
           | deal with SSMs look really good. But looking for better
           | architects is countered with "a regular architecture with
           | more parameters are doing better. What's the point of this"
        
             | tysam_and wrote:
             | Heyo! Have been doing this for a while. SSMs certainly are
             | flashy (most popular topics-of-the-year are), and it would
             | be nice to see if they hit a point of competitive
             | performance with transformers (and if they stand the test
             | of time!)
             | 
             | There are certainly tradeoffs to both, the general
             | transformer motif scales very well on a number of axis, so
             | that may be the dominant algorithm for a while to come,
             | though almost certainly it will change and evolve as time
             | goes along (and who knows? something else may come along as
             | well <3 :')))) ).
        
           | imtringued wrote:
           | You have to consider that there are still some low hanging
           | fruit that let you improve prompt processing (not token
           | generation) performance by an order of magnitude or even two,
           | but there are no takers. The reason is quite simple. You can
           | just buy more GPUs and forget about the optimizations.
           | 
           | If a 100x improvement in performance is left on the table,
           | then surely even lower priority optimizations won't be
           | implemented any time soon.
           | 
           | Consider this: a lot of clever attention optimizations rely
           | on some initial pass to narrow the important tokens down and
           | discarding them from the KV cache. If this was actually
           | possible, then how come the first few layers of the LLM don't
           | already do this numerically to focus their attention? Here is
           | the shocker: they already do, but since you're passing the
           | full 8k context to the next layer anyway, you're wasting it
           | on mostly... Nothing.
           | 
           | I repeat: Does the 80th layer really need the ability to
           | perform attention over all the previous 8k outputs of the
           | 79th layer? The first layer? Definitely. The last? No. What
           | happens if you only perform attention over 10% of the outputs
           | of layer 79? What speedup does this give you?
           | 
           | Notice how the model has already learned the most optimal
           | attention scheme. You just need to give it less stuff to do
           | and it will get faster automatically.
        
             | miven wrote:
             | I don't get your point, how is what you're suggesting here
             | different from a few papers we already have on KV cache
             | pruning methods like [1]?
             | 
             | [1] https://arxiv.org/abs/2305.15805
        
         | dauertewigkeit wrote:
         | There are things like NAS (neural architectural search) but all
         | you are doing is just growing the search space and making the
         | optimization problem much harder. Typically you do the
         | architectural optimization by hand, using heuristics and past
         | experiments as guidance.
        
         | Mehdi2277 wrote:
         | I've occasionally worked with more dynamic models (tree
         | structured decoding). They are generally not a good fit for
         | trying to max gpu thoroughput. A lot of magic of transformers
         | and large language models is about pushing gpu as much we can
         | and simpler static model architecture that trains faster can
         | train on much more data.
         | 
         | So until the hardware allows for comparable (say with 2-4x)
         | thoroughput of samples per second I expect model architecture
         | to mostly be static for most effective models and dynamic
         | architectures to be an interesting side area.
        
         | aDyslecticCrow wrote:
         | The only thing that has changed since 2018 is the most popular
         | network structure to play with. The code looks the same as
         | always; python notebooks where someone manually calculated the
         | size of each hard-coded layer to make it fit.
        
       | revskill wrote:
       | Genius.
        
       | hovering_nox wrote:
       | Why can the author only write in all lowercase?
        
         | ronsor wrote:
         | Sam Altman does it too
        
         | Pr0ject217 wrote:
         | It's the cool thing to do now...
        
           | lelandfe wrote:
           | The treatment of the English language on TikTok is giving the
           | late Yahoo Answers a run for its money.
        
         | tredre3 wrote:
         | At least they use punctuation. We've recently had a project on
         | HN where the author used only lower cases and no punctuation
         | because they equated it to being chained by the system.
        
           | groovy2shoes wrote:
           | rip cormac mccarthy
        
             | _giorgio_ wrote:
             | It's your problem only.
        
           | programjames wrote:
           | The fight against capitalism spares no letter.
        
         | baobabKoodaa wrote:
         | do you wanna be cool or not?
        
         | teaearlgraycold wrote:
         | Too poor to fix their shift key
        
         | Retr0id wrote:
         | because it annoys HN commenters
        
         | renegade-otter wrote:
         | Because Sam Altman does it and he is rich, so...
        
           | bossyTeacher wrote:
           | Where? His blog looks normal
        
             | renegade-otter wrote:
             | Just look at his Twitter: https://x.com/sama
             | 
             | And no, Twitter is no excuse to type like an illiterate
             | teenager.
             | 
             | And I will bet you someone edits his blogs to not look like
             | that.
        
         | skriticos2 wrote:
         | Seeing Anya (the girl pointing at pictures), I'd guess the
         | author is partial to Japanese culture. As their writing system
         | does not have a concept of upper/lower case, he might just have
         | determined that they are superfluous. Or he is simply an
         | eccentric. Though I guess this is one of the things that some
         | folks will not care and others getting hung up mightily.
         | 
         | I personally don't really mind that bit of capitalization that
         | English does. German is much worse.
        
           | hovering_nox wrote:
           | >I personally don't really mind that bit of capitalization
           | that English does. German is much worse.
           | 
           | You misspelled 'better'.
        
           | Kuinox wrote:
           | Their twitter indicate Amsterdam, I just think they are an
           | anime fan.
           | 
           | And they are not alone.
           | 
           | https://twitter.com/karpathy/status/1792261360430293176
        
           | golergka wrote:
           | d u xpct hbrw spkr twrt nnglsh lk ths?
        
             | programjames wrote:
             | I think you mispelled that slightly:
             | 
             | > d' 'ou 'xp'ct h'br'w sp''k'rs t' wr't' 'n 'ngl'sh l'k'
             | th's?
        
         | nekochanwork wrote:
         | Creative writing + Hyperfocused autistic obsession = The Anime
         | Guide to Neural Networks and Large Language Models.
        
         | TacticalCoder wrote:
         | And why can't the author pass its text into a LLM and simply
         | ask: _" plz fix frist word of each paragraf by using an
         | uppercase letter k txh bye"_.
         | 
         | A just question.
        
         | adamrezich wrote:
         | 2024 is the year that most of us are collectively growing out
         | of the early social media era all-lowercase thing, but everyone
         | hasn't gotten the memo yet.
        
         | spencerchubb wrote:
         | so more people comment on the hn post and it will rank higher
         | in the algo
         | 
         | such as your comment and my comment!
        
         | bdangubic wrote:
         | shift key busted
        
         | efilife wrote:
         | This comment is unsubstantial and provides no value. Why do you
         | care about this?
        
       | andy99 wrote:
       | I don't want to be dismissive, it's a fun project, but this has
       | been done a lot already - maybe not with llama3 but the
       | architecture is basically the same as llama2. Look at the big
       | list of from scratch implementations on Karpathys llama2.c page.
       | 
       | Is there something particularly different about this one?
       | 
       | Edit - guess not?
        
         | fifilura wrote:
         | I think they learned a lot doing this? And they tried hard
         | explaining each step!
        
         | rvz wrote:
         | Well given the fast pace of AI, it should not be a surprise
         | that this is similar to llama2 and that we're seeing the n + 1
         | toy implementations and likely has bugs or leaks in the
         | background.
         | 
         | You might as well look at llama.cpp for a serious and
         | production grade implementation to learn from. Otherwise,
         | nothing to see here.
         | 
         | > Is there something particularly different about this one?
         | 
         | Other than the immature lowercase, anime BS, etc, then...
         | 
         | No.
        
         | tildef wrote:
         | There's literally an image of Anya pointing at Karpathy on this
         | GitHub page.
        
       | fnetisma wrote:
       | Iterative leaps of open-source models becoming better are huge
       | examples that companies competing on LLM model layer have an
       | ephemeral moat.
       | 
       | Serious question: assuming this is true, if an incumbent-
       | challenger like OpenAI wants to win, how do they effectively
       | compete against current services such as Meta and Google product
       | offerings which can be AI enhanced in a snap?
        
         | cal85 wrote:
         | Their moat atm is being 6 months ahead of everyone else on
         | model quality. Plus the 'startup' advantage over their
         | corporate competitors. Oh and they can hoard a lot of the best
         | talent because it's an extremely high status place to work.
         | 
         | Their task now is to maintain and exploit those advantages as
         | best they can while they build up a more stable long term moat:
         | lots of companies having their tech deeply integrated into
         | their operations.
        
           | andy99 wrote:
           | Just to add, they don't have the baggage of google or Meta so
           | they can do more without worrying how it impacts the rest of
           | the company. And of the big players they seem the most aware
           | of how important _good_ data is and have paid for lots of
           | high quality curated fine tuning data in order to build a
           | proper product instead of doing a research project. That
           | mindset and the commercial difference it makes shouldn 't be
           | underestimated.
        
         | 123yawaworht456 wrote:
         | the very first big AI company who gives up trying to lobotomize
         | and emasculate their models to align with the values of 0.01%
         | of the world population will win a lot of hearts and minds
         | overnight. the censorship necessary for corporate applications
         | can be trivially implemented as a toggleable layer, using a
         | small, efficient, specialist model to detect no-no words and
         | wrongthink in inputs/outputs.
         | 
         | gpt, claude, gemini, even llama and mistral, all tend to
         | produce the same nauseating slop, easily-recognizable by anyone
         | familiar with LLMs - these days, I cringe when I read 'It is
         | important to remember' even when I see it in some ancient, pre-
         | slop writings.
         | 
         | creativity - one of the very few applications generative AI can
         | truly excel at - is currently impossible. it could
         | revolutionize entertainment, but it isn't allowed to. the
         | models are only _allowed_ to produce inoffensive, positivity-
         | biased, sterile slop that no human being finds attractive.
        
           | andy99 wrote:
           | > the censorship necessary for corporate applications can be
           | trivially implemented as a toggleable layer, using a small,
           | efficient, specialist model to detect no-no words and
           | wrongthink in inputs/outputs.
           | 
           | What's really funny is they all have "jailbreaks" that you
           | can use to make then say anything anyway. So for "corporate"
           | uses, the method you propose is already mandatory. The whole
           | thing (censoring base models) is a misguided combination of
           | ideology and (over the top) risk aversion.
        
           | malfist wrote:
           | Please explain what you mean when you say the 0.01% are
           | emasculating AI
        
             | mavhc wrote:
             | They're suggesting that 99.99% of people don't mind if AI
             | reflects biases of society. Which is weird because I'm
             | pretty sure most people in the world aren't old white
             | middle class Americans
        
               | ben_w wrote:
               | Indeed. If religion is a good guide, then I think around
               | 24% think that pork is inherently unclean and not fit for
               | human consumption under penalty of divine wrath, and 15%
               | think that it's immoral to kill cattle for any reason.
               | Also, non-religiously, I'd guess around 17% think "Zhong
               | Guo Hen Bang ,Zhi You Tian An Men Yan Chang Fa Sheng Liao
               | Hao Shi ".
        
               | 123yawaworht456 wrote:
               | yes, yes, bias like the fact that Wehrmacht was not a
               | human menagerie that 0.01% of the population insist we
               | live in.
               | 
               | https://www.google.com/search?q=gemini+german+soldier
               | 
               | prompt-injected mandatory diversity let to the most
               | hilarious shit I've seen generative AI do so far.
               | 
               | but, yes, of course, other instances of 'I reject your
               | reality and substitute my own' - like depicting medieval
               | Europe to be as diverse as vibrant American inner cities
               | - those are fine.
        
         | golergka wrote:
         | They scare the government into regulating the field into
         | oblivion.
        
       | miki123211 wrote:
       | If you like this, it's also worth looking at llama2.c[1], an
       | implementation of the Llama 2 architecture in about 1000 lines of
       | plain, dependency-free C, tokenizer and all. THe fact that this
       | 960-line file and a somewhat modern C compiler is all you really
       | need to run a state-of-the-art language model is really
       | surprising to many.
       | 
       | Of course, this is not all there is to a modern LLM, it would
       | probably take another thousand lines or two to implement
       | training, and many more than that to make it fast on all the
       | major CPU and GPU architectures. If you want a flexible framework
       | that lets a developer define any model you want and still goes as
       | fast as it can, the complexity spirals.
       | 
       | Most programmers have an intuition that duplicating a large
       | software project from scratch, like Linux or Chromium for
       | example, would require incredible amounts of expertise, manpower
       | and time. It's not something that a small team can achieve in a
       | few months. You're limited by talent, not hardware.
       | 
       | LLMs are very different. THe code isn't _that_ complicated, you
       | could probably implement training and inference for a single
       | model architecture, from scratch, on a single kind of GPU, with
       | reasonable performance, as an individual with a background in
       | programming and who still remembers their calculus and linear
       | algebra, with a year or so of self study. What makes LLMs
       | difficult is getting access to all the hardware to train them,
       | getting the data, and being able to preprocess that data.
        
         | evanjrowley wrote:
         | Links for llama2.c:
         | 
         | https://github.com/karpathy/llama2.c
         | 
         | https://news.ycombinator.com/item?id=36838051
        
         | Fubarberry wrote:
         | There's also a project where they have GPT-2 running off of an
         | excel spreadsheet.
         | 
         | https://arstechnica.com/information-technology/2024/03/once-...
        
       | _giorgio_ wrote:
       | I wanted to try the repo by Karpathy, but I still don't want to
       | learn C (Llama is probably his only C repo), so thanks for
       | posting this.
        
       ___________________________________________________________________
       (page generated 2024-05-19 23:00 UTC)