[HN Gopher] Llama3 implemented from scratch
___________________________________________________________________
Llama3 implemented from scratch
Author : Hadi7546
Score : 280 points
Date : 2024-05-19 18:42 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| lakshyaag wrote:
| Awesome, gonna go through!
| digitaltrees wrote:
| Are you the repo author or reposting something cool? I am curious
| because I want to talk to the repo author about a collaboration
| project.
| magoghm wrote:
| You might be able to reach the repo author on X:
| https://x.com/naklecha
| brcmthrowaway wrote:
| Wait, are you saying SoTA NN research hasnt evolved from
| hardcoding a bunch of layer structures and sizes?
|
| I'm kind of shocked. I thought there would be more dynamism by
| now and I stopped dabbling in like 2018.
| astrange wrote:
| The innovation is that everything is just one standardized
| structure now (transformer models) and you make it bigger if
| you feel like you need that.
|
| There's still some room for experimenting if you care about
| memory/power efficiency, like MoE models, but they're not as
| well understood yet.
| aDyslecticCrow wrote:
| There are too many papers throwing transformers on everything
| without thinking. Transformers are amazing for language but
| kinda mid on everything else. CS researchers tend to jump on
| trends really hard, so it will probably go back to normal
| again soon.
| imtringued wrote:
| I don't know what you mean by amazing for language. Almost
| everything is built on transformers nowadays. Image
| segmentation uses transformers. Text to speech uses
| transformers. Voice recognition uses transformers. There
| are robotics transformers that take image inputs and output
| motion sequences. Transformers are inherently multi-modal.
| They handle whatever you throw at them, it's just that
| language tends to be a very common input or output.
| pshc wrote:
| My wild guess is that adjusting the shape before each step is
| not worth the speed hit. Uniform structures make GPUs go brrrrr
| astrange wrote:
| It's also easier to train and in particular easier to
| parallelize.
| delusional wrote:
| The innovation is the amount of resources people are willing to
| spend right now. From looking at the research code it's clear
| that the whole field is basically doing a (somewhat) guided
| search in the entire space of possible layer permutations.
|
| There seems to be no rhyme or reason, no scientific insight, no
| analysis. They just try a million different permutations, and
| whatever scores the highest on the benchmarks gets published.
| moffkalast wrote:
| Well it took evolution 4 billion years of testing out random
| permutations that resulted in a pretty good local maximum, so
| there is hope for us yet.
| WanderPanda wrote:
| ,,I'm a pretty good local maximum" that is what any local
| maximum would tell you if asked how it likes itself
| curious_cat_163 wrote:
| There is a tick-tock between searching the dominant NN
| architectures (tick) and optimizing for accuracy, compute and
| inference latency and throughput (tock).
|
| This particular (tock) is still playing out. The next (tick)
| does not feel imminent and will likely depend on when we
| discover the limits of the transformers when it comes to
| solving for long tail of use-cases.
|
| My $0.02.
| rdedev wrote:
| My wish is they would move on to the next phase. The whole
| deal with SSMs look really good. But looking for better
| architects is countered with "a regular architecture with
| more parameters are doing better. What's the point of this"
| tysam_and wrote:
| Heyo! Have been doing this for a while. SSMs certainly are
| flashy (most popular topics-of-the-year are), and it would
| be nice to see if they hit a point of competitive
| performance with transformers (and if they stand the test
| of time!)
|
| There are certainly tradeoffs to both, the general
| transformer motif scales very well on a number of axis, so
| that may be the dominant algorithm for a while to come,
| though almost certainly it will change and evolve as time
| goes along (and who knows? something else may come along as
| well <3 :')))) ).
| imtringued wrote:
| You have to consider that there are still some low hanging
| fruit that let you improve prompt processing (not token
| generation) performance by an order of magnitude or even two,
| but there are no takers. The reason is quite simple. You can
| just buy more GPUs and forget about the optimizations.
|
| If a 100x improvement in performance is left on the table,
| then surely even lower priority optimizations won't be
| implemented any time soon.
|
| Consider this: a lot of clever attention optimizations rely
| on some initial pass to narrow the important tokens down and
| discarding them from the KV cache. If this was actually
| possible, then how come the first few layers of the LLM don't
| already do this numerically to focus their attention? Here is
| the shocker: they already do, but since you're passing the
| full 8k context to the next layer anyway, you're wasting it
| on mostly... Nothing.
|
| I repeat: Does the 80th layer really need the ability to
| perform attention over all the previous 8k outputs of the
| 79th layer? The first layer? Definitely. The last? No. What
| happens if you only perform attention over 10% of the outputs
| of layer 79? What speedup does this give you?
|
| Notice how the model has already learned the most optimal
| attention scheme. You just need to give it less stuff to do
| and it will get faster automatically.
| miven wrote:
| I don't get your point, how is what you're suggesting here
| different from a few papers we already have on KV cache
| pruning methods like [1]?
|
| [1] https://arxiv.org/abs/2305.15805
| dauertewigkeit wrote:
| There are things like NAS (neural architectural search) but all
| you are doing is just growing the search space and making the
| optimization problem much harder. Typically you do the
| architectural optimization by hand, using heuristics and past
| experiments as guidance.
| Mehdi2277 wrote:
| I've occasionally worked with more dynamic models (tree
| structured decoding). They are generally not a good fit for
| trying to max gpu thoroughput. A lot of magic of transformers
| and large language models is about pushing gpu as much we can
| and simpler static model architecture that trains faster can
| train on much more data.
|
| So until the hardware allows for comparable (say with 2-4x)
| thoroughput of samples per second I expect model architecture
| to mostly be static for most effective models and dynamic
| architectures to be an interesting side area.
| aDyslecticCrow wrote:
| The only thing that has changed since 2018 is the most popular
| network structure to play with. The code looks the same as
| always; python notebooks where someone manually calculated the
| size of each hard-coded layer to make it fit.
| revskill wrote:
| Genius.
| hovering_nox wrote:
| Why can the author only write in all lowercase?
| ronsor wrote:
| Sam Altman does it too
| Pr0ject217 wrote:
| It's the cool thing to do now...
| lelandfe wrote:
| The treatment of the English language on TikTok is giving the
| late Yahoo Answers a run for its money.
| tredre3 wrote:
| At least they use punctuation. We've recently had a project on
| HN where the author used only lower cases and no punctuation
| because they equated it to being chained by the system.
| groovy2shoes wrote:
| rip cormac mccarthy
| _giorgio_ wrote:
| It's your problem only.
| programjames wrote:
| The fight against capitalism spares no letter.
| baobabKoodaa wrote:
| do you wanna be cool or not?
| teaearlgraycold wrote:
| Too poor to fix their shift key
| Retr0id wrote:
| because it annoys HN commenters
| renegade-otter wrote:
| Because Sam Altman does it and he is rich, so...
| bossyTeacher wrote:
| Where? His blog looks normal
| renegade-otter wrote:
| Just look at his Twitter: https://x.com/sama
|
| And no, Twitter is no excuse to type like an illiterate
| teenager.
|
| And I will bet you someone edits his blogs to not look like
| that.
| skriticos2 wrote:
| Seeing Anya (the girl pointing at pictures), I'd guess the
| author is partial to Japanese culture. As their writing system
| does not have a concept of upper/lower case, he might just have
| determined that they are superfluous. Or he is simply an
| eccentric. Though I guess this is one of the things that some
| folks will not care and others getting hung up mightily.
|
| I personally don't really mind that bit of capitalization that
| English does. German is much worse.
| hovering_nox wrote:
| >I personally don't really mind that bit of capitalization
| that English does. German is much worse.
|
| You misspelled 'better'.
| Kuinox wrote:
| Their twitter indicate Amsterdam, I just think they are an
| anime fan.
|
| And they are not alone.
|
| https://twitter.com/karpathy/status/1792261360430293176
| golergka wrote:
| d u xpct hbrw spkr twrt nnglsh lk ths?
| programjames wrote:
| I think you mispelled that slightly:
|
| > d' 'ou 'xp'ct h'br'w sp''k'rs t' wr't' 'n 'ngl'sh l'k'
| th's?
| nekochanwork wrote:
| Creative writing + Hyperfocused autistic obsession = The Anime
| Guide to Neural Networks and Large Language Models.
| TacticalCoder wrote:
| And why can't the author pass its text into a LLM and simply
| ask: _" plz fix frist word of each paragraf by using an
| uppercase letter k txh bye"_.
|
| A just question.
| adamrezich wrote:
| 2024 is the year that most of us are collectively growing out
| of the early social media era all-lowercase thing, but everyone
| hasn't gotten the memo yet.
| spencerchubb wrote:
| so more people comment on the hn post and it will rank higher
| in the algo
|
| such as your comment and my comment!
| bdangubic wrote:
| shift key busted
| efilife wrote:
| This comment is unsubstantial and provides no value. Why do you
| care about this?
| andy99 wrote:
| I don't want to be dismissive, it's a fun project, but this has
| been done a lot already - maybe not with llama3 but the
| architecture is basically the same as llama2. Look at the big
| list of from scratch implementations on Karpathys llama2.c page.
|
| Is there something particularly different about this one?
|
| Edit - guess not?
| fifilura wrote:
| I think they learned a lot doing this? And they tried hard
| explaining each step!
| rvz wrote:
| Well given the fast pace of AI, it should not be a surprise
| that this is similar to llama2 and that we're seeing the n + 1
| toy implementations and likely has bugs or leaks in the
| background.
|
| You might as well look at llama.cpp for a serious and
| production grade implementation to learn from. Otherwise,
| nothing to see here.
|
| > Is there something particularly different about this one?
|
| Other than the immature lowercase, anime BS, etc, then...
|
| No.
| tildef wrote:
| There's literally an image of Anya pointing at Karpathy on this
| GitHub page.
| fnetisma wrote:
| Iterative leaps of open-source models becoming better are huge
| examples that companies competing on LLM model layer have an
| ephemeral moat.
|
| Serious question: assuming this is true, if an incumbent-
| challenger like OpenAI wants to win, how do they effectively
| compete against current services such as Meta and Google product
| offerings which can be AI enhanced in a snap?
| cal85 wrote:
| Their moat atm is being 6 months ahead of everyone else on
| model quality. Plus the 'startup' advantage over their
| corporate competitors. Oh and they can hoard a lot of the best
| talent because it's an extremely high status place to work.
|
| Their task now is to maintain and exploit those advantages as
| best they can while they build up a more stable long term moat:
| lots of companies having their tech deeply integrated into
| their operations.
| andy99 wrote:
| Just to add, they don't have the baggage of google or Meta so
| they can do more without worrying how it impacts the rest of
| the company. And of the big players they seem the most aware
| of how important _good_ data is and have paid for lots of
| high quality curated fine tuning data in order to build a
| proper product instead of doing a research project. That
| mindset and the commercial difference it makes shouldn 't be
| underestimated.
| 123yawaworht456 wrote:
| the very first big AI company who gives up trying to lobotomize
| and emasculate their models to align with the values of 0.01%
| of the world population will win a lot of hearts and minds
| overnight. the censorship necessary for corporate applications
| can be trivially implemented as a toggleable layer, using a
| small, efficient, specialist model to detect no-no words and
| wrongthink in inputs/outputs.
|
| gpt, claude, gemini, even llama and mistral, all tend to
| produce the same nauseating slop, easily-recognizable by anyone
| familiar with LLMs - these days, I cringe when I read 'It is
| important to remember' even when I see it in some ancient, pre-
| slop writings.
|
| creativity - one of the very few applications generative AI can
| truly excel at - is currently impossible. it could
| revolutionize entertainment, but it isn't allowed to. the
| models are only _allowed_ to produce inoffensive, positivity-
| biased, sterile slop that no human being finds attractive.
| andy99 wrote:
| > the censorship necessary for corporate applications can be
| trivially implemented as a toggleable layer, using a small,
| efficient, specialist model to detect no-no words and
| wrongthink in inputs/outputs.
|
| What's really funny is they all have "jailbreaks" that you
| can use to make then say anything anyway. So for "corporate"
| uses, the method you propose is already mandatory. The whole
| thing (censoring base models) is a misguided combination of
| ideology and (over the top) risk aversion.
| malfist wrote:
| Please explain what you mean when you say the 0.01% are
| emasculating AI
| mavhc wrote:
| They're suggesting that 99.99% of people don't mind if AI
| reflects biases of society. Which is weird because I'm
| pretty sure most people in the world aren't old white
| middle class Americans
| ben_w wrote:
| Indeed. If religion is a good guide, then I think around
| 24% think that pork is inherently unclean and not fit for
| human consumption under penalty of divine wrath, and 15%
| think that it's immoral to kill cattle for any reason.
| Also, non-religiously, I'd guess around 17% think "Zhong
| Guo Hen Bang ,Zhi You Tian An Men Yan Chang Fa Sheng Liao
| Hao Shi ".
| 123yawaworht456 wrote:
| yes, yes, bias like the fact that Wehrmacht was not a
| human menagerie that 0.01% of the population insist we
| live in.
|
| https://www.google.com/search?q=gemini+german+soldier
|
| prompt-injected mandatory diversity let to the most
| hilarious shit I've seen generative AI do so far.
|
| but, yes, of course, other instances of 'I reject your
| reality and substitute my own' - like depicting medieval
| Europe to be as diverse as vibrant American inner cities
| - those are fine.
| golergka wrote:
| They scare the government into regulating the field into
| oblivion.
| miki123211 wrote:
| If you like this, it's also worth looking at llama2.c[1], an
| implementation of the Llama 2 architecture in about 1000 lines of
| plain, dependency-free C, tokenizer and all. THe fact that this
| 960-line file and a somewhat modern C compiler is all you really
| need to run a state-of-the-art language model is really
| surprising to many.
|
| Of course, this is not all there is to a modern LLM, it would
| probably take another thousand lines or two to implement
| training, and many more than that to make it fast on all the
| major CPU and GPU architectures. If you want a flexible framework
| that lets a developer define any model you want and still goes as
| fast as it can, the complexity spirals.
|
| Most programmers have an intuition that duplicating a large
| software project from scratch, like Linux or Chromium for
| example, would require incredible amounts of expertise, manpower
| and time. It's not something that a small team can achieve in a
| few months. You're limited by talent, not hardware.
|
| LLMs are very different. THe code isn't _that_ complicated, you
| could probably implement training and inference for a single
| model architecture, from scratch, on a single kind of GPU, with
| reasonable performance, as an individual with a background in
| programming and who still remembers their calculus and linear
| algebra, with a year or so of self study. What makes LLMs
| difficult is getting access to all the hardware to train them,
| getting the data, and being able to preprocess that data.
| evanjrowley wrote:
| Links for llama2.c:
|
| https://github.com/karpathy/llama2.c
|
| https://news.ycombinator.com/item?id=36838051
| Fubarberry wrote:
| There's also a project where they have GPT-2 running off of an
| excel spreadsheet.
|
| https://arstechnica.com/information-technology/2024/03/once-...
| _giorgio_ wrote:
| I wanted to try the repo by Karpathy, but I still don't want to
| learn C (Llama is probably his only C repo), so thanks for
| posting this.
___________________________________________________________________
(page generated 2024-05-19 23:00 UTC)