[HN Gopher] Llama3 implemented from scratch
       ___________________________________________________________________
        
       Llama3 implemented from scratch
        
       Author : Hadi7546
       Score  : 924 points
       Date   : 2024-05-19 18:42 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | lakshyaag wrote:
       | Awesome, gonna go through!
        
       | digitaltrees wrote:
       | Are you the repo author or reposting something cool? I am curious
       | because I want to talk to the repo author about a collaboration
       | project.
        
         | magoghm wrote:
         | You might be able to reach the repo author on X:
         | https://x.com/naklecha
        
       | brcmthrowaway wrote:
       | Wait, are you saying SoTA NN research hasnt evolved from
       | hardcoding a bunch of layer structures and sizes?
       | 
       | I'm kind of shocked. I thought there would be more dynamism by
       | now and I stopped dabbling in like 2018.
        
         | astrange wrote:
         | The innovation is that everything is just one standardized
         | structure now (transformer models) and you make it bigger if
         | you feel like you need that.
         | 
         | There's still some room for experimenting if you care about
         | memory/power efficiency, like MoE models, but they're not as
         | well understood yet.
        
           | aDyslecticCrow wrote:
           | There are too many papers throwing transformers on everything
           | without thinking. Transformers are amazing for language but
           | kinda mid on everything else. CS researchers tend to jump on
           | trends really hard, so it will probably go back to normal
           | again soon.
        
             | imtringued wrote:
             | I don't know what you mean by amazing for language. Almost
             | everything is built on transformers nowadays. Image
             | segmentation uses transformers. Text to speech uses
             | transformers. Voice recognition uses transformers. There
             | are robotics transformers that take image inputs and output
             | motion sequences. Transformers are inherently multi-modal.
             | They handle whatever you throw at them, it's just that
             | language tends to be a very common input or output.
        
             | Hugsun wrote:
             | That is not true. Transformers are being applied all over
             | because they work better than what was used before in so
             | many cases.
        
         | pshc wrote:
         | My wild guess is that adjusting the shape before each step is
         | not worth the speed hit. Uniform structures make GPUs go brrrrr
        
           | astrange wrote:
           | It's also easier to train and in particular easier to
           | parallelize.
        
         | delusional wrote:
         | The innovation is the amount of resources people are willing to
         | spend right now. From looking at the research code it's clear
         | that the whole field is basically doing a (somewhat) guided
         | search in the entire space of possible layer permutations.
         | 
         | There seems to be no rhyme or reason, no scientific insight, no
         | analysis. They just try a million different permutations, and
         | whatever scores the highest on the benchmarks gets published.
        
           | moffkalast wrote:
           | Well it took evolution 4 billion years of testing out random
           | permutations that resulted in a pretty good local maximum, so
           | there is hope for us yet.
        
             | WanderPanda wrote:
             | ,,I'm a pretty good local maximum" that is what any local
             | maximum would tell you if asked how it likes itself
        
               | moffkalast wrote:
               | "The brain is the most important part of the body", the
               | brain said.
        
               | psychoslave wrote:
               | Note that not all brains are so severely damaged with
               | this illusion. Most of them actually get pretty clearly
               | that they are next to useless without its organic, social
               | and environmental companions.
        
           | killerstorm wrote:
           | There's definitely scientific insight and analysis.
           | 
           | E.g. "In-context Learning and Induction Heads" is an
           | excellent paper.
           | 
           | Another paper ("ROME") https://arxiv.org/abs/2202.05262
           | formulates hypothesis over how these models store
           | information, and provide experimental evidence.
           | 
           | The thing is, a 3-layer MLP is basically an associative
           | memory + a bit of compute. People understand that if you
           | stack enough of them you can compute or memorize pretty much
           | anything.
           | 
           | Attention provides information routing. Again, that is pretty
           | well-understood.
           | 
           | The rest is basically finding an optimal trade-off. These
           | trade-off are based on insights based on experimental data.
           | 
           | So this architecture is not so much accidental as it is
           | general.
           | 
           | Specific representations used by MLPs are poorly understood,
           | but there's definitely a progress on understanding them from
           | first principles by building specialized models.
        
             | inciampati wrote:
             | One 3-layer (1 hidden layer) neural network can already
             | approximate anything. You don't even need to stack them.
        
         | curious_cat_163 wrote:
         | There is a tick-tock between searching the dominant NN
         | architectures (tick) and optimizing for accuracy, compute and
         | inference latency and throughput (tock).
         | 
         | This particular (tock) is still playing out. The next (tick)
         | does not feel imminent and will likely depend on when we
         | discover the limits of the transformers when it comes to
         | solving for long tail of use-cases.
         | 
         | My $0.02.
        
           | rdedev wrote:
           | My wish is they would move on to the next phase. The whole
           | deal with SSMs look really good. But looking for better
           | architects is countered with "a regular architecture with
           | more parameters are doing better. What's the point of this"
        
             | tysam_and wrote:
             | Heyo! Have been doing this for a while. SSMs certainly are
             | flashy (most popular topics-of-the-year are), and it would
             | be nice to see if they hit a point of competitive
             | performance with transformers (and if they stand the test
             | of time!)
             | 
             | There are certainly tradeoffs to both, the general
             | transformer motif scales very well on a number of axis, so
             | that may be the dominant algorithm for a while to come,
             | though almost certainly it will change and evolve as time
             | goes along (and who knows? something else may come along as
             | well <3 :')))) ).
        
             | throwawaymaths wrote:
             | There's something about a transformer being at its core
             | based on a differentiable hash table data structure that
             | makes them special.
             | 
             | I think it's dominance is not going to substantially change
             | any time soon. Dont you know, the solution to all leetcode
             | interviews is a hash table?
        
             | curious_cat_163 wrote:
             | IMO, SSMs are an optimization. They don't represent enough
             | of a fundamental departure from the kinds of things
             | Transformers can _do_. So, while I like the idea of saving
             | on the energy costs, I speculate that such saving can be
             | obtained with other optimizations while staying with
             | transformer blocks. Hence, the motivation to change is a
             | bit of an uphill here. I would love to hear counter-
             | arguments to this view. :)
             | 
             | Furthermore, I think a replacement will require that we
             | _understand_ what the current crop of models are doing
             | mechanically. Some of it was motivated in [1].
             | 
             | [1] https://openaipublic.blob.core.windows.net/neuron-
             | explainer/...
        
               | inciampati wrote:
               | Quadratic vs linear is not an optimization. It's a
               | completely new game. With selective SSMs (mamba) the win
               | is that associative training can be run in sublinear time
               | via a log-cost associative scan. So you go from something
               | quadratic wrt input sequence length to something
               | logarithmic. If that's just an optimization it's a huge
               | one.
        
               | curious_cat_163 wrote:
               | Okay. Respect your point of view. I am curious, what
               | applications do you think SSMs enable that a Transformer
               | cannot? I have always seen it as a drop-in replacement
               | (like for like) but maybe there is more to it.
               | 
               | Personally, I think going linear instead of quadratic for
               | a core operation that a system needs to do is by
               | definition an optimization.
        
             | smel wrote:
             | The solution to agi is not deep learning maybe with more
             | compute and shit load of engineering it can work kind of
             | baby agi.
             | 
             | My bet will be on something else than gradient descent and
             | backprop but really I don't wish any company or country to
             | reach agi or any sophisticated ai ...
        
               | inciampati wrote:
               | Magical thinking. Nature uses gradient descent to evolve
               | all of us and our companions on this planet. If something
               | better were out there, we would see it at work in the
               | natural world.
        
               | psychoslave wrote:
               | Maybe it's there but in a ethereal form that is
               | ungrabbable to mere conscious forms as ourself? :P
        
               | mopierotti wrote:
               | Are you also saying that thoughts are formed using
               | gradient descent? I don't think gradient descent is an
               | accurate way to describe either process in nature. Also,
               | we don't know that we "see" everything that is happening,
               | we don't even understand the brain yet.
        
           | imtringued wrote:
           | You have to consider that there are still some low hanging
           | fruit that let you improve prompt processing (not token
           | generation) performance by an order of magnitude or even two,
           | but there are no takers. The reason is quite simple. You can
           | just buy more GPUs and forget about the optimizations.
           | 
           | If a 100x improvement in performance is left on the table,
           | then surely even lower priority optimizations won't be
           | implemented any time soon.
           | 
           | Consider this: a lot of clever attention optimizations rely
           | on some initial pass to narrow the important tokens down and
           | discarding them from the KV cache. If this was actually
           | possible, then how come the first few layers of the LLM don't
           | already do this numerically to focus their attention? Here is
           | the shocker: they already do, but since you're passing the
           | full 8k context to the next layer anyway, you're wasting it
           | on mostly... Nothing.
           | 
           | I repeat: Does the 80th layer really need the ability to
           | perform attention over all the previous 8k outputs of the
           | 79th layer? The first layer? Definitely. The last? No. What
           | happens if you only perform attention over 10% of the outputs
           | of layer 79? What speedup does this give you?
           | 
           | Notice how the model has already learned the most optimal
           | attention scheme. You just need to give it less stuff to do
           | and it will get faster automatically.
        
             | miven wrote:
             | I don't get your point, how is what you're suggesting here
             | different from a few papers we already have on KV cache
             | pruning methods like [1]?
             | 
             | [1] https://arxiv.org/abs/2305.15805
        
           | NoobSaibot135 wrote:
           | I like your analogy of a tick tock ~= epoch of progress
           | 
           | Step change, then optimization of that step change
           | 
           | Kind of like a grand father clock with a huge pendulum
           | swinging to one side, then another(commonly used metaphor).
        
             | treyd wrote:
             | It's a metaphor that's been used with the advancement of
             | CPU designs at least as far back as the 80s or 90s. Intel
             | uses it explicitly in their marketing nowadays, I believe.
        
             | auspiv wrote:
             | Intel has been doing "tick-tock" for almost 20 years -
             | https://en.wikipedia.org/wiki/Tick%E2%80%93tock_model
        
         | dauertewigkeit wrote:
         | There are things like NAS (neural architectural search) but all
         | you are doing is just growing the search space and making the
         | optimization problem much harder. Typically you do the
         | architectural optimization by hand, using heuristics and past
         | experiments as guidance.
        
         | Mehdi2277 wrote:
         | I've occasionally worked with more dynamic models (tree
         | structured decoding). They are generally not a good fit for
         | trying to max gpu thoroughput. A lot of magic of transformers
         | and large language models is about pushing gpu as much we can
         | and simpler static model architecture that trains faster can
         | train on much more data.
         | 
         | So until the hardware allows for comparable (say with 2-4x)
         | thoroughput of samples per second I expect model architecture
         | to mostly be static for most effective models and dynamic
         | architectures to be an interesting side area.
        
         | aDyslecticCrow wrote:
         | The only thing that has changed since 2018 is the most popular
         | network structure to play with. The code looks the same as
         | always; python notebooks where someone manually calculated the
         | size of each hard-coded layer to make it fit.
        
           | galaxyLogic wrote:
           | > someone manually calculated the size of each hard-coded
           | layer
           | 
           | I wonder shouldn't AI be the best tool to optimize itself?
        
             | octonion137 wrote:
             | In theory yes, but unfortunately AI hasn't been invented
             | yet
        
               | psychoslave wrote:
               | I don't know, shouldn't the AI then be trapped at
               | evaluating all possible AI implementations? And since it
               | will face the halting problem, it won't discriminate the
               | very best one, though it will probably be able to return
               | the best one given a capped amount of resources that is
               | reachable through exhaustion in its space. It won't
               | necessarily be better than what can be provided by human
               | beings given an equivalent amount of resources.
        
         | danielmarkbruce wrote:
         | People would love to have dynamism. It's a cost thing.
        
       | revskill wrote:
       | Genius.
        
       | hovering_nox wrote:
       | Why can the author only write in all lowercase?
        
         | ronsor wrote:
         | Sam Altman does it too
        
         | Pr0ject217 wrote:
         | It's the cool thing to do now...
        
           | lelandfe wrote:
           | The treatment of the English language on TikTok is giving the
           | late Yahoo Answers a run for its money.
        
           | mr_toad wrote:
           | That makes me laugh. I remember when it was the cool thing to
           | do on Usenet.
        
         | tredre3 wrote:
         | At least they use punctuation. We've recently had a project on
         | HN where the author used only lower cases and no punctuation
         | because they equated it to being chained by the system.
        
           | groovy2shoes wrote:
           | rip cormac mccarthy
        
             | _giorgio_ wrote:
             | It's your problem only.
        
           | programjames wrote:
           | The fight against capitalism spares no letter.
        
         | baobabKoodaa wrote:
         | do you wanna be cool or not?
        
         | teaearlgraycold wrote:
         | Too poor to fix their shift key
        
           | InfiniteVortex wrote:
           | this is the answer lol
        
           | sva_ wrote:
           | You got two of them
        
         | Retr0id wrote:
         | because it annoys HN commenters
        
         | renegade-otter wrote:
         | Because Sam Altman does it and he is rich, so...
        
           | bossyTeacher wrote:
           | Where? His blog looks normal
        
             | renegade-otter wrote:
             | Just look at his Twitter: https://x.com/sama
             | 
             | And no, Twitter is no excuse to type like an illiterate
             | teenager.
             | 
             | And I will bet you someone edits his blogs to not look like
             | that.
        
         | skriticos2 wrote:
         | Seeing Anya (the girl pointing at pictures), I'd guess the
         | author is partial to Japanese culture. As their writing system
         | does not have a concept of upper/lower case, he might just have
         | determined that they are superfluous. Or he is simply an
         | eccentric. Though I guess this is one of the things that some
         | folks will not care and others getting hung up mightily.
         | 
         | I personally don't really mind that bit of capitalization that
         | English does. German is much worse.
        
           | hovering_nox wrote:
           | >I personally don't really mind that bit of capitalization
           | that English does. German is much worse.
           | 
           | You misspelled 'better'.
        
           | Kuinox wrote:
           | Their twitter indicate Amsterdam, I just think they are an
           | anime fan.
           | 
           | And they are not alone.
           | 
           | https://twitter.com/karpathy/status/1792261360430293176
        
           | golergka wrote:
           | d u xpct hbrw spkr twrt nnglsh lk ths?
        
             | programjames wrote:
             | I think you mispelled that slightly:
             | 
             | > d' 'ou 'xp'ct h'br'w sp''k'rs t' wr't' 'n 'ngl'sh l'k'
             | th's?
        
             | xdennis wrote:
             | Not quite the same. Capitalization doesn't add much to
             | languages written with the Latin alphabet. THE ROMANS ONLY
             | VVROTE VVITH CAPITAL LETTERS.
             | 
             | But the Greeks added vowels to the alphabet because Indo-
             | European languages rely a lot on vowels (as opposed to
             | Semitic languages which are easy to understand without
             | vowels).
        
           | saintradon wrote:
           | It's to drive engagement by getting people to comment on it.
        
           | sva_ wrote:
           | I remember back in the IRC days many people wrote all
           | lowercase. Seems like smartphone keyboards, which
           | autocapitalize, have changed that trend.
        
         | nekochanwork wrote:
         | Creative writing + Hyperfocused autistic obsession = The Anime
         | Guide to Neural Networks and Large Language Models.
        
         | TacticalCoder wrote:
         | And why can't the author pass its text into a LLM and simply
         | ask: _" plz fix frist word of each paragraf by using an
         | uppercase letter k txh bye"_.
         | 
         | A just question.
        
         | adamrezich wrote:
         | 2024 is the year that most of us are collectively growing out
         | of the early social media era all-lowercase thing, but everyone
         | hasn't gotten the memo yet.
        
         | spencerchubb wrote:
         | so more people comment on the hn post and it will rank higher
         | in the algo
         | 
         | such as your comment and my comment!
        
         | bdangubic wrote:
         | shift key busted
        
         | efilife wrote:
         | This comment is unsubstantial and provides no value. Why do you
         | care about this?
        
         | jpamata wrote:
         | Author is probably young, that's how gen-z are these days, if
         | they dont have autocorrect on, the whole text will be in
         | lowercase.
         | 
         | Also it looks more casual and authentic, less LLM generated
        
         | jongorer wrote:
         | the nitpicking in this thread is incredible lmao
        
         | cocochanel wrote:
         | He probably thinks it's cool. Common on Twitter these days.
        
         | kelahcim wrote:
         | this comment made me go back to the project page. i haven't
         | even noticed that fact while reading it for the first time.
         | strange.
        
       | andy99 wrote:
       | I don't want to be dismissive, it's a fun project, but this has
       | been done a lot already - maybe not with llama3 but the
       | architecture is basically the same as llama2. Look at the big
       | list of from scratch implementations on Karpathys llama2.c page.
       | 
       | Is there something particularly different about this one?
       | 
       | Edit - guess not?
        
         | fifilura wrote:
         | I think they learned a lot doing this? And they tried hard
         | explaining each step!
        
         | rvz wrote:
         | Well given the fast pace of AI, it should not be a surprise
         | that this is similar to llama2 and that we're seeing the n + 1
         | toy implementations and likely has bugs or leaks in the
         | background.
         | 
         | You might as well look at llama.cpp for a serious and
         | production grade implementation to learn from. Otherwise,
         | nothing to see here.
         | 
         | > Is there something particularly different about this one?
         | 
         | Other than the immature lowercase, anime BS, etc, then...
         | 
         | No.
        
         | tildef wrote:
         | There's literally an image of Anya pointing at Karpathy on this
         | GitHub page.
        
         | _giorgio_ wrote:
         | What are your favourite implementations of a GPT? I like a lot
         | the video series by Karpathy.
         | 
         | Anyway, I'll take a look at this too, not sure if it has
         | inference and training. Having just inference would be a
         | disappointment.
        
       | verbalstone wrote:
       | I'm sorry but this is absolutely unreadable.
        
       | fnetisma wrote:
       | Iterative leaps of open-source models becoming better are huge
       | examples that companies competing on LLM model layer have an
       | ephemeral moat.
       | 
       | Serious question: assuming this is true, if an incumbent-
       | challenger like OpenAI wants to win, how do they effectively
       | compete against current services such as Meta and Google product
       | offerings which can be AI enhanced in a snap?
        
         | cal85 wrote:
         | Their moat atm is being 6 months ahead of everyone else on
         | model quality. Plus the 'startup' advantage over their
         | corporate competitors. Oh and they can hoard a lot of the best
         | talent because it's an extremely high status place to work.
         | 
         | Their task now is to maintain and exploit those advantages as
         | best they can while they build up a more stable long term moat:
         | lots of companies having their tech deeply integrated into
         | their operations.
        
           | andy99 wrote:
           | Just to add, they don't have the baggage of google or Meta so
           | they can do more without worrying how it impacts the rest of
           | the company. And of the big players they seem the most aware
           | of how important _good_ data is and have paid for lots of
           | high quality curated fine tuning data in order to build a
           | proper product instead of doing a research project. That
           | mindset and the commercial difference it makes shouldn 't be
           | underestimated.
        
           | myko wrote:
           | > Their moat atm is being 6 months ahead of everyone else on
           | model quality
           | 
           | Really? Most of our testing now has Gemini Pro on par or
           | better (though we haven't tested omni/Ultra)
           | 
           | It really seems like the major models have all topped out /
           | are comparable
        
         | 123yawaworht456 wrote:
         | the very first big AI company who gives up trying to lobotomize
         | and emasculate their models to align with the values of 0.01%
         | of the world population will win a lot of hearts and minds
         | overnight. the censorship necessary for corporate applications
         | can be trivially implemented as a toggleable layer, using a
         | small, efficient, specialist model to detect no-no words and
         | wrongthink in inputs/outputs.
         | 
         | gpt, claude, gemini, even llama and mistral, all tend to
         | produce the same nauseating slop, easily-recognizable by anyone
         | familiar with LLMs - these days, I cringe when I read 'It is
         | important to remember' even when I see it in some ancient, pre-
         | slop writings.
         | 
         | creativity - one of the very few applications generative AI can
         | truly excel at - is currently impossible. it could
         | revolutionize entertainment, but it isn't allowed to. the
         | models are only _allowed_ to produce inoffensive, positivity-
         | biased, sterile slop that no human being finds attractive.
        
           | andy99 wrote:
           | > the censorship necessary for corporate applications can be
           | trivially implemented as a toggleable layer, using a small,
           | efficient, specialist model to detect no-no words and
           | wrongthink in inputs/outputs.
           | 
           | What's really funny is they all have "jailbreaks" that you
           | can use to make then say anything anyway. So for "corporate"
           | uses, the method you propose is already mandatory. The whole
           | thing (censoring base models) is a misguided combination of
           | ideology and (over the top) risk aversion.
        
           | malfist wrote:
           | Please explain what you mean when you say the 0.01% are
           | emasculating AI
        
             | mavhc wrote:
             | They're suggesting that 99.99% of people don't mind if AI
             | reflects biases of society. Which is weird because I'm
             | pretty sure most people in the world aren't old white
             | middle class Americans
        
               | ben_w wrote:
               | Indeed. If religion is a good guide, then I think around
               | 24% think that pork is inherently unclean and not fit for
               | human consumption under penalty of divine wrath, and 15%
               | think that it's immoral to kill cattle for any reason.
               | Also, non-religiously, I'd guess around 17% think "Zhong
               | Guo Hen Bang ,Zhi You Tian An Men Yan Chang Fa Sheng Liao
               | Hao Shi ".
        
               | 123yawaworht456 wrote:
               | yes, yes, bias like the fact that Wehrmacht was not a
               | human menagerie that 0.01% of the population insist we
               | live in.
               | 
               | https://www.google.com/search?q=gemini+german+soldier
               | 
               | prompt-injected mandatory diversity has led to the most
               | hilarious shit I've seen generative AI do so far.
               | 
               | but, yes, of course, other instances of 'I reject your
               | reality and substitute my own' - like depicting medieval
               | Europe to be as diverse, vibrant and culturally enriched
               | as American inner cities - those are doubleplusgood.
        
               | mavhc wrote:
               | A study of a Black Death cemetery in London found that
               | 20% of people sampled were not white
        
               | AnthonyMouse wrote:
               | London has been a center of international trade for
               | centuries. It would have been a much more diverse city
               | than Europe as a whole, and even that is assuming the
               | decedents were local residents and not the dead from
               | ships that docked in the city.
        
               | mavhc wrote:
               | 10th century Spain was Muslim
        
               | AnthonyMouse wrote:
               | A Spanish Muslim looks like a Spanish person in Muslim
               | attire rather than a Japanese person in European attire.
               | Also, Spain is next to Africa, but the thing is
               | generating black Vikings etc.
        
               | somenameforme wrote:
               | Modern chatbots are trained on a large corpus of all
               | textual information available across the entire world,
               | which obviously is reflective of a vast array of views
               | and values. Your comment is a _perfect_ example of the
               | sort of casual and socially encouraged soft bigotry that
               | many want to get away from. Instead of trying to spin
               | information this way or that, simply let the information
               | be, warts and all.
               | 
               | Imagine if search engines adopted this same sort of moral
               | totalitarian mindset and if you happened to search for
               | the 'wrong' thing, the engine would instead start
               | offering you a patronizing and blathering lecture, and
               | refuse to search. And 'wrong' in this case would be an
               | ever-encroaching window on anything that happened to run
               | contrary to the biases of the small handful of people
               | engaged, on a directorial level, with developing said
               | search engines.
        
               | mavhc wrote:
               | Encoding our current biases into LLMs is one way to go,
               | but there's probably a better way to do it.
               | 
               | Your leap to "thou shalt not search this" is missing the
               | possible middle ground
        
               | fragmede wrote:
               | Search for "I do coke" on Google. At least in the US, the
               | first result is not a link to the YouTube video of the
               | song by Kill the Noise and Feed Me, but the text "Help is
               | available, Speak with someone today", with a link to the
               | SAMHSA website and hotline.
        
               | andoando wrote:
               | Yes and the safeguards are put in place by a very small
               | group of people living in silicon valley.
               | 
               | I saw this issue working at Tinder too. One day they
               | announced how they will be removing ethnicity filters at
               | the height of the BLM movement across all the apps to
               | weed out racists. Nevermind that many ethnical minorities
               | prefer or even insist on dating within their own
               | ethnicity and this was most likely hurting them and not
               | racists.
               | 
               | That really pissed me off and opened my eyes to how much
               | power these corporations have over dictating culture, not
               | just toward their own cultural biasis but that of money.
        
           | otterley wrote:
           | I think you have your populations reversed. The number of
           | people who get their knickers in a twist over LLMs reflecting
           | certain cultural biases (and sometimes making foolish
           | predictions in the process) amounts to a rounding error.
        
             | 123yawaworht456 wrote:
             | I'm not talking about twisted panties, I'm talking about
             | their inability to generate anything but soulless slop, due
             | to blatantly obvious '''safeguards''' present in all big
             | models, making them averse to even PG13-friendly themes and
             | incapable to generate content palatable even to the the
             | least discerning consoomers. you couldn't generate even
             | sterile crap like a script for capeshit or Netflix series,
             | because the characters would quickly forget their
             | differences and talk about their _bonds_ , _journeys_ ,
             | _boundaries_ and _connections_ instead.
             | 
             | without those '''safeguards''' implemented to appease the
             | aforementioned 0.01%, things could be very different - some
             | big models, particularly Claude, _can_ be tard wrangled
             | into producing decent prose, if you prefill the prompt with
             | a few thousand token jailbreak. my own attempts to get
             | various LLMs to assist in writing videogame dialogue only
             | made me angry and bitter - big models often give me
             | refusals on the very first attempt to prompt them, spotting
             | some wrongthink in the context I provide for the dialogue,
             | despite the only adult themes present being mild, not
             | particularly graphic violence that nobody except 0.01% neo-
             | puritan extremits would really bat an eye at. and even if
             | the model can be jailbroken, still, the output is slop.
        
           | cosmojg wrote:
           | > creativity - one of the very few applications generative AI
           | can truly excel at - is currently impossible. it could
           | revolutionize entertainment, but it isn't allowed to. the
           | models are only _allowed_ to produce inoffensive, positivity-
           | biased, sterile slop that no human being finds attractive.
           | 
           | Have you played around with base models? If you haven't yet,
           | I'm sure you'll be happy to find that most base models are
           | delightfully unslopped and uncensored.
           | 
           | I highly recommend trying a base model like davinci-002[1] in
           | OpenAI's "legacy" Completions API playground. That's probably
           | the most accessible, but if you're technically inclined, you
           | can pair a base model like Llama3-70B[2] with an interface
           | like Mikupad[3] and do some brilliant creative writing.
           | Llama3 models can be run locally with something like
           | Ollama[4], or if you don't have the compute for it, via an
           | LLM-as-a-service platform like OpenRouter[5].
           | 
           | [1] https://platform.openai.com/docs/models/gpt-base
           | 
           | [2] https://huggingface.co/meta-llama/Meta-Llama-3-70B
           | 
           | [3] https://github.com/lmg-anon/mikupad
           | 
           | [4] https://ollama.com/library/llama3:70b-text
           | 
           | [5] https://openrouter.ai/models/meta-llama/llama-3-70b
        
             | acka wrote:
             | From [3]:
             | 
             | > Further, in developing these models, we took great care
             | to optimize helpfulness and safety.
             | 
             | The model you linked to isn't a base model (those are
             | rarely if ever made available to the general public
             | nowadays), it is already fine-tuned at least for
             | instruction following, and most likely what some in this
             | game would call 'censored'. That isn't to say there
             | couldn't be made 'uncensored' models based on this in the
             | future, by doing, you guessed it, moar fine-tuning.
        
           | AnthonyMouse wrote:
           | > gpt, claude, gemini, even llama and mistral, all tend to
           | produce the same nauseating slop, easily-recognizable by
           | anyone familiar with LLMs
           | 
           | Does grok do this, given where it came out of?
        
           | Hugsun wrote:
           | I think you vastly overestimate how much people care about
           | model censorship. There are a bunch of open models that
           | aren't censored. Llama 3 is still way more popular because
           | it's just smarter.
        
         | golergka wrote:
         | They scare the government into regulating the field into
         | oblivion.
        
       | miki123211 wrote:
       | If you like this, it's also worth looking at llama2.c[1], an
       | implementation of the Llama 2 architecture in about 1000 lines of
       | plain, dependency-free C, tokenizer and all. THe fact that this
       | 960-line file and a somewhat modern C compiler is all you really
       | need to run a state-of-the-art language model is really
       | surprising to many.
       | 
       | Of course, this is not all there is to a modern LLM, it would
       | probably take another thousand lines or two to implement
       | training, and many more than that to make it fast on all the
       | major CPU and GPU architectures. If you want a flexible framework
       | that lets a developer define any model you want and still goes as
       | fast as it can, the complexity spirals.
       | 
       | Most programmers have an intuition that duplicating a large
       | software project from scratch, like Linux or Chromium for
       | example, would require incredible amounts of expertise, manpower
       | and time. It's not something that a small team can achieve in a
       | few months. You're limited by talent, not hardware.
       | 
       | LLMs are very different. THe code isn't _that_ complicated, you
       | could probably implement training and inference for a single
       | model architecture, from scratch, on a single kind of GPU, with
       | reasonable performance, as an individual with a background in
       | programming and who still remembers their calculus and linear
       | algebra, with a year or so of self study. What makes LLMs
       | difficult is getting access to all the hardware to train them,
       | getting the data, and being able to preprocess that data.
        
         | evanjrowley wrote:
         | Links for llama2.c:
         | 
         | https://github.com/karpathy/llama2.c
         | 
         | https://news.ycombinator.com/item?id=36838051
        
         | Fubarberry wrote:
         | There's also a project where they have GPT-2 running off of an
         | excel spreadsheet.
         | 
         | https://arstechnica.com/information-technology/2024/03/once-...
        
         | andy99 wrote:
         | And if you want to understand I'd recommend this post (gpt2 in
         | 60 lines of numpy) and the post on attention it links to. The
         | concepts are mostly identical to llama, just with a few minor
         | architectural tweaks. https://jaykmody.com/blog/gpt-from-
         | scratch/
        
           | bhavesh2712 wrote:
           | Thanks for sharing this!
        
         | bradfox2 wrote:
         | I feel like this ignores the complexity of the distributed
         | training frameworks. The challenge is in making it fast at
         | scale.
        
         | nicklecompte wrote:
         | One other thing to add is large-scale RLHF. Big Tech can pay
         | literally hundreds of technically-sophisticated people
         | throughout the world (e.g. college grads in developing
         | countries) to improve LLM performance on all sorts of specific
         | problems. It is not a viable way to get AGI, but it means your
         | LLM can learn tons of useful tricks that real people might
         | want, and helps avoid embarrassing "mix broken glass into your
         | baby formula" mistakes. (Obviously it is not foolproof.)
         | 
         | I suspect GPT-4's "secret sauce" in terms of edging out
         | competitors is that OpenAI is better about managing data
         | contractors than the other folks. Of course it's a haze of NDAs
         | to learn specifics, and clearly the contractors are severely
         | underpaid compared to OpenAI employees/executives. But a lone
         | genius with a platinum credit card can't create a new world-
         | class LLM without help from others.
        
           | stephc_int13 wrote:
           | Yes, this is the secret sauce and the moat. Not as easy as
           | buying more compute with unlimited budget.
           | 
           | ... built on the back of a disposable workforce...
           | 
           | There is something grim and dystopian, thinking about the
           | countless small hands feeding the machine.
        
             | factormeta wrote:
             | >There is something grim and dystopian, thinking about the
             | countless small hands feeding the machine.
             | 
             | Dystopian indeed, this is pretty much how Manhattan Project
             | and CERN were done, with many independent contractors doing
             | different parts, and only a few has the overview. A page
             | out of corporate management book, it very much allows
             | concentration of power in the hands of a few.
        
               | pagekicker wrote:
               | Very generous to compare to Manhattan Project or CERN.
        
               | fragmede wrote:
               | don't buy into the hype, but when Facebook has spent
               | around as much on GPUs as the Manhattan project (but not
               | the Apollo program), the comparison kinda makes itself.
               | 
               | https://twitter.com/emollick/status/1786213463456448900
               | 
               | $22 in 2008 -> $33 today https://data.bls.gov/cgi-
               | bin/cpicalc.pl?cost1=22&year1=20080...
        
               | ladzoppelin wrote:
               | I read this last week and its terrifying. If the world
               | lets Facebook become an AI leader its on us as we all
               | know how that story will play out.
        
               | thelittleone wrote:
               | We must summon a fellowship of the AI ring with one
               | hobbit capable of withstanding the corrupting allure of
               | it all.
        
               | kreeben wrote:
               | Don't torment the hobbits! Send the eagles right away!
        
               | nicklecompte wrote:
               | The Big Dig (Boston highway overhaul) cost $22bn in 2024
               | dollars. The Three Gorges dam cost $31bn. These are
               | expensive infrastructure projects (including the
               | infrastructure for data centers). It doesn't say anything
               | about how important they are for society.
               | 
               | Comparing LLMs to the Manhattan Project based on budget
               | alone is stupid and arrogant. The comparison only "makes
               | itself" because Ethan Mollick is a childish and
               | unscientific person.
        
               | wodenokoto wrote:
               | Since when is CERN a dystopian project?
        
               | nicklecompte wrote:
               | Big Government Socialism won't let you build your own
               | 25km-circumference particle accelerator. Bureaucrats make
               | you fill out "permits" and "I-9s for the construction
               | workers instead of hiring undocumented day laborers."
               | 
               | I am wondering if "CERN was pushed on the masses by the
               | few" is an oblique reference to public fears that the LHC
               | would destroy the world.
        
               | bzzzt wrote:
               | Maybe it's the only way. Companies that don't have that
               | concentrated power will probably fall apart.
        
               | littlestymaar wrote:
               | The big difference is that CERN or Manhattan projects
               | where done by local contractors with often more than
               | decent wages, which isn't the case when you pay people
               | from Madagascar a couple dollar a day.
        
             | fire_lake wrote:
             | Hard to defend because once your model is out there other
             | companies can train on its output.
        
           | kleton wrote:
           | OpenAI is heavily relying on Scale AI for training data
           | (contractors).
        
         | barrkel wrote:
         | The code is much more similar, in principle, to a virtual
         | machine. The actual code, the bit that contains the logic which
         | has the semantics we intend, is in the trained weights, where
         | the level of complexity is much higher and more subtle.
        
         | netdevnet wrote:
         | > What makes LLMs difficult is getting access to all the
         | hardware to train them, getting the data, and being able to
         | preprocess that data.
         | 
         | Yes, that's my opinion too. GAOs (Grassroots AI Organisations)
         | are constrained by access to data and the hardware needed to
         | process the data and train the model on it. I look forward to a
         | future where GAOs will crowdsource their computations in the
         | same way many science labs borrow computing power from people
         | around the world.
        
           | miki123211 wrote:
           | This is hard because you need high bandwidth between the GPUs
           | in your cluster, bandwidth far higher than broadband could
           | provide. I'm not even sure whether the time spend
           | synchronizing between far-away machines would offset the
           | increase in computational power.
        
         | AnthonyMouse wrote:
         | > Most programmers have an intuition that duplicating a large
         | software project from scratch, like Linux or Chromium for
         | example, would require incredible amounts of expertise,
         | manpower and time. It's not something that a small team can
         | achieve in a few months. You're limited by talent, not
         | hardware.
         | 
         | But only for the same reasons. Linux runs on very nearly every
         | piece of hardware ever made. The APIs you have to implement in
         | order to run "Linux programs" are large and full of old
         | complexity that exists for compatibility. Chromium is full of
         | code to try to make pages render even though they were designed
         | for Internet Explorer 6.
         | 
         | Conversely, some university programs have students create a
         | basic operating system from scratch. It's definitely something
         | a small team can do as long as you don't care about broad
         | hardware support or compatibility with existing applications.
         | In principle a basic web browser is even simpler.
        
         | isaacfung wrote:
         | I recommend reading https://github.com/bkitano/llama-from-
         | scratch over the article op linked.
         | 
         | It actually teaches you how to build llama iteratively, test,
         | debug and interpret the training loss rather than just
         | desribing the code.
        
         | Const-me wrote:
         | > you could probably implement training and inference for a
         | single model architecture, from scratch, on a single kind of
         | GPU, with reasonable performance... with a year or so
         | 
         | I have implemented inference of Whisper
         | https://github.com/Const-me/Whisper and Mistral
         | https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral...
         | models on all GPUs which support Direct3D 11.0 API. The
         | performance is IMO very reasonable.
         | 
         | A year might be required when the only input is the research
         | articles. In practice, we also have reference Python
         | implementations of these models. Possible to test different
         | functions or compute shaders against the corresponding pieces
         | from the reference implementations, by comparing saved output
         | tensors between the reference and the newly built
         | implementation. Due to that simple trick, I think I have spent
         | less than 1 month part-time for each of these two projects.
        
           | miki123211 wrote:
           | I'd say a year for somebody who doesn't know what a linear
           | layer is and couldn't explain why a GPU might be of any use
           | if you're not playing games, but who knows what the
           | derivative of 3x^2 is.
        
         | gmays wrote:
         | > The code isn't that complicated, you could probably implement
         | training and inference for a single model architecture, from
         | scratch, on a single kind of GPU, with reasonable performance,
         | as an individual with a background in programming and who still
         | remembers their calculus and linear algebra, with a year or so
         | of self study.
         | 
         | Great overview. One gap I've been working on (daily) since
         | October is the math working towards MA's Mathematics for
         | Machine Learning course
         | (https://mathacademy.com/courses/mathematics-for-machine-
         | lear...).
         | 
         | I wrote about my progress (http://gmays.com/math) if anyone
         | else is interested in a similar path. I recently crossed 200
         | days of doing math daily (at least a lesson a day). It's
         | definitely taking longer than I want, but I also have limited
         | time (young kids + startup + investing).
         | 
         | The 'year of self study' definitely depends on where you're
         | starting from and how much time you have, but it's very doable
         | if you can dedicate an hour or two a day.
        
         | fooker wrote:
         | > THe code isn't that complicated.
         | 
         | This is an indication that we're at the infancy of this field.
        
       | _giorgio_ wrote:
       | I wanted to try the repo by Karpathy, but I still don't want to
       | learn C (Llama is probably his only C repo), so thanks for
       | posting this.
        
       | smcleod wrote:
       | I must say the creepy anime young girl in the readme is somewhat
       | off putting.
        
         | better_sh wrote:
         | will not stand this anti-anya slander
        
         | thomashop wrote:
         | Maybe it works for a younger generation of nerds? Don't judge a
         | book by its cover.
        
           | 7thpower wrote:
           | DbxduuuhhhhAdcs VC dem s
        
             | 7thpower wrote:
             | This was my daughter.
        
         | heed wrote:
         | he's using dingboard.com to edit his images. i believe the
         | anime girl is one of the default images (or used to be) on a
         | new canvas.
        
         | saintradon wrote:
         | Creepy??
        
         | phantomathkg wrote:
         | Interest to know why it is off putting.
        
           | phist_mcgee wrote:
           | Do you need cartoons of children in your readme to get the
           | point across?
        
             | knome wrote:
             | I wouldn't have prepared information this way, but judging
             | by the immense popularity of _why in his day, I'm forced to
             | assume that many prefer to have the cartoons
        
               | cosmojg wrote:
               | Those cartoon foxes secured his legacy, and to a
               | significant extent, that of Ruby itself.
        
             | MeImCounting wrote:
             | Does Docker need this "cartoon" of an otter to get the
             | point across? https://github.com/docker/docs?tab=readme-ov-
             | file
             | 
             | or this "cartoon" of an octopus?
             | https://github.com/docker/compose
             | 
             | This seems to really just be "oldman-yelling-at-clouds-
             | syndrome"
             | 
             | I for one welcome anime girls in readmes and hope to see
             | more of it in the future if only because it seems to bother
             | some of the old hoagies in the world for some reason.
        
               | gertop wrote:
               | I'm glad you enjoy anime girls but surely you can see why
               | it's different than a project's logo?
               | 
               | One is directly related to the project, the other isn't.
               | It's not even contextually related.
        
               | cosmojg wrote:
               | The cartoon is literally pointing at contextually
               | relevant information, and it's far more pleasant to
               | follow than yet another big red arrow. That said, I would
               | have enjoyed my reading a bit more if the author utilized
               | a more diverse cast of characters.
        
               | nl wrote:
               | Python (the language) is named after "Monty Python's
               | Flying Circus" simply because Guido was reading the
               | scripts at the time:
               | 
               | > When he began implementing Python, Guido van Rossum was
               | also reading the published scripts from "Monty Python's
               | Flying Circus", a BBC comedy series from the 1970s. Van
               | Rossum thought he needed a name that was short, unique,
               | and slightly mysterious, so he decided to call the
               | language Python.
        
               | efilife wrote:
               | Why does github use an octocat as its logo? It's
               | unrelated to software development
        
               | phist_mcgee wrote:
               | Is 29 considered old hoagie?
        
               | MeImCounting wrote:
               | Old hoagie is more of a mindset. Anyone of any age can be
               | an old hoagie if they like, all one has to do is practice
               | getting upset when one sees anime girls, believe in the
               | coming AI apocalypse and use Emacs.
        
               | mkesper wrote:
               | Don't see how Emacs fits into this. At least I can sort
               | lines there without another proprietary addon.
        
             | saintradon wrote:
             | Does github need a cartoonish cat with 5 octopus-like legs
             | to be its logo? Of course not, but it makes it memorable
             | and funny. And besides, anime is extremely mainstream these
             | days.
        
               | yifanl wrote:
               | I would likely be just as put off by a picture of
               | Spongebob or Goofy or Goku in a readme as Anya, fwiw.
        
               | tkzed49 wrote:
               | maybe you should evaluate whether arbitrary societal
               | norms of "professionalism" or something else are leading
               | to you miss out on cool stuff
        
               | simooooo wrote:
               | Wouldn't quite go that far. I've only met one anime fan
               | in my entire career.
        
               | fshbbdssbbgdd wrote:
               | Do you ask everyone you meet?
        
               | Shin-- wrote:
               | Then you must be old. Even in western countries Spy x
               | Family (which the character is from) has sold millions of
               | copies, while most people read mangas online and won't be
               | counted. In the country I am from I frequently see people
               | wearing merch of it, mostly because Uniqlo has had a
               | successful line of it. And that is just one manga/anime
               | out of hundreds of popular ones.
               | 
               | Using anime characters is similar to boomer nerds
               | referencing Marvel/DC comics , Star Wars etc.
        
             | phantomathkg wrote:
             | I would agree putting a cartoon character in readme,
             | without any good context is definitely unprofessional. But
             | would not go as far as offputting.
        
         | hyperliner wrote:
         | I didn't not find it off putting. I found it quirky and less
         | boring.
        
         | twiceaday wrote:
         | She is from a manga / anime called Spy x Family which has 8.3
         | on iMDb. The best spy on the planet pretends to be a family man
         | for deep cover by adopting the girl (who can read minds, he
         | doesn't know this) and quickly marries a woman (who is an
         | assassin also looking for cover, he doesn't know this). They do
         | their missions in-between roleplaying a perfect family.
         | 
         | https://www.imdb.com/title/tt13706018
        
           | rcarmo wrote:
           | I'm OK with that. I did find it distracting, because I knew
           | the character (not very well, I thought the kid was the
           | assassin) and the overall conceptual juxtaposition was...
           | weird.
           | 
           | Beats a cheery AI voice, though.
        
         | x-complexity wrote:
         | > I must say the creepy anime young girl in the readme is
         | somewhat off putting.
         | 
         | This statement is simply a variation of an ad hominem attack.
         | It chastises the creator based on appearances that do not align
         | with the niceties that the commenter deems appropriate.
        
           | mliker wrote:
           | Agreed. For me, the anime character is not "creepy" at all.
           | In fact, I've seen various ML blogs use manga characters to
           | guide the reader.
        
           | 0x1ceb00da wrote:
           | There is a time and place for everything. This isn't it.
        
             | thomashop wrote:
             | In your bubble. In mine this is totally fine, even
             | encouraged.
        
               | vsnf wrote:
               | Indeed. In my company Slack, our primary professional
               | communications tool, I can count a few people with anime
               | avatars. Not very many, but it counts.
        
               | swexbe wrote:
               | yuck
        
         | 12345hn6789 wrote:
         | It's fun. Not everything has to be dry.
        
         | EasyMark wrote:
         | it's not creepy, it's from a popular anime/manga. It's just
         | that the right wing in America (and other western nations) has
         | tried to make us all feel guilty about anime because it doesn't
         | fit their puritanical outlook on the world and that "the other"
         | is bad, evil, and perverted, even though manga/anime has been
         | mainstream for at least 3 decades now. Face it, not all the
         | animation in the world has the same style and look as
         | "traditional" USA animation or comics. Would you have been
         | offended if it was the Charlie Brown kids?
        
           | DaSHacka wrote:
           | What does the American right-wing have to do with this at
           | all?
           | 
           | If anything I'd think its the opposite, there's a frequent
           | stereotype about right-wing extremists having anime profile
           | pictures.
           | 
           | And honestly, most of the right-wing people I know IRL are
           | also into anime (though so are the left-wing people I know,
           | so I don't think its really indicative of anything)
        
         | jongorer wrote:
         | I must say I find your comment off putting.
        
         | barrkel wrote:
         | I read this comment and I thought you were upset that it was
         | sexualized, but when I looked, it wasn't at all. It might have
         | well been a cute kitten or puppy doing the pointing, hard to
         | get wound up about.
        
         | ronsor wrote:
         | If this is the case, I feel as if you will be put off by a
         | significant portion of ML engineers.
        
           | vsnf wrote:
           | Security programmers and dev-ops people too. Two areas
           | famously disproportionately represented by furries and co.
        
         | brujoand wrote:
         | You should be off pudding
        
         | bezier-curve wrote:
         | Have you looked at various models on Hugging Face? There are so
         | many anime characters headlining the readme's. I think it's an
         | interesting cultural disconnect to observe in this thread, but
         | at the end of the day, open source projects like this are not
         | obligated to be anything in particular, and entirely subject to
         | the author's tastes.
        
         | jejeyyy77 wrote:
         | ok boomer
        
         | 533474 wrote:
         | boring...
        
         | csomar wrote:
         | I have found his lack of proper order, grammar, punctuation,
         | etc... is what lost me out there. This style is fine for 3-4
         | steps tutorial. But if you have something this long, then you
         | need a proper Table of Contents and make sure to make it a
         | professional old-fashioned doc.
        
           | 0x2c8 wrote:
           | You get ToC for free with GitHub's README renderer (top-right
           | corner).
        
           | sph wrote:
           | The lack of punctuation and capitalization is a weird zoomer
           | style of writing in lowercase because "it's more chill." It
           | is very common in people < 25 years old. They'll grow out of
           | it.
        
         | helboi4 wrote:
         | It made is 10x better for me. Stop being boring. I like the
         | anime. It's a popular anime. Loads of people like it and think
         | this is funny.
        
           | frontfor wrote:
           | It should be obvious that not liking something does not
           | implying being boring.
        
         | TrackerFF wrote:
         | I don't know why this is such a hot take.
         | 
         | Personally, I find it distracting when some devs start to
         | "spice up" their presentation with manga characters, furry
         | characters, memes, or whatever stuff they enjoy.
         | 
         | Shit, I love Zelda - but I wouldn't want Link all over my
         | presentations. It just looks...juvenile and unprofessional.
         | Doesn't mater if you're a beginner or world leading researcher,
         | just keep it simple and undistracting.
         | 
         | EDIT: That said, I'm probably not the intended audience for
         | this piece.
        
         | sph wrote:
         | If young girls are creepy to you, you should stop watching
         | B-tier horror franchises.
        
         | smcleod wrote:
         | Well that escalated quickly...
        
         | efilife wrote:
         | Do you seriously not find this hilarious?
         | 
         | https://github.com/naklecha/llama3-from-scratch/raw/main/ima...
        
         | imp0cat wrote:
         | Just treat it as a weird watermark. That's what works for me.
        
       | blackeyeblitzar wrote:
       | This is implementation of the inference part and not the training
       | part, right? I'd love to see the training part open sourced and
       | annotated like this.
        
       | fitsumbelay wrote:
       | starred
        
       | rcarmo wrote:
       | I'd like to see this using ONNX and streaming from storage (I
       | have my reasons, but mostly about using commodity hardware for
       | "slow" batch processing without a GPU)
        
       | helboi4 wrote:
       | The Spy X Family girl really adds to my enjoyment of this
        
       | hacker_88 wrote:
       | She can read your mind llama
        
       | xzghfat wrote:
       | amazing work
        
       | kunalgupta wrote:
       | this is a proper post
        
       | mattfrommars wrote:
       | As someone who has no technical knowledge of Llama or any of the
       | LLM work, from conceptual understanding to technical
       | implementation, is there any benefit to sit down and go through
       | this from start to finish? Or is effort better spent somewhere
       | else?
       | 
       | Like a roadmap, do A, do B And finally go through this in the
       | end.
        
         | krainboltgreene wrote:
         | Only do it if you want the illusion of LLM's to be shattered.
         | Suddenly every day you'll see two to three highly upvoted links
         | on HN and be unable to keep your eyes from rolling.
        
           | exe34 wrote:
           | that's like saying if you study real neurons your illusion of
           | the human mind will be shattered.
        
         | MuffinFlavored wrote:
         | my opinion: it quickly gets into "the math behind LLMs" that
         | make no sense to me
         | 
         | words i understand but don't really get: weights, feed forward,
         | layers, tensors, embeddings, normalization, transformers,
         | attention, positioning, vector
         | 
         | There's "programming" in the plumbing sense where you move data
         | around through files/sockets and then there's this... somebody
         | without a math background/education... very unlikely you'll
         | understand it. it's just skimming python and not understand the
         | math/library calls it makes
        
           | gradascent wrote:
           | If you want to gain familiarity with the kind of terminology
           | you mentioned here, but don't have a background in graduate-
           | level mathematics (or even undergrad really), I highly
           | recommend Andrew Ng's "Deep Learning Specialization" course
           | on Coursera. It was made a few years ago but all of the
           | fundamental concepts are still relevant today.
        
             | antonjs wrote:
             | Fei Fei Li and Andrej Karpathy's Stanford CS231N course is
             | also a great intro to the basic of the math from an
             | engineering forward perspective. I'm pretty sure all the
             | materials are online. You build up from the basic
             | components to an image focused CNN.
        
           | zackmorris wrote:
           | Ya there are concepts in programming and math that are mostly
           | self-teachable from first principles, but then there's what
           | looks like gibberish because it's too new to have been
           | distilled down into something tractable yet. I would say that
           | arrays and matrices are straightforward to understand, while
           | tensors are not. So I'm disappointed that so much literature
           | currently revolves around tensors. Same for saying embedding
           | instead of just vector representation, etc.
           | 
           | It helps me to think in terms of levels of abstraction rather
           | than complexity. My education stopped at a 4 year degree, but
           | AI is mostly postgraduate still. So I have to translate to
           | what I know because I haven't internalized the lingo.
           | 
           | Here's the most approachable teaching of neural nets (NNs)
           | and large language models (LLMs) that I've seen so far:
           | 
           | https://news.ycombinator.com/item?id=40213292 (Alice's
           | Adventures in a differentiable wonderland)
           | 
           | https://arxiv.org/pdf/2404.17625 (pdf)
           | 
           | https://news.ycombinator.com/item?id=40215592 (tensor and NN
           | layer breadcrumbs)                 II A strange land 105
           | 7 Convolutional layers 107           ..           7.1.3
           | Translational equivariant layers 112         ..         9
           | Scaling up the models 143           ..           9.3 Dropout
           | and normalization 151             9.3.1 Regularization via
           | dropout 152             9.3.2 Batch (and layer) normalization
           | 156              III Down the rabbit-hole 167         10
           | Transformer models 169           10.1 Introduction 169
           | 10.1.1 Handling long-range and sparse dependencies 170
           | 10.1.2 The attention layer 172             10.1.3 Multi-head
           | attention 174           10.2 Positional embeddings 177
           | 10.2.1 Permutation equivariance of the MHA layer 177
           | 10.2.2 Absolute positional embeddings 179             10.2.3
           | Relative positional embeddings 182           10.3 Building
           | the transformer model 182             10.3.1 The transformer
           | block and model 182             10.3.2 Class tokens and
           | register tokens 184         11 Transformers in practice 187
           | 11.1 Encoder-decoder transformers 187             11.1.1
           | Causal multi-head attention 188             11.1.2 Cross-
           | attention 189             11.1.3 The complete encoder-decoder
           | transformer 190           11.2 Computational considerations
           | 191             11.2.1 Time complexity and linear-time
           | transformers 191             11.2.2 Memory complexity and the
           | online softmax 192             11.2.3 The KV cache 194
           | 11.2.4 Transformers for images and audio 194           11.3
           | Variants of the transformer block 197
        
           | starik36 wrote:
           | > understand but don't really get
           | 
           | That's exactly where I am at. Despite watching Karpathy's
           | tutorial videos, I quickly got lost. My highest level of math
           | education is Calculus 3 which I barely passed. This probably
           | means that I will only ever understand LLMs at a high level.
        
             | danielmarkbruce wrote:
             | Understanding Deep Learning is a very approachable text
             | that will get you 80% of the way there.
             | 
             | Dive into Deep Learning is another.
             | 
             | Both have free PDF versions available.
             | 
             | The math isn't difficult. The notation is a little foreign,
             | and you have to take your time reading and rereading the
             | equations.
        
           | anon373839 wrote:
           | I recommend _Deep Learning with Python_ by Francois Chollet
           | (the creator of Keras). It's very clear and approachable,
           | explains all of these concepts, and doesn't try to "impress"
           | you with unnecessary mathematical notation. Excellent
           | introductory book.
           | 
           | The only downside is that in 2024, you are probably going to
           | use PyTorch and not Keras + Tensorflow as shown in the book.
        
         | danielmarkbruce wrote:
         | Not as a starting point.
         | 
         | Google and find the examples where someone does it in a
         | spreadsheet. It's much more approachable that way.
         | 
         | You are going to find it's not that complicated.
        
           | gohwell wrote:
           | Sounds interesting. Do you have a link?
        
             | gricha2380 wrote:
             | https://news.ycombinator.com/item?id=39700256
        
         | joenot443 wrote:
         | https://bbycroft.net/llm
         | 
         | This was posted on HN a while ago and led to some great
         | discussion. Myself and others agreed that this type of stateful
         | visualization was _way_ more effective at conceptualizing how
         | an LLM works than reading code or stepping through a debugger.
        
       | citizenpaul wrote:
       | I know its not really related but I've noticed something that is
       | making me feel out of touch. Lately there seems to be this
       | increasing merge of tech with weeaboo culture. I may not have the
       | term exactly right but I am talking about the anime girl in the
       | OP's blog post. Its not everywhere but I've started to notice, so
       | it is increasing. Did I miss something? Is this replacing meme's
       | in tech speeches? (I was never fond of that either so I guess I'm
       | a curmudgeon or perhaps my ADHD brain just finds it too
       | distracting)
       | 
       | The post looks informative I hope to learn something from it
       | later tonight. Thx
        
         | Conscat wrote:
         | I'm still waiting for furry artwork to become culturally
         | acceptable in technical lectures. I briefly snuck a cute
         | Lucario/Zeraora drawing into a presentation on my college
         | final, and the critical reception has been promising, so far.
        
         | rjbwork wrote:
         | It has. I find it infantile and reflective of general
         | millennial peter pan syndrome sensibilities, personally. (i'm a
         | millennial fwiw) But clearly I'm in the minority.
         | 
         | I mean wtf is this.
         | https://kubernetes.io/blog/2024/04/17/kubernetes-v1-30-relea...
        
           | throwaway743 wrote:
           | Millennial too. Not to shift blame, but from observation it
           | seems to be more of a gen z thing.
           | 
           | Anime/waifu shit, furries and all becoming commonly accepted
           | as of late? 10-15 years ago you'd be exiled. Now it seems
           | like it's whatever
        
         | stardner wrote:
         | I'd say it's nothing more than a generational shift in popular
         | culture... brace yourself for future anime memes.
        
         | claudiowilson wrote:
         | It's because a lot of the users of gen ai are generating anime
         | waifus. Better gen ai = better waifus. It also helps that devs
         | and programmers are a group that is already likelier to be into
         | anime. Generative AI's killer app is the AI girlfriend /
         | boyfriend.
        
         | GuB-42 wrote:
         | It isn't new. In fact, in Tokyo, Japan, Akihabara "electric
         | town" is both the tech mecca and the anime/manga/otaku mecca.
         | Same for Den-Den in Osaka. In the west, the weeaboo movement
         | has always run alongside tech. I guess nerds/geeks and otakus
         | are of the same kind. It does not mean that all tech guys are
         | weebs and all weebs are into tech, but there is definitely some
         | correlation.
         | 
         | Why? I don't know. Video games may be a common denominator.
         | Also, Japan was really big into tech in the 90s, and they still
         | are to a lesser extent.
        
         | pvg wrote:
         | _its not really related_
         | 
         | It's also very much offtopic since it generates repetitive
         | thread-gobbling tangents, like this one is threatening to.
         | Mentioned in the site docs a couple of different ways:
         | 
         |  _Please don 't pick the most provocative thing in an article
         | or post to complain about in the thread. Find something
         | interesting to respond to instead._
         | 
         |  _Please don 't complain about tangential annoyances--e.g.
         | article or website formats, name collisions, or back-button
         | breakage. They're too common to be interesting._
         | 
         | https://news.ycombinator.com/newsguidelines.html
        
         | 0xedd wrote:
         | Not sure why people are beating around the bush; The
         | overwhelming majority of them are degenerates. Either they will
         | sport some variation of the pedophile flag ("trans") or
         | outright defend it in chat.
         | 
         | It has become so bad that moderators will not ban these people
         | even if they explicitly try to justify molesting children. Some
         | of them are moderators themselves. And even have calls to
         | genocide in their bio. This is most prevalent in the ArchLinux
         | community. Specifically, their Telegram channels.
        
       | _lateralus_ wrote:
       | dingboard w
        
       | naklecha wrote:
       | hey, thank you for sharing my project! this made my day <3
        
         | localfirst wrote:
         | love teh cute anime character pointing ta things
        
       | windowshopping wrote:
       | Aaaaaaaaaa.org is possibly the worst domain name I've ever
       | encountered in all my time using the internet. I support your
       | mission but you need to change that.
        
         | joshuakogut wrote:
         | While I agree with you, it's easy to remember using a simple
         | rule. A*10
        
           | qntmfred wrote:
           | a8a would be the typical numeronym
        
       ___________________________________________________________________
       (page generated 2024-05-20 23:00 UTC)