[HN Gopher] The Mathematics of Training LLMs
       ___________________________________________________________________
        
       The Mathematics of Training LLMs
        
       Author : FanaHOVA
       Score  : 104 points
       Date   : 2023-08-16 16:59 UTC (6 hours ago)
        
 (HTM) web link (www.latent.space)
 (TXT) w3m dump (www.latent.space)
        
       | mchiang wrote:
       | Thanks for sharing this. How do you think the local LLM movement
       | will evolve? Especially as in the post, you mentioned startups
       | and VCs both hoarding GPUs to attract talent.
       | 
       | There seems to be a good demand behind tools like llama.cpp or
       | ollama (https://github.com/jmorganca/ollama) to run models
       | locally.
       | 
       | Maybe as the local runners become more efficient, we'll start
       | seeing more trainings for smaller models or fine-tuning done
       | locally? I too am still trying to wrap my head around this.
        
         | CuriouslyC wrote:
         | The current generation of models (Llama/Llama2) seem to pass a
         | threshold of "good enough" for the majority of use cases at
         | 60b+ parameters. Quantized 30b models that can run in 24gb of
         | GPU VRAM are good enough for many applications but definitely
         | show their limitations frequently.
         | 
         | It is likely that we will eventually see good fine tunes for
         | Llama 30b that produce usable output for code/other challenging
         | domains, but until we get GPUs with 48g+ VRAM we're going to
         | have to make do with general models that aren't great at
         | anything, and fine tunes that only do one very narrow thing
         | well.
        
         | swyx wrote:
         | apart from the obvious GGML, we've done podcasts with both
         | MLC/TQ Chen (https://www.latent.space/p/llms-everywhere) and
         | Tiny/George Hotz (https://www.latent.space/p/geohot) who are
         | building out more tooling for the Local LLM space!
         | 
         | there's actually already a ton of interest, and arguably if you
         | go by the huggingface model hub it's actually a very well
         | developed ecosystem.. just that a lot of the usecases tend to
         | be NSFW oriented. still, i'm looking to do more podcasts in
         | this space, please let me know if any good guests come to mind.
        
         | FanaHOVA wrote:
         | Training smaller models can be really compute intensive (a 7B
         | models should get trained on 1.4T tokens to follow the "LLaMA
         | laws"). So that would be C = 6 * 1.4T * 7B = 58.8T FLOP-
         | seconds. That's 1/5th the compute of GPT3 for example, but it's
         | still a lot. We asked Quentin to do a similar post but for fine
         | tuning math; that's still a very underexplored space.
         | 
         | (Not to self plug too much, but this is exactly what last
         | episode's with Tianqi Chen was about if you're interested :)
         | https://www.latent.space/p/llms-everywhere#details
        
           | arugulum wrote:
           | I want to jump in and correct your usage of "LLaMA Laws"
           | (even you are using it informally, but I just want to
           | clarify).
           | 
           | There is no "LLaMA scaling law". There are a set of LLaMA
           | training configurations.
           | 
           | Scaling laws describe the relationship between training
           | compute, data, and expected loss (performance). Kaplan et
           | al., estimated one set of laws, and the Chinchilla folks
           | refined that estimate (mainly improving it by adjusting the
           | learning rate schedule).
           | 
           | The LLaMA papers do not posit any new law nor contradict any
           | prior one. They chose a specific training configuration that
           | still abide by the scaling laws but with a different goal in
           | mind.
           | 
           | (Put another way: a scaling law doesn't tell you what
           | configuration to train on. It tells you what to expect given
           | a configuration, but you're free to decide on whatever
           | configuration you want.)
        
             | FanaHOVA wrote:
             | Yep, +1. That's why I used the quotes. :) Thanks for
             | expanding!
        
               | arugulum wrote:
               | Yep I understood that you were using it informally, just
               | trying to keep things informative for other folks reading
               | too.
        
               | swyx wrote:
               | there frankly needs to be a paper calling this out tho,
               | because at this point there are a bunch of industry
               | models following "llama laws" and nobody's really done
               | the research, its all monkey see monkey do
        
               | arugulum wrote:
               | But what would they be calling out?
               | 
               | If industry groups want to run a training run based on
               | the configurations of a well-performing model, I don't
               | see anything wrong with that. Now, if they were to claim
               | that what they are doing is somehow "optimal", then there
               | would be something to criticize.
        
               | swyx wrote:
               | poor choice of words, i probably mean sketching out the
               | curves/doing ablation studies in a comprehensive way like
               | the chinchilla paper did.
        
           | swyx wrote:
           | that said there's more data efficiency to be gained in the
           | smol models space - Phi-1 achieved 51% on HumanEval with only
           | a 5.3x token-param ratio: https://www.latent.space/p/cogrev-
           | tinystories#details
        
       | babelfish wrote:
       | Why is the constant in the formula 6? In the transcript I see a
       | mention of "6 tokens per parameter", but I'm not clear on why
       | that is.
        
         | swyx wrote:
         | if you click through to the source doc and we talk about it on
         | the pod, its basically 2 for forward pass, 4 for backward pass
        
         | FanaHOVA wrote:
         | It's 2 for forward pass, 4 for backward pass (2PD + 4PD = 6PD),
         | but we didn't cover why it's 2 and 4 since it was a very deep
         | rabbit hole. Maybe he'll do a separate post for that in the
         | future!
        
           | arugulum wrote:
           | If you want a speedrun explanation for how we get to "2": In
           | the limit of model scaling, context size doesn't matter (yes,
           | forget about the quadratic attention), most of the compute is
           | in the linear layers, which boil down to matrix multiplies.
           | Consider a single matrix of size [T,d] multiplied by weight
           | of size [d,d], the compute needed for a matrix multiplication
           | is approximately 2Td^2 (2 coming from multiply + add). Swap T
           | out with D for your whole dataset in tokens, d^2 is the
           | number of parameters in a single linear layer so scale up
           | your model to P, and you've got 2PD.
           | 
           | Even shorter: The 2 comes from the multiply-add
        
           | float-trip wrote:
           | There's a breakdown here for anyone interested (ctrl+f
           | "weight flops for")
           | 
           | https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-
           | la...
        
       | xchip wrote:
       | Ambiguous title, it should be "memory requirements for training
       | LLMs"
        
         | d4rkp4ttern wrote:
         | Thanks. This is why I read HN comments before the article :)
        
       | swyx wrote:
       | this was the deepest dive we ever did into Transformers Math 101
       | (https://news.ycombinator.com/item?id=35631546). we mostly cover
       | AI Engineer/inference time things on the pod, but training
       | knowledge is hard won and extremely valuable, and I love it when
       | experts are wiling to distill the rules of thumb for what they
       | have learned.
       | 
       | questions welcome!
        
         | FanaHOVA wrote:
         | Yes! I have a 5 pages doc of prep notes I made, happy to share
         | :)
        
           | rayval wrote:
           | Please do! Thanks
        
         | swyx wrote:
         | related online discussion when this article came out:
         | 
         | from Eleuther:
         | https://twitter.com/AiEleuther/status/1648782486736969728?s=...
         | 
         | from AI anon:
         | https://twitter.com/nearcyan/status/1662937711156625408?s=20
         | 
         | Stella Biderman (coauthor and now lead of the Eleuther
         | Foundation, Stella if you're reading this pls come on our pod
         | sometime!!!!)
         | https://twitter.com/BlancheMinerva/status/164878587828883456...
        
         | photon_lines wrote:
         | Sorry for the criticism here, but your linked paper / article
         | does nothing to explain the math behind transformers. You
         | should re-name it something like 'Scaling Transformer
         | Mathematics' or 'The Math Behind Scaling GPUs / Transformers'
         | or whatever more descriptive name there is to describe what
         | you're outlining in your article. 'Transformer Math 101' should
         | at least explain 1) input embeddings 2) (keys, queries, values)
         | and the idea behind the matrix multiplication of these linear /
         | numeric sets of values 3) the Softmax computation in the
         | original paper as well as the relevant matrix transformations
         | which take place 4) dot products ( and how they relate to
         | similarity scoring) 5) feed-forward neural networks and the
         | gradient descent / backpropagation and how they work. There are
         | many many concepts you didn't even touch upon. This is not
         | 'Transformer Math 101' by any means.
        
           | swyx wrote:
           | fwiw i'm not the author of that doc, we just interviewed him,
           | and the hn submission i linked to was also renamed presumably
           | for similar concerns. we do have an "Algorithms 101" episode
           | (in the theme of our Datasets 101 and Benchmarks 101 episode)
           | where we have at least some of your topics lined up
        
       | tysam_and wrote:
       | I think this is a lot of the mathematics of scaling LLM training.
       | Which is quite important!
       | 
       | One fundamental requirement though for any machine learning
       | engineer working on these kinds of systems is
       | https://people.math.harvard.edu/~ctm/home/text/others/shanno....
       | I do not want to be entirely hypocritical as I am still ingesting
       | this theory myself (started several years ago!), but I've found
       | it _absolutely crucial_ in working in ML, as it implicitly
       | informs every single decision you make when designing, deploying,
       | and scaling neural networks.
       | 
       | Without it, I feel the field turns into an empirical "stabby
       | stab-around in the dark" kind of game, which very much has its
       | dopamine highs and lows, but ala Sutton, does not scale very well
       | in the long-run. ;P
        
         | davnn wrote:
         | Do you mean that information theory in general is essential in
         | working with ML systems, or a specific point raised by Shannon?
        
           | tysam_and wrote:
           | I'm sure there's a specific point raised by Shannon, and I've
           | been (very recently!) learning how Shannon himself is not the
           | sole info theory dude (even though lots of people credit him
           | for it!).
           | 
           | But basically, the communication of information over a noisy
           | channel _is_ the foundation for deep learning.
        
             | arketyp wrote:
             | As they say, intelligence is prediction is compression is
             | encoding...
        
               | tysam_and wrote:
               | Well, I agree that encoding is compression, at least, but
               | the rest of that statement I do disagree with! It seems
               | to be one of the more common mantras going around right
               | now, though. Which is partially why I advocate for
               | Shannon's theory! It's very clean and bare metal in terms
               | of being related to intelligence, though I think one
               | could build a decent argument for intelligence from info
               | theory.
        
               | PeterisP wrote:
               | > Well, I agree that encoding is compression, at least,
               | but the rest of that statement I do disagree with!
               | 
               | IMHO that Shannon's paper you linked is effectively the
               | initial work for linking prediction with compression,
               | showing how the information transmitted (and the
               | necessity for transmitted information) decreases as you
               | can more accurately predict the likelihoods of the next
               | symbol.
        
               | tysam_and wrote:
               | Yes, but prediction != compression, and intelligence !=
               | prediction! However, prediction can inform compression,
               | but there's necessarily some framework needed to
               | translate the information from prediction<->compression.
               | Perhaps that is too nitpicky, but to me it's like
               | thinking about raw materials (like raw food or whatnot)
               | and how it ends up as a finished product (like cornbread.
               | Corn farming is not cornbread, but there is a semi-clear
               | family of paths from one to the other with its own set of
               | optimizations, rules, and etc).
               | 
               | Again, that could be a bit nitpicky on my end depending
               | on how one is viewing it.
        
               | [deleted]
        
             | davnn wrote:
             | It probably is from the perspective of an information
             | theorist. Did you read any interesting articles on the
             | connections between deep learning and information theory to
             | come to this conclusion? I'm highly interested in this
             | space, but the influence of information theory on deep
             | learning developments appears to be negligible.
        
         | jackblemming wrote:
         | Please list a single decision you've made that was directly
         | influenced by Shannon's paper, and no I do not mean something
         | you post hoc rationalized.
        
           | jdkoeck wrote:
           | Good challenge, I hope we will get a response.
        
             | tysam_and wrote:
             | Your wish hath been granted! <3 :))))
        
           | tysam_and wrote:
           | Sure. Basically everything in https://github.com/tysam-
           | code/hlb-CIFAR10 was directly founded on concepts shared in
           | the above paper, down to the coding, commenting, and layout
           | styles (hence why I advocate so strongly for it as a
           | requirement for ML. The empirical benefits are clear to me).
           | 
           | Before I sat down and wrote my first line, I spent a very
           | long time thinking about how to optimize the repo. Not just
           | in terms of information flow during training, but how the
           | code was laid out (minimize the expected value of deltas for
           | changes from a superset of possible code changes), to even
           | the explanatory comments (ratio of space vs mental effort to
           | decode the repo for experienced vs inexperienced developers).
           | I really want it to be a good exemplary model of a different,
           | more scalable, and more efficient way of conducting small-
           | scale (and potentially resource-constrained) research. To do
           | that, you have to maximize information efficiency at every
           | stage of the pipeline, including temporally (!!!!).
           | 
           | It's not perfect, but I've used info theory as a strong
           | guiding light for that repo. There's more to say here, but
           | it's a long conversation about the expected utility of doing
           | research a few different kinds of ways.
        
             | [deleted]
        
         | godelski wrote:
         | There's something I like to tell people: "You don't need math
         | to make a good model, but you need math to know why your model
         | is wrong." All models are wrong, and actually one of the
         | crucial points we're at is not enough concentration on how
         | wrong our models are (or how wrong our evaluations are, which
         | is also a type of model). I suggest also investing time in
         | statistics, understanding higher dimensional spaces
         | (topologically and statistically), and metric theory.
         | 
         | It is always a "stab-around in the dark" unfortunately, but the
         | deeper your mathematical understanding is the brighter your
         | candle burns to help you poke around. I think a lack of
         | mathematical understanding has also made people misread Sutton
         | as endorsing "scale is all you need" rather than "flexibility
         | combined with scale has historically resulted in the largest
         | gains." These things are very different.
        
           | tysam_and wrote:
           | Ha! I think we may be on the same page with Sutton, that and
           | the misuse of the NFL theorem are the two disclaimers I put
           | out the most. My most recent Sutton one was ~3 hours ago!
           | (https://news.ycombinator.com/item?id=37129921#37151192).
           | 
           | That's a really good point, and I stand corrected. I guess it
           | is still very much stabbing around in the dark, just with
           | some more information about the probability density function
           | of good answers. Heck, even Terry Tao's post about how to
           | solve problems shows very much a (refined) guess-and-check
           | method, so I have quite little ground to stand on there.
           | 
           | Metric theory is fun, and one I'd like to learn a lot more
           | about. I certainly have a lot to learn there.
        
             | godelski wrote:
             | Yeah I think the nature of research is de facto searching
             | in the dark. I mean if it weren't, would it really be
             | research? I think of the analogy as the knowledge space is
             | a dark void and our collective knowledge is permanent
             | torches lighting up the area. Researchers are exploring
             | into the darkness and finding suitable places to place the
             | next torch. Maybe we can say that the torch's brightness is
             | defined by how well known your work is, since people tend
             | to the flames more (not to be confused with how important
             | the work is).
             | 
             | The big question to me is about the geometry of that
             | knowledge space. Is it bounded? If bounded (most likely
             | imo), how do we approach the bound? Do we plateau once we
             | hit it (unlikely imo)? Do we approach
             | {,sub,super}-linearly? Do we approach {,sub,super}-linearly
             | and then transition into an asymptotic convergence? (e.g.
             | S-curve) This question actually has a profound impact on
             | the question of the risk of a super-intelligence. IMO
             | knowledge is most likely bound ( and most likely looks like
             | an S-curve. If we're more than half way through that
             | S-curve than a super intelligence that is 100x smarter than
             | us may only know <100x as much as us, which we have good
             | reason to believe that this is the case since knowledge
             | appears to compound and accelerate (linear and sublinear
             | models don't fit our historic trend, but that doesn't tell
             | us the shape of the curve ahead). That reduces the risk of
             | a super intelligence absolutely outsmarting us since it
             | would be easier to close the gap, especially since learning
             | through observation is easier than learning through
             | exploration (can use its torches, even if they are more
             | difficult to see). I'm not sure I see this really discussed
             | much in the super intelligence conversations and that we're
             | often working with different priors so that creates
             | fundamental disagreements that become impassible. Maybe we
             | need to set these priors first before we even discuss the
             | next priors about motivation of a super intelligence.
             | 
             | At least for math, I don't think I have the capacity to
             | learn it all in the short human life, even if we're just
             | discussing what math would be helpful to ML. But it sure is
             | fun to learn it, so I guess the inability to learn it all
             | is not really a downside :) Now if only I can convince
             | others that math is fun __and__ useful to ML.
        
           | eointierney wrote:
           | This is well put.
           | 
           | Mathematics is our candle in the darkness of our
           | understanding, and the more math we know the brighter our
           | candle flames.
           | 
           | But Mathematics is not a candle, it is the purest analogy we
           | have ever encountered, and it enlightens our minds.
           | 
           | It gives us an appreciation of scale, and measurement
           | thereof, that is unmatched, yielding deep insights into
           | itself and everything else.
           | 
           | I just wish everyone learned number theory and geometry and
           | algebra and physics and and and...
           | 
           | I just wish everyone would read the Road to Reality by
           | Penrose.
           | 
           | Models are little stepping stones that are slippy and unsure,
           | but we must use them to take further steps into the darkness
           | of our ignorance.
        
       | kitanata wrote:
       | [dead]
        
       ___________________________________________________________________
       (page generated 2023-08-16 23:00 UTC)