[HN Gopher] The Mathematics of Training LLMs
___________________________________________________________________
The Mathematics of Training LLMs
Author : FanaHOVA
Score : 104 points
Date : 2023-08-16 16:59 UTC (6 hours ago)
(HTM) web link (www.latent.space)
(TXT) w3m dump (www.latent.space)
| mchiang wrote:
| Thanks for sharing this. How do you think the local LLM movement
| will evolve? Especially as in the post, you mentioned startups
| and VCs both hoarding GPUs to attract talent.
|
| There seems to be a good demand behind tools like llama.cpp or
| ollama (https://github.com/jmorganca/ollama) to run models
| locally.
|
| Maybe as the local runners become more efficient, we'll start
| seeing more trainings for smaller models or fine-tuning done
| locally? I too am still trying to wrap my head around this.
| CuriouslyC wrote:
| The current generation of models (Llama/Llama2) seem to pass a
| threshold of "good enough" for the majority of use cases at
| 60b+ parameters. Quantized 30b models that can run in 24gb of
| GPU VRAM are good enough for many applications but definitely
| show their limitations frequently.
|
| It is likely that we will eventually see good fine tunes for
| Llama 30b that produce usable output for code/other challenging
| domains, but until we get GPUs with 48g+ VRAM we're going to
| have to make do with general models that aren't great at
| anything, and fine tunes that only do one very narrow thing
| well.
| swyx wrote:
| apart from the obvious GGML, we've done podcasts with both
| MLC/TQ Chen (https://www.latent.space/p/llms-everywhere) and
| Tiny/George Hotz (https://www.latent.space/p/geohot) who are
| building out more tooling for the Local LLM space!
|
| there's actually already a ton of interest, and arguably if you
| go by the huggingface model hub it's actually a very well
| developed ecosystem.. just that a lot of the usecases tend to
| be NSFW oriented. still, i'm looking to do more podcasts in
| this space, please let me know if any good guests come to mind.
| FanaHOVA wrote:
| Training smaller models can be really compute intensive (a 7B
| models should get trained on 1.4T tokens to follow the "LLaMA
| laws"). So that would be C = 6 * 1.4T * 7B = 58.8T FLOP-
| seconds. That's 1/5th the compute of GPT3 for example, but it's
| still a lot. We asked Quentin to do a similar post but for fine
| tuning math; that's still a very underexplored space.
|
| (Not to self plug too much, but this is exactly what last
| episode's with Tianqi Chen was about if you're interested :)
| https://www.latent.space/p/llms-everywhere#details
| arugulum wrote:
| I want to jump in and correct your usage of "LLaMA Laws"
| (even you are using it informally, but I just want to
| clarify).
|
| There is no "LLaMA scaling law". There are a set of LLaMA
| training configurations.
|
| Scaling laws describe the relationship between training
| compute, data, and expected loss (performance). Kaplan et
| al., estimated one set of laws, and the Chinchilla folks
| refined that estimate (mainly improving it by adjusting the
| learning rate schedule).
|
| The LLaMA papers do not posit any new law nor contradict any
| prior one. They chose a specific training configuration that
| still abide by the scaling laws but with a different goal in
| mind.
|
| (Put another way: a scaling law doesn't tell you what
| configuration to train on. It tells you what to expect given
| a configuration, but you're free to decide on whatever
| configuration you want.)
| FanaHOVA wrote:
| Yep, +1. That's why I used the quotes. :) Thanks for
| expanding!
| arugulum wrote:
| Yep I understood that you were using it informally, just
| trying to keep things informative for other folks reading
| too.
| swyx wrote:
| there frankly needs to be a paper calling this out tho,
| because at this point there are a bunch of industry
| models following "llama laws" and nobody's really done
| the research, its all monkey see monkey do
| arugulum wrote:
| But what would they be calling out?
|
| If industry groups want to run a training run based on
| the configurations of a well-performing model, I don't
| see anything wrong with that. Now, if they were to claim
| that what they are doing is somehow "optimal", then there
| would be something to criticize.
| swyx wrote:
| poor choice of words, i probably mean sketching out the
| curves/doing ablation studies in a comprehensive way like
| the chinchilla paper did.
| swyx wrote:
| that said there's more data efficiency to be gained in the
| smol models space - Phi-1 achieved 51% on HumanEval with only
| a 5.3x token-param ratio: https://www.latent.space/p/cogrev-
| tinystories#details
| babelfish wrote:
| Why is the constant in the formula 6? In the transcript I see a
| mention of "6 tokens per parameter", but I'm not clear on why
| that is.
| swyx wrote:
| if you click through to the source doc and we talk about it on
| the pod, its basically 2 for forward pass, 4 for backward pass
| FanaHOVA wrote:
| It's 2 for forward pass, 4 for backward pass (2PD + 4PD = 6PD),
| but we didn't cover why it's 2 and 4 since it was a very deep
| rabbit hole. Maybe he'll do a separate post for that in the
| future!
| arugulum wrote:
| If you want a speedrun explanation for how we get to "2": In
| the limit of model scaling, context size doesn't matter (yes,
| forget about the quadratic attention), most of the compute is
| in the linear layers, which boil down to matrix multiplies.
| Consider a single matrix of size [T,d] multiplied by weight
| of size [d,d], the compute needed for a matrix multiplication
| is approximately 2Td^2 (2 coming from multiply + add). Swap T
| out with D for your whole dataset in tokens, d^2 is the
| number of parameters in a single linear layer so scale up
| your model to P, and you've got 2PD.
|
| Even shorter: The 2 comes from the multiply-add
| float-trip wrote:
| There's a breakdown here for anyone interested (ctrl+f
| "weight flops for")
|
| https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-
| la...
| xchip wrote:
| Ambiguous title, it should be "memory requirements for training
| LLMs"
| d4rkp4ttern wrote:
| Thanks. This is why I read HN comments before the article :)
| swyx wrote:
| this was the deepest dive we ever did into Transformers Math 101
| (https://news.ycombinator.com/item?id=35631546). we mostly cover
| AI Engineer/inference time things on the pod, but training
| knowledge is hard won and extremely valuable, and I love it when
| experts are wiling to distill the rules of thumb for what they
| have learned.
|
| questions welcome!
| FanaHOVA wrote:
| Yes! I have a 5 pages doc of prep notes I made, happy to share
| :)
| rayval wrote:
| Please do! Thanks
| swyx wrote:
| related online discussion when this article came out:
|
| from Eleuther:
| https://twitter.com/AiEleuther/status/1648782486736969728?s=...
|
| from AI anon:
| https://twitter.com/nearcyan/status/1662937711156625408?s=20
|
| Stella Biderman (coauthor and now lead of the Eleuther
| Foundation, Stella if you're reading this pls come on our pod
| sometime!!!!)
| https://twitter.com/BlancheMinerva/status/164878587828883456...
| photon_lines wrote:
| Sorry for the criticism here, but your linked paper / article
| does nothing to explain the math behind transformers. You
| should re-name it something like 'Scaling Transformer
| Mathematics' or 'The Math Behind Scaling GPUs / Transformers'
| or whatever more descriptive name there is to describe what
| you're outlining in your article. 'Transformer Math 101' should
| at least explain 1) input embeddings 2) (keys, queries, values)
| and the idea behind the matrix multiplication of these linear /
| numeric sets of values 3) the Softmax computation in the
| original paper as well as the relevant matrix transformations
| which take place 4) dot products ( and how they relate to
| similarity scoring) 5) feed-forward neural networks and the
| gradient descent / backpropagation and how they work. There are
| many many concepts you didn't even touch upon. This is not
| 'Transformer Math 101' by any means.
| swyx wrote:
| fwiw i'm not the author of that doc, we just interviewed him,
| and the hn submission i linked to was also renamed presumably
| for similar concerns. we do have an "Algorithms 101" episode
| (in the theme of our Datasets 101 and Benchmarks 101 episode)
| where we have at least some of your topics lined up
| tysam_and wrote:
| I think this is a lot of the mathematics of scaling LLM training.
| Which is quite important!
|
| One fundamental requirement though for any machine learning
| engineer working on these kinds of systems is
| https://people.math.harvard.edu/~ctm/home/text/others/shanno....
| I do not want to be entirely hypocritical as I am still ingesting
| this theory myself (started several years ago!), but I've found
| it _absolutely crucial_ in working in ML, as it implicitly
| informs every single decision you make when designing, deploying,
| and scaling neural networks.
|
| Without it, I feel the field turns into an empirical "stabby
| stab-around in the dark" kind of game, which very much has its
| dopamine highs and lows, but ala Sutton, does not scale very well
| in the long-run. ;P
| davnn wrote:
| Do you mean that information theory in general is essential in
| working with ML systems, or a specific point raised by Shannon?
| tysam_and wrote:
| I'm sure there's a specific point raised by Shannon, and I've
| been (very recently!) learning how Shannon himself is not the
| sole info theory dude (even though lots of people credit him
| for it!).
|
| But basically, the communication of information over a noisy
| channel _is_ the foundation for deep learning.
| arketyp wrote:
| As they say, intelligence is prediction is compression is
| encoding...
| tysam_and wrote:
| Well, I agree that encoding is compression, at least, but
| the rest of that statement I do disagree with! It seems
| to be one of the more common mantras going around right
| now, though. Which is partially why I advocate for
| Shannon's theory! It's very clean and bare metal in terms
| of being related to intelligence, though I think one
| could build a decent argument for intelligence from info
| theory.
| PeterisP wrote:
| > Well, I agree that encoding is compression, at least,
| but the rest of that statement I do disagree with!
|
| IMHO that Shannon's paper you linked is effectively the
| initial work for linking prediction with compression,
| showing how the information transmitted (and the
| necessity for transmitted information) decreases as you
| can more accurately predict the likelihoods of the next
| symbol.
| tysam_and wrote:
| Yes, but prediction != compression, and intelligence !=
| prediction! However, prediction can inform compression,
| but there's necessarily some framework needed to
| translate the information from prediction<->compression.
| Perhaps that is too nitpicky, but to me it's like
| thinking about raw materials (like raw food or whatnot)
| and how it ends up as a finished product (like cornbread.
| Corn farming is not cornbread, but there is a semi-clear
| family of paths from one to the other with its own set of
| optimizations, rules, and etc).
|
| Again, that could be a bit nitpicky on my end depending
| on how one is viewing it.
| [deleted]
| davnn wrote:
| It probably is from the perspective of an information
| theorist. Did you read any interesting articles on the
| connections between deep learning and information theory to
| come to this conclusion? I'm highly interested in this
| space, but the influence of information theory on deep
| learning developments appears to be negligible.
| jackblemming wrote:
| Please list a single decision you've made that was directly
| influenced by Shannon's paper, and no I do not mean something
| you post hoc rationalized.
| jdkoeck wrote:
| Good challenge, I hope we will get a response.
| tysam_and wrote:
| Your wish hath been granted! <3 :))))
| tysam_and wrote:
| Sure. Basically everything in https://github.com/tysam-
| code/hlb-CIFAR10 was directly founded on concepts shared in
| the above paper, down to the coding, commenting, and layout
| styles (hence why I advocate so strongly for it as a
| requirement for ML. The empirical benefits are clear to me).
|
| Before I sat down and wrote my first line, I spent a very
| long time thinking about how to optimize the repo. Not just
| in terms of information flow during training, but how the
| code was laid out (minimize the expected value of deltas for
| changes from a superset of possible code changes), to even
| the explanatory comments (ratio of space vs mental effort to
| decode the repo for experienced vs inexperienced developers).
| I really want it to be a good exemplary model of a different,
| more scalable, and more efficient way of conducting small-
| scale (and potentially resource-constrained) research. To do
| that, you have to maximize information efficiency at every
| stage of the pipeline, including temporally (!!!!).
|
| It's not perfect, but I've used info theory as a strong
| guiding light for that repo. There's more to say here, but
| it's a long conversation about the expected utility of doing
| research a few different kinds of ways.
| [deleted]
| godelski wrote:
| There's something I like to tell people: "You don't need math
| to make a good model, but you need math to know why your model
| is wrong." All models are wrong, and actually one of the
| crucial points we're at is not enough concentration on how
| wrong our models are (or how wrong our evaluations are, which
| is also a type of model). I suggest also investing time in
| statistics, understanding higher dimensional spaces
| (topologically and statistically), and metric theory.
|
| It is always a "stab-around in the dark" unfortunately, but the
| deeper your mathematical understanding is the brighter your
| candle burns to help you poke around. I think a lack of
| mathematical understanding has also made people misread Sutton
| as endorsing "scale is all you need" rather than "flexibility
| combined with scale has historically resulted in the largest
| gains." These things are very different.
| tysam_and wrote:
| Ha! I think we may be on the same page with Sutton, that and
| the misuse of the NFL theorem are the two disclaimers I put
| out the most. My most recent Sutton one was ~3 hours ago!
| (https://news.ycombinator.com/item?id=37129921#37151192).
|
| That's a really good point, and I stand corrected. I guess it
| is still very much stabbing around in the dark, just with
| some more information about the probability density function
| of good answers. Heck, even Terry Tao's post about how to
| solve problems shows very much a (refined) guess-and-check
| method, so I have quite little ground to stand on there.
|
| Metric theory is fun, and one I'd like to learn a lot more
| about. I certainly have a lot to learn there.
| godelski wrote:
| Yeah I think the nature of research is de facto searching
| in the dark. I mean if it weren't, would it really be
| research? I think of the analogy as the knowledge space is
| a dark void and our collective knowledge is permanent
| torches lighting up the area. Researchers are exploring
| into the darkness and finding suitable places to place the
| next torch. Maybe we can say that the torch's brightness is
| defined by how well known your work is, since people tend
| to the flames more (not to be confused with how important
| the work is).
|
| The big question to me is about the geometry of that
| knowledge space. Is it bounded? If bounded (most likely
| imo), how do we approach the bound? Do we plateau once we
| hit it (unlikely imo)? Do we approach
| {,sub,super}-linearly? Do we approach {,sub,super}-linearly
| and then transition into an asymptotic convergence? (e.g.
| S-curve) This question actually has a profound impact on
| the question of the risk of a super-intelligence. IMO
| knowledge is most likely bound ( and most likely looks like
| an S-curve. If we're more than half way through that
| S-curve than a super intelligence that is 100x smarter than
| us may only know <100x as much as us, which we have good
| reason to believe that this is the case since knowledge
| appears to compound and accelerate (linear and sublinear
| models don't fit our historic trend, but that doesn't tell
| us the shape of the curve ahead). That reduces the risk of
| a super intelligence absolutely outsmarting us since it
| would be easier to close the gap, especially since learning
| through observation is easier than learning through
| exploration (can use its torches, even if they are more
| difficult to see). I'm not sure I see this really discussed
| much in the super intelligence conversations and that we're
| often working with different priors so that creates
| fundamental disagreements that become impassible. Maybe we
| need to set these priors first before we even discuss the
| next priors about motivation of a super intelligence.
|
| At least for math, I don't think I have the capacity to
| learn it all in the short human life, even if we're just
| discussing what math would be helpful to ML. But it sure is
| fun to learn it, so I guess the inability to learn it all
| is not really a downside :) Now if only I can convince
| others that math is fun __and__ useful to ML.
| eointierney wrote:
| This is well put.
|
| Mathematics is our candle in the darkness of our
| understanding, and the more math we know the brighter our
| candle flames.
|
| But Mathematics is not a candle, it is the purest analogy we
| have ever encountered, and it enlightens our minds.
|
| It gives us an appreciation of scale, and measurement
| thereof, that is unmatched, yielding deep insights into
| itself and everything else.
|
| I just wish everyone learned number theory and geometry and
| algebra and physics and and and...
|
| I just wish everyone would read the Road to Reality by
| Penrose.
|
| Models are little stepping stones that are slippy and unsure,
| but we must use them to take further steps into the darkness
| of our ignorance.
| kitanata wrote:
| [dead]
___________________________________________________________________
(page generated 2023-08-16 23:00 UTC)