[HN Gopher] Do large language models need all those layers?
___________________________________________________________________
Do large language models need all those layers?
Author : belter
Score : 121 points
Date : 2023-12-15 17:00 UTC (6 hours ago)
(HTM) web link (www.amazon.science)
(TXT) w3m dump (www.amazon.science)
| blamestross wrote:
| The answer is always "No it doesn't need all those nodes", but we
| do it that way because it makes doing the math easier. I bet they
| just removed a bunch of edge weights to nodes, but used the same
| matrices for calculating output.
|
| I wish arbitrary topology networks were scalable (I love NEAT)
| but bipartide graphs crunch good in GPUs.
| zaptrem wrote:
| Can you elaborate what you mean by "edge weights to nodes, but
| used the same matrices"?
| feanaro wrote:
| The network retains the same topology, and therefore the same
| matrix dimensions, but some elements are set to zero,
| removing their contribution.
| AlexErrant wrote:
| Relevant excerpt from
| https://news.ycombinator.com/item?id=38656172
|
| > ...if we meet the shape 512x512 (512 is a power of 2,
| therefore a very "round" number in computer science), then
| maybe some kernel will be very fast; but then on a
| 512x511matrix, the same kernel may need to add some padding
| first to transform it into a round 512x512 matrix with
| zeros at the end of each row. Adding those zeros means
| shifting all rows, which is a very costly operation.
| eurekin wrote:
| I'm quite surprised so much was retained; many presentations on
| convolutional networks suggest adding a lot of features, but
| prune most in a post learning step. Supposedly lots of
| features, which are initialized randomly, just raise the
| probability of being close to the real answer (the concept
| being learned). Once it's found, most of features typically go
| unused or are reduntant.
| visarga wrote:
| Warning! dated 18 Dec 2022 - Rethinking the Role of Scale for In
| Context Learning: An Interpretability based Case Study at 66
| Billion Scale
|
| https://arxiv.org/abs/2212.09095
|
| since it is at its one year anniversary, we can check citations -
| it has 16 so far
|
| https://scholar.google.com/scholar?start=0&hl=ro&as_sdt=2005...
| bjornsing wrote:
| > Finding that 70% of attention heads and 20% of feed-forward
| networks can be excised with minimal effect on in-context
| learning suggests that large language models are undertrained.
|
| So why do the larger models perform so much better...?
| sodality2 wrote:
| If they're all equally un-pruned, sounds like they still
| maintain their linear scale of performance.
|
| Just like quantization!
| cubefox wrote:
| How does this answer the question?
| danielmarkbruce wrote:
| The question sort of implies you couldn't prune the smaller
| models and see the same thing. So, the answer given is to
| consider that in both cases, you sort of only use 30% of
| the model. Bigger is still bigger. The basic intuition of
| more parameters = better holds.
| bjornsing wrote:
| I don't follow...
| sodality2 wrote:
| The article mentions that models have a lot of extra
| information that is unnecessary. You asked why the large
| ones still outperform small ones. presumably they all have
| that inefficiency. But the large ones are still better. 30%
| of a big number is still bigger than 30% of a small number.
| danielmarkbruce wrote:
| lottery ticket hypothesis might be real
| youngNed wrote:
| Because 70% of a big number is a lot more than 70% of a smaller
| number?
|
| Not being facetious, I don't know the answer, but that's my
| best guess
| sdenton4 wrote:
| Here's an explanation I hit on some time ago:
|
| More parameters makes it easier to find solutions with low-
| energy.
|
| Suppose we have a product of two variables z = x * y. And now
| suppose that the 'correct' product is z=2, and we're learning x
| and y. A very good analytical solution is x=1, y=2 (or vice
| versa) allowing us to eliminate either x or y from our learning
| problem. The total energy of (x, y) in this case is 1*2 + 2*2 =
| 5.
|
| However, another solution is x = y = sqrt(2), which has energy
| 2: this solution is much closer to the origin. The extra
| variable means that we have a /surface/ of solutions instead of
| a unique solution, so we can hone in on ones that are easier to
| get to using our optimizer.
|
| As you add more variables, you can find lower and lower energy
| solutions.
|
| Consider that we initialize neural networks 'near' zero, and
| then walk with gradient descent in some direction towards a
| solution. Then adding lots of extra variables - wiggle room -
| makes it much easier to find a solution within walking distance
| of the (noisy) origin.
| bjornsing wrote:
| Fits with my first intuitive guess: it's implicit
| regularization (that works as you describe).
|
| Would be interesting to try some explicit regularization. But
| unfortunately you need a million bucks to an experiment on
| LLMs. :/
| gessha wrote:
| Do you know of any literature that looks into this? This is a
| pretty interesting hypothesis.
| jncfhnb wrote:
| Undertrained does not mean bad. It means it could be better.
|
| But I also disagree with takeaway.
| og_kalu wrote:
| all LLMs are undertrained to some degree.
|
| assuming the models are identical except one is bigger then the
| bigger model is better because 70% of a bigger number is larger
| than 70% of a smaller number.
|
| Now if you train a smaller model much longer than the bigger
| model (more tokens) then you are reducing the level of "under-
| trainedness" to some degree. at some point, you _may_ have a
| smaller model that is better than that larger model.
|
| 70% of a bigger number may be larger than 70% of a smaller
| number but no guarantee 70% of a bigger number is larger than
| say 90% of a smaller number and so on.
| earthboundkid wrote:
| You only use 10% of your LLM neural network.
| usgroup wrote:
| Take my upvote!
| dist-epoch wrote:
| Imagine if you could use 100% of your LLM. I think I saw a
| movie about that.
| lgxz wrote:
| Lucy: https://en.wikipedia.org/wiki/Lucy_(2014_film)
| GaggiX wrote:
| They should do the same test on Mistral 7B or Phi-2 instead of
| OPT-66B.
| changoplatanero wrote:
| I can believe that you can get good performance on 14 NLP tasks
| after pruning 70% of the model. But for things like chatgpt the
| use cases are way more diverse and you can't keep high
| performance for everything when you prune that many of the
| weights.
| lolinder wrote:
| You don't need to get good performance on everything with a
| single model--Mixture of Experts models can outperform
| equivalently sized monolithic models. My understanding is that
| GPT-4 is structured this way.
|
| If we can trim a model in _different_ ways to get different
| specializations, that could be really effective.
| novaRom wrote:
| I just tried Mistral MoE 8x7B model and it works a bit faster
| than llama-2-70B but it looks it has almost the same skills.
| In fact, all latest models of 13B-70B size are quite similar.
| Could it be large part of their training data is the same?
| sigmoid10 wrote:
| >But for things like chatgpt the use cases are way more diverse
| and you can't keep high performance for everything when you
| prune that many of the weights.
|
| This also has become increasingly obvious from recent
| developments in the field. Today, we regulalry see new models
| that come at a fraction of the size of GPT-3 and yet easily
| outperform it, especially on certain downstream tasks when
| fine-tuned correctly. These small models also retain some
| generality, but not as much as the really high end really large
| models like GPT-4. I'd say a sub 10B parameter model equal or
| better than GPT-3 overall is achievable, but not for GPT-4. At
| least not with current technology. However, that would still
| imply that it's possible to reduce parameter counts in common
| approaches by 95%. I'm pretty sure in a few years people will
| look back and smirk at the crude methods we used to train LLMs
| today.
| k__ wrote:
| How big are GPT3 and GPT4 models?
| pizza wrote:
| 175B and rumored 1.76T params
| k__ wrote:
| How much on disk?
| pizza wrote:
| if fp32, 4 bytes per param. but the weights may have been
| quantized to lower precision (eg fp16, so 2B/p)
| Der_Einzige wrote:
| Correct, people in the model optimization world will claim all
| day that it only marginally impacts performance, but I've
| extensively played with quantization and pruning methods, and
| can report that they _do_ cripple models in ways that the
| authors either didn't notice or on purpose omitted from their
| SOTA benchmark chasing.
|
| Most claims of models being even equal to GPT-3.5 are also
| significantly overblown. I haven't seen one yet below 70
| billion parameters which even comes close.
|
| Nothing is free in this world.
| zoogeny wrote:
| Forgive me if this is not a good question, but is there a
| difference here between training and inference?
| quadrature wrote:
| Yes. Think of it in terms of fitting a line to some points
| y=mx+b. Training is finding the right slope m and intercept b
| of the line to get a good fit to the points. Inference is when
| you take an x coordinate and find the y value using the
| "trained" m and b in the line equation
| zoogeny wrote:
| I'm not sure if that gives me an intuition on the title of
| the article: "Do large language models need all those layers"
|
| Am I interpreting you correctly if I say: "Finding the slope
| (training) may require those extra layers but finding a
| particular y value given an known x coordinate (inference)
| may not require those extra layers".
|
| What I mean is, does the answer to the article's question
| change if one is considering training vs. inference?
| quadrature wrote:
| Apologies, I thought you were asking a general question
| about ML. Will let someone else comment on the specifics
| here.
| xanderlewis wrote:
| I think I've misinterpreted it in the same way. I guess
| you're asking something like: if we can exorcise parts of
| a model without affecting quality of inferences (in some
| particular domain), can we do the same with the training
| step? That is, is it necessary to train a model on a wide
| variety of topics in order to get high-quality
| 'understanding' for a particular application?
|
| If we don't need those weights at inference time, why do
| the computation to train them in the first place?
| ska wrote:
| The real answer is we don't know yet but it's
| interesting.
|
| To go back to your ax+b example, imagine instead you are
| fitting a much higher dimensional model, but you don't
| know how high. ax^n+bx^(n-1) ... where n might be in the
| millions, or hundreds of millions, or?? So we know if we
| make the model high enough order (e.g n-1 training points
| will give "perfect") it will overfit, so we throw some
| regularization and a bit of handwavy tuning and we end up
| with a model of say n=7213472123 and a set of a,b ..
| which behaves pretty well, but from it's behavior we
| suspect most of them dont' matter. And maybe should be <=
| 2million, or whatever.
|
| So, a few obvious questions - one is can we find a way to
| throw out most of the a,b,c ... to get just the core,
| i.e. if we throw away all |k| <= 0.00001 does it change
| anything (for inference). A very different question is
| could we decide that ahead of time (during training). A
| different class of question looks more like "could we
| have figured this out from the data".
|
| It's a _lot_ harder to reason about the latter questions,
| because the former one is empirical: After training, this
| one doesn 't seem to do anything. Ahead of time, how do
| you know? This has interesting offshoots, like how stable
| is the distribution of the parts that matter, etc.
| ImprobableTruth wrote:
| (I assume this question is about whether models need all those
| layers during training, even if they don't need them during
| inference)
|
| Yes. There's the so called "lottery ticket hypothesis".
| Essentially the idea is that large models start with many
| randomly initialized subnetworks ("lottery tickets"0 and that
| training finds which ones work best. Then it's only natural
| that during inference we can prune all the "losing tickets"
| away, even though we need them during training.
|
| It's kind of an open question how large this effect is though.
| As the article mentions, if you can prune a lot away, this
| could also just mean that the network isn't optimally trained.
| zoogeny wrote:
| I figured this must be a well-known property of neural
| networks. I'll do some reading on the lottery ticket
| hypothesis. That is almost exactly what I was thinking when
| reading the article: sure after you have trained it you can
| prune the layers that aren't used. But I wasn't sure you
| could know/guess which layers will be unused before you
| train.
|
| It strikes me as an interesting open question since if it is
| the case that you need big networks for training but can use
| significantly smaller "pruned" networks for inference there
| are many, many reasons why that might be true. Determining
| which of the possible reasons is the actual reason may be a
| key in understanding how LLMs work.
| jncfhnb wrote:
| Training is making the model (or rather going from something
| random and useless to something well calibrated and useful).
| Inference is using it to make a prediction.
|
| This is saying that you don't need the entire model to make
| good predictions for specific subsets of tasks. You can
| literally remove a large part of the model and it will do fine.
| Which is not very controversial. The model, after being
| trained, is a large collection of interacting nodes. When this
| is talking about dropping chunks of the model it means dropping
| nodes after training to make predictions. The advantage
| primarily being that smaller models are cheaper and faster to
| run or modify with further training.
|
| You know that meme about how you only use 10% of your brain at
| a time? Well, yeah, but the idiot movies that suggest using
| 100% of your brain would make you impossibly smarter are not
| correct. 90% of your brain just isn't relevant. More brain /
| model is not better than the relevant subset alone.
|
| The important question to be asking is whether you can remove
| large chunks of the model without hurting its ability to
| generally to do well on whatever you ask it.
|
| As a very crude example, imagine you trained a simple model to
| predict rainfall using a weather monitor and the number of
| farts you did last week. The model will probably learn that the
| monitor is a useful and the farts are irrelevant. If this were
| as simple as a linear regression, you could just remove the
| farts coefficient from the equation and the model would come
| out to the same outcomes. Neural nets are not so easily
| observed but it's still just dropping the irrelevant bits to
| whatever you're trying to do.
| xanderlewis wrote:
| The use of the word 'inference' in this context can seem a bit
| weird, but I think it's borrowed from statistics and it's quite
| standard.
|
| Training = optimising model parameters to 'learn' from data.
|
| Inference = asking the model to make a prediction, usually
| assuming the model is already trained.
|
| Instead of _inference_ , you could say _running /querying the
| model_.
| reqo wrote:
| >Finding that 70% of attention heads and 20% of feed-forward
| networks can be excised with minimal effect on in-context
| learning suggests that large language models are undertrained.
|
| I thought that it was common knowledge that LLMs are
| undertrained, none of the publicly available loss graphs show any
| sign of convergence!
| novaRom wrote:
| It's a disadvantage of current SOTA models: they are easy to
| train, but they must be large wasting lots of weights in order
| to generalize well. Maybe another architecture, transformer's
| successor will be more economical - having less weights with
| more skills and knowledge.
| k__ wrote:
| Can we use unoptimized to train optimized ones?
| versteegen wrote:
| Maybe that's why the Phi models do so well for their size. I'm
| guessing they may have been trained close to convergence, but
| the loss graphs aren't published. Phi-1.5 (1.3B parameters) was
| trained on 150B tokens (5 epochs), yet phi-1.5-web was trained
| on 300B so they didn't stop for lack of compute. Phi-2 (2.7B
| params) was trained on 1.4T tokens, epochs unknown.
| Legend2440 wrote:
| Especially the OPT models they study here were known to be
| undertrained. It would be interesting to compare to a more
| modern model like llama 2.
| praveen9920 wrote:
| If it is true, does that mean that we can "compress" the models
| for efficient inference on smaller devices? That would be
| wonderful
| CuriouslyC wrote:
| That is literally what quantization and distillation are
| WhitneyLand wrote:
| There are lots ways to make LLMs more efficient:
|
| - Pruning
|
| - Distillation
|
| - Sparse transformers
|
| - Mixture or experts
|
| - Quantizations
|
| My understanding is that none of these are free and they all come
| with various trade offs.
|
| For example MOE lets Mistral beat similarly sized models, and the
| inference performance stays close (only an incremental increase).
| But the training time is way more than a typical 7B model.
|
| But which of these approaches gives the most bang for the buck?
|
| Also consider it's not either/or many of these techniques can be
| combined.
|
| And maybe worst of all, some of the testing that can be done to
| find out doesn't give the same answer with smaller/toy models.
| rolisz wrote:
| Mistral MOE model is 8*7B parameters, so you should compare
| training time to similar sized models, not to 7B ones.
| versteegen wrote:
| Mixtral 8x7B actually has 46.7B total parameters, not 8*7B =
| 56B. The reason being that not all parameters are multiplied
| 8x.
|
| Also it uses 12.9B parameters per token, not quite comparable
| to 7B models.
| tbruckner wrote:
| Accidentally read this as "Do large companies need all those
| lawyers?"
| jmward01 wrote:
| We quantize models and get nearly the same performance and we
| keep seeing new 'small' models performing on par with older
| larger models. Additionally, people don't need nearly the amount
| of data that models do in order to learn the same tasks. This
| implies a lot about how fast it may be possible to learn and how
| far models currently are from that optimum. Because of this I
| think the authors conclusions about scaling data up really only
| apply to current models and training techniques. Based on how
| fast people can learn, I believe we are likely missing a few
| things that will allow models to learn vastly faster and deeper
| with fewer total parameters. My own work, as an example, (github
| jmward01/lmplay) may show how simple changes to embedding
| training may make drastic improvements to model training.
| fluidcruft wrote:
| I was a RSNA this year (major Radiology conference and trade
| show) and one of the presenters made the claim that their model
| was generalized because it works on different body parts. My
| intuition was they were claiming they trained it on one body
| parts and it subsequently worked on different body parts (which
| could be convincing). So at first this seemed fine. But in
| reality they had trained the same model on all of those body
| parts. That really go me to thinking about the old myth that we
| only use 10% of our brains. Anyway I think capacity would be when
| the model can no longer learn.
|
| But anyway it made me wonder if there's a way to measure "what x%
| of a model is actually used" similar to the myths about human
| brains.
___________________________________________________________________
(page generated 2023-12-15 23:01 UTC)