[HN Gopher] Do large language models need all those layers?
       ___________________________________________________________________
        
       Do large language models need all those layers?
        
       Author : belter
       Score  : 121 points
       Date   : 2023-12-15 17:00 UTC (6 hours ago)
        
 (HTM) web link (www.amazon.science)
 (TXT) w3m dump (www.amazon.science)
        
       | blamestross wrote:
       | The answer is always "No it doesn't need all those nodes", but we
       | do it that way because it makes doing the math easier. I bet they
       | just removed a bunch of edge weights to nodes, but used the same
       | matrices for calculating output.
       | 
       | I wish arbitrary topology networks were scalable (I love NEAT)
       | but bipartide graphs crunch good in GPUs.
        
         | zaptrem wrote:
         | Can you elaborate what you mean by "edge weights to nodes, but
         | used the same matrices"?
        
           | feanaro wrote:
           | The network retains the same topology, and therefore the same
           | matrix dimensions, but some elements are set to zero,
           | removing their contribution.
        
             | AlexErrant wrote:
             | Relevant excerpt from
             | https://news.ycombinator.com/item?id=38656172
             | 
             | > ...if we meet the shape 512x512 (512 is a power of 2,
             | therefore a very "round" number in computer science), then
             | maybe some kernel will be very fast; but then on a
             | 512x511matrix, the same kernel may need to add some padding
             | first to transform it into a round 512x512 matrix with
             | zeros at the end of each row. Adding those zeros means
             | shifting all rows, which is a very costly operation.
        
         | eurekin wrote:
         | I'm quite surprised so much was retained; many presentations on
         | convolutional networks suggest adding a lot of features, but
         | prune most in a post learning step. Supposedly lots of
         | features, which are initialized randomly, just raise the
         | probability of being close to the real answer (the concept
         | being learned). Once it's found, most of features typically go
         | unused or are reduntant.
        
       | visarga wrote:
       | Warning! dated 18 Dec 2022 - Rethinking the Role of Scale for In
       | Context Learning: An Interpretability based Case Study at 66
       | Billion Scale
       | 
       | https://arxiv.org/abs/2212.09095
       | 
       | since it is at its one year anniversary, we can check citations -
       | it has 16 so far
       | 
       | https://scholar.google.com/scholar?start=0&hl=ro&as_sdt=2005...
        
       | bjornsing wrote:
       | > Finding that 70% of attention heads and 20% of feed-forward
       | networks can be excised with minimal effect on in-context
       | learning suggests that large language models are undertrained.
       | 
       | So why do the larger models perform so much better...?
        
         | sodality2 wrote:
         | If they're all equally un-pruned, sounds like they still
         | maintain their linear scale of performance.
         | 
         | Just like quantization!
        
           | cubefox wrote:
           | How does this answer the question?
        
             | danielmarkbruce wrote:
             | The question sort of implies you couldn't prune the smaller
             | models and see the same thing. So, the answer given is to
             | consider that in both cases, you sort of only use 30% of
             | the model. Bigger is still bigger. The basic intuition of
             | more parameters = better holds.
        
           | bjornsing wrote:
           | I don't follow...
        
             | sodality2 wrote:
             | The article mentions that models have a lot of extra
             | information that is unnecessary. You asked why the large
             | ones still outperform small ones. presumably they all have
             | that inefficiency. But the large ones are still better. 30%
             | of a big number is still bigger than 30% of a small number.
        
         | danielmarkbruce wrote:
         | lottery ticket hypothesis might be real
        
         | youngNed wrote:
         | Because 70% of a big number is a lot more than 70% of a smaller
         | number?
         | 
         | Not being facetious, I don't know the answer, but that's my
         | best guess
        
         | sdenton4 wrote:
         | Here's an explanation I hit on some time ago:
         | 
         | More parameters makes it easier to find solutions with low-
         | energy.
         | 
         | Suppose we have a product of two variables z = x * y. And now
         | suppose that the 'correct' product is z=2, and we're learning x
         | and y. A very good analytical solution is x=1, y=2 (or vice
         | versa) allowing us to eliminate either x or y from our learning
         | problem. The total energy of (x, y) in this case is 1*2 + 2*2 =
         | 5.
         | 
         | However, another solution is x = y = sqrt(2), which has energy
         | 2: this solution is much closer to the origin. The extra
         | variable means that we have a /surface/ of solutions instead of
         | a unique solution, so we can hone in on ones that are easier to
         | get to using our optimizer.
         | 
         | As you add more variables, you can find lower and lower energy
         | solutions.
         | 
         | Consider that we initialize neural networks 'near' zero, and
         | then walk with gradient descent in some direction towards a
         | solution. Then adding lots of extra variables - wiggle room -
         | makes it much easier to find a solution within walking distance
         | of the (noisy) origin.
        
           | bjornsing wrote:
           | Fits with my first intuitive guess: it's implicit
           | regularization (that works as you describe).
           | 
           | Would be interesting to try some explicit regularization. But
           | unfortunately you need a million bucks to an experiment on
           | LLMs. :/
        
           | gessha wrote:
           | Do you know of any literature that looks into this? This is a
           | pretty interesting hypothesis.
        
         | jncfhnb wrote:
         | Undertrained does not mean bad. It means it could be better.
         | 
         | But I also disagree with takeaway.
        
         | og_kalu wrote:
         | all LLMs are undertrained to some degree.
         | 
         | assuming the models are identical except one is bigger then the
         | bigger model is better because 70% of a bigger number is larger
         | than 70% of a smaller number.
         | 
         | Now if you train a smaller model much longer than the bigger
         | model (more tokens) then you are reducing the level of "under-
         | trainedness" to some degree. at some point, you _may_ have a
         | smaller model that is better than that larger model.
         | 
         | 70% of a bigger number may be larger than 70% of a smaller
         | number but no guarantee 70% of a bigger number is larger than
         | say 90% of a smaller number and so on.
        
       | earthboundkid wrote:
       | You only use 10% of your LLM neural network.
        
         | usgroup wrote:
         | Take my upvote!
        
         | dist-epoch wrote:
         | Imagine if you could use 100% of your LLM. I think I saw a
         | movie about that.
        
           | lgxz wrote:
           | Lucy: https://en.wikipedia.org/wiki/Lucy_(2014_film)
        
       | GaggiX wrote:
       | They should do the same test on Mistral 7B or Phi-2 instead of
       | OPT-66B.
        
       | changoplatanero wrote:
       | I can believe that you can get good performance on 14 NLP tasks
       | after pruning 70% of the model. But for things like chatgpt the
       | use cases are way more diverse and you can't keep high
       | performance for everything when you prune that many of the
       | weights.
        
         | lolinder wrote:
         | You don't need to get good performance on everything with a
         | single model--Mixture of Experts models can outperform
         | equivalently sized monolithic models. My understanding is that
         | GPT-4 is structured this way.
         | 
         | If we can trim a model in _different_ ways to get different
         | specializations, that could be really effective.
        
           | novaRom wrote:
           | I just tried Mistral MoE 8x7B model and it works a bit faster
           | than llama-2-70B but it looks it has almost the same skills.
           | In fact, all latest models of 13B-70B size are quite similar.
           | Could it be large part of their training data is the same?
        
         | sigmoid10 wrote:
         | >But for things like chatgpt the use cases are way more diverse
         | and you can't keep high performance for everything when you
         | prune that many of the weights.
         | 
         | This also has become increasingly obvious from recent
         | developments in the field. Today, we regulalry see new models
         | that come at a fraction of the size of GPT-3 and yet easily
         | outperform it, especially on certain downstream tasks when
         | fine-tuned correctly. These small models also retain some
         | generality, but not as much as the really high end really large
         | models like GPT-4. I'd say a sub 10B parameter model equal or
         | better than GPT-3 overall is achievable, but not for GPT-4. At
         | least not with current technology. However, that would still
         | imply that it's possible to reduce parameter counts in common
         | approaches by 95%. I'm pretty sure in a few years people will
         | look back and smirk at the crude methods we used to train LLMs
         | today.
        
           | k__ wrote:
           | How big are GPT3 and GPT4 models?
        
             | pizza wrote:
             | 175B and rumored 1.76T params
        
               | k__ wrote:
               | How much on disk?
        
               | pizza wrote:
               | if fp32, 4 bytes per param. but the weights may have been
               | quantized to lower precision (eg fp16, so 2B/p)
        
         | Der_Einzige wrote:
         | Correct, people in the model optimization world will claim all
         | day that it only marginally impacts performance, but I've
         | extensively played with quantization and pruning methods, and
         | can report that they _do_ cripple models in ways that the
         | authors either didn't notice or on purpose omitted from their
         | SOTA benchmark chasing.
         | 
         | Most claims of models being even equal to GPT-3.5 are also
         | significantly overblown. I haven't seen one yet below 70
         | billion parameters which even comes close.
         | 
         | Nothing is free in this world.
        
       | zoogeny wrote:
       | Forgive me if this is not a good question, but is there a
       | difference here between training and inference?
        
         | quadrature wrote:
         | Yes. Think of it in terms of fitting a line to some points
         | y=mx+b. Training is finding the right slope m and intercept b
         | of the line to get a good fit to the points. Inference is when
         | you take an x coordinate and find the y value using the
         | "trained" m and b in the line equation
        
           | zoogeny wrote:
           | I'm not sure if that gives me an intuition on the title of
           | the article: "Do large language models need all those layers"
           | 
           | Am I interpreting you correctly if I say: "Finding the slope
           | (training) may require those extra layers but finding a
           | particular y value given an known x coordinate (inference)
           | may not require those extra layers".
           | 
           | What I mean is, does the answer to the article's question
           | change if one is considering training vs. inference?
        
             | quadrature wrote:
             | Apologies, I thought you were asking a general question
             | about ML. Will let someone else comment on the specifics
             | here.
        
               | xanderlewis wrote:
               | I think I've misinterpreted it in the same way. I guess
               | you're asking something like: if we can exorcise parts of
               | a model without affecting quality of inferences (in some
               | particular domain), can we do the same with the training
               | step? That is, is it necessary to train a model on a wide
               | variety of topics in order to get high-quality
               | 'understanding' for a particular application?
               | 
               | If we don't need those weights at inference time, why do
               | the computation to train them in the first place?
        
               | ska wrote:
               | The real answer is we don't know yet but it's
               | interesting.
               | 
               | To go back to your ax+b example, imagine instead you are
               | fitting a much higher dimensional model, but you don't
               | know how high. ax^n+bx^(n-1) ... where n might be in the
               | millions, or hundreds of millions, or?? So we know if we
               | make the model high enough order (e.g n-1 training points
               | will give "perfect") it will overfit, so we throw some
               | regularization and a bit of handwavy tuning and we end up
               | with a model of say n=7213472123 and a set of a,b ..
               | which behaves pretty well, but from it's behavior we
               | suspect most of them dont' matter. And maybe should be <=
               | 2million, or whatever.
               | 
               | So, a few obvious questions - one is can we find a way to
               | throw out most of the a,b,c ... to get just the core,
               | i.e. if we throw away all |k| <= 0.00001 does it change
               | anything (for inference). A very different question is
               | could we decide that ahead of time (during training). A
               | different class of question looks more like "could we
               | have figured this out from the data".
               | 
               | It's a _lot_ harder to reason about the latter questions,
               | because the former one is empirical: After training, this
               | one doesn 't seem to do anything. Ahead of time, how do
               | you know? This has interesting offshoots, like how stable
               | is the distribution of the parts that matter, etc.
        
         | ImprobableTruth wrote:
         | (I assume this question is about whether models need all those
         | layers during training, even if they don't need them during
         | inference)
         | 
         | Yes. There's the so called "lottery ticket hypothesis".
         | Essentially the idea is that large models start with many
         | randomly initialized subnetworks ("lottery tickets"0 and that
         | training finds which ones work best. Then it's only natural
         | that during inference we can prune all the "losing tickets"
         | away, even though we need them during training.
         | 
         | It's kind of an open question how large this effect is though.
         | As the article mentions, if you can prune a lot away, this
         | could also just mean that the network isn't optimally trained.
        
           | zoogeny wrote:
           | I figured this must be a well-known property of neural
           | networks. I'll do some reading on the lottery ticket
           | hypothesis. That is almost exactly what I was thinking when
           | reading the article: sure after you have trained it you can
           | prune the layers that aren't used. But I wasn't sure you
           | could know/guess which layers will be unused before you
           | train.
           | 
           | It strikes me as an interesting open question since if it is
           | the case that you need big networks for training but can use
           | significantly smaller "pruned" networks for inference there
           | are many, many reasons why that might be true. Determining
           | which of the possible reasons is the actual reason may be a
           | key in understanding how LLMs work.
        
         | jncfhnb wrote:
         | Training is making the model (or rather going from something
         | random and useless to something well calibrated and useful).
         | Inference is using it to make a prediction.
         | 
         | This is saying that you don't need the entire model to make
         | good predictions for specific subsets of tasks. You can
         | literally remove a large part of the model and it will do fine.
         | Which is not very controversial. The model, after being
         | trained, is a large collection of interacting nodes. When this
         | is talking about dropping chunks of the model it means dropping
         | nodes after training to make predictions. The advantage
         | primarily being that smaller models are cheaper and faster to
         | run or modify with further training.
         | 
         | You know that meme about how you only use 10% of your brain at
         | a time? Well, yeah, but the idiot movies that suggest using
         | 100% of your brain would make you impossibly smarter are not
         | correct. 90% of your brain just isn't relevant. More brain /
         | model is not better than the relevant subset alone.
         | 
         | The important question to be asking is whether you can remove
         | large chunks of the model without hurting its ability to
         | generally to do well on whatever you ask it.
         | 
         | As a very crude example, imagine you trained a simple model to
         | predict rainfall using a weather monitor and the number of
         | farts you did last week. The model will probably learn that the
         | monitor is a useful and the farts are irrelevant. If this were
         | as simple as a linear regression, you could just remove the
         | farts coefficient from the equation and the model would come
         | out to the same outcomes. Neural nets are not so easily
         | observed but it's still just dropping the irrelevant bits to
         | whatever you're trying to do.
        
         | xanderlewis wrote:
         | The use of the word 'inference' in this context can seem a bit
         | weird, but I think it's borrowed from statistics and it's quite
         | standard.
         | 
         | Training = optimising model parameters to 'learn' from data.
         | 
         | Inference = asking the model to make a prediction, usually
         | assuming the model is already trained.
         | 
         | Instead of _inference_ , you could say _running /querying the
         | model_.
        
       | reqo wrote:
       | >Finding that 70% of attention heads and 20% of feed-forward
       | networks can be excised with minimal effect on in-context
       | learning suggests that large language models are undertrained.
       | 
       | I thought that it was common knowledge that LLMs are
       | undertrained, none of the publicly available loss graphs show any
       | sign of convergence!
        
         | novaRom wrote:
         | It's a disadvantage of current SOTA models: they are easy to
         | train, but they must be large wasting lots of weights in order
         | to generalize well. Maybe another architecture, transformer's
         | successor will be more economical - having less weights with
         | more skills and knowledge.
        
         | k__ wrote:
         | Can we use unoptimized to train optimized ones?
        
         | versteegen wrote:
         | Maybe that's why the Phi models do so well for their size. I'm
         | guessing they may have been trained close to convergence, but
         | the loss graphs aren't published. Phi-1.5 (1.3B parameters) was
         | trained on 150B tokens (5 epochs), yet phi-1.5-web was trained
         | on 300B so they didn't stop for lack of compute. Phi-2 (2.7B
         | params) was trained on 1.4T tokens, epochs unknown.
        
         | Legend2440 wrote:
         | Especially the OPT models they study here were known to be
         | undertrained. It would be interesting to compare to a more
         | modern model like llama 2.
        
       | praveen9920 wrote:
       | If it is true, does that mean that we can "compress" the models
       | for efficient inference on smaller devices? That would be
       | wonderful
        
         | CuriouslyC wrote:
         | That is literally what quantization and distillation are
        
       | WhitneyLand wrote:
       | There are lots ways to make LLMs more efficient:
       | 
       | - Pruning
       | 
       | - Distillation
       | 
       | - Sparse transformers
       | 
       | - Mixture or experts
       | 
       | - Quantizations
       | 
       | My understanding is that none of these are free and they all come
       | with various trade offs.
       | 
       | For example MOE lets Mistral beat similarly sized models, and the
       | inference performance stays close (only an incremental increase).
       | But the training time is way more than a typical 7B model.
       | 
       | But which of these approaches gives the most bang for the buck?
       | 
       | Also consider it's not either/or many of these techniques can be
       | combined.
       | 
       | And maybe worst of all, some of the testing that can be done to
       | find out doesn't give the same answer with smaller/toy models.
        
         | rolisz wrote:
         | Mistral MOE model is 8*7B parameters, so you should compare
         | training time to similar sized models, not to 7B ones.
        
           | versteegen wrote:
           | Mixtral 8x7B actually has 46.7B total parameters, not 8*7B =
           | 56B. The reason being that not all parameters are multiplied
           | 8x.
           | 
           | Also it uses 12.9B parameters per token, not quite comparable
           | to 7B models.
        
       | tbruckner wrote:
       | Accidentally read this as "Do large companies need all those
       | lawyers?"
        
       | jmward01 wrote:
       | We quantize models and get nearly the same performance and we
       | keep seeing new 'small' models performing on par with older
       | larger models. Additionally, people don't need nearly the amount
       | of data that models do in order to learn the same tasks. This
       | implies a lot about how fast it may be possible to learn and how
       | far models currently are from that optimum. Because of this I
       | think the authors conclusions about scaling data up really only
       | apply to current models and training techniques. Based on how
       | fast people can learn, I believe we are likely missing a few
       | things that will allow models to learn vastly faster and deeper
       | with fewer total parameters. My own work, as an example, (github
       | jmward01/lmplay) may show how simple changes to embedding
       | training may make drastic improvements to model training.
        
       | fluidcruft wrote:
       | I was a RSNA this year (major Radiology conference and trade
       | show) and one of the presenters made the claim that their model
       | was generalized because it works on different body parts. My
       | intuition was they were claiming they trained it on one body
       | parts and it subsequently worked on different body parts (which
       | could be convincing). So at first this seemed fine. But in
       | reality they had trained the same model on all of those body
       | parts. That really go me to thinking about the old myth that we
       | only use 10% of our brains. Anyway I think capacity would be when
       | the model can no longer learn.
       | 
       | But anyway it made me wonder if there's a way to measure "what x%
       | of a model is actually used" similar to the myths about human
       | brains.
        
       ___________________________________________________________________
       (page generated 2023-12-15 23:01 UTC)