[HN Gopher] SparseGPT: Language Models Can Be Accurately Pruned ...
       ___________________________________________________________________
        
       SparseGPT: Language Models Can Be Accurately Pruned in One-Shot
        
       Author : tosh
       Score  : 182 points
       Date   : 2023-05-03 16:44 UTC (6 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | dougmwne wrote:
       | Once you prune your model, can you get even better performance by
       | re-training it? I've heard theories that this is the function of
       | sleep in brains.
        
         | valine wrote:
         | Retraining a large GPT is very expensive. The goal of this
         | paper is to help limit the need for retraining after pruning.
        
           | dougmwne wrote:
           | Extremely expensive, though as I understand it, the goal is
           | to get maximum performance out of a given model size so that
           | you can actually inference at product scale. A few extra
           | million for training is expensive, but then consider what it
           | costs to run inference for something like Bing for 100
           | million daily active users.
           | 
           | If they can develop new methods to "overtrain" these models
           | they will get more bang out of the smaller parameter model
           | buck.
        
         | napo wrote:
         | It sounds nice (maybe too nice?). I always wanted to see that
         | it would be necessary to have a "sleeping" phase in AI.
         | 
         | It always felt weird that we have to sleep, it doesn't seem to
         | give any evolutionary advantages.
        
           | visarga wrote:
           | Continual learning. When models will do that, they will have
           | to sleep as well to avoid catastrophic forgetting.
        
             | feuerwehrnrw wrote:
             | [dead]
        
             | klobuerste wrote:
             | [dead]
        
           | roomey wrote:
           | It must give an evolutionary advantage, or we wouldn't sleep.
           | 
           | It may be hard to pin point exactly what advantage, but as we
           | do it, it must have given us an advantage!
        
             | dougmwne wrote:
             | Especially considering that it is so widespread in nearly
             | every creature with a brain. And it's not simply a period
             | of motionless energy conservation but has very specific
             | neural patterns. The science is definitely zeroing in on a
             | connection to learning.
        
           | richardw wrote:
           | I have an unbaked theory, but the very short version is:
           | 
           | - Animals that have peaks of energy use outcompete animals
           | that have a steady-state energy use. Catch the animal, then
           | rest and recover. For any given amount of energy, this means
           | we can recruit more in a smaller window compared to an animal
           | that plods along with no recuperative phase.
           | 
           | - Many things happen when you're sleeping. Rather than having
           | everything running 24/7, having different phases means we can
           | specialise action and recovery. Since the time is already
           | driven by energy demands, many parts of our body and mind
           | leverage it for different purposes.
        
             | LesZedCB wrote:
             | 1 day of in-context learning and 1 night of fine-tuning on
             | context. that's my pet theory, just shooting from the hip
             | as a total layperson.
        
       | vessenes wrote:
       | This is interesting. OPT and BLOOM are significantly Chinchilla
       | under-trained, and I can't help but wonder if this is related to
       | their compressibility here. I would like to see the results for
       | something Chinchilla over-trained, like Llama - my gut is that
       | the 'free lunch' they see will get slightly more expensive.
       | 
       | Implementation q - can torch or other inference runtimes take
       | advantage of the memory savings delivered by a sparsification
       | like this? Or do you need a special implementation to not malloc
       | out all the memory implied by each tensor layer?
        
         | MacsHeadroom wrote:
         | SparseGPT-for-LLaMA[0] exists. Pruning more than 30% of weights
         | of 30B starts to show significant perplexity losses. 50% is a
         | total disaster, while it was not for OPT or BLOOM. So your
         | intuition seems to be good here.
         | 
         | [0] https://github.com/AlpinDale/sparsegpt-for-LLaMA
        
           | generalizations wrote:
           | Has anyone figured out what the optimal pruning is for 65b? I
           | don't really know what that matrix in your link is saying,
           | but it didn't seem to show optimal pruning.
        
       | [deleted]
        
       | m3kw9 wrote:
       | Who's gonna apply this to LLama weights?
        
         | MacsHeadroom wrote:
         | It was done over two months ago:
         | https://github.com/AlpinDale/sparsegpt-for-LLaMA
        
       | valine wrote:
       | If the abstract is accurate, then I'm very, very excited to try
       | this on LLaMA 65B. We are tantalizingly close to ChatGPT
       | performance parity on consumer hardware.
       | 
       | Hopefully this lowers the cost of doing instruct fine tuning on
       | the larger models, and we see a Vicuna like model based on LLaMA
       | 65B soon. This is exciting folks.
        
       | icyfox wrote:
       | Neat paper. Planning on reading more in-depth over the weekend,
       | but more fundamental than just applications to GPT their insights
       | are:
       | 
       | - Existing pruners were written for models that are order-of-
       | magnitudes smaller than any in the modern GPT family. They grow
       | in linear time with the amount of input parameters so they're
       | unequipped to work on current architectures. The best existing
       | pruner performs takes 4.3h for a 1.3B model
       | 
       | - The core issue to scale is time to calculate the Hessian during
       | prune analysis (effectively a matrix of second-order derivatives,
       | famously computationally intense to calculate)
       | 
       | - They follow the existing literature and use a local approach to
       | each layer. By doing this (and doing it well), it can preserve
       | the input/output contract for surrounding layers, which makes the
       | whole thing paralellizable across machines
       | 
       | - Their solution approximates reconstruction loss by
       | approximating a quadratic loss and then running a OBS update
       | (with a few other optimizations on ordering and iteration on the
       | side)
       | 
       | I'm particularly excited for these smaller models, mostly for
       | inference efficiency gains in realtime applications. The general
       | con of weight pruning is they still require incredibly large
       | training clusters / investment in training resources upfront to
       | get the original parameter weight. But if the lottery ticket
       | hypothesis holds true, this might be the best way we have at the
       | moment to get models with same performance and lower longterm
       | operational costs.
        
         | teej wrote:
         | This is a great breakdown. I don't know much about LLM
         | internals but I could follow this easily.
        
         | codethief wrote:
         | > lottery ticket hypothesis
         | 
         | For those that, like me, didn't know the reference:
         | https://arxiv.org/abs/1803.03635
        
         | gfodor wrote:
         | Reduction in working memory for sparse models seems pretty
         | huge.
        
         | icyfox wrote:
         | Another random thought: Most of these general purpose pruning
         | approaches rely on randomly calculating the X vector for which
         | they want to measure output loss of the layer. In theory it's
         | possible to feed actual datasets into these models as well,
         | which could be another way to get a sparse model that's more
         | acutely optimized towards one task. The original model produces
         | the X activations on each layer, and these are used as the
         | optimization criteria for the pruned version.
         | 
         | It might be able to provide performance similar to fine-tuning
         | but without the weight skew that you'll necessarily see in
         | parameter values.
        
           | CuriouslyC wrote:
           | Statisticians have been using L1 regularization to estimate
           | sparse models for a while, it seems reasonable to assume that
           | you could fine tune the model on a data set while also
           | pushing weak parameters to zero in a natural way in this
           | domain as well.
        
           | ianbutler wrote:
           | I believe you are correct, I worked on a summer research
           | project at NYU in 2018 based on
           | https://arxiv.org/abs/1805.12185
           | 
           | As part of that project I constructed an API that took a
           | small dataset and a model, launched a K8s pod and ran
           | something like this from the paper:
           | 
           | > The pruning defense works as follows: the defender
           | exercises the DNN received from the attacker with clean
           | inputs from the validation dataset, D_valid, and records the
           | average activation of each neuron. The defender then
           | iteratively prunes neurons from the DNN in increasing order
           | of average activations and records the accuracy of the pruned
           | network in each iteration. The defense terminates when the
           | accuracy on the validation dataset drops below a pre-
           | determined threshold. We note that pruning has been proposed
           | in prior work for n
           | 
           | Obviously this wasn't on transformers but the idea is
           | similar.
        
       | Jimmc414 wrote:
       | Eternal Sunshine of the Spotless Mind for LLMs.
        
         | ben_w wrote:
         | I don't think so, from the abstract it's more like JPEG for
         | LLMs.
        
           | igravious wrote:
           | One hopes it's more like PNG for LLMs ?
        
             | ben_w wrote:
             | It's lossy rather than lossless, and the impact can be
             | dialled up or down depending on space requirements.
        
           | barking_biscuit wrote:
           | Can we call it LLMPeg?
        
       | laughy wrote:
       | This suggests that the effective number of parameters is far
       | lower than the nominal number. My head canon for neural networks
       | as overparametrized models still holds.
        
       | smrtinsert wrote:
       | To copy a reddit meme: text-generation-webui plugin when? But
       | seriously, this seems like an incredible upgrade.
        
         | chime wrote:
         | Better yet, text-generation-webui-docker when?
        
       | gorkish wrote:
       | Very good result, and awesome to see such great progress happen
       | so fast.
       | 
       | Quantizing/pruning/deduplicating/compressing models and
       | embeddings is still a vast orchard of low hanging fruit.
       | 
       | I personally think there are still quite a few multiple-orders-
       | of-magnitude scale opportunities to accelerate inference, and we
       | are fortunate to have strong economic incentives aligned with the
       | problem.
        
         | Der_Einzige wrote:
         | So much low hanging fruit for those willing to pick it. It's a
         | great time to be an LLM researcher.
        
           | bitL wrote:
           | Yeah, writing a thesis on it right now; this + adapters give
           | so many options to play with and Meta was nice to give me
           | access to their research models.
        
       | NM_Ricky wrote:
       | For anyone interested in SparseGPT, on May 25th, the author of
       | the SparseGPT paper will show you how you can download an
       | optimized and open-sourced LLM and run it on CPUs at GPU speeds
       | using DeepSparse.
       | 
       | Confirm your spot: https://neuralmagic.com/unlock-faster-and-
       | more-efficient-lan...
        
       | Reubend wrote:
       | Wow, these results are amazing! This might be extremely helpful
       | in reducing the memory consumption of large models in the future.
        
       | seydor wrote:
       | The robustness with which these models can be quantized, and now
       | trimmed makes one think if they could be easily implemented some
       | form of analog (or optical) hardware.
        
         | meepmorp wrote:
         | Isn't the inspiration behind these models our own analog
         | hardware?
        
           | seydor wrote:
           | not sure i would call neurons analog, they are very nonlinear
           | and capricious beasts.
        
         | mitthrowaway2 wrote:
         | Perhaps even by biological cells!
        
           | barking_biscuit wrote:
           | Xenobots, even.
        
       | avereveard wrote:
       | wow this could make the fabled 65 billion parameter llama sparsed
       | and pruned runnable on a 3060
        
         | bick_nyers wrote:
         | How would this be any different from running one of the lower
         | parameter models?
        
           | avereveard wrote:
           | Larger models seem to handle much better introspection, makes
           | for better backend for sourced knowledge extraction
        
           | sva_ wrote:
           | It says in the abstract
           | 
           | > at minimal loss of accuracy
           | 
           | Suggesting that there is a lot of redundancy in the weights.
        
             | amelius wrote:
             | This makes me wonder what would happen if you took the
             | sparse model, reset its weights, then trained it with the
             | original training data.
        
               | sva_ wrote:
               | Very interesting idea. I'd hypothesize that it won't
               | achieve the same(ish) accuracy, and that pruning might be
               | required (similar to how humans go through a heavy
               | pruning phase at an early age[0]). Would be worth setting
               | up an experiment on a smaller scale.
               | 
               | As some other commentator stated, there's currently a lot
               | of low hanging fruit in optimizing NN.
               | 
               | 0. https://en.m.wikipedia.org/wiki/Synaptic_pruning
        
             | theLiminator wrote:
             | I wonder how far we can take this. Is 1B parameters
             | theoretically "expressive" enough for GPT-4 like
             | performance? I wonder how far off "theoretically optimal"
             | we are in terms of performance/parameters ratio.
        
               | dontwearitout wrote:
               | Good open questions. I suspect we'll see models distilled
               | and compressed down to retain most of their "common
               | sense", "core knowledge", and reasoning ability, while
               | stripping out the gigabytes of random trivia most people
               | will never need.
        
       | babyshake wrote:
       | It looks like by pruning by a factor of 0.5, you reduce the size
       | of the model by 50%? In practice, what is the expected observed
       | change in the output before and after pruning?
        
         | MacsHeadroom wrote:
         | >what is the expected observed change in the output before and
         | after pruning?
         | 
         | The expected and observed change is virtually none. That's the
         | whole point!
         | 
         | Notably, quantizing weights from 16bit weights to 4bit weights
         | (reducing the size by 75%) also has almost no change in output
         | quality when using modern algorithms like GPTQ.
        
       ___________________________________________________________________
       (page generated 2023-05-03 23:01 UTC)