[HN Gopher] SparseGPT: Language Models Can Be Accurately Pruned ...
___________________________________________________________________
SparseGPT: Language Models Can Be Accurately Pruned in One-Shot
Author : tosh
Score : 182 points
Date : 2023-05-03 16:44 UTC (6 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| dougmwne wrote:
| Once you prune your model, can you get even better performance by
| re-training it? I've heard theories that this is the function of
| sleep in brains.
| valine wrote:
| Retraining a large GPT is very expensive. The goal of this
| paper is to help limit the need for retraining after pruning.
| dougmwne wrote:
| Extremely expensive, though as I understand it, the goal is
| to get maximum performance out of a given model size so that
| you can actually inference at product scale. A few extra
| million for training is expensive, but then consider what it
| costs to run inference for something like Bing for 100
| million daily active users.
|
| If they can develop new methods to "overtrain" these models
| they will get more bang out of the smaller parameter model
| buck.
| napo wrote:
| It sounds nice (maybe too nice?). I always wanted to see that
| it would be necessary to have a "sleeping" phase in AI.
|
| It always felt weird that we have to sleep, it doesn't seem to
| give any evolutionary advantages.
| visarga wrote:
| Continual learning. When models will do that, they will have
| to sleep as well to avoid catastrophic forgetting.
| feuerwehrnrw wrote:
| [dead]
| klobuerste wrote:
| [dead]
| roomey wrote:
| It must give an evolutionary advantage, or we wouldn't sleep.
|
| It may be hard to pin point exactly what advantage, but as we
| do it, it must have given us an advantage!
| dougmwne wrote:
| Especially considering that it is so widespread in nearly
| every creature with a brain. And it's not simply a period
| of motionless energy conservation but has very specific
| neural patterns. The science is definitely zeroing in on a
| connection to learning.
| richardw wrote:
| I have an unbaked theory, but the very short version is:
|
| - Animals that have peaks of energy use outcompete animals
| that have a steady-state energy use. Catch the animal, then
| rest and recover. For any given amount of energy, this means
| we can recruit more in a smaller window compared to an animal
| that plods along with no recuperative phase.
|
| - Many things happen when you're sleeping. Rather than having
| everything running 24/7, having different phases means we can
| specialise action and recovery. Since the time is already
| driven by energy demands, many parts of our body and mind
| leverage it for different purposes.
| LesZedCB wrote:
| 1 day of in-context learning and 1 night of fine-tuning on
| context. that's my pet theory, just shooting from the hip
| as a total layperson.
| vessenes wrote:
| This is interesting. OPT and BLOOM are significantly Chinchilla
| under-trained, and I can't help but wonder if this is related to
| their compressibility here. I would like to see the results for
| something Chinchilla over-trained, like Llama - my gut is that
| the 'free lunch' they see will get slightly more expensive.
|
| Implementation q - can torch or other inference runtimes take
| advantage of the memory savings delivered by a sparsification
| like this? Or do you need a special implementation to not malloc
| out all the memory implied by each tensor layer?
| MacsHeadroom wrote:
| SparseGPT-for-LLaMA[0] exists. Pruning more than 30% of weights
| of 30B starts to show significant perplexity losses. 50% is a
| total disaster, while it was not for OPT or BLOOM. So your
| intuition seems to be good here.
|
| [0] https://github.com/AlpinDale/sparsegpt-for-LLaMA
| generalizations wrote:
| Has anyone figured out what the optimal pruning is for 65b? I
| don't really know what that matrix in your link is saying,
| but it didn't seem to show optimal pruning.
| [deleted]
| m3kw9 wrote:
| Who's gonna apply this to LLama weights?
| MacsHeadroom wrote:
| It was done over two months ago:
| https://github.com/AlpinDale/sparsegpt-for-LLaMA
| valine wrote:
| If the abstract is accurate, then I'm very, very excited to try
| this on LLaMA 65B. We are tantalizingly close to ChatGPT
| performance parity on consumer hardware.
|
| Hopefully this lowers the cost of doing instruct fine tuning on
| the larger models, and we see a Vicuna like model based on LLaMA
| 65B soon. This is exciting folks.
| icyfox wrote:
| Neat paper. Planning on reading more in-depth over the weekend,
| but more fundamental than just applications to GPT their insights
| are:
|
| - Existing pruners were written for models that are order-of-
| magnitudes smaller than any in the modern GPT family. They grow
| in linear time with the amount of input parameters so they're
| unequipped to work on current architectures. The best existing
| pruner performs takes 4.3h for a 1.3B model
|
| - The core issue to scale is time to calculate the Hessian during
| prune analysis (effectively a matrix of second-order derivatives,
| famously computationally intense to calculate)
|
| - They follow the existing literature and use a local approach to
| each layer. By doing this (and doing it well), it can preserve
| the input/output contract for surrounding layers, which makes the
| whole thing paralellizable across machines
|
| - Their solution approximates reconstruction loss by
| approximating a quadratic loss and then running a OBS update
| (with a few other optimizations on ordering and iteration on the
| side)
|
| I'm particularly excited for these smaller models, mostly for
| inference efficiency gains in realtime applications. The general
| con of weight pruning is they still require incredibly large
| training clusters / investment in training resources upfront to
| get the original parameter weight. But if the lottery ticket
| hypothesis holds true, this might be the best way we have at the
| moment to get models with same performance and lower longterm
| operational costs.
| teej wrote:
| This is a great breakdown. I don't know much about LLM
| internals but I could follow this easily.
| codethief wrote:
| > lottery ticket hypothesis
|
| For those that, like me, didn't know the reference:
| https://arxiv.org/abs/1803.03635
| gfodor wrote:
| Reduction in working memory for sparse models seems pretty
| huge.
| icyfox wrote:
| Another random thought: Most of these general purpose pruning
| approaches rely on randomly calculating the X vector for which
| they want to measure output loss of the layer. In theory it's
| possible to feed actual datasets into these models as well,
| which could be another way to get a sparse model that's more
| acutely optimized towards one task. The original model produces
| the X activations on each layer, and these are used as the
| optimization criteria for the pruned version.
|
| It might be able to provide performance similar to fine-tuning
| but without the weight skew that you'll necessarily see in
| parameter values.
| CuriouslyC wrote:
| Statisticians have been using L1 regularization to estimate
| sparse models for a while, it seems reasonable to assume that
| you could fine tune the model on a data set while also
| pushing weak parameters to zero in a natural way in this
| domain as well.
| ianbutler wrote:
| I believe you are correct, I worked on a summer research
| project at NYU in 2018 based on
| https://arxiv.org/abs/1805.12185
|
| As part of that project I constructed an API that took a
| small dataset and a model, launched a K8s pod and ran
| something like this from the paper:
|
| > The pruning defense works as follows: the defender
| exercises the DNN received from the attacker with clean
| inputs from the validation dataset, D_valid, and records the
| average activation of each neuron. The defender then
| iteratively prunes neurons from the DNN in increasing order
| of average activations and records the accuracy of the pruned
| network in each iteration. The defense terminates when the
| accuracy on the validation dataset drops below a pre-
| determined threshold. We note that pruning has been proposed
| in prior work for n
|
| Obviously this wasn't on transformers but the idea is
| similar.
| Jimmc414 wrote:
| Eternal Sunshine of the Spotless Mind for LLMs.
| ben_w wrote:
| I don't think so, from the abstract it's more like JPEG for
| LLMs.
| igravious wrote:
| One hopes it's more like PNG for LLMs ?
| ben_w wrote:
| It's lossy rather than lossless, and the impact can be
| dialled up or down depending on space requirements.
| barking_biscuit wrote:
| Can we call it LLMPeg?
| laughy wrote:
| This suggests that the effective number of parameters is far
| lower than the nominal number. My head canon for neural networks
| as overparametrized models still holds.
| smrtinsert wrote:
| To copy a reddit meme: text-generation-webui plugin when? But
| seriously, this seems like an incredible upgrade.
| chime wrote:
| Better yet, text-generation-webui-docker when?
| gorkish wrote:
| Very good result, and awesome to see such great progress happen
| so fast.
|
| Quantizing/pruning/deduplicating/compressing models and
| embeddings is still a vast orchard of low hanging fruit.
|
| I personally think there are still quite a few multiple-orders-
| of-magnitude scale opportunities to accelerate inference, and we
| are fortunate to have strong economic incentives aligned with the
| problem.
| Der_Einzige wrote:
| So much low hanging fruit for those willing to pick it. It's a
| great time to be an LLM researcher.
| bitL wrote:
| Yeah, writing a thesis on it right now; this + adapters give
| so many options to play with and Meta was nice to give me
| access to their research models.
| NM_Ricky wrote:
| For anyone interested in SparseGPT, on May 25th, the author of
| the SparseGPT paper will show you how you can download an
| optimized and open-sourced LLM and run it on CPUs at GPU speeds
| using DeepSparse.
|
| Confirm your spot: https://neuralmagic.com/unlock-faster-and-
| more-efficient-lan...
| Reubend wrote:
| Wow, these results are amazing! This might be extremely helpful
| in reducing the memory consumption of large models in the future.
| seydor wrote:
| The robustness with which these models can be quantized, and now
| trimmed makes one think if they could be easily implemented some
| form of analog (or optical) hardware.
| meepmorp wrote:
| Isn't the inspiration behind these models our own analog
| hardware?
| seydor wrote:
| not sure i would call neurons analog, they are very nonlinear
| and capricious beasts.
| mitthrowaway2 wrote:
| Perhaps even by biological cells!
| barking_biscuit wrote:
| Xenobots, even.
| avereveard wrote:
| wow this could make the fabled 65 billion parameter llama sparsed
| and pruned runnable on a 3060
| bick_nyers wrote:
| How would this be any different from running one of the lower
| parameter models?
| avereveard wrote:
| Larger models seem to handle much better introspection, makes
| for better backend for sourced knowledge extraction
| sva_ wrote:
| It says in the abstract
|
| > at minimal loss of accuracy
|
| Suggesting that there is a lot of redundancy in the weights.
| amelius wrote:
| This makes me wonder what would happen if you took the
| sparse model, reset its weights, then trained it with the
| original training data.
| sva_ wrote:
| Very interesting idea. I'd hypothesize that it won't
| achieve the same(ish) accuracy, and that pruning might be
| required (similar to how humans go through a heavy
| pruning phase at an early age[0]). Would be worth setting
| up an experiment on a smaller scale.
|
| As some other commentator stated, there's currently a lot
| of low hanging fruit in optimizing NN.
|
| 0. https://en.m.wikipedia.org/wiki/Synaptic_pruning
| theLiminator wrote:
| I wonder how far we can take this. Is 1B parameters
| theoretically "expressive" enough for GPT-4 like
| performance? I wonder how far off "theoretically optimal"
| we are in terms of performance/parameters ratio.
| dontwearitout wrote:
| Good open questions. I suspect we'll see models distilled
| and compressed down to retain most of their "common
| sense", "core knowledge", and reasoning ability, while
| stripping out the gigabytes of random trivia most people
| will never need.
| babyshake wrote:
| It looks like by pruning by a factor of 0.5, you reduce the size
| of the model by 50%? In practice, what is the expected observed
| change in the output before and after pruning?
| MacsHeadroom wrote:
| >what is the expected observed change in the output before and
| after pruning?
|
| The expected and observed change is virtually none. That's the
| whole point!
|
| Notably, quantizing weights from 16bit weights to 4bit weights
| (reducing the size by 75%) also has almost no change in output
| quality when using modern algorithms like GPTQ.
___________________________________________________________________
(page generated 2023-05-03 23:01 UTC)