[HN Gopher] QMoE: Practical Sub-1-Bit Compression of Trillion-Pa...
___________________________________________________________________
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Author : titaniumtown
Score : 32 points
Date : 2023-12-13 18:57 UTC (4 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| kosolam wrote:
| Nice!
| iTokio wrote:
| > affordable commodity hardware, like a single server with 4x
| NVIDIA A6000 or 8x NVIDIA 3090 GPUs
|
| I need to seriously revise my definition of affordable commodity
| hardware
| ronsor wrote:
| NVIDIA's price gouging has distorted people's idea of
| "affordable"
| jetrink wrote:
| You can rent such a system for less than $4/hr. That sounds
| pretty affordable to me!
| withinboredom wrote:
| That's nearly minimum wage in most western countries, or a
| really nice living in some countries.
| nine_k wrote:
| Speaking of the US, it's roughly a price or a hamburger.
|
| If you can't afford a hamburger, your problems are likely
| not in compressing trillion-parameter models.
| samus wrote:
| Running humongous models for the price of a small car? Yes,
| it's absolutely affordable. It's peanuts for all except the
| smallest, self-bootstrapped startups. Amortized it's way less
| than the expenses for data scientist and developers that can
| actually make full use of the cards.
| karmakaze wrote:
| > Concretely, QMoE can compress the 1.6 trillion parameter
| SwitchTransformer-c2048 model to less than 160GB (20x
| compression, 0.8 bits per parameter) at only minor accuracy loss,
| in less than a day on a single GPU.
|
| I'm not in the field. Can someone explain how the sub-1-bit part
| works--are they also reducing the number of parameters as part of
| the compression?
| chessgecko wrote:
| It takes a 2/1.5bit model, groups parameters together then
| exploits a lack of entropy in the parameters to compress it a
| bit like text compression. It was only below 1bit for the ultra
| large model, guess the smaller ones weren't quite as random.
|
| It'll be interesting to see if it works on the new mistral moe
| model, which is less sparse and probably trained more per param
| than these.
| cyanydeez wrote:
| sparse means there's a lot of nulls.
|
| think of it like bog standard compression algorithm.
___________________________________________________________________
(page generated 2023-12-13 23:01 UTC)