hngopher.com

       [HN Gopher] QMoE: Practical Sub-1-Bit Compression of Trillion-Pa...
       ___________________________________________________________________
        
       QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
        
       Author : titaniumtown
       Score  : 32 points
       Date   : 2023-12-13 18:57 UTC (4 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | kosolam wrote:
       | Nice!
        
       | iTokio wrote:
       | > affordable commodity hardware, like a single server with 4x
       | NVIDIA A6000 or 8x NVIDIA 3090 GPUs
       | 
       | I need to seriously revise my definition of affordable commodity
       | hardware
        
         | ronsor wrote:
         | NVIDIA's price gouging has distorted people's idea of
         | "affordable"
        
         | jetrink wrote:
         | You can rent such a system for less than $4/hr. That sounds
         | pretty affordable to me!
        
           | withinboredom wrote:
           | That's nearly minimum wage in most western countries, or a
           | really nice living in some countries.
        
             | nine_k wrote:
             | Speaking of the US, it's roughly a price or a hamburger.
             | 
             | If you can't afford a hamburger, your problems are likely
             | not in compressing trillion-parameter models.
        
         | samus wrote:
         | Running humongous models for the price of a small car? Yes,
         | it's absolutely affordable. It's peanuts for all except the
         | smallest, self-bootstrapped startups. Amortized it's way less
         | than the expenses for data scientist and developers that can
         | actually make full use of the cards.
        
       | karmakaze wrote:
       | > Concretely, QMoE can compress the 1.6 trillion parameter
       | SwitchTransformer-c2048 model to less than 160GB (20x
       | compression, 0.8 bits per parameter) at only minor accuracy loss,
       | in less than a day on a single GPU.
       | 
       | I'm not in the field. Can someone explain how the sub-1-bit part
       | works--are they also reducing the number of parameters as part of
       | the compression?
        
         | chessgecko wrote:
         | It takes a 2/1.5bit model, groups parameters together then
         | exploits a lack of entropy in the parameters to compress it a
         | bit like text compression. It was only below 1bit for the ultra
         | large model, guess the smaller ones weren't quite as random.
         | 
         | It'll be interesting to see if it works on the new mistral moe
         | model, which is less sparse and probably trained more per param
         | than these.
        
         | cyanydeez wrote:
         | sparse means there's a lot of nulls.
         | 
         | think of it like bog standard compression algorithm.
        
       ___________________________________________________________________
       (page generated 2023-12-13 23:01 UTC)