[HN Gopher] Switch Transformers C - 2048 experts (1.6T params fo...
       ___________________________________________________________________
        
       Switch Transformers C - 2048 experts (1.6T params for 3.1 TB)
        
       Author : tosh
       Score  : 65 points
       Date   : 2023-11-20 14:25 UTC (8 hours ago)
        
 (HTM) web link (huggingface.co)
 (TXT) w3m dump (huggingface.co)
        
       | karencarits wrote:
       | The top comment says a lot
       | 
       | > It's pretty much the rumored size of GPT-4. However, even when
       | quantized to 4bits, one would need ~800GB of VRAM to run it.
        
         | robotnikman wrote:
         | Wow... I won't be running anything like that locally anytime
         | soon unfortunately...
        
         | sp332 wrote:
         | It's a sparse model, which means you only need to load the
         | "experts" that you're going to use for a given input.
        
           | ntonozzi wrote:
           | But you still need to page in the weights from disk to the
           | GPU at each layer, right?
        
         | elorant wrote:
         | You can get a second hand RTX 3090 with $800. So in theory you
         | could build a cluster with less than $40k for all hardware. Not
         | that this is pocket change, but given the capabilities of such
         | a model the expense doesn't sound that much.
        
           | rthnbgrredf wrote:
           | Or one can try to rent a cluster in the clouds.
        
           | karmasimida wrote:
           | you don't have the nvlink, how would this work then
        
             | elorant wrote:
             | You don't need the entire VRAM to be accessed as one.
             | Everything is parallelized by the framework. You lose some
             | speed, but it's the only way to do it. There's no freaking
             | way to have 800 GB of unified VRAM.
        
               | _zoltan_ wrote:
               | GH200 would like a word ;-)
        
               | razodactyl wrote:
               | Hahaha I love this.
        
           | htrp wrote:
           | 3090's probably aren't gonna cut it because of the
           | generational improvements in architecture (even if you could
           | hack the drivers together to work)
        
         | nullc wrote:
         | It's no big deal to run an EPYC host with 2TB of ram. It'll run
         | slow. But the Falcon 180B model runs at several tokens per
         | second that way and its nowhere near as sparse as this should
         | be.
         | 
         | I also have 960GB of vram in my garage (40x P40), though I
         | suspect getting this model distributed in a useful way might be
         | annoying and not worth the effort particularly since it would
         | probably be close to turn-key (if a bit slow) on a high ram
         | epyc host.
        
           | arthurcolle wrote:
           | be careful, the Compute Enforcement Administration might be
           | reading
        
       | layer8 wrote:
       | MoE = Mixture of Experts
       | 
       | Not the anime kind. ;)
        
       | MallocVoidstar wrote:
       | ... in November of 2022.
        
       | rkagerer wrote:
       | Assuming you have the hardware, what could you use this for?
       | 
       | Would it be difficult to wire up for conversations like ChatGPT?
       | Could you run it against a local photo store to let you search by
       | names of objects/people? Or is it basically an intermediate model
       | that needs further training to fine tune to your application?
        
         | borg16 wrote:
         | this is what I have been struggling to understand too. the
         | accolades speak about the size and other input related
         | capabilities of the model, but the output - not so much
        
       | alsodumb wrote:
       | It's an experimental model that is trained on a small set of
       | data. It's been out there for months. It wasn't popular because
       | it's pretty much useless.
        
       | skzv wrote:
       | Interesting, all 3 authors left Google. 2 went to OpenAI, the 3rd
       | cofounded Character.AI.
        
       | gchokov wrote:
       | Google is dead in the water. Their products are hard to use, all
       | of them. Remember when you can setup most of the GA & Ads stuff
       | for your business yourself? Gone are the days. UI is clustered,
       | broken, stupid. You need advisors and professional services for
       | simple things. It's only Google.com that keeps things afloat. The
       | CEO is retroactive, and doesn't know what the right and left
       | hands are doing. Google will be gone in 10 years.
        
       | sainez wrote:
       | I'm much more interested in lower parameter models that are
       | optimized to punch above their weight. There is already
       | interesting work done in this space with Mistral and Phi. I see
       | research coming out virtually every week trying to address the
       | low hanging fruit.
        
       | kristianp wrote:
       | I reckon it would be useful if there were a set of open source
       | MOE models with different sizes, like llama, for example 1.5B,
       | 7B, 34B and 72B (or perhaps those numbers x the number of
       | experts). It would enable much experimentation in the community.
       | Not many are GPU rich enough to play with a 1.6T model!
        
       | wg0 wrote:
       | Noob question - why is it must to load the whole model all at
       | once in VRAM?
       | 
       | Why it can't be streamed from disk layer by layer, a sliding
       | window of it, computed, temporary results held, offloaded it back
       | and load the next window. Repeat till whole inference is done.
       | 
       | Also, if these are so much repetitive calculations in nature that
       | you need CUDA cores to compute then why the inference can't be
       | streamed and spread across a cluster of machines each having
       | multiple commodity GPUs where one central
       | "conductor/orchestrator" machines collects all the results from
       | all the cluster participants?
        
         | SCHiM wrote:
         | That is actually possible. For example, someone wrote python
         | code to do this for the massive open source model BLOOM.
         | 
         | However, it's still slow as tar. When I was running the BLOOM
         | model I think my inference time was 1 token / m.
         | 
         | See: https://towardsdatascience.com/run-bloom-the-largest-open-
         | ac...
        
         | foxes wrote:
         | No, its not necessary to load it in all at once, you could in
         | theory make it stream in chunks of layers.
        
         | p1esk wrote:
         | Yes, it can be done and is done when needed, but your life will
         | be simpler if you can fit your models in GPU memory. Even
         | better if you can fit it in a single GPU memory.
        
         | swatcoder wrote:
         | It's not a formal must, it's a practical "must". Paging and
         | streaming and distribution (and collection) all bring their own
         | overhead, and given the architecture of most of these systems,
         | the overhead of those techniques is very large.
         | 
         | So for researchers who can access a system with enough memory,
         | which is most of the people _professionally_ exploring these
         | models, nobody needs to invest much preparatory effort into
         | those other techniques or reducing their overhead. For them,
         | there 's basically no ROI for it.
         | 
         | But a lot of that work to optimize for those other techniques
         | is gradually being accomplished in the open source community,
         | where people don't have access to expensive clouds and lab
         | systems. It takes time though, and can still only achieve so
         | much.
        
       | Tiberium wrote:
       | The title should be changed to reflect that the checkpoints were
       | released 1.5 years ago [1], and the paper was published almost 3
       | years ago [2]. IMO "releases" means that it was released
       | recently, but that's not the case.
       | 
       | [1] https://github.com/google-
       | research/t5x/commit/199f226eeff5f8...
       | 
       | [2] https://arxiv.org/abs/2101.03961
        
         | dang wrote:
         | Changed now. Thanks!
        
       | greatpostman wrote:
       | Google is so irrelevant
        
       | a_wild_dandan wrote:
       | It's wild that you could run a* quant (160GB) of this on a $6k
       | Mac Studio. Times are a changin'.
        
       ___________________________________________________________________
       (page generated 2023-11-20 23:02 UTC)