[HN Gopher] Switch Transformers C - 2048 experts (1.6T params fo...
___________________________________________________________________
Switch Transformers C - 2048 experts (1.6T params for 3.1 TB)
Author : tosh
Score : 65 points
Date : 2023-11-20 14:25 UTC (8 hours ago)
(HTM) web link (huggingface.co)
(TXT) w3m dump (huggingface.co)
| karencarits wrote:
| The top comment says a lot
|
| > It's pretty much the rumored size of GPT-4. However, even when
| quantized to 4bits, one would need ~800GB of VRAM to run it.
| robotnikman wrote:
| Wow... I won't be running anything like that locally anytime
| soon unfortunately...
| sp332 wrote:
| It's a sparse model, which means you only need to load the
| "experts" that you're going to use for a given input.
| ntonozzi wrote:
| But you still need to page in the weights from disk to the
| GPU at each layer, right?
| elorant wrote:
| You can get a second hand RTX 3090 with $800. So in theory you
| could build a cluster with less than $40k for all hardware. Not
| that this is pocket change, but given the capabilities of such
| a model the expense doesn't sound that much.
| rthnbgrredf wrote:
| Or one can try to rent a cluster in the clouds.
| karmasimida wrote:
| you don't have the nvlink, how would this work then
| elorant wrote:
| You don't need the entire VRAM to be accessed as one.
| Everything is parallelized by the framework. You lose some
| speed, but it's the only way to do it. There's no freaking
| way to have 800 GB of unified VRAM.
| _zoltan_ wrote:
| GH200 would like a word ;-)
| razodactyl wrote:
| Hahaha I love this.
| htrp wrote:
| 3090's probably aren't gonna cut it because of the
| generational improvements in architecture (even if you could
| hack the drivers together to work)
| nullc wrote:
| It's no big deal to run an EPYC host with 2TB of ram. It'll run
| slow. But the Falcon 180B model runs at several tokens per
| second that way and its nowhere near as sparse as this should
| be.
|
| I also have 960GB of vram in my garage (40x P40), though I
| suspect getting this model distributed in a useful way might be
| annoying and not worth the effort particularly since it would
| probably be close to turn-key (if a bit slow) on a high ram
| epyc host.
| arthurcolle wrote:
| be careful, the Compute Enforcement Administration might be
| reading
| layer8 wrote:
| MoE = Mixture of Experts
|
| Not the anime kind. ;)
| MallocVoidstar wrote:
| ... in November of 2022.
| rkagerer wrote:
| Assuming you have the hardware, what could you use this for?
|
| Would it be difficult to wire up for conversations like ChatGPT?
| Could you run it against a local photo store to let you search by
| names of objects/people? Or is it basically an intermediate model
| that needs further training to fine tune to your application?
| borg16 wrote:
| this is what I have been struggling to understand too. the
| accolades speak about the size and other input related
| capabilities of the model, but the output - not so much
| alsodumb wrote:
| It's an experimental model that is trained on a small set of
| data. It's been out there for months. It wasn't popular because
| it's pretty much useless.
| skzv wrote:
| Interesting, all 3 authors left Google. 2 went to OpenAI, the 3rd
| cofounded Character.AI.
| gchokov wrote:
| Google is dead in the water. Their products are hard to use, all
| of them. Remember when you can setup most of the GA & Ads stuff
| for your business yourself? Gone are the days. UI is clustered,
| broken, stupid. You need advisors and professional services for
| simple things. It's only Google.com that keeps things afloat. The
| CEO is retroactive, and doesn't know what the right and left
| hands are doing. Google will be gone in 10 years.
| sainez wrote:
| I'm much more interested in lower parameter models that are
| optimized to punch above their weight. There is already
| interesting work done in this space with Mistral and Phi. I see
| research coming out virtually every week trying to address the
| low hanging fruit.
| kristianp wrote:
| I reckon it would be useful if there were a set of open source
| MOE models with different sizes, like llama, for example 1.5B,
| 7B, 34B and 72B (or perhaps those numbers x the number of
| experts). It would enable much experimentation in the community.
| Not many are GPU rich enough to play with a 1.6T model!
| wg0 wrote:
| Noob question - why is it must to load the whole model all at
| once in VRAM?
|
| Why it can't be streamed from disk layer by layer, a sliding
| window of it, computed, temporary results held, offloaded it back
| and load the next window. Repeat till whole inference is done.
|
| Also, if these are so much repetitive calculations in nature that
| you need CUDA cores to compute then why the inference can't be
| streamed and spread across a cluster of machines each having
| multiple commodity GPUs where one central
| "conductor/orchestrator" machines collects all the results from
| all the cluster participants?
| SCHiM wrote:
| That is actually possible. For example, someone wrote python
| code to do this for the massive open source model BLOOM.
|
| However, it's still slow as tar. When I was running the BLOOM
| model I think my inference time was 1 token / m.
|
| See: https://towardsdatascience.com/run-bloom-the-largest-open-
| ac...
| foxes wrote:
| No, its not necessary to load it in all at once, you could in
| theory make it stream in chunks of layers.
| p1esk wrote:
| Yes, it can be done and is done when needed, but your life will
| be simpler if you can fit your models in GPU memory. Even
| better if you can fit it in a single GPU memory.
| swatcoder wrote:
| It's not a formal must, it's a practical "must". Paging and
| streaming and distribution (and collection) all bring their own
| overhead, and given the architecture of most of these systems,
| the overhead of those techniques is very large.
|
| So for researchers who can access a system with enough memory,
| which is most of the people _professionally_ exploring these
| models, nobody needs to invest much preparatory effort into
| those other techniques or reducing their overhead. For them,
| there 's basically no ROI for it.
|
| But a lot of that work to optimize for those other techniques
| is gradually being accomplished in the open source community,
| where people don't have access to expensive clouds and lab
| systems. It takes time though, and can still only achieve so
| much.
| Tiberium wrote:
| The title should be changed to reflect that the checkpoints were
| released 1.5 years ago [1], and the paper was published almost 3
| years ago [2]. IMO "releases" means that it was released
| recently, but that's not the case.
|
| [1] https://github.com/google-
| research/t5x/commit/199f226eeff5f8...
|
| [2] https://arxiv.org/abs/2101.03961
| dang wrote:
| Changed now. Thanks!
| greatpostman wrote:
| Google is so irrelevant
| a_wild_dandan wrote:
| It's wild that you could run a* quant (160GB) of this on a $6k
| Mac Studio. Times are a changin'.
___________________________________________________________________
(page generated 2023-11-20 23:02 UTC)