[HN Gopher] Implementation of mixture of experts language model ...
___________________________________________________________________
Implementation of mixture of experts language model in a single
file of PyTorch
Author : avisoori1x
Score : 85 points
Date : 2024-03-18 11:57 UTC (4 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| avisoori1x wrote:
| A from scratch implementation of a sparse mixture of experts
| language model in a single file of PyTorch. This is inspired by
| and largely based on Andrej Karpathy's project 'makemore' and
| borrows a number of re-usable components from that
| implementation. Just like makemore, makeMoE is also an
| autoregressive character-level language model but uses the
| aforementioned sparse mixture of experts architecture. I added
| Expert Capacity to this implementation to make it more complete
| radarsat1 wrote:
| Adding scaled unit gaussian noise to the logits
| noise = torch.randn_like(logits)*F.softplus(noise_logits)
| noisy_logits = logits + noise
|
| Question, if you changed this Gaussian normal for Gumbel noise
| you would get something like Gumbel softmax, right? I'm curious
| why not use it? Isn't it a usual way to implement
| differentiable discrete selection? My curiosity is about the
| effectiveness of Gumbel softmax since I have had some trouble
| using it in practice so I'm curious why it's not used here and
| if there are downsides to it compared to other methods.
| Honestly just adding normal noise like this seems simpler
| anyway.
| avisoori1x wrote:
| This is a good point. I'm yet to try it as I've kind of let
| this project sit for a couple of months and only getting back
| to it. I went with this because it's simpler but I'm not sure
| simpler is necessarily better in this case.
| zingelshuher wrote:
| Question, have you seen the improvement after adding the
| noise? I mean in practice. Asking because intuition
| sometimes doesn't work.
| gradascent wrote:
| Very cool. I'm curious - did you find the results from your
| mixture of experts model to be (qualitatively) better than with
| the standard approach?
| avisoori1x wrote:
| Thanks! So this is something I tried and qualitatively I didn't
| see a huge difference. I'd like to swap out my hand rolled
| modules with standard pytorch modules for self attention etc.
| and train it on the wikipedia English split. That's on my to-do
| list for sure.
| zingelshuher wrote:
| I run some tests. Single model of the same size is better than
| MoE. Single expert out of N is better than model of the same
| size (i.e. same as expert). 2 experts are better than one. That
| was on small LLM, not sure if it scales.
| zingelshuher wrote:
| Similar MoE implementation was on GitHub for a while, since Jan
| 2024
|
| https://github.com/zxaall/moegpt
| avisoori1x wrote:
| Oh nice. What's new here would be noisy top-k routing and
| expert capacity. It also seems to use the nanoGPT base from
| Andrej Karpathy. Mine is from January as well. Here's the
| original blog: https://huggingface.co/blog/AviSoori1x/makemoe-
| from-scratch
| zingelshuher wrote:
| It was inspired by Mixtral 8x7B, of course. I think the same
| approach, soft to hard MoE, can be used in other domains.
| Like video/image processing. Would be interesting to take it
| to extreme, like 4 experts out of 100.
___________________________________________________________________
(page generated 2024-03-22 23:02 UTC)