[HN Gopher] Implementation of mixture of experts language model ...
       ___________________________________________________________________
        
       Implementation of mixture of experts language model in a single
       file of PyTorch
        
       Author : avisoori1x
       Score  : 85 points
       Date   : 2024-03-18 11:57 UTC (4 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | avisoori1x wrote:
       | A from scratch implementation of a sparse mixture of experts
       | language model in a single file of PyTorch. This is inspired by
       | and largely based on Andrej Karpathy's project 'makemore' and
       | borrows a number of re-usable components from that
       | implementation. Just like makemore, makeMoE is also an
       | autoregressive character-level language model but uses the
       | aforementioned sparse mixture of experts architecture. I added
       | Expert Capacity to this implementation to make it more complete
        
         | radarsat1 wrote:
         | Adding scaled unit gaussian noise to the logits
         | noise = torch.randn_like(logits)*F.softplus(noise_logits)
         | noisy_logits = logits + noise
         | 
         | Question, if you changed this Gaussian normal for Gumbel noise
         | you would get something like Gumbel softmax, right? I'm curious
         | why not use it? Isn't it a usual way to implement
         | differentiable discrete selection? My curiosity is about the
         | effectiveness of Gumbel softmax since I have had some trouble
         | using it in practice so I'm curious why it's not used here and
         | if there are downsides to it compared to other methods.
         | Honestly just adding normal noise like this seems simpler
         | anyway.
        
           | avisoori1x wrote:
           | This is a good point. I'm yet to try it as I've kind of let
           | this project sit for a couple of months and only getting back
           | to it. I went with this because it's simpler but I'm not sure
           | simpler is necessarily better in this case.
        
             | zingelshuher wrote:
             | Question, have you seen the improvement after adding the
             | noise? I mean in practice. Asking because intuition
             | sometimes doesn't work.
        
       | gradascent wrote:
       | Very cool. I'm curious - did you find the results from your
       | mixture of experts model to be (qualitatively) better than with
       | the standard approach?
        
         | avisoori1x wrote:
         | Thanks! So this is something I tried and qualitatively I didn't
         | see a huge difference. I'd like to swap out my hand rolled
         | modules with standard pytorch modules for self attention etc.
         | and train it on the wikipedia English split. That's on my to-do
         | list for sure.
        
         | zingelshuher wrote:
         | I run some tests. Single model of the same size is better than
         | MoE. Single expert out of N is better than model of the same
         | size (i.e. same as expert). 2 experts are better than one. That
         | was on small LLM, not sure if it scales.
        
       | zingelshuher wrote:
       | Similar MoE implementation was on GitHub for a while, since Jan
       | 2024
       | 
       | https://github.com/zxaall/moegpt
        
         | avisoori1x wrote:
         | Oh nice. What's new here would be noisy top-k routing and
         | expert capacity. It also seems to use the nanoGPT base from
         | Andrej Karpathy. Mine is from January as well. Here's the
         | original blog: https://huggingface.co/blog/AviSoori1x/makemoe-
         | from-scratch
        
           | zingelshuher wrote:
           | It was inspired by Mixtral 8x7B, of course. I think the same
           | approach, soft to hard MoE, can be used in other domains.
           | Like video/image processing. Would be interesting to take it
           | to extreme, like 4 experts out of 100.
        
       ___________________________________________________________________
       (page generated 2024-03-22 23:02 UTC)