[HN Gopher] Scaling Vision with Sparse Mixture of Experts
       ___________________________________________________________________
        
       Scaling Vision with Sparse Mixture of Experts
        
       Author : panarky
       Score  : 28 points
       Date   : 2022-01-13 18:54 UTC (4 hours ago)
        
 (HTM) web link (ai.googleblog.com)
 (TXT) w3m dump (ai.googleblog.com)
        
       | fundamental wrote:
       | Perhaps the described patch based routing to experts isn't a
       | problem in practice, but at first glance it does seem to discard
       | more spatial information than you'd like as well as introducing
       | more image boundaries than would be ideal. You could argue that
       | the former is a known issue with many DNN architectures, though
       | if the intent is to enable larger scale generalization it seems
       | like this paper might be trading away more information in the
       | source material for speed than would be desired. AFAIK the
       | shuffling would be less of an issue in textual models than image
       | processing tasks. As per the boundaries, I guess there could be
       | padding in play, though I suspect that the resulting network is
       | going to have higher sensitivities to shifts up/down or
       | left/right by a few pixels.
       | 
       | Even with those issues I'd imagine there could be some nice
       | benefits and the authors are correct (IMO) for leaning on the
       | areas of conditional execution and routing as it allows for the
       | network to specialize on a given subdomain while being
       | computationally efficient. We'll have to see where subsequent
       | work takes this approach.
        
       ___________________________________________________________________
       (page generated 2022-01-13 23:01 UTC)