[HN Gopher] Mixture-of-Depths: Dynamically allocating compute in...
       ___________________________________________________________________
        
       Mixture-of-Depths: Dynamically allocating compute in transformers
        
       Author : milliondreams
       Score  : 174 points
       Date   : 2024-04-07 13:42 UTC (9 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | rughouse wrote:
       | It's very similar to Mixture of Experts. But instead of routing
       | tokens to multiple experts, you "deploy to a single expert which
       | can be dynamically skipped"
        
         | erikaww wrote:
         | Mixing these would be pretty cool. Further reduced compute for
         | MoE while keeping the performance.
        
           | GaggiX wrote:
           | In the paper they already show a mixing of these two with
           | Mixture-of-Depths-and-Experts (MoDE).
        
       | whimsicalism wrote:
       | I think more complicated routing is absolutely going to become
       | more common.
       | 
       | Specifically, I think at some point we are going to move to
       | recursive routing, ie. pass back through a set of experts again.
       | In the future, 'chain-of-thought' will happen internal to the
       | model recursively
        
         | optimalsolver wrote:
         | We can name these hypothetical objects Recursive Neural
         | Networks.
        
           | whimsicalism wrote:
           | i know you're jesting but RNNs are recursive along the
           | sequence length where I am describing recursion along the
           | depth.
        
             | refulgentis wrote:
             | Like decode the next token, then adjust what you're paying
             | attention to, then decode it again?
        
               | nine_k wrote:
               | Isn't it the only way to, say,understand a pun?
        
           | conradev wrote:
           | Yep: https://arxiv.org/abs/2305.13048
        
         | digdugdirk wrote:
         | See, this is where my understanding of LLMs breaks down. I can
         | understand one token going through the model, but I can't
         | understand a model that has different "experts" internally.
         | 
         | Do you have any resources or links to help explain that
         | concept?
        
           | whimsicalism wrote:
           | It is still just one token going through the model.
           | 
           | I actually think mixture-of-expert is a bit of a misnomer,
           | the 'experts' do not really necessarily have super distinct
           | expertise. Think of it more as how neurons activate in the
           | brain - your entire brain doesn't light up for every query,
           | now in neural networks the same thing happens (it doesn't
           | fully light up for every query).
           | 
           | Don't really know a resource besides the seminal Noam Shazeer
           | paper, sorry - I'm sure others have higher-level.
        
         | miven wrote:
         | What you describe here sounds a little like the line of work
         | centered around Universal Transformers, which basically process
         | the input embeddings through a single transformer block
         | multiple times with a separate module deciding when the
         | embeddings have been cooked enough and can be pulled out of the
         | oven so to speak.
         | 
         | Even more in line with the idea of "experts" there's a paper
         | from last year on Sparse Universal Transformers in which they
         | combine a universal transformer with sparse mixture of experts,
         | so it's up to the gating mechanism to decide which transformer
         | blocks and in which order are to be used in shaping the
         | embeddings.
         | 
         | This really isn't my specialty but from what I gathered these
         | are tricky to train properly, and require more overall compute
         | during inference to reach comparable results to their vanilla
         | transformer counterparts. It's an interesting direction
         | nonetheless, having an upper bound on the number of computation
         | steps per token is, in my opinion, one of the major downsides
         | of the classical transformer architecture.
        
         | imranq wrote:
         | Attention is basically routing, these other routing schemes put
         | a less fine-grained choice for the model, which potentially
         | makes it easier to train
        
           | whimsicalism wrote:
           | How is attention basically routing?
        
             | visarga wrote:
             | It routes values based on linear combinations taken from
             | the attention map.
        
               | whimsicalism wrote:
               | But all of those values are created using an MLP with the
               | same parameters, so there is no routing to different
               | parameters.
        
             | pizza wrote:
             | Think of it like an edge flow matrix
        
         | londons_explore wrote:
         | I think the reason this hasn't been done is you have no way to
         | decide how many recursions are necessary at train time.
         | 
         | And if you pick a random number/try many different levels of
         | recursion, you 'blur' the output. Ie. the output of a layer
         | doesn't know if it should be outputting info important for the
         | final result, or the output that is the best possible input to
         | another round of recursion.
        
           | whimsicalism wrote:
           | Yes, I think training this model would be hard. Perhaps
           | something akin to how MoEs are trained where you impose some
           | sort of loss distribution to encourage equitable routing, but
           | for recursion.
        
           | pizza wrote:
           | You could just learn the right estimated number of
           | recursions, also passing 'backtracking'/'state' information
           | at the next nested level. Kind of like how state space models
           | encode extractible information via a basis function
           | representation, you could encode extractible recursion state
           | information into the embedding. See also transformers that
           | can learn to recognize n-deep balanced parentheses (Dyck-n
           | languages)
        
           | sdenton4 wrote:
           | This is actually how EfficientNet trains, using random
           | truncation of the network during training. It does just
           | fine... The game is that each layer needs to get as close as
           | it can to good output, improving in the previous activation
           | quality.
        
       | barrenko wrote:
       | Are we going to hit bullseye?
        
         | ein0p wrote:
         | This only cuts compute by "up to" 50% and only during
         | inference. Quadratic dependence on context size remains, as do
         | the enormous memory requirements. For something to be
         | considered a bulls eye in this space it has to offer nonlinear
         | improvements on both of these axes, and/or be much faster to
         | train. Until that happens, people, including Google will
         | continue to train bog standard MoE and dense transformers.
         | Radical experimentation at scale is too expensive even for
         | megacorps at this point.
        
           | visarga wrote:
           | Yeah all attempts at reducing complexity from quadratic to
           | linear failed, only Mamba still has a chance, but it's not
           | tested on large models and only provides a speedup at for
           | 2000+ tokens. That was to be expected as small sequences have
           | very small memory requirements for transformers, but
           | recursive architectures use the same hidden size. So when
           | recurrent hidden size > sequence length, the old transformer
           | is faster.
        
             | ein0p wrote:
             | It's more subtle than that IMO. They haven't necessarily
             | "failed" - they just don't have the "superpowers" that the
             | metrics used to evaluate such systems are aimed at. E.g. no
             | such linear method devised so far (that I know of, at
             | least) is able to do very high recall point retrieval in
             | long context _and_ effective in-context learning
             | simultaneously. You get one or the other, but not both. But
             | as far as the metrics go, high recall retrieval in long
             | context is easier to for the researcher to demonstrate and
             | for the observer to comprehend - a typical needle/haystack
             | setting is trivial to put together. It is also something
             | that (unlike in-context learning) humans are usually very
             | bad at, so it's perceived as a "superpower" or "magic". In
             | this case e.g. Mamba being more human like due to its
             | selective forgetfulness is currently playing against it.
             | But whether it's "better" per se will depend on the task.
             | It's just that we do not know how to evaluate most of the
             | tasks yet, so people keep trying to find the proverbial
             | keys under the lamp post, and measure what they can to make
             | progress, and thereby keep their efforts lavishly funded.
        
           | mdale wrote:
           | Makes opportunities for smaller companies to
           | innovative/experiment to offer solutions / acquisition
           | targets where tighter inference compute requirements makes or
           | breaks the experience but larger training cost is less of a
           | concern (such as embedded or local runtime use cases)
        
             | ein0p wrote:
             | Before those opportunities are available to you, someone
             | would need to spend a few million dollars and train a
             | competitive model with this, and then release it under a
             | license that allows commercial use. This is out of reach
             | for the vast majority of smaller companies. These models
             | only excel at large parameter counts, even for narrow
             | problems. This is especially true in the case of MoE, which
             | is a way to push the overall parameter count even larger
             | without lighting up the whole thing for every token.
        
       | mattmcdonagh wrote:
       | I wrote up a bit about it here, from what I could piece together:
       | 
       | https://lifeinthesingularity.com/p/googles-breakthroughs-in-...
        
       ___________________________________________________________________
       (page generated 2024-04-07 23:00 UTC)