[HN Gopher] Mixture-of-Depths: Dynamically allocating compute in...
___________________________________________________________________
Mixture-of-Depths: Dynamically allocating compute in transformers
Author : milliondreams
Score : 174 points
Date : 2024-04-07 13:42 UTC (9 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| rughouse wrote:
| It's very similar to Mixture of Experts. But instead of routing
| tokens to multiple experts, you "deploy to a single expert which
| can be dynamically skipped"
| erikaww wrote:
| Mixing these would be pretty cool. Further reduced compute for
| MoE while keeping the performance.
| GaggiX wrote:
| In the paper they already show a mixing of these two with
| Mixture-of-Depths-and-Experts (MoDE).
| whimsicalism wrote:
| I think more complicated routing is absolutely going to become
| more common.
|
| Specifically, I think at some point we are going to move to
| recursive routing, ie. pass back through a set of experts again.
| In the future, 'chain-of-thought' will happen internal to the
| model recursively
| optimalsolver wrote:
| We can name these hypothetical objects Recursive Neural
| Networks.
| whimsicalism wrote:
| i know you're jesting but RNNs are recursive along the
| sequence length where I am describing recursion along the
| depth.
| refulgentis wrote:
| Like decode the next token, then adjust what you're paying
| attention to, then decode it again?
| nine_k wrote:
| Isn't it the only way to, say,understand a pun?
| conradev wrote:
| Yep: https://arxiv.org/abs/2305.13048
| digdugdirk wrote:
| See, this is where my understanding of LLMs breaks down. I can
| understand one token going through the model, but I can't
| understand a model that has different "experts" internally.
|
| Do you have any resources or links to help explain that
| concept?
| whimsicalism wrote:
| It is still just one token going through the model.
|
| I actually think mixture-of-expert is a bit of a misnomer,
| the 'experts' do not really necessarily have super distinct
| expertise. Think of it more as how neurons activate in the
| brain - your entire brain doesn't light up for every query,
| now in neural networks the same thing happens (it doesn't
| fully light up for every query).
|
| Don't really know a resource besides the seminal Noam Shazeer
| paper, sorry - I'm sure others have higher-level.
| miven wrote:
| What you describe here sounds a little like the line of work
| centered around Universal Transformers, which basically process
| the input embeddings through a single transformer block
| multiple times with a separate module deciding when the
| embeddings have been cooked enough and can be pulled out of the
| oven so to speak.
|
| Even more in line with the idea of "experts" there's a paper
| from last year on Sparse Universal Transformers in which they
| combine a universal transformer with sparse mixture of experts,
| so it's up to the gating mechanism to decide which transformer
| blocks and in which order are to be used in shaping the
| embeddings.
|
| This really isn't my specialty but from what I gathered these
| are tricky to train properly, and require more overall compute
| during inference to reach comparable results to their vanilla
| transformer counterparts. It's an interesting direction
| nonetheless, having an upper bound on the number of computation
| steps per token is, in my opinion, one of the major downsides
| of the classical transformer architecture.
| imranq wrote:
| Attention is basically routing, these other routing schemes put
| a less fine-grained choice for the model, which potentially
| makes it easier to train
| whimsicalism wrote:
| How is attention basically routing?
| visarga wrote:
| It routes values based on linear combinations taken from
| the attention map.
| whimsicalism wrote:
| But all of those values are created using an MLP with the
| same parameters, so there is no routing to different
| parameters.
| pizza wrote:
| Think of it like an edge flow matrix
| londons_explore wrote:
| I think the reason this hasn't been done is you have no way to
| decide how many recursions are necessary at train time.
|
| And if you pick a random number/try many different levels of
| recursion, you 'blur' the output. Ie. the output of a layer
| doesn't know if it should be outputting info important for the
| final result, or the output that is the best possible input to
| another round of recursion.
| whimsicalism wrote:
| Yes, I think training this model would be hard. Perhaps
| something akin to how MoEs are trained where you impose some
| sort of loss distribution to encourage equitable routing, but
| for recursion.
| pizza wrote:
| You could just learn the right estimated number of
| recursions, also passing 'backtracking'/'state' information
| at the next nested level. Kind of like how state space models
| encode extractible information via a basis function
| representation, you could encode extractible recursion state
| information into the embedding. See also transformers that
| can learn to recognize n-deep balanced parentheses (Dyck-n
| languages)
| sdenton4 wrote:
| This is actually how EfficientNet trains, using random
| truncation of the network during training. It does just
| fine... The game is that each layer needs to get as close as
| it can to good output, improving in the previous activation
| quality.
| barrenko wrote:
| Are we going to hit bullseye?
| ein0p wrote:
| This only cuts compute by "up to" 50% and only during
| inference. Quadratic dependence on context size remains, as do
| the enormous memory requirements. For something to be
| considered a bulls eye in this space it has to offer nonlinear
| improvements on both of these axes, and/or be much faster to
| train. Until that happens, people, including Google will
| continue to train bog standard MoE and dense transformers.
| Radical experimentation at scale is too expensive even for
| megacorps at this point.
| visarga wrote:
| Yeah all attempts at reducing complexity from quadratic to
| linear failed, only Mamba still has a chance, but it's not
| tested on large models and only provides a speedup at for
| 2000+ tokens. That was to be expected as small sequences have
| very small memory requirements for transformers, but
| recursive architectures use the same hidden size. So when
| recurrent hidden size > sequence length, the old transformer
| is faster.
| ein0p wrote:
| It's more subtle than that IMO. They haven't necessarily
| "failed" - they just don't have the "superpowers" that the
| metrics used to evaluate such systems are aimed at. E.g. no
| such linear method devised so far (that I know of, at
| least) is able to do very high recall point retrieval in
| long context _and_ effective in-context learning
| simultaneously. You get one or the other, but not both. But
| as far as the metrics go, high recall retrieval in long
| context is easier to for the researcher to demonstrate and
| for the observer to comprehend - a typical needle/haystack
| setting is trivial to put together. It is also something
| that (unlike in-context learning) humans are usually very
| bad at, so it's perceived as a "superpower" or "magic". In
| this case e.g. Mamba being more human like due to its
| selective forgetfulness is currently playing against it.
| But whether it's "better" per se will depend on the task.
| It's just that we do not know how to evaluate most of the
| tasks yet, so people keep trying to find the proverbial
| keys under the lamp post, and measure what they can to make
| progress, and thereby keep their efforts lavishly funded.
| mdale wrote:
| Makes opportunities for smaller companies to
| innovative/experiment to offer solutions / acquisition
| targets where tighter inference compute requirements makes or
| breaks the experience but larger training cost is less of a
| concern (such as embedded or local runtime use cases)
| ein0p wrote:
| Before those opportunities are available to you, someone
| would need to spend a few million dollars and train a
| competitive model with this, and then release it under a
| license that allows commercial use. This is out of reach
| for the vast majority of smaller companies. These models
| only excel at large parameter counts, even for narrow
| problems. This is especially true in the case of MoE, which
| is a way to push the overall parameter count even larger
| without lighting up the whole thing for every token.
| mattmcdonagh wrote:
| I wrote up a bit about it here, from what I could piece together:
|
| https://lifeinthesingularity.com/p/googles-breakthroughs-in-...
___________________________________________________________________
(page generated 2024-04-07 23:00 UTC)