[HN Gopher] Transformer Layers as Painters
___________________________________________________________________
Transformer Layers as Painters
Author : fzliu
Score : 28 points
Date : 2024-07-15 18:14 UTC (4 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| hiddencost wrote:
| This is honestly one of the coolest things I've seen in a while.
| bigyikes wrote:
| Very informative paper.
|
| They find:
|
| * Inner layers of transformers share a representation space
|
| * Some middle layers can be dropped without total failure (though
| it results in reduced performance)
|
| * Middle layers are not interchangeable, they are performing
| different functions
|
| * Order of layers only matters somewhat
|
| * Layers can somewhat be executed in parallel
|
| Each layer performs a different function but speaks the same
| language as other layers. A stack of transformers isn't
| performing a sequence of fundamental transformations as much as
| it as performing a sequence of additions, each layer adding new
| paint to a shared canvas.
|
| Since the layers speak the same language, it makes me wonder how
| we could modify and extend a transformer. Can you train other
| models to share the same representational space and have them
| "plug in" to the transformer? Does this shared representational
| space make it easier to perform RL and unlock agentic behavior?
| derefr wrote:
| > A stack of transformers isn't performing a sequence of
| fundamental transformations as much as it as performing a
| sequence of additions, each layer adding new paint to a shared
| canvas.
|
| I don't know if "additions" is a good mental model.
|
| If you have layers 1..N that you're training via backprop, then
| layer N has no reason to "push" some of its computation "back"
| to layer N-1 if that computation could be done fully
| independently using the information already available at layer
| N-1. Instead, you'd just get a wider parameter space + post-
| pruning embedding vector at layer N, to do more parallel work
| and produce more parallel outputs at layer N.
|
| The only reason you end up with more than a single hidden layer
| doing anything other than pure passthrough, is that a given
| layer K is constrained in the operations it can perform. If a
| layer K requires inputs that are more than just a single linear
| AddMM+softmax of the input layer away, then layer K can't do
| those operations "on its own", and needs some other layer to
| supply a pre-transformed input for layer K to do "the rest" of
| the work on. In practice, layer K thus acts as a loss function
| to train layer K-1 to compute those nonlinear inputs that layer
| K needs; and so on, pushing back "dependent responsibilities"
| for computing the outputs, all the way back to the input layer.
|
| (You might have the intuition that a layer might just "run out
| of parameter space" and so need to "slide" some of the
| computation backward in time to a previous layer -- but no,
| there'd be no reason to do this, as the abstract NN passthrough
| nodes required to propagate that already-complete computation
| from a previous layer, take up just as much space in a
| Transformer's Q/K embedding vectors as abstract NN nodes that
| are actually doing nontrivial computation do.)
|
| So fundamentally, each layer is doing _something_ to the
| previous layer that 's _not_ just "addition" (an operation
| which _is_ one AddMM+softmax away.)
|
| ...but that being said, it _can_ be something conceptually
| equally-trivial to addition. For example, in an image-
| generation model, "alpha-blending by weighted averaging in an
| oppositional-color-space projection" is _conceptually_ trivial,
| but isn 't a simple AddMM+softmax, and so requires another NN
| layer each time any object must be composited on top of any
| other existing alpha-blended object.
|
| ---
|
| The interesting intuition that this "dependency graph" mental
| model of a Transformer gives you is that, despite the analogy
| used in the paper:
|
| > The canvas (input) is passed along a series of painters. Some
| painters specialize in birds, while others are better at
| painting wheels. Each painter receives the canvas from the
| painter below her, then she decides whether to add a few
| strokes to the painting or just pass it along to the painter
| above her
|
| ...it's actually still _possible_ for several layers to all
| know how to paint a bird _and_ a wheel _and_ other things too.
|
| The constraint is that those things all need to be able to be
| done _in parallel and independently_ to each-other, for them to
| be "scheduled" together as parallel operations within a single
| layer. A given layer can't do two things at the same time -- or
| at least can't _finish_ two things at the same time -- if those
| two things are interdependent in a way that requires a
| nontrivial (not just AddMM+softmax) amount of math to merge.
|
| Whereas, if any object depends on another object, then the work
| for one or the other object has to be "pushed backward in time"
| at training time to some previous layer, so that the outputs of
| that operation can be used as inputs. (Thus "painting" in a
| very literal sense -- paint has to already be on the canvas by
| the time you want to blend on top of it!)
|
| When that "pushing computation backward in time" happens often
| during training, in an biased way (i.e. with a causal
| correlation in the computational-dependency order that the
| occlusion/lighting/reflection/etc effect requires the parts of
| the scene be composed in), then due to the "scheduling
| constraint", some particular layers might end up trained more
| often to do a particular thing; and so end up better at doing
| that thing; and so end up being "the place" where that thing
| happens.
|
| But just the same, if the "pushing computation backward in
| time" happens often during training in an _unbiased_ way -- and
| /or if you paint birds-on-birds-on-birds in your image, such
| that there's no way to get one layer to be "the" bird expert;
| then _many_ layers will end up being trained to do the task,
| slowly, as at one time or another the "responsibility" for
| learning that sub-task falls one or more different layers for
| each training example.
| bluecoconut wrote:
| Nice~ Glad to see this published / confirmed by others. Next I
| hope to see some of this symmetry used to improve MoE / dynamic
| compute / adaptive style models!
|
| Context: I found the same structure: early - middle - end layers
| serving different purposes, including the permutability of the
| middle layers, a year or so ago, but never got to testing more
| models rigerously or publishing it.
|
| We talked about it a bit in a hackernews thread a few months ago.
| (https://news.ycombinator.com/item?id=39504780#39505523)
|
| > One interesting finding though (now that I'm rambling and just
| typing a lot) is that in a static model, you can "shuffle" the
| layers (eg. swap layer 4's weights with layer 7's weights) and
| the resulting tokens roughly seem similar (likely caused by the
| ResNet style backbone). Only the first ~3 layers and last ~3
| layers seem "important to not permute". It kinda makes me
| interpret models as using the first few layers to get into some
| "universal" embedding space, operating in that space "without
| ordering in layer-order", and then "projecting back" to token
| space at the end. (rather than staying in token space the whole
| way through).
| bigyikes wrote:
| Do you have any indication whether this "universal" space might
| be shared between models, or is it unique to the architecture
| and training set?
|
| Maybe it's crazy, but is there any possibility that, say, Llama
| and Mistral use the same representation space?
| dr_dshiv wrote:
| > the same representation space
|
| Platonic world of forms, perchance?
___________________________________________________________________
(page generated 2024-07-15 23:00 UTC)