[HN Gopher] Transformer Layers as Painters
       ___________________________________________________________________
        
       Transformer Layers as Painters
        
       Author : fzliu
       Score  : 28 points
       Date   : 2024-07-15 18:14 UTC (4 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | hiddencost wrote:
       | This is honestly one of the coolest things I've seen in a while.
        
       | bigyikes wrote:
       | Very informative paper.
       | 
       | They find:
       | 
       | * Inner layers of transformers share a representation space
       | 
       | * Some middle layers can be dropped without total failure (though
       | it results in reduced performance)
       | 
       | * Middle layers are not interchangeable, they are performing
       | different functions
       | 
       | * Order of layers only matters somewhat
       | 
       | * Layers can somewhat be executed in parallel
       | 
       | Each layer performs a different function but speaks the same
       | language as other layers. A stack of transformers isn't
       | performing a sequence of fundamental transformations as much as
       | it as performing a sequence of additions, each layer adding new
       | paint to a shared canvas.
       | 
       | Since the layers speak the same language, it makes me wonder how
       | we could modify and extend a transformer. Can you train other
       | models to share the same representational space and have them
       | "plug in" to the transformer? Does this shared representational
       | space make it easier to perform RL and unlock agentic behavior?
        
         | derefr wrote:
         | > A stack of transformers isn't performing a sequence of
         | fundamental transformations as much as it as performing a
         | sequence of additions, each layer adding new paint to a shared
         | canvas.
         | 
         | I don't know if "additions" is a good mental model.
         | 
         | If you have layers 1..N that you're training via backprop, then
         | layer N has no reason to "push" some of its computation "back"
         | to layer N-1 if that computation could be done fully
         | independently using the information already available at layer
         | N-1. Instead, you'd just get a wider parameter space + post-
         | pruning embedding vector at layer N, to do more parallel work
         | and produce more parallel outputs at layer N.
         | 
         | The only reason you end up with more than a single hidden layer
         | doing anything other than pure passthrough, is that a given
         | layer K is constrained in the operations it can perform. If a
         | layer K requires inputs that are more than just a single linear
         | AddMM+softmax of the input layer away, then layer K can't do
         | those operations "on its own", and needs some other layer to
         | supply a pre-transformed input for layer K to do "the rest" of
         | the work on. In practice, layer K thus acts as a loss function
         | to train layer K-1 to compute those nonlinear inputs that layer
         | K needs; and so on, pushing back "dependent responsibilities"
         | for computing the outputs, all the way back to the input layer.
         | 
         | (You might have the intuition that a layer might just "run out
         | of parameter space" and so need to "slide" some of the
         | computation backward in time to a previous layer -- but no,
         | there'd be no reason to do this, as the abstract NN passthrough
         | nodes required to propagate that already-complete computation
         | from a previous layer, take up just as much space in a
         | Transformer's Q/K embedding vectors as abstract NN nodes that
         | are actually doing nontrivial computation do.)
         | 
         | So fundamentally, each layer is doing _something_ to the
         | previous layer that 's _not_ just  "addition" (an operation
         | which _is_ one AddMM+softmax away.)
         | 
         | ...but that being said, it _can_ be something conceptually
         | equally-trivial to addition. For example, in an image-
         | generation model,  "alpha-blending by weighted averaging in an
         | oppositional-color-space projection" is _conceptually_ trivial,
         | but isn 't a simple AddMM+softmax, and so requires another NN
         | layer each time any object must be composited on top of any
         | other existing alpha-blended object.
         | 
         | ---
         | 
         | The interesting intuition that this "dependency graph" mental
         | model of a Transformer gives you is that, despite the analogy
         | used in the paper:
         | 
         | > The canvas (input) is passed along a series of painters. Some
         | painters specialize in birds, while others are better at
         | painting wheels. Each painter receives the canvas from the
         | painter below her, then she decides whether to add a few
         | strokes to the painting or just pass it along to the painter
         | above her
         | 
         | ...it's actually still _possible_ for several layers to all
         | know how to paint a bird _and_ a wheel _and_ other things too.
         | 
         | The constraint is that those things all need to be able to be
         | done _in parallel and independently_ to each-other, for them to
         | be  "scheduled" together as parallel operations within a single
         | layer. A given layer can't do two things at the same time -- or
         | at least can't _finish_ two things at the same time -- if those
         | two things are interdependent in a way that requires a
         | nontrivial (not just AddMM+softmax) amount of math to merge.
         | 
         | Whereas, if any object depends on another object, then the work
         | for one or the other object has to be "pushed backward in time"
         | at training time to some previous layer, so that the outputs of
         | that operation can be used as inputs. (Thus "painting" in a
         | very literal sense -- paint has to already be on the canvas by
         | the time you want to blend on top of it!)
         | 
         | When that "pushing computation backward in time" happens often
         | during training, in an biased way (i.e. with a causal
         | correlation in the computational-dependency order that the
         | occlusion/lighting/reflection/etc effect requires the parts of
         | the scene be composed in), then due to the "scheduling
         | constraint", some particular layers might end up trained more
         | often to do a particular thing; and so end up better at doing
         | that thing; and so end up being "the place" where that thing
         | happens.
         | 
         | But just the same, if the "pushing computation backward in
         | time" happens often during training in an _unbiased_ way -- and
         | /or if you paint birds-on-birds-on-birds in your image, such
         | that there's no way to get one layer to be "the" bird expert;
         | then _many_ layers will end up being trained to do the task,
         | slowly, as at one time or another the  "responsibility" for
         | learning that sub-task falls one or more different layers for
         | each training example.
        
       | bluecoconut wrote:
       | Nice~ Glad to see this published / confirmed by others. Next I
       | hope to see some of this symmetry used to improve MoE / dynamic
       | compute / adaptive style models!
       | 
       | Context: I found the same structure: early - middle - end layers
       | serving different purposes, including the permutability of the
       | middle layers, a year or so ago, but never got to testing more
       | models rigerously or publishing it.
       | 
       | We talked about it a bit in a hackernews thread a few months ago.
       | (https://news.ycombinator.com/item?id=39504780#39505523)
       | 
       | > One interesting finding though (now that I'm rambling and just
       | typing a lot) is that in a static model, you can "shuffle" the
       | layers (eg. swap layer 4's weights with layer 7's weights) and
       | the resulting tokens roughly seem similar (likely caused by the
       | ResNet style backbone). Only the first ~3 layers and last ~3
       | layers seem "important to not permute". It kinda makes me
       | interpret models as using the first few layers to get into some
       | "universal" embedding space, operating in that space "without
       | ordering in layer-order", and then "projecting back" to token
       | space at the end. (rather than staying in token space the whole
       | way through).
        
         | bigyikes wrote:
         | Do you have any indication whether this "universal" space might
         | be shared between models, or is it unique to the architecture
         | and training set?
         | 
         | Maybe it's crazy, but is there any possibility that, say, Llama
         | and Mistral use the same representation space?
        
           | dr_dshiv wrote:
           | > the same representation space
           | 
           | Platonic world of forms, perchance?
        
       ___________________________________________________________________
       (page generated 2024-07-15 23:00 UTC)