[HN Gopher] Transformer learning explained: Coinductive guide to...
       ___________________________________________________________________
        
       Transformer learning explained: Coinductive guide to inductive
       transformer heads
        
       Author : adamnemecek
       Score  : 78 points
       Date   : 2023-02-28 16:10 UTC (6 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | adamnemecek wrote:
       | I'm the author. I have recently made the realization that Hopf
       | algebras are a surprisingly good abstraction for machine learning
       | and this paper is an invitation for others to look into them as
       | they can provide a unified foundation for convnets, transformers
       | and diffusion models.
       | 
       | Hopf algebras are a tensorial bialgebra, meaning they are both a
       | tensor and a cotensor at once. What's the co- prefix? You can
       | think of it as "a thing with recurrence relationships". This
       | matters since recurrence relationships can be thought of as a
       | generalization of autodiff. The two ML building blocks, tensors
       | and autodiff, are then unified under one umbrella.
       | 
       | Looking at transformers through the lens of Hopf algebra allowed
       | me to identify their learning mechanism. Namely, the residual
       | stream corresponds to what my paper calls unit impulse path and
       | the attention corresponds to the convolution path of a Hopf
       | algebra. The transformer learns by enforcing an invariant called
       | Hopf coherence between the two paths.
       | 
       | Hopf convolution is a generalization of the standard convolution
       | and transformer attention emerges as a part of this convolution.
       | Transformer models can then be understood as linear time-
       | invariant systems.
       | 
       | Furthermore, I believe that Hopf coherence provides a better way
       | of doing backprop, one that occurs within the single layers as
       | opposed to across the whole graph, but more research is needed.
       | 
       | This approach also opens the door for verified machine learning,
       | which I will discuss in future work.
       | 
       | I'm currently in the process of implementing a simple machine
       | learning framework based on Hopf algebra but it might take some
       | time.
       | 
       | I have a discord channel if you want to discuss this further or
       | keep track of our progress https://discord.cofunctional.ai
        
         | rnosov wrote:
         | Is there any chance you can answer in simple terms whether the
         | "Hopf coherence" method would be any faster way to do a back
         | propagation than current methods (gradient descent)? My math
         | skills are bit rusty now so I can't tell from looking at your
         | paper alone.
        
           | adamnemecek wrote:
           | Yes, that's the reason why I'm talking about it. It should be
           | faster since it works locally (within the single layers) as
           | opposed to across the whole graph.
        
             | rnosov wrote:
             | Could you quantify it (Twice, three times, etc)? Another
             | question, does it apply to only attention heads learning or
             | the whole shebang?
        
               | adamnemecek wrote:
               | It's going to be more than that, like significantly. Hopf
               | coherence theoretically makes away with backprop
               | altogether.
               | 
               | I'm in the process of implementing it but it might take
               | sometime.
               | 
               | I'm aware of Hinton's forward-forward, but I need more
               | time to provide a comparison.
        
       | harveywi wrote:
       | One forward-forward step for man, one giant Hopf for mankind.
        
       | uoaei wrote:
       | You comment elsewhere that the Hopf coherence-based optimization
       | procedure works locally. How does it compare to local predictive
       | coding techniques?
       | 
       | E.g.: https://openreview.net/pdf?id=PdauS7wZBfC
        
         | adamnemecek wrote:
         | I have to read the paper in more detail but I imagine there's a
         | good amount of overlap.
        
       | frbd wrote:
       | The Geometric Deep Learning theory developed by Bronstein ea. is
       | also a very nice unifying theory for deep learning, using Graph
       | Neural Networks. Did you study the interactions with these?
        
         | niemandhier wrote:
         | The book you mention is quiet easy to read and didactically
         | structured, the paper was hard to follow.
         | 
         | Just to encourage people to read the book, even if they did not
         | grokk the paper.
        
         | adamnemecek wrote:
         | I did, yes, and there is good amount of overlap. In fact, I
         | will be speaking with one of the authors later this week.
         | 
         | However, I find that my approach relies on fewer concepts.
         | There are some concepts that our approaches share but what I
         | like about my approach is that all the concepts are just parts
         | of the Hopf algebra whereas in GDL, they are somewhat ad hoc.
         | 
         | My approach also ties nicely with other things like linear
         | logic and PL theory and therefore provides concrete steps on
         | how to build the next gen machine learning framework which
         | might very provide verifiable machine learning.
         | 
         | My approach fundamentally says that transformer models,
         | diffusion modes and convnets can be expressed with Hopf
         | convolution which is a generalization of the standard
         | convolution.
         | 
         | Also my approach provides a new learning approach which does
         | away with backwards pass and replaces it with something
         | resembling feedback like the one found in electronic circuits.
        
       | algeber wrote:
       | hey adam, by the way I implemented the Hopf algebraic shuffle
       | product in Julia, but ironically you blocked me before I got the
       | chance to tell you -- chakravala
        
       | Afrotechmods wrote:
       | As an EE, trying to understand what this was about fried my
       | brain.
        
         | adamnemecek wrote:
         | EE is actually a good place to be. Hopf algebras are like LTIs.
         | There's a Laplace transform/z-transform relationship between
         | the unit impulse path and the convolution path.
        
       | p1esk wrote:
       | Hey Adam, if you're interested in smart people looking at your
       | ideas and providing criticism you should submit your papers to an
       | ML conference for a peer review.
        
         | adamnemecek wrote:
         | I will, this is the first approximation. I'm working on an
         | implementation that will provide further guidance for the
         | paper.
        
           | algeber wrote:
           | by the way, I have already had shuffle products implemented
           | for years, but you ironically blocked me the other day before
           | telling you more about it, as you know science always works
           | faster when you stop the conversation -- chakravala
        
       | ftxbro wrote:
       | I don't know anything about algebraic geometry or category theory
       | or haskell, but I know some stuff about the normal boring kind of
       | linear algebra, and in some places the author relies on
       | statements like "since the eigenvalues are real and positive we
       | can conclude that it is a symmetric positive definite matrix"
       | which I know are just wrong (it wouldn't have to be symmetric).
       | This is particularly bad because the rest of the argument (for
       | example in 6.2.2) explains why it's so cool that we've shown it's
       | symmetric (we haven't).
        
         | [deleted]
        
         | adamnemecek wrote:
         | You are right, I will rectify that. The statement still holds
         | but I need to provide a justification.
        
         | matheist wrote:
         | For example, [[1, 1], [0, 1]] is a non-symmetric matrix with
         | positive real eigenvalues.
        
       | abeppu wrote:
       | I don't understand the claims well enough to evaluate them. But
       | the way this document is written, and where it choose to put its
       | attention (no pun intended) strike me as odd. The sections on the
       | transformer and its components seem like they should be backed
       | with a lot more specifics tying back to the (Combinatorial) Hopf
       | algebra described in the preceding sections. The Hopf algebra
       | section defines a bunch of notation and structure, and introduces
       | the label "Hopf Coherence" for some invariant, which is expressed
       | in terms of that pile of notation. The Combinatorial Hopf Algebra
       | section introduces a "shuffle product".
       | 
       | But we never actually get a clear explanation of how this stuff
       | lines up with parts of the transformer / attention mechanism.
       | There's an _assertion_ of Hopf coherence between two paths, and
       | in 6.2 there's a description of the QK and OV circuits as having
       | a co-product product relationship, which isn't especially clear
       | but is at least a direction. What is the antipode in the
       | transformer context? If Hopf coherence is m(id x S)\Delta =
       | \epsilon * u, and OV is (or is like?) the product m, and QK is
       | the coproduct \Delta ... what are the unit u, counit \epsilon,
       | and antipode S?
       | 
       | It looks like the author never spells it out, and certainly never
       | proves the assertion.
       | 
       | Instead, there are these hard-to-comprehend claims in prose,
       | interspersed with math defining the most trivial things. In
       | 6.2.1, it reminds us of what it means for a row to sum to 1. Then
       | it quotes a definition of co-injection. In 6.2.2 they remind us
       | what a symmetric matrix is.
       | 
       | The shuffle product defined in 5.1 doesn't get used, except to
       | say that it will be part of future work.
       | 
       | I'm not saying there isn't any insight here, but I think because
       | the author shows in sections 3-5 that they expect many readers
       | will not have this background, they should do a clearer job
       | drawing the relevant connections explicitly, and backing up
       | claims.
        
         | adamnemecek wrote:
         | You are right, the paper can and will be improved. I think that
         | I assume too much that people have read the Pang2014 paper.
         | 
         | Hopf coherence is not new, it's a name for an old concept that
         | did not have a name.
         | 
         | Unit and counit together make the residual stream.
         | 
         | The antipode is a part of the OV circuit (see section 6.2.2).
         | 
         | You are right re: basic definitions, I wanted the definitions
         | in there and briefly considered a separate section that
         | provides an overview (as is customary) but there were not
         | enough of them to warrant that but they do seem somewhat out of
         | place sometimes.
        
       ___________________________________________________________________
       (page generated 2023-02-28 23:01 UTC)