[HN Gopher] Transformer learning explained: Coinductive guide to...
___________________________________________________________________
Transformer learning explained: Coinductive guide to inductive
transformer heads
Author : adamnemecek
Score : 78 points
Date : 2023-02-28 16:10 UTC (6 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| adamnemecek wrote:
| I'm the author. I have recently made the realization that Hopf
| algebras are a surprisingly good abstraction for machine learning
| and this paper is an invitation for others to look into them as
| they can provide a unified foundation for convnets, transformers
| and diffusion models.
|
| Hopf algebras are a tensorial bialgebra, meaning they are both a
| tensor and a cotensor at once. What's the co- prefix? You can
| think of it as "a thing with recurrence relationships". This
| matters since recurrence relationships can be thought of as a
| generalization of autodiff. The two ML building blocks, tensors
| and autodiff, are then unified under one umbrella.
|
| Looking at transformers through the lens of Hopf algebra allowed
| me to identify their learning mechanism. Namely, the residual
| stream corresponds to what my paper calls unit impulse path and
| the attention corresponds to the convolution path of a Hopf
| algebra. The transformer learns by enforcing an invariant called
| Hopf coherence between the two paths.
|
| Hopf convolution is a generalization of the standard convolution
| and transformer attention emerges as a part of this convolution.
| Transformer models can then be understood as linear time-
| invariant systems.
|
| Furthermore, I believe that Hopf coherence provides a better way
| of doing backprop, one that occurs within the single layers as
| opposed to across the whole graph, but more research is needed.
|
| This approach also opens the door for verified machine learning,
| which I will discuss in future work.
|
| I'm currently in the process of implementing a simple machine
| learning framework based on Hopf algebra but it might take some
| time.
|
| I have a discord channel if you want to discuss this further or
| keep track of our progress https://discord.cofunctional.ai
| rnosov wrote:
| Is there any chance you can answer in simple terms whether the
| "Hopf coherence" method would be any faster way to do a back
| propagation than current methods (gradient descent)? My math
| skills are bit rusty now so I can't tell from looking at your
| paper alone.
| adamnemecek wrote:
| Yes, that's the reason why I'm talking about it. It should be
| faster since it works locally (within the single layers) as
| opposed to across the whole graph.
| rnosov wrote:
| Could you quantify it (Twice, three times, etc)? Another
| question, does it apply to only attention heads learning or
| the whole shebang?
| adamnemecek wrote:
| It's going to be more than that, like significantly. Hopf
| coherence theoretically makes away with backprop
| altogether.
|
| I'm in the process of implementing it but it might take
| sometime.
|
| I'm aware of Hinton's forward-forward, but I need more
| time to provide a comparison.
| harveywi wrote:
| One forward-forward step for man, one giant Hopf for mankind.
| uoaei wrote:
| You comment elsewhere that the Hopf coherence-based optimization
| procedure works locally. How does it compare to local predictive
| coding techniques?
|
| E.g.: https://openreview.net/pdf?id=PdauS7wZBfC
| adamnemecek wrote:
| I have to read the paper in more detail but I imagine there's a
| good amount of overlap.
| frbd wrote:
| The Geometric Deep Learning theory developed by Bronstein ea. is
| also a very nice unifying theory for deep learning, using Graph
| Neural Networks. Did you study the interactions with these?
| niemandhier wrote:
| The book you mention is quiet easy to read and didactically
| structured, the paper was hard to follow.
|
| Just to encourage people to read the book, even if they did not
| grokk the paper.
| adamnemecek wrote:
| I did, yes, and there is good amount of overlap. In fact, I
| will be speaking with one of the authors later this week.
|
| However, I find that my approach relies on fewer concepts.
| There are some concepts that our approaches share but what I
| like about my approach is that all the concepts are just parts
| of the Hopf algebra whereas in GDL, they are somewhat ad hoc.
|
| My approach also ties nicely with other things like linear
| logic and PL theory and therefore provides concrete steps on
| how to build the next gen machine learning framework which
| might very provide verifiable machine learning.
|
| My approach fundamentally says that transformer models,
| diffusion modes and convnets can be expressed with Hopf
| convolution which is a generalization of the standard
| convolution.
|
| Also my approach provides a new learning approach which does
| away with backwards pass and replaces it with something
| resembling feedback like the one found in electronic circuits.
| algeber wrote:
| hey adam, by the way I implemented the Hopf algebraic shuffle
| product in Julia, but ironically you blocked me before I got the
| chance to tell you -- chakravala
| Afrotechmods wrote:
| As an EE, trying to understand what this was about fried my
| brain.
| adamnemecek wrote:
| EE is actually a good place to be. Hopf algebras are like LTIs.
| There's a Laplace transform/z-transform relationship between
| the unit impulse path and the convolution path.
| p1esk wrote:
| Hey Adam, if you're interested in smart people looking at your
| ideas and providing criticism you should submit your papers to an
| ML conference for a peer review.
| adamnemecek wrote:
| I will, this is the first approximation. I'm working on an
| implementation that will provide further guidance for the
| paper.
| algeber wrote:
| by the way, I have already had shuffle products implemented
| for years, but you ironically blocked me the other day before
| telling you more about it, as you know science always works
| faster when you stop the conversation -- chakravala
| ftxbro wrote:
| I don't know anything about algebraic geometry or category theory
| or haskell, but I know some stuff about the normal boring kind of
| linear algebra, and in some places the author relies on
| statements like "since the eigenvalues are real and positive we
| can conclude that it is a symmetric positive definite matrix"
| which I know are just wrong (it wouldn't have to be symmetric).
| This is particularly bad because the rest of the argument (for
| example in 6.2.2) explains why it's so cool that we've shown it's
| symmetric (we haven't).
| [deleted]
| adamnemecek wrote:
| You are right, I will rectify that. The statement still holds
| but I need to provide a justification.
| matheist wrote:
| For example, [[1, 1], [0, 1]] is a non-symmetric matrix with
| positive real eigenvalues.
| abeppu wrote:
| I don't understand the claims well enough to evaluate them. But
| the way this document is written, and where it choose to put its
| attention (no pun intended) strike me as odd. The sections on the
| transformer and its components seem like they should be backed
| with a lot more specifics tying back to the (Combinatorial) Hopf
| algebra described in the preceding sections. The Hopf algebra
| section defines a bunch of notation and structure, and introduces
| the label "Hopf Coherence" for some invariant, which is expressed
| in terms of that pile of notation. The Combinatorial Hopf Algebra
| section introduces a "shuffle product".
|
| But we never actually get a clear explanation of how this stuff
| lines up with parts of the transformer / attention mechanism.
| There's an _assertion_ of Hopf coherence between two paths, and
| in 6.2 there's a description of the QK and OV circuits as having
| a co-product product relationship, which isn't especially clear
| but is at least a direction. What is the antipode in the
| transformer context? If Hopf coherence is m(id x S)\Delta =
| \epsilon * u, and OV is (or is like?) the product m, and QK is
| the coproduct \Delta ... what are the unit u, counit \epsilon,
| and antipode S?
|
| It looks like the author never spells it out, and certainly never
| proves the assertion.
|
| Instead, there are these hard-to-comprehend claims in prose,
| interspersed with math defining the most trivial things. In
| 6.2.1, it reminds us of what it means for a row to sum to 1. Then
| it quotes a definition of co-injection. In 6.2.2 they remind us
| what a symmetric matrix is.
|
| The shuffle product defined in 5.1 doesn't get used, except to
| say that it will be part of future work.
|
| I'm not saying there isn't any insight here, but I think because
| the author shows in sections 3-5 that they expect many readers
| will not have this background, they should do a clearer job
| drawing the relevant connections explicitly, and backing up
| claims.
| adamnemecek wrote:
| You are right, the paper can and will be improved. I think that
| I assume too much that people have read the Pang2014 paper.
|
| Hopf coherence is not new, it's a name for an old concept that
| did not have a name.
|
| Unit and counit together make the residual stream.
|
| The antipode is a part of the OV circuit (see section 6.2.2).
|
| You are right re: basic definitions, I wanted the definitions
| in there and briefly considered a separate section that
| provides an overview (as is customary) but there were not
| enough of them to warrant that but they do seem somewhat out of
| place sometimes.
___________________________________________________________________
(page generated 2023-02-28 23:01 UTC)