[HN Gopher] Formal Algorithms for Transformers
       ___________________________________________________________________
        
       Formal Algorithms for Transformers
        
       Author : hexhowells
       Score  : 93 points
       Date   : 2022-07-20 09:28 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | ThrowawayTestr wrote:
       | I was assuming electrical transformers.
        
         | zdw wrote:
         | Or the cartoon/toy franchise
        
       | geysersam wrote:
       | This is a fantastic resource. It's the missing piece of many
       | machine learning articles.
        
       | sva_ wrote:
       | I like how this seems to actually be self-contained. They even
       | have a list of notations in the end.
        
       | godelski wrote:
       | I can't tell who this paper is aimed at. It isn't formal. It
       | isn't mathematical. It isn't a good description and doesn't have
       | good coverage. I can only assume it is for citations.
        
       | lynguist wrote:
       | I find the distinction introduced in this paper into encoder-
       | decoder Transformers, encoder-only Transformers and decoder-only
       | Transformers very useful for my informal understanding of the
       | different architectures. Thank you for this clear clarification.
        
         | godelski wrote:
         | I'm curious about this comment and it being the top comment. Is
         | this something people don't know? Everything in this paper was
         | introduced in Attention Is All You Need[0]. They introduced Dot
         | Product Attention, which is what everyone just refers to now as
         | Attention, and they talk about the decoder and encoder
         | framework. The encoder is just self attention
         | `softmax(<q(x),k(x)>)v(x)` and decoder includes joint attention
         | `softmax(<q(x),k(x)>)v(y)`
         | 
         | I have a lot of complaints about this paper because it only
         | covers topics addressed in the main attention paper (Vaswani)
         | and I can't see how it accomplishes anything but pulling
         | citations away from grad students who did survey papers on
         | Attention, which are more precise and have more coverage of the
         | field. As a quick search, here's a survey paper from last year
         | that has more in depth discussion and more mathematical
         | precision[1].
         | 
         | Maybe because this is an area of research adjacent to my active
         | area, but I'm a little confused at the attention (pun intended)
         | it is getting. Seems just like Deep Mind's name is the major
         | motivation.
         | 
         | [0] https://arxiv.org/abs/1706.03762
         | 
         | [1] https://arxiv.org/abs/2106.04554
        
       | tartakovsky wrote:
       | Zero diagrams, but maybe they wouldn't be helpful to clarify the
       | concept? Guess it depends on the types of learners, I'm not sure.
        
         | nh23423fefe wrote:
         | paper explicitly rejects diagrams as unhelpful
         | 
         | > Some 100+ page papers contain only a few lines of prose
         | informally describing the model [RBC+21]. At best there are
         | some high-level diagrams
        
           | godelski wrote:
           | > only a few lines of prose informally describing the model
           | 
           | This is ironic considering they use more words to describe
           | chunking (splitting along a dimension, x,y = a[0,:,:,:],
           | a[1,:,:,:]) than multi-heads.
        
       | mrhether wrote:
       | familiar with basic ML terminology might be an understatement
        
         | uoaei wrote:
         | This field moves very quickly. I don't think anyone can be
         | expected to keep up who don't make it a weekly study subject or
         | are actively employed in it.
        
         | godelski wrote:
         | If you're starting from scratch scratch, these might be of more
         | use to you. Second focuses on vision transformers, but all the
         | concepts still apply.
         | 
         | https://jalammar.github.io/illustrated-transformer/
         | 
         | https://medium.com/pytorch/training-compact-transformers-fro...
        
       ___________________________________________________________________
       (page generated 2022-07-21 23:02 UTC)