[HN Gopher] Formal Algorithms for Transformers
___________________________________________________________________
Formal Algorithms for Transformers
Author : hexhowells
Score : 93 points
Date : 2022-07-20 09:28 UTC (1 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| ThrowawayTestr wrote:
| I was assuming electrical transformers.
| zdw wrote:
| Or the cartoon/toy franchise
| geysersam wrote:
| This is a fantastic resource. It's the missing piece of many
| machine learning articles.
| sva_ wrote:
| I like how this seems to actually be self-contained. They even
| have a list of notations in the end.
| godelski wrote:
| I can't tell who this paper is aimed at. It isn't formal. It
| isn't mathematical. It isn't a good description and doesn't have
| good coverage. I can only assume it is for citations.
| lynguist wrote:
| I find the distinction introduced in this paper into encoder-
| decoder Transformers, encoder-only Transformers and decoder-only
| Transformers very useful for my informal understanding of the
| different architectures. Thank you for this clear clarification.
| godelski wrote:
| I'm curious about this comment and it being the top comment. Is
| this something people don't know? Everything in this paper was
| introduced in Attention Is All You Need[0]. They introduced Dot
| Product Attention, which is what everyone just refers to now as
| Attention, and they talk about the decoder and encoder
| framework. The encoder is just self attention
| `softmax(<q(x),k(x)>)v(x)` and decoder includes joint attention
| `softmax(<q(x),k(x)>)v(y)`
|
| I have a lot of complaints about this paper because it only
| covers topics addressed in the main attention paper (Vaswani)
| and I can't see how it accomplishes anything but pulling
| citations away from grad students who did survey papers on
| Attention, which are more precise and have more coverage of the
| field. As a quick search, here's a survey paper from last year
| that has more in depth discussion and more mathematical
| precision[1].
|
| Maybe because this is an area of research adjacent to my active
| area, but I'm a little confused at the attention (pun intended)
| it is getting. Seems just like Deep Mind's name is the major
| motivation.
|
| [0] https://arxiv.org/abs/1706.03762
|
| [1] https://arxiv.org/abs/2106.04554
| tartakovsky wrote:
| Zero diagrams, but maybe they wouldn't be helpful to clarify the
| concept? Guess it depends on the types of learners, I'm not sure.
| nh23423fefe wrote:
| paper explicitly rejects diagrams as unhelpful
|
| > Some 100+ page papers contain only a few lines of prose
| informally describing the model [RBC+21]. At best there are
| some high-level diagrams
| godelski wrote:
| > only a few lines of prose informally describing the model
|
| This is ironic considering they use more words to describe
| chunking (splitting along a dimension, x,y = a[0,:,:,:],
| a[1,:,:,:]) than multi-heads.
| mrhether wrote:
| familiar with basic ML terminology might be an understatement
| uoaei wrote:
| This field moves very quickly. I don't think anyone can be
| expected to keep up who don't make it a weekly study subject or
| are actively employed in it.
| godelski wrote:
| If you're starting from scratch scratch, these might be of more
| use to you. Second focuses on vision transformers, but all the
| concepts still apply.
|
| https://jalammar.github.io/illustrated-transformer/
|
| https://medium.com/pytorch/training-compact-transformers-fro...
___________________________________________________________________
(page generated 2022-07-21 23:02 UTC)