[HN Gopher] Transformers from Scratch
       ___________________________________________________________________
        
       Transformers from Scratch
        
       Author : gradientgarden
       Score  : 160 points
       Date   : 2021-11-23 06:27 UTC (2 days ago)
        
 (HTM) web link (e2eml.school)
 (TXT) w3m dump (e2eml.school)
        
       | arketyp wrote:
       | Transformers are exciting as they seem to work on all types of
       | modalities, including vision [1]. It makes me wonder if the
       | transformer module captures some essence of the minicolumn
       | structure found all over the neocortex that Jeff Hawkins raves
       | about, citing Vernon Mountcastle. Hawkins et al talk about grid
       | cells nowadays, location; maybe attention and context is the
       | generalization of such a notion.
       | 
       | [1] https://arxiv.org/abs/2105.15203
        
         | mountainriver wrote:
         | They would say no, they believe it is more like a graph
         | probabilistic structure
        
       | ypcx wrote:
       | It's interesting to see how human understanding differs when it
       | comes to complex, yet clearly defined topics, like machine-
       | learning/Transformers.
       | 
       | For comparison, my understanding of Transformers, after going
       | through Peter Bloem's "Transformers from scratch" [1],
       | implementing/understanding the code and the actual flow of the
       | mathematical quantities, my understanding is that:
       | 
       | - Transformers consist of 3 main parts: 1. Encoders/Decoders (I/O
       | conversion), 2. Self-attention (Indexing), 3. Feed-forward
       | trainable network (Memory).
       | 
       | - The Feed-forward is the most simple kind of (an input->single-
       | layer) neural net, actually often implemented by a Conv1d layer,
       | which is a simple matrix multiply plus a bias and activation.
       | 
       | - The most interesting part is the Multi-head self-attention,
       | which I understand as [2] a randomly-initialized multi-
       | dimensional indexing system where different heads focus on
       | different variations of the indexed token instance (token =
       | initially e.g. a word or a part of a word) with respect to its
       | containing sequence/context. Such encoded token instance contains
       | information about all other tokens of the input sequence = a.k.a.
       | self-attention, and these tokens vary based on how the given
       | "attention head" was (randomly) initialized.
       | 
       | The part that really hits you is when you understand that for a
       | Transformer, a token is not unique only due to its
       | content/identity (and due to all other tokens in the given
       | context/sentence), but also due to its position in the context --
       | e.g. to the Transformer, the word "the" at the first position is
       | a _completely different word_ to the word  "the" on e.g. the
       | second position (even if the rest of the context would be the
       | same). (Which is obviously a massive waste of space if you think
       | about it, but at the same time, at the moment, the only/best way
       | of doing it, because it moves a massive amount of processing from
       | inference time to the training time - which is what our current
       | von-Neumann hardware architectures require.)
       | 
       | [1] http://peterbloem.nl/blog/transformers
       | 
       | [2] https://datascience.stackexchange.com/a/103151/101197
        
         | igorkraw wrote:
         | Your last point is true only with positional encodings though,
         | attention itself is a permutation equivariant function
        
       | Chris2048 wrote:
       | An little off-topic discussion: I sometimes think use of English
       | is a bit too silo-ed.
       | 
       | This is talking about transformers of a mathematical nature. The
       | same term can describe an electrical component. The word itself
       | "trans-form-er" seems to mean "A thing that changes (the shape
       | of?) something" - this is a very vague semantic, maybe
       | appropriate for an abstract mathematical notion, but silly for
       | any specific thing.
       | 
       | It reminds me of all the vague nouns used to name OOP objects:
       | FooManager, FooHandler, FooController, BarContext, BarInfo,
       | BazAdapter, BazConvertor, BazTransformer; etc etc etc.
        
         | krisrm wrote:
         | I remember designing plenty of "BazAdapterFactoryManager"
         | software in my early days of software engineering.
         | 
         | Sometimes I still have to maintain that software, and I
         | remember quickly what regret feels like.
        
           | Chris2048 wrote:
           | Yep. And sometimes different fiefdoms reuse the same words
           | with slightly different meaning/conventions - consider even
           | what an object is in Java vs smalltalk vs Javascript.
           | 
           | Maybe we need stronger conventions of meaning?
        
         | stult wrote:
         | As a Decepticon, I couldn't agree more
        
         | m12k wrote:
         | Yeah, based on the headline I actually thought the article
         | would be about electric power conversion for dummies. I guess
         | both topics are on-topic for HN.
        
           | seanmcdirmid wrote:
           | Well, at least you didn't think it was about robots who could
           | turn into cars. I spend way too much time watching videos
           | with my son these days.
        
           | ucosty wrote:
           | Same, especially since the domain gave me electrical
           | engineering vibes.
        
       | d4rkp4ttern wrote:
       | This is a great comprehensive deep dive, thank you for sharing!
       | 
       | Interestingly, there was a blog post and video in 2019 by Peter
       | Bloem [1] with the same title which I consider to be one of the
       | very best quick intros to transformers. Rather than diving into
       | all of the details right away, Peter focuses on the intuition and
       | "peels the onion" gradually.
       | 
       | [1] http://peterbloem.nl/blog/transformers
       | 
       | Other notable good intros to transformers are:
       | 
       | The Illustrated Transformer by Jay Alammar
       | 
       | https://jalammar.github.io/illustrated-transformer/
       | 
       | The Annotated Transformer by Harvard NLP
       | 
       | https://nlp.seas.harvard.edu/2018/04/03/attention.html
        
       | [deleted]
        
       | analognoise wrote:
       | I was like "oh cool, winding our own transformers!"
        
       ___________________________________________________________________
       (page generated 2021-11-25 23:01 UTC)