[HN Gopher] Transformers from Scratch
___________________________________________________________________
Transformers from Scratch
Author : gradientgarden
Score : 160 points
Date : 2021-11-23 06:27 UTC (2 days ago)
(HTM) web link (e2eml.school)
(TXT) w3m dump (e2eml.school)
| arketyp wrote:
| Transformers are exciting as they seem to work on all types of
| modalities, including vision [1]. It makes me wonder if the
| transformer module captures some essence of the minicolumn
| structure found all over the neocortex that Jeff Hawkins raves
| about, citing Vernon Mountcastle. Hawkins et al talk about grid
| cells nowadays, location; maybe attention and context is the
| generalization of such a notion.
|
| [1] https://arxiv.org/abs/2105.15203
| mountainriver wrote:
| They would say no, they believe it is more like a graph
| probabilistic structure
| ypcx wrote:
| It's interesting to see how human understanding differs when it
| comes to complex, yet clearly defined topics, like machine-
| learning/Transformers.
|
| For comparison, my understanding of Transformers, after going
| through Peter Bloem's "Transformers from scratch" [1],
| implementing/understanding the code and the actual flow of the
| mathematical quantities, my understanding is that:
|
| - Transformers consist of 3 main parts: 1. Encoders/Decoders (I/O
| conversion), 2. Self-attention (Indexing), 3. Feed-forward
| trainable network (Memory).
|
| - The Feed-forward is the most simple kind of (an input->single-
| layer) neural net, actually often implemented by a Conv1d layer,
| which is a simple matrix multiply plus a bias and activation.
|
| - The most interesting part is the Multi-head self-attention,
| which I understand as [2] a randomly-initialized multi-
| dimensional indexing system where different heads focus on
| different variations of the indexed token instance (token =
| initially e.g. a word or a part of a word) with respect to its
| containing sequence/context. Such encoded token instance contains
| information about all other tokens of the input sequence = a.k.a.
| self-attention, and these tokens vary based on how the given
| "attention head" was (randomly) initialized.
|
| The part that really hits you is when you understand that for a
| Transformer, a token is not unique only due to its
| content/identity (and due to all other tokens in the given
| context/sentence), but also due to its position in the context --
| e.g. to the Transformer, the word "the" at the first position is
| a _completely different word_ to the word "the" on e.g. the
| second position (even if the rest of the context would be the
| same). (Which is obviously a massive waste of space if you think
| about it, but at the same time, at the moment, the only/best way
| of doing it, because it moves a massive amount of processing from
| inference time to the training time - which is what our current
| von-Neumann hardware architectures require.)
|
| [1] http://peterbloem.nl/blog/transformers
|
| [2] https://datascience.stackexchange.com/a/103151/101197
| igorkraw wrote:
| Your last point is true only with positional encodings though,
| attention itself is a permutation equivariant function
| Chris2048 wrote:
| An little off-topic discussion: I sometimes think use of English
| is a bit too silo-ed.
|
| This is talking about transformers of a mathematical nature. The
| same term can describe an electrical component. The word itself
| "trans-form-er" seems to mean "A thing that changes (the shape
| of?) something" - this is a very vague semantic, maybe
| appropriate for an abstract mathematical notion, but silly for
| any specific thing.
|
| It reminds me of all the vague nouns used to name OOP objects:
| FooManager, FooHandler, FooController, BarContext, BarInfo,
| BazAdapter, BazConvertor, BazTransformer; etc etc etc.
| krisrm wrote:
| I remember designing plenty of "BazAdapterFactoryManager"
| software in my early days of software engineering.
|
| Sometimes I still have to maintain that software, and I
| remember quickly what regret feels like.
| Chris2048 wrote:
| Yep. And sometimes different fiefdoms reuse the same words
| with slightly different meaning/conventions - consider even
| what an object is in Java vs smalltalk vs Javascript.
|
| Maybe we need stronger conventions of meaning?
| stult wrote:
| As a Decepticon, I couldn't agree more
| m12k wrote:
| Yeah, based on the headline I actually thought the article
| would be about electric power conversion for dummies. I guess
| both topics are on-topic for HN.
| seanmcdirmid wrote:
| Well, at least you didn't think it was about robots who could
| turn into cars. I spend way too much time watching videos
| with my son these days.
| ucosty wrote:
| Same, especially since the domain gave me electrical
| engineering vibes.
| d4rkp4ttern wrote:
| This is a great comprehensive deep dive, thank you for sharing!
|
| Interestingly, there was a blog post and video in 2019 by Peter
| Bloem [1] with the same title which I consider to be one of the
| very best quick intros to transformers. Rather than diving into
| all of the details right away, Peter focuses on the intuition and
| "peels the onion" gradually.
|
| [1] http://peterbloem.nl/blog/transformers
|
| Other notable good intros to transformers are:
|
| The Illustrated Transformer by Jay Alammar
|
| https://jalammar.github.io/illustrated-transformer/
|
| The Annotated Transformer by Harvard NLP
|
| https://nlp.seas.harvard.edu/2018/04/03/attention.html
| [deleted]
| analognoise wrote:
| I was like "oh cool, winding our own transformers!"
___________________________________________________________________
(page generated 2021-11-25 23:01 UTC)