[HN Gopher] The Transformer Family
       ___________________________________________________________________
        
       The Transformer Family
        
       Author : alexmolas
       Score  : 105 points
       Date   : 2023-01-29 09:07 UTC (13 hours ago)
        
 (HTM) web link (lilianweng.github.io)
 (TXT) w3m dump (lilianweng.github.io)
        
       | simonw wrote:
       | On the one hand, this looks really useful.
       | 
       | On the other hand:
       | 
       | > There are various forms of attention / self-attention,
       | Transformer (Vaswani et al., 2017) relies on the scaled dot-
       | product attention: given a query matrix , a key matrix and a
       | value matrix , the output is a weighted sum of the value vectors,
       | where the weight assigned to each value slot is determined by the
       | dot-product of the query with the corresponding key
       | 
       | There HAS to be a better way of communicating this stuff. I'm
       | honestly not even sure where to start decoding and explaining
       | that paragraph.
       | 
       | We really need someone with the explanatory skills of
       | https://jvns.ca/ to start helping people understand this space.
        
         | mirekrusin wrote:
         | Andrej Karpathy ie [0]?
         | 
         | [0] https://www.youtube.com/watch?v=kCc8FmEb1nY
        
         | benreesman wrote:
         | The Illustrated Transformer is pretty great. I was pretty hazy
         | after reading the paper back in 2017 and this resource helped a
         | lot.
         | 
         | https://jalammar.github.io/illustrated-transformer/
        
           | simonw wrote:
           | Thanks, that one is really useful.
        
         | nerdponx wrote:
         | Complicated from whose perspective? I don't go around
         | commenting on systems programming articles about how low-level
         | memory management and currency algorithms are too complicated,
         | or commenting on category theory articles that the terminology
         | is too obtuse and monads are too hard.
         | 
         | I agree that there probably could be a better "on ramp" into
         | this material than "take an undergraduate linear algebra
         | course", but ultimately it is a mathematical model and you're
         | going to have to deal with the math at some point if you want
         | to actually understand what's going on. Linear algebra and
         | calculus are entry-level table stakes for understanding how
         | machine learning works, and there's really no way around that.
        
         | peterfirefly wrote:
         | Vector, matrix, weighted sum, and dot product are good places
         | to start. In fact, these concepts are so useful that they are
         | good places to start pretty much no matter where you want to
         | go. 3D graphics, statistics, physics, ... and neural networks.
        
       | lostdog wrote:
       | Probably a strong neural net only needs sparse connections to
       | learn well. However, we simple humans cannot predict which sparse
       | connections are important. Therefore, the net needs to learn
       | which connections are important, but learning the connections
       | means it needs to compute all of them during the training
       | process, so the training process is slow. It's very challenging
       | to break this cycle!
        
         | polygamous_bat wrote:
         | What you are describing is known as the "lottery ticket
         | hypothesis" [0] in the ML world -- it is a well-studied
         | phenomenon!
         | 
         | [0] https://arxiv.org/abs/1803.03635
        
       | jerpint wrote:
       | This amount of diversity in transformers is very impressive, but
       | what's more impressive is that for models like GPT, scaling the
       | models seems much more effective than engineering the models
        
         | ad404b8a372f2b9 wrote:
         | I see no evidence of that, transformers seem to follow the same
         | trend as other architectures with improved models coming out
         | every month that demonstrate similar performance with orders of
         | magnitude less parameters.
        
           | nerdponx wrote:
           | I think by "effective" they meant better overall performance,
           | not necessarily better performance per parameter or hour of
           | training.
        
         | alexmolas wrote:
         | I don't remember the paper, I think it's on the vision
         | transformers paper, that they say something like "scaling the
         | model and having more data completely beats inductive bias".
         | It's impressive how we went from feature engineering in
         | classical ML, to inductive bias in early deep learning, to just
         | have more data in modern deep learning.
        
           | 0xBABAD00C wrote:
           | > scaling the model and having more data completely beats
           | inductive bias
           | 
           | The analogy in my mind is this: "burning natural oil/gas
           | completely beats figuring out cleaner & more sustainable
           | energy sources"
           | 
           | My point is that "more data" here simply represents the
           | mental effort that has already been exerted in pre-AI/DL era,
           | which we're now capitalizing on while we can. Similar to how
           | fossil fuels represent the energy storage efforts by earlier
           | lifeforms that we're now capitalizing on, again while we can.
           | It's a system way out of equilibrium, progressing while it
           | can on borrowed resources from the prior generations.
           | 
           | In the long run, the AI agents will be less wasteful as they
           | reach the limits of what data or energy is available on the
           | margins to compete within themselves and to reach their
           | goals. It's just we haven't reached that limit yet, and the
           | competition at this stage is on processing more data and
           | scaling the models at any cost.
        
       | hn_throwaway_99 wrote:
       | Somewhat off topic. As someone who did some neural network
       | programming in Matlab a couple decades ago, I always feel a bit
       | dismayed that I'm able to understand so little about modern AI
       | given the explosion in advances in the field starting in about
       | the late 00s or so with things like convolutional neural networks
       | and deep learning, transformers, large language models, etc.
       | 
       | Can anyone recommend some great courses or other online resources
       | for getting up to speed on the state-of-the-art with respect to
       | AI? Not really so much looking for an "ELI5" but more of a "you
       | have a strong programming and very-old-school AI background, here
       | are the steps/processes you need to know to understand modern
       | tools".
       | 
       | Edit: thanks for all the great replies, super helpful!
        
         | sorenjan wrote:
         | A course by Andrej Karpathy on building neural networks, from
         | scratch, in code. We start with the basics of backpropagation
         | and build up to modern deep neural networks, like GPT.
         | 
         | https://karpathy.ai/zero-to-hero.html
        
         | ad404b8a372f2b9 wrote:
         | http://cs231n.stanford.edu/ is good for convolutional networks.
        
         | rjtavares wrote:
         | The obvious answer is Fast AI's Practical Deep Learning for
         | Coders - https://course.fast.ai/
         | 
         | They'll also be releasing the "From Deep Learning Foundations
         | to Stable Diffusion" course soon, which is basically Part 2 of
         | the course.
        
         | whoateallthepy wrote:
         | I put together a repository at the end of last year to walk
         | through a basic use of a single layer Transformer: detect
         | whether "a" and "b" are in a sequence of characters. Everything
         | is reproducible, so hopefully helpful at getting used to some
         | of the tooling too!
         | 
         | https://github.com/rstebbing/workshop/tree/main/experiments/...
        
         | heyitsguay wrote:
         | For a while now, an answer I've seen is to start with
         | "Attention Is All You Need", the original Transformers paper.
         | It's still pretty good, but over the past year I've led a few
         | working sessions on grokking transformer computational
         | fundamentals and they've turned up some helpful later additions
         | that simplify and clarify what's going on.
         | 
         | You can quickly get overwhelmed by the million good resources
         | out there so I'll keep it to these three. If you have a strong
         | CS background, they'll take you a long way:
         | 
         | (1) Transformers from Scratch:
         | https://peterbloem.nl/blog/transformers
         | 
         | (2) Attention Is All You Need:
         | https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547de...
         | 
         | (3) Formal Algorithms for Transformers:
         | https://arxiv.org/abs/2207.09238
        
           | nerdponx wrote:
           | Part of the problem with self studying this stuff is that
           | it's hard to know which resources are good, without already
           | being at least conversant with the material already.
        
             | peterfirefly wrote:
             | That problem doesn't really disappear with teachers and
             | classes ;)
        
           | bfung wrote:
           | YouTube channel that explains the paper in detail:
           | https://youtu.be/iDulhoQ2pro
           | 
           | And subsequent follow-ups (ROME, editing transformer arch):
           | https://youtu.be/_NMQyOu2HTo
           | 
           | I find the channel amazing explaining super complex topics in
           | simple enough terms for people who have some background in
           | AI.
        
             | peterfirefly wrote:
             | Yannic Kilcher is great but this video worked better for
             | me:
             | 
             | "LSTM is dead. Long live transformers!" (Leo Dirac):
             | https://www.youtube.com/watch?v=S27pHKBEp30
        
       | yonz wrote:
       | Great compilation, would be great to see Vision Transformer (ViT)
       | included.
       | 
       | Andrej Karpathy's GPT video is a must have companion for this
       | https://youtu.be/kCc8FmEb1nY I was going nuts trying to grok Key,
       | Query,Position and Value until Andrej broke it down for me.
        
       ___________________________________________________________________
       (page generated 2023-01-29 23:00 UTC)