[HN Gopher] The Transformer Family
___________________________________________________________________
The Transformer Family
Author : alexmolas
Score : 105 points
Date : 2023-01-29 09:07 UTC (13 hours ago)
(HTM) web link (lilianweng.github.io)
(TXT) w3m dump (lilianweng.github.io)
| simonw wrote:
| On the one hand, this looks really useful.
|
| On the other hand:
|
| > There are various forms of attention / self-attention,
| Transformer (Vaswani et al., 2017) relies on the scaled dot-
| product attention: given a query matrix , a key matrix and a
| value matrix , the output is a weighted sum of the value vectors,
| where the weight assigned to each value slot is determined by the
| dot-product of the query with the corresponding key
|
| There HAS to be a better way of communicating this stuff. I'm
| honestly not even sure where to start decoding and explaining
| that paragraph.
|
| We really need someone with the explanatory skills of
| https://jvns.ca/ to start helping people understand this space.
| mirekrusin wrote:
| Andrej Karpathy ie [0]?
|
| [0] https://www.youtube.com/watch?v=kCc8FmEb1nY
| benreesman wrote:
| The Illustrated Transformer is pretty great. I was pretty hazy
| after reading the paper back in 2017 and this resource helped a
| lot.
|
| https://jalammar.github.io/illustrated-transformer/
| simonw wrote:
| Thanks, that one is really useful.
| nerdponx wrote:
| Complicated from whose perspective? I don't go around
| commenting on systems programming articles about how low-level
| memory management and currency algorithms are too complicated,
| or commenting on category theory articles that the terminology
| is too obtuse and monads are too hard.
|
| I agree that there probably could be a better "on ramp" into
| this material than "take an undergraduate linear algebra
| course", but ultimately it is a mathematical model and you're
| going to have to deal with the math at some point if you want
| to actually understand what's going on. Linear algebra and
| calculus are entry-level table stakes for understanding how
| machine learning works, and there's really no way around that.
| peterfirefly wrote:
| Vector, matrix, weighted sum, and dot product are good places
| to start. In fact, these concepts are so useful that they are
| good places to start pretty much no matter where you want to
| go. 3D graphics, statistics, physics, ... and neural networks.
| lostdog wrote:
| Probably a strong neural net only needs sparse connections to
| learn well. However, we simple humans cannot predict which sparse
| connections are important. Therefore, the net needs to learn
| which connections are important, but learning the connections
| means it needs to compute all of them during the training
| process, so the training process is slow. It's very challenging
| to break this cycle!
| polygamous_bat wrote:
| What you are describing is known as the "lottery ticket
| hypothesis" [0] in the ML world -- it is a well-studied
| phenomenon!
|
| [0] https://arxiv.org/abs/1803.03635
| jerpint wrote:
| This amount of diversity in transformers is very impressive, but
| what's more impressive is that for models like GPT, scaling the
| models seems much more effective than engineering the models
| ad404b8a372f2b9 wrote:
| I see no evidence of that, transformers seem to follow the same
| trend as other architectures with improved models coming out
| every month that demonstrate similar performance with orders of
| magnitude less parameters.
| nerdponx wrote:
| I think by "effective" they meant better overall performance,
| not necessarily better performance per parameter or hour of
| training.
| alexmolas wrote:
| I don't remember the paper, I think it's on the vision
| transformers paper, that they say something like "scaling the
| model and having more data completely beats inductive bias".
| It's impressive how we went from feature engineering in
| classical ML, to inductive bias in early deep learning, to just
| have more data in modern deep learning.
| 0xBABAD00C wrote:
| > scaling the model and having more data completely beats
| inductive bias
|
| The analogy in my mind is this: "burning natural oil/gas
| completely beats figuring out cleaner & more sustainable
| energy sources"
|
| My point is that "more data" here simply represents the
| mental effort that has already been exerted in pre-AI/DL era,
| which we're now capitalizing on while we can. Similar to how
| fossil fuels represent the energy storage efforts by earlier
| lifeforms that we're now capitalizing on, again while we can.
| It's a system way out of equilibrium, progressing while it
| can on borrowed resources from the prior generations.
|
| In the long run, the AI agents will be less wasteful as they
| reach the limits of what data or energy is available on the
| margins to compete within themselves and to reach their
| goals. It's just we haven't reached that limit yet, and the
| competition at this stage is on processing more data and
| scaling the models at any cost.
| hn_throwaway_99 wrote:
| Somewhat off topic. As someone who did some neural network
| programming in Matlab a couple decades ago, I always feel a bit
| dismayed that I'm able to understand so little about modern AI
| given the explosion in advances in the field starting in about
| the late 00s or so with things like convolutional neural networks
| and deep learning, transformers, large language models, etc.
|
| Can anyone recommend some great courses or other online resources
| for getting up to speed on the state-of-the-art with respect to
| AI? Not really so much looking for an "ELI5" but more of a "you
| have a strong programming and very-old-school AI background, here
| are the steps/processes you need to know to understand modern
| tools".
|
| Edit: thanks for all the great replies, super helpful!
| sorenjan wrote:
| A course by Andrej Karpathy on building neural networks, from
| scratch, in code. We start with the basics of backpropagation
| and build up to modern deep neural networks, like GPT.
|
| https://karpathy.ai/zero-to-hero.html
| ad404b8a372f2b9 wrote:
| http://cs231n.stanford.edu/ is good for convolutional networks.
| rjtavares wrote:
| The obvious answer is Fast AI's Practical Deep Learning for
| Coders - https://course.fast.ai/
|
| They'll also be releasing the "From Deep Learning Foundations
| to Stable Diffusion" course soon, which is basically Part 2 of
| the course.
| whoateallthepy wrote:
| I put together a repository at the end of last year to walk
| through a basic use of a single layer Transformer: detect
| whether "a" and "b" are in a sequence of characters. Everything
| is reproducible, so hopefully helpful at getting used to some
| of the tooling too!
|
| https://github.com/rstebbing/workshop/tree/main/experiments/...
| heyitsguay wrote:
| For a while now, an answer I've seen is to start with
| "Attention Is All You Need", the original Transformers paper.
| It's still pretty good, but over the past year I've led a few
| working sessions on grokking transformer computational
| fundamentals and they've turned up some helpful later additions
| that simplify and clarify what's going on.
|
| You can quickly get overwhelmed by the million good resources
| out there so I'll keep it to these three. If you have a strong
| CS background, they'll take you a long way:
|
| (1) Transformers from Scratch:
| https://peterbloem.nl/blog/transformers
|
| (2) Attention Is All You Need:
| https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547de...
|
| (3) Formal Algorithms for Transformers:
| https://arxiv.org/abs/2207.09238
| nerdponx wrote:
| Part of the problem with self studying this stuff is that
| it's hard to know which resources are good, without already
| being at least conversant with the material already.
| peterfirefly wrote:
| That problem doesn't really disappear with teachers and
| classes ;)
| bfung wrote:
| YouTube channel that explains the paper in detail:
| https://youtu.be/iDulhoQ2pro
|
| And subsequent follow-ups (ROME, editing transformer arch):
| https://youtu.be/_NMQyOu2HTo
|
| I find the channel amazing explaining super complex topics in
| simple enough terms for people who have some background in
| AI.
| peterfirefly wrote:
| Yannic Kilcher is great but this video worked better for
| me:
|
| "LSTM is dead. Long live transformers!" (Leo Dirac):
| https://www.youtube.com/watch?v=S27pHKBEp30
| yonz wrote:
| Great compilation, would be great to see Vision Transformer (ViT)
| included.
|
| Andrej Karpathy's GPT video is a must have companion for this
| https://youtu.be/kCc8FmEb1nY I was going nuts trying to grok Key,
| Query,Position and Value until Andrej broke it down for me.
___________________________________________________________________
(page generated 2023-01-29 23:00 UTC)