[HN Gopher] What is a transformer model? (2022)
___________________________________________________________________
What is a transformer model? (2022)
Author : Anon84
Score : 222 points
Date : 2023-06-23 17:24 UTC (5 hours ago)
(HTM) web link (blogs.nvidia.com)
(TXT) w3m dump (blogs.nvidia.com)
| smrtinsert wrote:
| Great summary. I'm still working through a class that's just
| getting to RNN and CNN. Crazy it seems like an obsolete tech
| already.
| venky180 wrote:
| While it is largely obsolete for practical purposes, learning
| about them is still valuable as they illustrate the natural
| evolution in the thought process behind the development of
| transformers.
| techbruv wrote:
| Some other good resources:
|
| [0]: The original paper: https://arxiv.org/abs/1706.03762
|
| [1]: Full walkthrough for building a GPT from Scratch:
| https://www.youtube.com/watch?v=kCc8FmEb1nY
|
| [2]: A simple inference only implementation in just NumPy, that's
| only 60 lines: https://jaykmody.com/blog/gpt-from-scratch/
|
| [3]: Some great visualizations and high-level explanations:
| http://jalammar.github.io/illustrated-transformer/
|
| [4]: An implementation that is presented side-by-side with the
| original paper:
| https://nlp.seas.harvard.edu/2018/04/03/attention.html
| itissid wrote:
| [1] is thoroughly recommended.
| revskill wrote:
| Recommended for who ?
| quickthrower2 wrote:
| Masochists! In a good way! I recommend you do the full
| course not jump into that video. I did the full course,
| paused to do some of a university course around lecture 2
| to really understand some stuff then came back and
| finishing it off.
|
| Bu the end of you would have done stuff like hand working
| out back-propagation though sums, broadcasting, batchnorm
| etc. Fairly intense for a regular programmer!
| basedbertram wrote:
| From looking at the video probably someone who has good
| working knowledge of PyTorch, familiarity with NLP
| fundamentals and transformers, and somewhat of a working
| understanding of how GPT works.
| [deleted]
| quickthrower2 wrote:
| Done 1. It is a drawdropper! Especially if you have done the
| rest of the series and seen results of older architectures. And
| I was like "where is the rest of it, you ain't finished!" ...
| and then ... ah I see why they named the paper attention is all
| you need.
|
| But even the crappy (small 500k param IIRC) Transformer model
| trained on a free colab in a couple of minites was relatively
| impressive. Looking at only 8 chars back and train on a HN
| thread it got the structure / layout of the page pretty good,
| interspersed with drunken looking HN comments.
| samvher wrote:
| I found this lecture and the one following it very helpful as
| well:
| https://www.youtube.com/watch?v=ptuGllU5SQQ&list=PLoROMvodv4...
| basedbertram wrote:
| There's a newer version of [4]:
| http://nlp.seas.harvard.edu/annotated-transformer/
| [deleted]
| sylware wrote:
| Weird, still no diagram of a transformer based connectome. Or did
| I missi it??
| simonw wrote:
| This is from March 25, 2022 - over a year old now.
| cubefox wrote:
| (2022)
|
| It feels like a joke, but the fact they are mentioning GPT-3 and
| Megatron-Turing as the hottest new things makes this piece seem
| so outdated.
| hyperbovine wrote:
| The relevant lifespan of a paper in this field appears to
| average about three months. Can't really hold that against
| them.
| amelius wrote:
| Any self-respecting AI researcher writes self-updating
| papers.
| ronsor wrote:
| Papers that update themselves with GPT-4?
| Anon84 wrote:
| GPT-42
| jprd wrote:
| If you say "Transformer" and "Megatron" in the same sentence,
| you have my full-attention for at least enough time to start
| making a serious observation about ML, or frankly, anything.
| moffkalast wrote:
| Once they realize they can distil and quantize Megatron, I
| suppose they'll name that one "Starscream"?
| sorenjan wrote:
| There's one thing I haven't quite grasped about transformers. Are
| the query, key, and value vectors trained per token? Does each
| token have specific QVK vectors in each head, or does each
| attention head have one set that are trained using a lot of
| tokens?
| danielmarkbruce wrote:
| The q,k,v _projections_ are trained, usually per head. But each
| token goes through the same projection (per head). In some
| architectures (see the Falcon model) the k, v projections are
| shared across the heads.
|
| During the forward pass a "query" is "created" for each token
| using the query projection (again, one projection per head, all
| the tokens run through the same projection). The keys are
| created using the key projection, the values are created using
| the dot product of query with the keys for each "token" then
| projected using the value projection.
|
| But again, different models do different things. Some models
| bring in the positional encoding into the attention calculation
| rather than adding it in earlier. Practically every combination
| of things has been tried since that paper was published.
| karmasimida wrote:
| Best teacher of transformer model is ChatGPT itself
| gnicholas wrote:
| One thing I wonder about is where it gets its data about itself
| from. Did they feed it a bunch of accurate information about
| itself?
| quickthrower2 wrote:
| I reckon they did
| gnicholas wrote:
| Quite possible! But given how ChatGPT hallucinates, and my
| general lack of knowledge about LLMs in general and ChatGPT
| in particular, I would be hesitant to take what it says at
| face value. I'm especially hesitant to trust anything it
| says about itself in particular, since much of its
| specifics are not publicly documented and are essentially
| unverifiable.
|
| I wish there were some way for it to communicate that
| certain responses about itself were more or less hardcoded.
| einpoklum wrote:
| I think this is a transformer model: https://t.ly/se-M
| czbond wrote:
| It's more than meets the eye
| enriquto wrote:
| If you'd rather prefer something readable and explicit, instead
| of empty handwaving and uml-like diagrams, read "The Transformer
| model in equations" [0] by John Thickstun [1].
|
| [0] https://johnthickstun.com/docs/transformers.pdf
|
| [1] https://johnthickstun.com/docs/
| robotresearcher wrote:
| See also the original paper:
|
| https://arxiv.org/abs/1706.03762
| epistasis wrote:
| The original paper is quite deceptive and hard to understand,
| IMHO. It relies on jumping between several different figures
| and mapping between shapes, in addition to guessing at what
| the unlabeled inputs are.
|
| Just a few more labels, making the implicit explicit, would
| make it far more intelligible. Plus, last time I went through
| it Im pretty sure that there's either a swap on the order of
| the three inputs between different figures, or that it's
| incorrectly diagrammed.
| semiquaver wrote:
| I feel like I have a schematic understanding of how transformer
| models process text. But I struggle to understand how the same
| concept could be used with images or other less linear data
| types. I understand that such data can be vectorized so it's
| represented as a point or series of points in a higher
| dimensional space, but how does such a process capture the
| essential high level perceptual aspects well enough to be able to
| replicate large scale features? Or equivalently, how are the
| dimensions chosen? I have to imagine that simply looking at an
| image as an RGB pixel sequence would largely miss the point of
| what it "is" and represents.
| og_kalu wrote:
| Just to add, the first user who replied to you is quite wrong.
| You can use CNNs to get features first...but it doesn't happen
| anymore. It's unnecessary and adds nothing.
|
| Pixels don't get fed into transformers but that's more expense
| than anything. Transformers need to understand how each piece
| relates to every other piece. That gets very costly very fast
| when the "pieces" are pixels. Images are split into patches
| instead with positional embeddings.
|
| as for how it learns the representations anyway. Well it's not
| like there's any specific intuition to it. after all, the
| original authors didn't anticipate the use case in Vision.
|
| and the fact that you didn't need CNNs to extract features
| first didn't really come into light till this paper -
| https://arxiv.org/abs/2010.11929
|
| It basically just comes down to lots of layers and training
| QuesnayJr wrote:
| They split the image into small patches (I think 16x16 is
| standard), and then treat each patch as a token. The image
| becomes a sequence of tokens, which gets analyzed the same as
| if it was text.
|
| Doing it this way is obviously throwing away a lot of
| information, but I assume the advantages of using a transfomer
| outweigh the disadvantages.
| mliker wrote:
| If I understand what you're asking, the Transformer isn't
| initially treating the image as a sequence of pixels like p1,
| p2, ..., pN. Instead, you can use a convolutional neural
| network to respect the structure of the image to extract
| features. Then you use the attention mechanism to pay attention
| to parts of the image that aren't necessarily close together
| but that when viewed together, contribute to the classification
| of an object within the image.
| og_kalu wrote:
| Vision Transformers don't use CNNs to extract anything first.
| (https://arxiv.org/abs/2010.11929). You could but it's not
| necessary and it doesn't add anything so it doesn't happen
| anymore.
|
| Vision transformers won't treat the image as a sequence of
| pixels but that's mostly because doing that gets very
| expensive very fast. The image is split into patches and the
| patches have positional embeddings.
| sangnoir wrote:
| > how does such a process capture the essential high level
| perceptual aspects well enough to be able to replicate large
| scale features
|
| Layers and lots of training. Upper layers can capture/recognize
| large scale features
| asylteltine wrote:
| [dead]
| edc117 wrote:
| There was an excellent thread/discussion about it a while ago:
|
| https://news.ycombinator.com/item?id=35977891
___________________________________________________________________
(page generated 2023-06-23 23:00 UTC)