[HN Gopher] What is a transformer model? (2022)
       ___________________________________________________________________
        
       What is a transformer model? (2022)
        
       Author : Anon84
       Score  : 222 points
       Date   : 2023-06-23 17:24 UTC (5 hours ago)
        
 (HTM) web link (blogs.nvidia.com)
 (TXT) w3m dump (blogs.nvidia.com)
        
       | smrtinsert wrote:
       | Great summary. I'm still working through a class that's just
       | getting to RNN and CNN. Crazy it seems like an obsolete tech
       | already.
        
         | venky180 wrote:
         | While it is largely obsolete for practical purposes, learning
         | about them is still valuable as they illustrate the natural
         | evolution in the thought process behind the development of
         | transformers.
        
       | techbruv wrote:
       | Some other good resources:
       | 
       | [0]: The original paper: https://arxiv.org/abs/1706.03762
       | 
       | [1]: Full walkthrough for building a GPT from Scratch:
       | https://www.youtube.com/watch?v=kCc8FmEb1nY
       | 
       | [2]: A simple inference only implementation in just NumPy, that's
       | only 60 lines: https://jaykmody.com/blog/gpt-from-scratch/
       | 
       | [3]: Some great visualizations and high-level explanations:
       | http://jalammar.github.io/illustrated-transformer/
       | 
       | [4]: An implementation that is presented side-by-side with the
       | original paper:
       | https://nlp.seas.harvard.edu/2018/04/03/attention.html
        
         | itissid wrote:
         | [1] is thoroughly recommended.
        
           | revskill wrote:
           | Recommended for who ?
        
             | quickthrower2 wrote:
             | Masochists! In a good way! I recommend you do the full
             | course not jump into that video. I did the full course,
             | paused to do some of a university course around lecture 2
             | to really understand some stuff then came back and
             | finishing it off.
             | 
             | Bu the end of you would have done stuff like hand working
             | out back-propagation though sums, broadcasting, batchnorm
             | etc. Fairly intense for a regular programmer!
        
             | basedbertram wrote:
             | From looking at the video probably someone who has good
             | working knowledge of PyTorch, familiarity with NLP
             | fundamentals and transformers, and somewhat of a working
             | understanding of how GPT works.
        
               | [deleted]
        
         | quickthrower2 wrote:
         | Done 1. It is a drawdropper! Especially if you have done the
         | rest of the series and seen results of older architectures. And
         | I was like "where is the rest of it, you ain't finished!" ...
         | and then ... ah I see why they named the paper attention is all
         | you need.
         | 
         | But even the crappy (small 500k param IIRC) Transformer model
         | trained on a free colab in a couple of minites was relatively
         | impressive. Looking at only 8 chars back and train on a HN
         | thread it got the structure / layout of the page pretty good,
         | interspersed with drunken looking HN comments.
        
         | samvher wrote:
         | I found this lecture and the one following it very helpful as
         | well:
         | https://www.youtube.com/watch?v=ptuGllU5SQQ&list=PLoROMvodv4...
        
         | basedbertram wrote:
         | There's a newer version of [4]:
         | http://nlp.seas.harvard.edu/annotated-transformer/
        
       | [deleted]
        
       | sylware wrote:
       | Weird, still no diagram of a transformer based connectome. Or did
       | I missi it??
        
       | simonw wrote:
       | This is from March 25, 2022 - over a year old now.
        
       | cubefox wrote:
       | (2022)
       | 
       | It feels like a joke, but the fact they are mentioning GPT-3 and
       | Megatron-Turing as the hottest new things makes this piece seem
       | so outdated.
        
         | hyperbovine wrote:
         | The relevant lifespan of a paper in this field appears to
         | average about three months. Can't really hold that against
         | them.
        
           | amelius wrote:
           | Any self-respecting AI researcher writes self-updating
           | papers.
        
             | ronsor wrote:
             | Papers that update themselves with GPT-4?
        
               | Anon84 wrote:
               | GPT-42
        
           | jprd wrote:
           | If you say "Transformer" and "Megatron" in the same sentence,
           | you have my full-attention for at least enough time to start
           | making a serious observation about ML, or frankly, anything.
        
             | moffkalast wrote:
             | Once they realize they can distil and quantize Megatron, I
             | suppose they'll name that one "Starscream"?
        
       | sorenjan wrote:
       | There's one thing I haven't quite grasped about transformers. Are
       | the query, key, and value vectors trained per token? Does each
       | token have specific QVK vectors in each head, or does each
       | attention head have one set that are trained using a lot of
       | tokens?
        
         | danielmarkbruce wrote:
         | The q,k,v _projections_ are trained, usually per head. But each
         | token goes through the same projection (per head). In some
         | architectures (see the Falcon model) the k, v projections are
         | shared across the heads.
         | 
         | During the forward pass a "query" is "created" for each token
         | using the query projection (again, one projection per head, all
         | the tokens run through the same projection). The keys are
         | created using the key projection, the values are created using
         | the dot product of query with the keys for each "token" then
         | projected using the value projection.
         | 
         | But again, different models do different things. Some models
         | bring in the positional encoding into the attention calculation
         | rather than adding it in earlier. Practically every combination
         | of things has been tried since that paper was published.
        
       | karmasimida wrote:
       | Best teacher of transformer model is ChatGPT itself
        
         | gnicholas wrote:
         | One thing I wonder about is where it gets its data about itself
         | from. Did they feed it a bunch of accurate information about
         | itself?
        
           | quickthrower2 wrote:
           | I reckon they did
        
             | gnicholas wrote:
             | Quite possible! But given how ChatGPT hallucinates, and my
             | general lack of knowledge about LLMs in general and ChatGPT
             | in particular, I would be hesitant to take what it says at
             | face value. I'm especially hesitant to trust anything it
             | says about itself in particular, since much of its
             | specifics are not publicly documented and are essentially
             | unverifiable.
             | 
             | I wish there were some way for it to communicate that
             | certain responses about itself were more or less hardcoded.
        
       | einpoklum wrote:
       | I think this is a transformer model: https://t.ly/se-M
        
       | czbond wrote:
       | It's more than meets the eye
        
       | enriquto wrote:
       | If you'd rather prefer something readable and explicit, instead
       | of empty handwaving and uml-like diagrams, read "The Transformer
       | model in equations" [0] by John Thickstun [1].
       | 
       | [0] https://johnthickstun.com/docs/transformers.pdf
       | 
       | [1] https://johnthickstun.com/docs/
        
         | robotresearcher wrote:
         | See also the original paper:
         | 
         | https://arxiv.org/abs/1706.03762
        
           | epistasis wrote:
           | The original paper is quite deceptive and hard to understand,
           | IMHO. It relies on jumping between several different figures
           | and mapping between shapes, in addition to guessing at what
           | the unlabeled inputs are.
           | 
           | Just a few more labels, making the implicit explicit, would
           | make it far more intelligible. Plus, last time I went through
           | it Im pretty sure that there's either a swap on the order of
           | the three inputs between different figures, or that it's
           | incorrectly diagrammed.
        
       | semiquaver wrote:
       | I feel like I have a schematic understanding of how transformer
       | models process text. But I struggle to understand how the same
       | concept could be used with images or other less linear data
       | types. I understand that such data can be vectorized so it's
       | represented as a point or series of points in a higher
       | dimensional space, but how does such a process capture the
       | essential high level perceptual aspects well enough to be able to
       | replicate large scale features? Or equivalently, how are the
       | dimensions chosen? I have to imagine that simply looking at an
       | image as an RGB pixel sequence would largely miss the point of
       | what it "is" and represents.
        
         | og_kalu wrote:
         | Just to add, the first user who replied to you is quite wrong.
         | You can use CNNs to get features first...but it doesn't happen
         | anymore. It's unnecessary and adds nothing.
         | 
         | Pixels don't get fed into transformers but that's more expense
         | than anything. Transformers need to understand how each piece
         | relates to every other piece. That gets very costly very fast
         | when the "pieces" are pixels. Images are split into patches
         | instead with positional embeddings.
         | 
         | as for how it learns the representations anyway. Well it's not
         | like there's any specific intuition to it. after all, the
         | original authors didn't anticipate the use case in Vision.
         | 
         | and the fact that you didn't need CNNs to extract features
         | first didn't really come into light till this paper -
         | https://arxiv.org/abs/2010.11929
         | 
         | It basically just comes down to lots of layers and training
        
         | QuesnayJr wrote:
         | They split the image into small patches (I think 16x16 is
         | standard), and then treat each patch as a token. The image
         | becomes a sequence of tokens, which gets analyzed the same as
         | if it was text.
         | 
         | Doing it this way is obviously throwing away a lot of
         | information, but I assume the advantages of using a transfomer
         | outweigh the disadvantages.
        
         | mliker wrote:
         | If I understand what you're asking, the Transformer isn't
         | initially treating the image as a sequence of pixels like p1,
         | p2, ..., pN. Instead, you can use a convolutional neural
         | network to respect the structure of the image to extract
         | features. Then you use the attention mechanism to pay attention
         | to parts of the image that aren't necessarily close together
         | but that when viewed together, contribute to the classification
         | of an object within the image.
        
           | og_kalu wrote:
           | Vision Transformers don't use CNNs to extract anything first.
           | (https://arxiv.org/abs/2010.11929). You could but it's not
           | necessary and it doesn't add anything so it doesn't happen
           | anymore.
           | 
           | Vision transformers won't treat the image as a sequence of
           | pixels but that's mostly because doing that gets very
           | expensive very fast. The image is split into patches and the
           | patches have positional embeddings.
        
         | sangnoir wrote:
         | > how does such a process capture the essential high level
         | perceptual aspects well enough to be able to replicate large
         | scale features
         | 
         | Layers and lots of training. Upper layers can capture/recognize
         | large scale features
        
       | asylteltine wrote:
       | [dead]
        
       | edc117 wrote:
       | There was an excellent thread/discussion about it a while ago:
       | 
       | https://news.ycombinator.com/item?id=35977891
        
       ___________________________________________________________________
       (page generated 2023-06-23 23:00 UTC)