[HN Gopher] The GPT Architecture, on a Napkin
       ___________________________________________________________________
        
       The GPT Architecture, on a Napkin
        
       Author : decremental
       Score  : 344 points
       Date   : 2022-12-11 12:29 UTC (10 hours ago)
        
 (HTM) web link (dugas.ch)
 (TXT) w3m dump (dugas.ch)
        
       | sAbakumoff wrote:
       | Personalized chatGPT + Boston Dynamics collaboration in creating
       | best friends who can talk about anything
        
       | macleginn wrote:
       | A small, very clearly written and well commented implementation:
       | https://github.com/karpathy/minGPT
        
       | oars wrote:
       | Great article about ChatGPT architecture, thank you.
        
       | devxpy wrote:
       | Would also love this to include InsturctGPT architecture with its
       | RL reward model!
        
       | bogomipz wrote:
       | This was a great read. Does anyone know what software the author
       | may have used for the hand-written or napkin-sketch looking
       | graphics?
        
       | [deleted]
        
       | seydor wrote:
       | It s not intutive why the QK^T matrix is importance, similarity
       | makes more sense
        
       | [deleted]
        
       | ec109685 wrote:
       | This was helpful. What are the basics that allow ChatGPT to
       | output numbers and solve math problems?
        
         | sva_ wrote:
         | Did you read the article? Under "Encoding"/"Decoding".
        
           | ec109685 wrote:
           | I guess each digit is a word?
           | 
           | For example, how does the explanation in the article produce
           | a module that can solve this:
           | 
           | " To calculate the hypotenuse of a triangle with one side
           | that is 12 inches long and another i S side that is 36
           | centimeters long, a 6th grader might say something like this:
           | "First, we need to convert the 36 centimeters into inches so
           | that both sides of the triangle are in the same units. We can
           | do this by dividing 36 by 2.54, which is the number of
           | centimeters in one inch. This gives us 14.173228 inches.
           | Then, we can use the Pythagorean theorem to find the length
           | of the hypotenuse. The Pythagorean theorem says that in a
           | right triangle, the square of the length of the hypotenuse is
           | equal to the sum of the squares of the lengths of the other
           | two sides. So we can use this formula to find the length of
           | the hypotenuse: a^2 + b^2 = c^2. In our triangle, the length
           | of one side is 12 inches, and the length of the other side is
           | 14.173228 inches. So we can plug those numbers into the
           | formula like this: 12^2 + 14.173228^2 = c^2. Then we just
           | need to do the math to find the value of c. 12^2 is 144, and
           | 14.173228^2 is 201.837296. So if we add those two numbers
           | together, we get 346.837296. And if we take the square root
           | of that number, we get the length of the hypotenuse, which is
           | 18.816199 inches."
        
             | heisenzombie wrote:
             | Well, it's wrong for one: It correctly gets the division (I
             | assume it has that fact memorized) but 14.173228^2 is
             | 200.88 not 201.83. it then also does the addition wrong,
             | and the square root is also wrong.
             | 
             | You gotta be REAL careful with ChatGPT output that sounds
             | convincing and technical. It's very good at convincingly
             | making stuff up, even math-y science-y sounding stuff.
        
         | bagels wrote:
         | Numbers are just more words to the model.
        
           | ec109685 wrote:
           | But it supports arbitrary precision or at least a whole bunch
           | of precision:
           | 
           | https://news.ycombinator.com/threads?id=ec109685#33944516
        
       | badrabbit wrote:
       | If I was let's say China, what stops me from replicating GPT!
        
         | randyrand wrote:
         | very little! just competent engineers and lots of gpus
        
       | mysterydip wrote:
       | If one were to make a markov chain with the same amount of input
       | data, would the result be the same? Markov chain chatbots have
       | been a thing for years, just on a much more limited set of data.
        
         | igorkraw wrote:
         | It _is_ a Markov chain. Your input context is your state.
        
         | n2d4 wrote:
         | No. The state size is 50257^2048. The vast majority of states
         | have never been seen and will never be seen in all of humanity.
         | 
         | For an example, if your training set consists of the words rain
         | and thunder used interchangeably a lot, but the word "today" is
         | only used once in the sentence "there is no rain today", then a
         | Markov chain based on the data would never output "there is no
         | thunder today", but a transformer might.
         | 
         | In other words, information compression (eg. equating rain with
         | thunder) isn't just for practicability, it's a necessary
         | requirement for (the current generation of) good language
         | models.
        
           | mysterydip wrote:
           | Ah, that's what I was missing. Thanks!
        
       | wklm wrote:
       | Very cool write-up, would love it even more if author included
       | the references.
        
       | mudrockbestgirl wrote:
       | That's a great summary, but it's important to understand that
       | much more goes into training these models. The architecture is
       | not any kind of secret sauce, or special in any way. It's just a
       | typical Transformer. I call this "architecture porn" - people
       | love looking at neural net architectures and think that's the key
       | to success. If only you know the algorithm! It's so simple!
       | 
       | But reality is usually much messier. The real training code will
       | be littered with hundreds of ugly little tricks to make it work.
       | A large part of it will be input preprocessing and data
       | engineering, tricks to deal with exploding/vanishing gradients,
       | monitoring, learning rate schedules and optimizer cycling,
       | complexity for distributed training, regularization tricks,
       | changing parts of the architecture for performance reasons (like
       | attention), and so on.
        
         | quonn wrote:
         | Don't know. Karpathy has a very compact implementation of GPT
         | [0] using standard technology (could be even more compact but
         | is reimplementing for example the attention layer for teaching
         | purposes) and while he presumably has no access to how the real
         | model was trained exactly, if there would be more to it I think
         | he would be the kind of person to point it out.
         | 
         | [0] https://github.com/karpathy/minGPT/tree/master/mingpt
        
         | aqme28 wrote:
         | I used to work in data engineering for ML and yes, I'd say 90%
         | of our technical expertise on both the science and engineering
         | side went into designing the datasets.
        
           | sebzim4500 wrote:
           | It feels like this is less true for GPT though, especially as
           | OpenAI seems to be adopting a 'kitchen sink' approach.
        
             | mike_hearn wrote:
             | Just getting plain text out of the web without getting
             | flooded with boilerplate, noise, SEO spam, duplication,
             | infinity pages like calendars etc is already a hard data
             | engineering problem.
        
         | sebzim4500 wrote:
         | I'm far from an expert in this field, but based on my
         | conversations with people who are I think this is getting less
         | true. Normally these models are trained with straightforward
         | optimizers (basically naive SGD) since advances like batch
         | normalization and residual connections make the more fancy
         | stuff unnecessary. I think the learning rate schedules used for
         | these big networks tend to be simple as well, just two or three
         | steps.
        
           | [deleted]
        
           | andreyk wrote:
           | I work in this field (PhD candidate), and what you say is
           | true for smaller models, but not GPT-3 scale models. Training
           | large scale models involved a lot more, as the OP said. It's
           | not just learning rate schedulers, it's a whole bunch of
           | stuff.
           | 
           | See this logbook from training the GPT-3 sized OPT model - ht
           | tps://github.com/facebookresearch/metaseq/blob/main/projec...
        
             | lucidrains wrote:
             | it is neither as simple as the person you are responding
             | to, nor as complicated as you make it seem. it will only
             | get simpler with time.
        
               | [deleted]
        
             | [deleted]
        
             | marstall wrote:
             | so creating each new rev of GPT3 would involve going
             | through something like all those messy steps in that
             | logbook?
        
             | lossolo wrote:
             | Seems like majority of problems in this log are devops
             | problems, which seems to be combination of ML people doing
             | devops work while not having experience with devops work
             | and really bad cloud vendor. I've been running multiple
             | bare metal nodes with 8 GPUs each running 24/7 for months
             | with almost 100% utilization and had 100x less problems
             | than they had.
        
         | [deleted]
        
         | [deleted]
        
         | WanderPanda wrote:
         | I've recently come to the conclusion that the magic of fully
         | connected neural networks is that there are almost no tricks to
         | reach close to sota. Dense layers + relu + adam = it just works
        
           | blackbear_ wrote:
           | Sorry but this is just wrong, using only fully connected
           | layers would result in pretty bad performance on images,
           | text, audio, etc., or at the very least require much more
           | data to perform well. At least use the right type of
           | architecture for each data modality, then I agree that the
           | basic version won't perform much worse than sota in the real
           | world.
        
             | WanderPanda wrote:
             | Maybe I wasn't clear enough but of course I'm not implying
             | that you can reach sota on image classification with fcnns.
             | There are many problems where the input space is not as
             | noisy, redundant and structure bearing as with images.
        
             | jerrygenser wrote:
             | I think part of parent is wrong but part is correct.
             | 
             | There are many rules of thumb that took the last 5+ years
             | to discover but are now quite standard. You are nit picking
             | on fully connected, but if we add dropout, weight
             | initialization, and adaptive learning rate to what they
             | said, then we are fairly close to being able at least get a
             | deep architecture to overfit a toy dataset and be off to
             | the races for then applying it to a larger dataset.
        
               | eternalban wrote:
               | The smart money should be on research on current
               | shortcomings that will become deal breakers when AI is
               | fully pervasive in society. For example, addressing
               | catastrophic forgetting seems to me to be a very
               | profitable research aim.
        
           | [deleted]
        
       | cuuupid wrote:
       | Note on the sinusoidal encoding, the reason it's used is
       | generally speaking twofold:
       | 
       | 1 - To encode position somehow (which author details)
       | 
       | 2 - Because sine is easy "noise" for the network to learn.
       | 
       | There's also a bunch of cool tricks here even down to the PyTorch
       | implementation to optimize this encoding by exploiting the nature
       | of sine/cosine which is an added reason for its popularity in
       | Transformer architectures. If you like math I recommend diving
       | into it as it's quick but fun!
       | 
       | (Side note it's also falling out of fashion for other encoding
       | methods. e.g. rotary positional encoding is vastly popular in the
       | reformer branch of transformers)
        
         | WanderPanda wrote:
         | Is treating the position embeddings as trainable weights
         | already out of fashion again?
        
           | alexmolas wrote:
           | according to the Attention is all you need paper there wasn't
           | huge differences between using sin/cos and trainable weights.
           | But the paper is a little bit old, so I don't know what is
           | the current sota regarding positional embeddings
        
             | [deleted]
        
             | yunwal wrote:
             | I'm new to this stuff, but as I understand it, the
             | "Attention is all you need" paper stated that training the
             | positional encoding weights didn't improve results for
             | language models specifically, but other papers found that
             | vision transformers performed better with trainable
             | weights.
        
       ___________________________________________________________________
       (page generated 2022-12-11 23:01 UTC)