[HN Gopher] The GPT Architecture, on a Napkin
___________________________________________________________________
The GPT Architecture, on a Napkin
Author : decremental
Score : 344 points
Date : 2022-12-11 12:29 UTC (10 hours ago)
(HTM) web link (dugas.ch)
(TXT) w3m dump (dugas.ch)
| sAbakumoff wrote:
| Personalized chatGPT + Boston Dynamics collaboration in creating
| best friends who can talk about anything
| macleginn wrote:
| A small, very clearly written and well commented implementation:
| https://github.com/karpathy/minGPT
| oars wrote:
| Great article about ChatGPT architecture, thank you.
| devxpy wrote:
| Would also love this to include InsturctGPT architecture with its
| RL reward model!
| bogomipz wrote:
| This was a great read. Does anyone know what software the author
| may have used for the hand-written or napkin-sketch looking
| graphics?
| [deleted]
| seydor wrote:
| It s not intutive why the QK^T matrix is importance, similarity
| makes more sense
| [deleted]
| ec109685 wrote:
| This was helpful. What are the basics that allow ChatGPT to
| output numbers and solve math problems?
| sva_ wrote:
| Did you read the article? Under "Encoding"/"Decoding".
| ec109685 wrote:
| I guess each digit is a word?
|
| For example, how does the explanation in the article produce
| a module that can solve this:
|
| " To calculate the hypotenuse of a triangle with one side
| that is 12 inches long and another i S side that is 36
| centimeters long, a 6th grader might say something like this:
| "First, we need to convert the 36 centimeters into inches so
| that both sides of the triangle are in the same units. We can
| do this by dividing 36 by 2.54, which is the number of
| centimeters in one inch. This gives us 14.173228 inches.
| Then, we can use the Pythagorean theorem to find the length
| of the hypotenuse. The Pythagorean theorem says that in a
| right triangle, the square of the length of the hypotenuse is
| equal to the sum of the squares of the lengths of the other
| two sides. So we can use this formula to find the length of
| the hypotenuse: a^2 + b^2 = c^2. In our triangle, the length
| of one side is 12 inches, and the length of the other side is
| 14.173228 inches. So we can plug those numbers into the
| formula like this: 12^2 + 14.173228^2 = c^2. Then we just
| need to do the math to find the value of c. 12^2 is 144, and
| 14.173228^2 is 201.837296. So if we add those two numbers
| together, we get 346.837296. And if we take the square root
| of that number, we get the length of the hypotenuse, which is
| 18.816199 inches."
| heisenzombie wrote:
| Well, it's wrong for one: It correctly gets the division (I
| assume it has that fact memorized) but 14.173228^2 is
| 200.88 not 201.83. it then also does the addition wrong,
| and the square root is also wrong.
|
| You gotta be REAL careful with ChatGPT output that sounds
| convincing and technical. It's very good at convincingly
| making stuff up, even math-y science-y sounding stuff.
| bagels wrote:
| Numbers are just more words to the model.
| ec109685 wrote:
| But it supports arbitrary precision or at least a whole bunch
| of precision:
|
| https://news.ycombinator.com/threads?id=ec109685#33944516
| badrabbit wrote:
| If I was let's say China, what stops me from replicating GPT!
| randyrand wrote:
| very little! just competent engineers and lots of gpus
| mysterydip wrote:
| If one were to make a markov chain with the same amount of input
| data, would the result be the same? Markov chain chatbots have
| been a thing for years, just on a much more limited set of data.
| igorkraw wrote:
| It _is_ a Markov chain. Your input context is your state.
| n2d4 wrote:
| No. The state size is 50257^2048. The vast majority of states
| have never been seen and will never be seen in all of humanity.
|
| For an example, if your training set consists of the words rain
| and thunder used interchangeably a lot, but the word "today" is
| only used once in the sentence "there is no rain today", then a
| Markov chain based on the data would never output "there is no
| thunder today", but a transformer might.
|
| In other words, information compression (eg. equating rain with
| thunder) isn't just for practicability, it's a necessary
| requirement for (the current generation of) good language
| models.
| mysterydip wrote:
| Ah, that's what I was missing. Thanks!
| wklm wrote:
| Very cool write-up, would love it even more if author included
| the references.
| mudrockbestgirl wrote:
| That's a great summary, but it's important to understand that
| much more goes into training these models. The architecture is
| not any kind of secret sauce, or special in any way. It's just a
| typical Transformer. I call this "architecture porn" - people
| love looking at neural net architectures and think that's the key
| to success. If only you know the algorithm! It's so simple!
|
| But reality is usually much messier. The real training code will
| be littered with hundreds of ugly little tricks to make it work.
| A large part of it will be input preprocessing and data
| engineering, tricks to deal with exploding/vanishing gradients,
| monitoring, learning rate schedules and optimizer cycling,
| complexity for distributed training, regularization tricks,
| changing parts of the architecture for performance reasons (like
| attention), and so on.
| quonn wrote:
| Don't know. Karpathy has a very compact implementation of GPT
| [0] using standard technology (could be even more compact but
| is reimplementing for example the attention layer for teaching
| purposes) and while he presumably has no access to how the real
| model was trained exactly, if there would be more to it I think
| he would be the kind of person to point it out.
|
| [0] https://github.com/karpathy/minGPT/tree/master/mingpt
| aqme28 wrote:
| I used to work in data engineering for ML and yes, I'd say 90%
| of our technical expertise on both the science and engineering
| side went into designing the datasets.
| sebzim4500 wrote:
| It feels like this is less true for GPT though, especially as
| OpenAI seems to be adopting a 'kitchen sink' approach.
| mike_hearn wrote:
| Just getting plain text out of the web without getting
| flooded with boilerplate, noise, SEO spam, duplication,
| infinity pages like calendars etc is already a hard data
| engineering problem.
| sebzim4500 wrote:
| I'm far from an expert in this field, but based on my
| conversations with people who are I think this is getting less
| true. Normally these models are trained with straightforward
| optimizers (basically naive SGD) since advances like batch
| normalization and residual connections make the more fancy
| stuff unnecessary. I think the learning rate schedules used for
| these big networks tend to be simple as well, just two or three
| steps.
| [deleted]
| andreyk wrote:
| I work in this field (PhD candidate), and what you say is
| true for smaller models, but not GPT-3 scale models. Training
| large scale models involved a lot more, as the OP said. It's
| not just learning rate schedulers, it's a whole bunch of
| stuff.
|
| See this logbook from training the GPT-3 sized OPT model - ht
| tps://github.com/facebookresearch/metaseq/blob/main/projec...
| lucidrains wrote:
| it is neither as simple as the person you are responding
| to, nor as complicated as you make it seem. it will only
| get simpler with time.
| [deleted]
| [deleted]
| marstall wrote:
| so creating each new rev of GPT3 would involve going
| through something like all those messy steps in that
| logbook?
| lossolo wrote:
| Seems like majority of problems in this log are devops
| problems, which seems to be combination of ML people doing
| devops work while not having experience with devops work
| and really bad cloud vendor. I've been running multiple
| bare metal nodes with 8 GPUs each running 24/7 for months
| with almost 100% utilization and had 100x less problems
| than they had.
| [deleted]
| [deleted]
| WanderPanda wrote:
| I've recently come to the conclusion that the magic of fully
| connected neural networks is that there are almost no tricks to
| reach close to sota. Dense layers + relu + adam = it just works
| blackbear_ wrote:
| Sorry but this is just wrong, using only fully connected
| layers would result in pretty bad performance on images,
| text, audio, etc., or at the very least require much more
| data to perform well. At least use the right type of
| architecture for each data modality, then I agree that the
| basic version won't perform much worse than sota in the real
| world.
| WanderPanda wrote:
| Maybe I wasn't clear enough but of course I'm not implying
| that you can reach sota on image classification with fcnns.
| There are many problems where the input space is not as
| noisy, redundant and structure bearing as with images.
| jerrygenser wrote:
| I think part of parent is wrong but part is correct.
|
| There are many rules of thumb that took the last 5+ years
| to discover but are now quite standard. You are nit picking
| on fully connected, but if we add dropout, weight
| initialization, and adaptive learning rate to what they
| said, then we are fairly close to being able at least get a
| deep architecture to overfit a toy dataset and be off to
| the races for then applying it to a larger dataset.
| eternalban wrote:
| The smart money should be on research on current
| shortcomings that will become deal breakers when AI is
| fully pervasive in society. For example, addressing
| catastrophic forgetting seems to me to be a very
| profitable research aim.
| [deleted]
| cuuupid wrote:
| Note on the sinusoidal encoding, the reason it's used is
| generally speaking twofold:
|
| 1 - To encode position somehow (which author details)
|
| 2 - Because sine is easy "noise" for the network to learn.
|
| There's also a bunch of cool tricks here even down to the PyTorch
| implementation to optimize this encoding by exploiting the nature
| of sine/cosine which is an added reason for its popularity in
| Transformer architectures. If you like math I recommend diving
| into it as it's quick but fun!
|
| (Side note it's also falling out of fashion for other encoding
| methods. e.g. rotary positional encoding is vastly popular in the
| reformer branch of transformers)
| WanderPanda wrote:
| Is treating the position embeddings as trainable weights
| already out of fashion again?
| alexmolas wrote:
| according to the Attention is all you need paper there wasn't
| huge differences between using sin/cos and trainable weights.
| But the paper is a little bit old, so I don't know what is
| the current sota regarding positional embeddings
| [deleted]
| yunwal wrote:
| I'm new to this stuff, but as I understand it, the
| "Attention is all you need" paper stated that training the
| positional encoding weights didn't improve results for
| language models specifically, but other papers found that
| vision transformers performed better with trainable
| weights.
___________________________________________________________________
(page generated 2022-12-11 23:01 UTC)