[HN Gopher] MinGPT: Minimal PyTorch re-implementation of GPT
___________________________________________________________________
MinGPT: Minimal PyTorch re-implementation of GPT
Author : memorable
Score : 195 points
Date : 2022-09-06 12:14 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| karpathy wrote:
| Hah funny to see this on HN, it is a relatively old project but
| one that I continue to love and still work on. I was trying to
| train a GPT one day and discovered that available implementations
| were quite complex, spread across many files, and took way too
| many kwargs switches for esoteric/rare options that just bloated
| and complexified the code. But in my head a GPT was a super
| simple neat, isotropic model, so I got all worked up and wrote
| minGPT.
|
| The project went on to have more impact than I originally
| imagined and made its way into a number of projects and papers.
| One of those I found only a few days ago here:
| https://twitter.com/karpathy/status/1566100736076697600 . What I
| love about these projects is that the authors often "hack up"
| minGPT in code directly. They don't configure a comprehensive
| kwarg monster. I think there's a beauty in that. Very often I
| wish we had more gists and fewer frameworks - to look at code
| chunks, understand them completely, tune them to our needs, and
| re-use them in projects, similar to how bacteria trade little DNA
| plasmids. minGPT is written for those who want that for their GPT
| projects. There's plenty of cons to this approach too, ultimately
| I think there's value in both approaches.
|
| Coming up the theme of future minGPT development: more examples,
| and more teeth - it should be possible to demonstrate the
| training of relatively serious (~few B) models with minGPT on one
| n-gpu node and reproduce some benchmarks around that scale, but
| never sacrifice its readability.
| ghub-mmulet wrote:
| Thanks for making it it! There is immense value in something
| you can just dive into and hack on. I've been hacking on stable
| Diffusion/latent diffusion these past couple weeks, and you
| don't know how much time it would have saved me, if it just had
| something similar!
| darawk wrote:
| For anyone else who was new to the phrase "isotropic model":
|
| https://github.com/christianversloot/machine-learning-articl...
| albertzeyer wrote:
| This works for an architecture which has been well tuned and
| studied before, like LSTM or Transformer.
|
| Once you do research on the model, testing out things, it often
| tends to become such kwarg monster in many frameworks.
|
| Having everything (relevant) in one file (even in the config
| file itself with hyper params) allows you to copy the file for
| every experiment and modify it inplace. This avoids the kwargs
| mess. But then the config files are very complex, and can
| become messy in other ways (esp for research projects).
| Example: https://github.com/rwth-i6/returnn-
| experiments/blob/master/2...
|
| Such approach makes it much more flexible and does not mess
| with the baseline code. As you say, it's more like an
| evolutionary DNA-like approach, where you then tend to do
| crossovers with other evolved good-performing configs, etc.
| jphoward wrote:
| I completely agree! I personally find these powerful new
| network releases border on the depressing, in that they aren't
| really network releases but huge training systems of dispersed
| YAMLs. YOLOv4 was a case in point where I was too overwhelmed
| to try and integrate it into a project I was working on.
|
| PS you are a hero of mine - I'm an academic medical doctor for
| who CS231n was my first foray into AI, and since then I've gone
| on to gold medal in a couple of Kaggle competitions and secured
| 5 years of higher research funding to pursue clinical AI. I am
| immensely grateful to you and Fei-Fei Li.
| Siira wrote:
| Are there any similarly structured projects around?
| HeckFeck wrote:
| Here was I thinking someone had recreated the GUID Partition
| Table in some form of micropython. Perhaps someday.
| rexreed wrote:
| With enough training data and enough GPUs to do the model
| training, you'll be there! Goes to show that for AI, the code
| really isn't the important part. AI is and always has been about
| data and compute.
| s_Hogg wrote:
| Karpathy really seems to have discovered there are a lot of hours
| in the day now he doesn't work for Tesla
| liuliu wrote:
| Not only him. The tech boom in the past decade made a lot of
| great programmers rich, and it is a good thing. Looking also at
| how Aras Pranckevicious (of the Unity fame) is now contributing
| to Blender. (Also to some extents Rui (for mold fame) and Raph
| Levien (for xi editor fame), although not certain about their
| financial standing).
| ShamelessC wrote:
| This implementation is quite old now actually - although I
| agree, it certainly seems that way otherwise :)
| jstx1 wrote:
| He was doing this kind of stuff while he was at Tesla too -
| https://github.com/karpathy/cryptos
| horseRad wrote:
| Pretty sure he wrote this while working at Tesla also
| mark_l_watson wrote:
| Nice! I remember way back when studying Karpathy's character RNN
| code, a great study resource. Looking forwards to understanding
| this example also!
| karpathy wrote:
| I am working on a video lecture series that will step through
| it and "spell it out". Without it even this code can be a bit
| opaque for someone who is new to the field and e.g.
| uncomfortable with n-dimensional array manipulations or the
| surrounding language modeling concepts.
| derac wrote:
| I love your approach and philosophy around programming. If anyone
| is unaware, Karpathy has a relatively small youtube channel he
| started a few weeks ago. https://youtu.be/VMj-3S1tku0
| polygamous_bat wrote:
| This is actually a pretty neat, self-contained implementation
| that can super easily extended beyond stereotypical natural
| language models, for example to create world models for video
| games [1] or to create robot models that can learn to imitate
| from large, chaotic human demonstration data [2] (disclaimer, I'm
| an author on the second one.) Basically, GPT (or minGPT) models
| are EXCELLENT sequence modelers, almost to the point where you
| can throw any sensible sequence data at it and hope to get
| interesting results, as long as you don't overfit.
|
| Even though I have only been working on machine learning for
| around six years, it's crazy to see how the landscape has changed
| so fast so recently, including diffusion models and transformers.
| It's not too much to say that we might expect more major
| breakthroughs by the end of this decade, and end in a place we
| can't even imagine right now!
|
| [1] https://github.com/eloialonso/iris [2]
| https://github.com/notmahi/bet
| a-dub wrote:
| > Even though I have only been working on machine learning for
| around six years, it's crazy to see how the landscape has
| changed so fast so recently, including diffusion models and
| transformers.
|
| it's pretty wild considering how hidden markov models were
| considered state of the art not all that long ago.
| visarga wrote:
| Some people demean GPT-3 saying it's just a Markov model.
| dang wrote:
| Related:
|
| _Karpathy 's MinGPT_ -
| https://news.ycombinator.com/item?id=24189497 - Aug 2020 (102
| comments)
___________________________________________________________________
(page generated 2022-09-06 23:00 UTC)