[HN Gopher] Replicating GPT-2 at Home
___________________________________________________________________
Replicating GPT-2 at Home
Author : bkkaggle
Score : 174 points
Date : 2021-01-23 16:52 UTC (6 hours ago)
(HTM) web link (bkkaggle.github.io)
(TXT) w3m dump (bkkaggle.github.io)
| kyberias wrote:
| How many off-the-shelf GPUs are needed to replicate GPT-2 in a
| year?
| minimaxir wrote:
| With current improvements to training performance and
| parallelism (e.g. DeepSpeed: https://www.deepspeed.ai ) it
| wouldn't surprise me if creating GPT-2 small from scratch
| becomes possible with a couple 3080s in _days_ , with GPT-2 XL
| not taking 10x longer.
| moyix wrote:
| I agree. I've been training on 2x3090s connected via NVLink
| and they're _really_ fast for training language models. I am
| actually tempted to try and replicate the OP 's GPT2
| replication using Huggingface, DeepSpeed, and OpenWebText,
| but the GPUs are occupied right now training a GPT2-774M C
| language model...
| Jack000 wrote:
| Does nvlink actually help? It's mostly useful for
| transferring data between gpus so I assume you're using
| pipeline parallelism or similar?
| natch wrote:
| What software stack are you using to get your 3090s
| working? Any hitches along the way?
| moyix wrote:
| Linux (Ubuntu 20.04) + Cuda 11.2. For the backend I use
| PyTorch; Tensorflow has some nice optimizations (like
| XLA, which uses LLVM to JIT optimized code for the GPU),
| but I found it very painful to get working reliably, and
| most of the language modeling stuff I've seen uses
| PyTorch.
|
| For the language model training itself I've been
| experimenting with a few different things. I started off
| with Huggingface because it's very easy to get up and
| running, and I still use its tokenizers library to do BPE
| training on the C source dataset (though there are still
| some hitches there - other libraries expect slightly
| different formats for the tokenizer model, like using
| different ways to represent the <|endoftext|> marker).
|
| After prototyping the C language model training at home,
| I tried moving the training up to NYU's HPC cluster,
| which has a bunch of 4xV100 and 4xRTX8000 nodes (mainly
| because the sound of two powerful GPU fans running at
| 100% gets a bit old after a while). Unfortunately I
| discovered that with larger models the GPU-GPU
| communication overhead can be prohibitive (most of the
| cluster nodes only support P2P GPU communication over
| PCIe, which is a _lot_ slower than NVLink), and
| Huggingface 's implementation actually performed _worse_
| on multiple GPUs than on two 3090s with NVLink (I opened
| an issue track it here
| https://github.com/huggingface/transformers/issues/9371
| ).
|
| Currently I'm working on getting DeepSpeed running so
| that I can hopefully get better scaling even in the
| absence of a fast GPU-GPU interconnect. This is again a
| little bit annoying, because it seems like every
| framework wants a slightly different way of representing
| the tokenizer and training data - I've had to preprocess
| the dataset in about 4 different ways (plain text, loose
| JSON, npy (for DeepSpeed), and a custom indexed binary
| format for Megatron-LM). I'm also hoping to try out
| Huggingface's recently-released DeepSpeed integration,
| which (if it works) would be a really nice combination of
| usability and performance:
| https://huggingface.co/blog/zero-deepspeed-fairscale
|
| As for other software stack hitches: so, so many. The
| main one is just managing the different versions of CUDA.
| The 3090 is only supported starting with CUDA 11.1, but
| many packages and frameworks only support 11.0 at best.
| And some of the newer things like DeepSpeed use PyTorch
| extensions, which require you to have the exact version
| of CUDA around that was used to build PyTorch. So I've
| had to do a fair bit of compiling packages from source
| rather than relying on prebuilt packages.
|
| The path of least resistance here is probably to use the
| NVIDIA NGC containers, but it took NVIDIA more than a
| month to get them updated after the 3090 was released,
| and I find working inside containers for everything
| inconvenient anyway (I hate losing my bash history, and I
| always accidentally end up losing data or local changes
| when I exit a container).
|
| Anyway, this ended up being a bit more rambling than I
| intended, but it was helpful to write it all down and
| maybe it'll help someone else avoid some stumbling blocks
| :)
| minimaxir wrote:
| As someone who maintains a package to both make it easy to fine-
| tune GPT-2 or create your own from scratch
| (https://github.com/minimaxir/aitextgen), this submission is a
| good run-through of the technical considerations toward building
| a GPT-2 model.
|
| It's both substantially easier and faster than it was when OpenAI
| released their paper in 2019, thanks to both Huggingface
| Transformers and Tokenizers making the architectures more
| efficient and other companies streamlining the training process
| and make it more efficient for all parts in the pipeline.
|
| You don't need a TPU cluster to train a working GPT-2 model,
| although it helps (unfortunately TPU support on PyTorch-based
| training like aitextgen is more fussy). A free GPU on Colab gets
| you most of the way, especially since you can get now a T4 or a
| V100 which lets you use FP16.
| bkkaggle wrote:
| Yep i started off with trying to get it to work with pytorch
| (https://github.com/bkkaggle/lm-training-research-
| project/blo...) then with pt-lightning but the whole 1 user VM
| per TPU board limitation in pytorch-xla 7-8 months ago made me
| switch over to TF
| punnerud wrote:
| Just as Google want you to do. Within 3-5 years you will
| probably see a high price increase and no where to go.
| bkkaggle wrote:
| heh. I've been using jax for a couple of months and its
| been a pretty nice replacement of both pt and tf. it feels
| like what a ml framework would look like if it were built
| around easy scaling and dev friendliness.
| bravura wrote:
| What do you think would be necessary to generate rhyming text
| with a particular phrasing / rhythm?
|
| e.g. in the style of a particular rapper?
|
| If you just fine-tune on a corpus of their lyrics, you might
| miss the underlying poetic constraints.
|
| If there were an additional prior (a "poetry / assonance /
| rhyme" model), what is the easiest way to constrain generation
| to respect this prior?
|
| Thanks!
| drusepth wrote:
| I wrote "Stylistic Rhyme-bound Poetry Generation or: How You
| Too Can Generate Sonnets in the Style of Kanye West" [1] back
| in 2017 for an easy DIY introduction to this topic. You
| specify the rhyming scheme (ABAB CDCD etc) and it forces end-
| line rhymes around it.
|
| It uses Markov chains instead of GPT-2, but the approach
| should work with prompt-based things like GPT-2 also: for
| lines that are "free" (e.g. no specific word you need to
| rhyme with), you can generate the line normally -- but for
| lines you need to rhyme with a specific word, you can just
| generate last-word-first and generate backwards. For a
| strictly LTR prompt like GPT-2, you could probably just
| reverse your corpus word order, generate "reverse" lines with
| GPT-2 given the previous line + word you need to rhyme with
| as the prompt, and then reverse it back to "normal" in
| postprocessing.
|
| [1] https://festivalpeak.com/stylistic-rhyme-bound-poetry-
| genera...
|
| Some examples of the output of this approach:
|
| [2] https://medium.com/words-of-mimicry/kanye-west-
| ballade-1-a6f...
|
| [3] https://medium.com/words-of-mimicry/me-you-and-slow-sip-
| slow...
|
| I'd expect the output to be better with something like
| GPT-2/3, since Markov chains are so twentieth-century, but I
| was pretty happy at the output quality even though it often
| rhymed the same word repeatedly; you could improve it by
| weighting previously-used words, removing them from the pool
| of rhyming words, and/or backtracking to previous lines when
| you find yourself without other words to rhyme.
| minimaxir wrote:
| A paper was recently released for that particular use case
| (https://github.com/markriedl/weirdai), in which it describes
| a number of technical caveats (and it's technically not using
| GPT-2).
|
| I do think it's possible to train a GPT-2-esque network to do
| something similar, albeit with some text encoding
| shenanigans.
| FL33TW00D wrote:
| As far as I know to get a V100 you need Colab Pro? Did this
| change recently?
| minimaxir wrote:
| It's unclear. I've heard people get the V100 without Colab
| Pro. Albeit I do use Colab Pro and get a V100 almost each
| time.
|
| As an aside, if you do get a V100, Colab Pro is by-far the
| cheapest way to train an AI model. ($10/mo is much, much
| cheaper than $2.48+/hr on GCP normally!) Although you need to
| sync checkpoints to off-loaded storage in case the Notebook
| dies.
| fpgaminer wrote:
| > As an aside, if you do get a V100, Colab Pro is by-far
| the cheapest way to train an AI model.
|
| But others should be aware that you get what you pay for.
| Google still rate limited me when I used Colab Pro, and I
| ran into a myriad of other small problems. If that's all
| one is willing to spend to play with AI, 100% go for it.
| It's a great place to start. But if you're at all serious
| and can afford it, I think a local machine with a modest
| GPU is worth every penny.
| nsomaru wrote:
| Curious; is it better to train locally on something like
| a 2080ti 11G or go for colab and offload checkpoints to
| S3?
|
| Asking because it seems V100 performance (or the other
| colab paid GPU) is worth the occasional instability if
| you've set up checkpoints.
| byefruit wrote:
| Alas, only if you live in the US.
|
| Colab Pro isn't available outside the US (without breaking
| Google's terms).
| polytronic wrote:
| The author at 17 years of age can understand academics and
| research. Has the skills and dedication to go through an exercise
| of reconstructing state-of-the-art.
|
| I can't help but feel pride and hope for the future, both the
| author's and the world.
| fpgaminer wrote:
| I was watching an ICML presentation and was surprised by the
| presenter's (not OP, a different AI prodigy) apparent age. Well
| turns out he was 17 and a 2nd year PhD student. I think he
| graduated from UC Davis when he was 14 or something.
|
| Some people roll wicked real life DnD character sheets, that's
| for sure.
| deeviant wrote:
| At home, in the cloud, for tens of thousands of $$$.
| dane-pgp wrote:
| "Mom, can I have a GPT-2?"
|
| "No, we have GPT-2 at home."
|
| GPT-2 at home: [Outputs this comment]
| zirkonit wrote:
| First off -- the author has done an amazing tutorial, it's very
| enjoyable, so I am by no means throwing a shade.
|
| But a week of TPUv3-128 is anywhere between $10k and $20k in TPU
| costs alone; saying that this is an "at home" kind of experiment
| is cheeky at best, clickbait at worst.
| nabla9 wrote:
| Many hobbies cost $10k-$20k. If you work in engineering, that's
| not far away from "at home" hobbies.
|
| The time that went into this project was almost certainly worth
| more than $10k.
| 6gvONxR4sf7o wrote:
| I imagine you're speaking about the cost of e.g. setting up a
| wood shop in your garage, rather than the cost of making
| something in said wood shop. Training this seems more like
| the latter, while the comparable cost is the former.
| nabla9 wrote:
| If you train this model and then use it to do other
| interesting things, training big models is like a setting
| up a wood shop.
| fpgaminer wrote:
| You can download a pretrained, full size GPT-2 for $0.
| Training it from scratch would be merely for fun. You can
| fine tune the model if you have a specific application
| for far, far less cost ($0-$10).
|
| It's not comparable to a hobby. It's comparable to paying
| $10k to make a sandwich.
| Closi wrote:
| If your hobby is building wood furniture, a wood shop
| helps you do that hobby into the future. It will improve
| your projects, and help your enjoyment of your hobby. The
| tools also hold some sort of residual value.
|
| If your hobby is building AI/ML models, a one-shot
| trained model isn't going to really help you on an
| ongoing basis. It's an amazing single shot project, but
| if your hobby is actually ML then you probably aren't
| going to be happy just looking at your completed trained
| model - you are going to want to train a bigger, better
| model.
|
| And if your hobby is building software, you can just
| download a pre-trained model for free.
|
| I don't think the analogy holds the other way.
| bkkaggle wrote:
| Hi, I love that you enjoyed it!
|
| Yeah I totally get your point about the title--the TPU quota
| that I got was close to about the equivalent of $20k--but in my
| defense I don't have any other access to compute beyond
| anything that I get through the TFRC or through google colab
| superasn wrote:
| Yes it's an amazing tutorial. Thank you.
|
| Speaking as a hobbyist, earlier if you had enough
| determination you could create just about any software if you
| kept hacking at it long enough. CPU or cost was generally not
| an issue, your time and tenacity was.
|
| This has now unfortunately changed and innovation in software
| (esp ML) is now largely more about how deep are you pockets
| are.
| phreeza wrote:
| I think this is quite a rose colored view of the past.
| Rendering with many graphics techniques was out of reach
| for hobbyists for a long time for example.
| imaginenore wrote:
| The point is, an average IT professional in the US can easily
| afford it.
___________________________________________________________________
(page generated 2021-01-23 23:00 UTC)