[HN Gopher] Reproducing GPT-2 in llm.c
       ___________________________________________________________________
        
       Reproducing GPT-2 in llm.c
        
       Author : tosh
       Score  : 587 points
       Date   : 2024-05-28 15:58 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | indigodaddy wrote:
       | Looks like this is re: training, but wonder how inference would
       | be on some garbage older machine with no GPU on this model?
        
         | ryankrage77 wrote:
         | Last time I tried GPT-2 on CPU (which I think was shortly
         | before chatGPT was launched), I was getting about 0.2
         | tokens/sec. CPU utilization was low though, so running
         | inference in parralel gave better results. I was using 2 x
         | E5-2660's.
        
           | int_19h wrote:
           | DDR5 helps a lot. You can actually run stuff like LLaMA at >1
           | tok/s on the CPU with high-end gaming hardware these days.
        
             | doubloon wrote:
             | I have a 24 core Intel cpu and llama3.cpp runs llama3
             | surprisingly fast in surprisingly little RAM. Yes it
             | becomes a space heater but theres light at the end of the
             | cuda free tunnel
        
       | benterix wrote:
       | I just hope than in a couple of years we'll see a submission here
       | titled "Reproduce GPT-4 on legacy RTX 4090."
       | 
       | Because currently even with open source (?) models we are still
       | consumers, and the training is still the domain of the rich.
        
         | ravetcofx wrote:
         | Accessing the dataset to train from scratch will be the biggest
         | hurdle, now a lot of the pile has had ladder pulled since GPT-4
        
           | CamperBob2 wrote:
           | Someone will come along and say "Why don't you just mirror
           | Anna's Archive?" in 3...2...1...
        
             | sebzim4500 wrote:
             | I think between Anna's Archive, fineweb and as many github
             | repos as you can scrape you can get a pretty decent
             | dataset.
             | 
             | I doubt Anna's Archive would produce a good model on its
             | own though.
        
           | exe34 wrote:
           | i suppose you wouldn't be able to use it for external
           | services, but internally, I'm sure you can find some books
           | that fell off the back of a truck...
        
             | HeatrayEnjoyer wrote:
             | No reason you can't go external. GPT was trained using
             | ebook torrent sites
        
               | artninja1988 wrote:
               | OpenAI has enough money to hire lawyers to defend it
               | until the end of time though
        
           | meiraleal wrote:
           | I'm okay with paying for datasets
        
             | CamperBob2 wrote:
             | Depends on how the courts rule. If the copyright
             | maximalists prevail, only the wealthiest entities will be
             | able to afford to license a useful data set.
             | 
             | Paradoxically enough, this is the outcome that most "Hacker
             | News" denizens seem to be rooting for.
        
               | meiraleal wrote:
               | I'd still get most of my dataset from torrent but I could
               | pay for specific things like high quality source code.
        
               | groby_b wrote:
               | It's almost as if people believe in fairness and
               | compensating people for their work.
               | 
               | Also, it's worth noting that this is only true as long as
               | we're stuck in the "must train on the entire sum total of
               | human output ever created" local minimum for machine
               | learning. Given that most biological entities learn with
               | much less data, this might well be the thing that prods
               | ML research to using an approach that isn't "IDK, buy a
               | few containers of GPUs, and half a DC of storage, see if
               | that makes things better".
        
               | nwsm wrote:
               | > It's almost as if people believe in fairness and
               | compensating people for their work.
               | 
               | Yet in this case we are talking about compensating the
               | compilers/massagers/owners of the datasets, not the
               | original authors from wherever the data was originally
               | scraped.
        
               | wizzwizz4 wrote:
               | Copyright is hideously broken, but in theory: the owners
               | only own it because they compensate the authors, which
               | they only do out of an expectation of future profit (on
               | average).
               | 
               | That theory's a fantasy, because extractive systems
               | involving gatekeepers get established, but in _this
               | specific case_ , enforcing copyright would make things
               | fairer for authors. There's no extractive copyright-
               | taking gatekeeper for websites: scrapers don't get
               | copyright, so can't re-license the material they've
               | scraped (unless it's permissively-licensed or something).
        
           | GaggiX wrote:
           | https://huggingface.co/datasets/HuggingFaceFW/fineweb has 15T
           | cleaned and deduplicated english web data tokens.
        
             | ravetcofx wrote:
             | Holy crap, Does huggingface charge for bandwidth if you're
             | downloading 45 terabytes??
        
               | drexlspivey wrote:
               | I believe they are hosting it on Cloudflare who doesn't
               | charge for egress
        
               | fragmede wrote:
               | More specifically, Cloudflare R2 doesn't charge for
               | egress, and Cloudflare doesn't charge for egress to
               | members in the Bandwidth Alliance which include Azure,
               | Google Cloud, Oracle, Alibaba Cloud, and others, though
               | critically not AWS.
               | 
               | They very much do charge egress fees elsewhere.
        
               | andersa wrote:
               | Fun trivia: downloading 45TB costs about $60, according
               | to Cloudflare.
        
               | verticalscaler wrote:
               | That's what Cloudflare charges. It costs them around 6
               | cents.
        
               | kazanz wrote:
               | Wish I could say I'm surprised you're getting downvotes.
               | Carrier costs are some of the lowest costs for hosting
               | providers. Yet that fact seems to elude a majority of the
               | community here.
        
               | andersa wrote:
               | That's what they said it costs on their blog, not that
               | they charge that. https://blog.cloudflare.com/aws-
               | egregious-egress
               | 
               | Where are you getting 6 cents from?
        
         | vineyardmike wrote:
         | We won't ever get there or need to because GPT-4 wasn't trained
         | on one GPU it was trained on thousands. The (most likely)
         | biggest meaningful difference between -2 and -4 is the number
         | of parameters and the training data/duration. I don't think
         | you'd really learn much more.
        
           | elicksaur wrote:
           | It's not about learning. It's about owning. Exactly the
           | reason OpenAI stopped being open. Having GPT-4-quality LLMs
           | created by anyone with a gaming PC would be pretty radical.
        
             | vineyardmike wrote:
             | And you won't get there. Those models are far too large for
             | a 2024 GPU. Llama-3 70b is arguably close to GPT-4 but is
             | still too large for gaming GPUs (and probably for many
             | years of GPU updates)
        
               | elicksaur wrote:
               | "You won't get there" is a pretty vast statement for all
               | of the future. Two fairly reasonable predictions: 1) the
               | compute needed to get GPT4 performance will decrease. 2)
               | the compute on consumer GPUs will increase.
               | 
               | At some point they cross, and you will be able to run a
               | GPT4-quality LLM on a consumer GPU. At some point after
               | that, you'll be able to run a GPT4-quality LLM on a 2024
               | consumer GPU if you can find one.
               | 
               | Important to emphasize, I'm not saying "GPT-4". Llama-3
               | was trained on 24k GPU clusters. "Able to do the exact
               | same processing at 1/24k the compute" is different from
               | "Able to get equivalent performance at 1/24k compute".
               | Even then, given a long enough time scale, the former is
               | possible.
        
               | vineyardmike wrote:
               | > 1) the compute needed to get GPT4 performance will
               | decrease. 2) the compute on consumer GPUs will increase.
               | 
               | I'm assuming we're just talking inference here...
               | 
               | Sure compute abilities for consumers will increase but
               | the original comment had a fixed GPU - the 4090. I can
               | already eke out LLama3:8b on my MacBook Air, and Apple
               | will sell you a laptop capable of running the full sized
               | LLama.
               | 
               | There is a direct correlation between parameters and
               | "knowledge" for an LM. There's some open questions as to
               | density (LLaMa3 specifically challenged previous
               | assumptions) but it seems implausible to fit an
               | equivalent model as GPT4 into 24gb vram. Just like
               | compression, you can't shrink forever.
               | 
               | GPT-4 and GPT-2 are pretty similar architecturally (I
               | assume). So if abilities don't matter, we can already run
               | GPT-2 so we're basically there for 4.
        
         | Invictus0 wrote:
         | I'm not saying this to be rude, but I think you have a deep
         | misunderstanding of how AI training works. You cannot just skip
         | the matrix multiplications necessary to train the model, or get
         | current hardware to do it faster.
        
           | xdavidliu wrote:
           | was the first sentence really necessary? The second sentence
           | seems fine by itself.
        
           | nickpsecurity wrote:
           | There's work on replacing multiplication. Here's four
           | examples:
           | 
           | https://openaccess.thecvf.com/content_CVPR_2020/papers/Chen_.
           | ..
           | 
           | https://arxiv.org/abs/2012.03458
           | 
           | https://openaccess.thecvf.com/content/CVPR2021W/MAI/papers/E.
           | ..
           | 
           | https://arxiv.org/pdf/2106.10860
        
           | benterix wrote:
           | No offence taken! As far as my (shallow!) understanding goes,
           | the main challenge is the need for many GPUs with huge
           | amounts of memory, and it still takes ages to train the
           | model. So regarding the use of consumer GPUs, some work has
           | been done already, and I've seen some setups where people
           | combine of these and are successful. As for the the other
           | aspects, maybe at some point we distill what is really needed
           | to a smaller but excellent dataset that would give similar
           | results in the final models.
        
         | auspiv wrote:
         | Considering it takes 8x A100 GPUs (80GB VRAM) to train GPT-2, I
         | think it'll take far more than a single 4090.
        
           | bufo wrote:
           | The RTX 4090 has about the same BF16 Tensor Core TOPs than
           | the A100, assuming 50% MFU (like the A100 40 GB PCIe) it
           | would take 8x longer on 1 RTX 4090 vs 8x A100 80GB SXM, so 12
           | hours. Datasheet here for the TOPs
           | https://images.nvidia.com/aem-
           | dam/Solutions/geforce/ada/nvid... 50% MFU should be
           | achievable on the 4090.
        
           | anthonix1 wrote:
           | Nah, I reproduced on 4x 7900 XTX machine in 8.75 hours, so a
           | single 7900 XTX (costs less than $1k) could do it in under 24
           | hours. Was hitting 55.4% MFU.
        
         | anthonix1 wrote:
         | FWIW, I'm seeing ~318,000 toks/sec throughput on a 4x AMD 7900
         | XTX machine (less than $4k worth of GPU), using the same
         | settings as in the post (0.5M batch size etc).
        
           | pama wrote:
           | Did you reproduce the evaluation as well?
        
             | anthonix1 wrote:
             | It converges similarly on smaller datasets.
             | 
             | About to kick off a training from scratch run on the same
             | fineweb-10B, which at 324k toks/sec should take about 8.6
             | hours. And with my kWh cost, that is about $2.50 cost to
             | train.
             | 
             | Will report back tomorrow when the training has finished..
        
             | anthonix1 wrote:
             | So... successfully reproduced in ~8.75 hours, taking about
             | 18 kWh / $2.70
             | 
             | The first run actually failed at step 3000 or so, and I
             | realized I had a bug in my attention / matmul kernels, but
             | after fixing that and restarting it worked great
             | 
             | [1] https://github.com/anthonix/llm.c
        
           | Manabu-eo wrote:
           | How much % of the theoretical FLOPs are you getting with
           | those 7900 XTX on training?
        
         | sabareesh wrote:
         | Well here is a comment on 4090
         | https://github.com/karpathy/llm.c/discussions/481#discussion...
        
           | huac wrote:
           | 25% MFU :( maybe because of the P2P nerf?
        
             | anthonix1 wrote:
             | Maybe get a 7900 XTX. 122 TFLOPS of BF16/FP16 for less than
             | $1k and I'm getting 55.4% MFU
        
               | sabareesh wrote:
               | These are not apples to apple comparison, as this is
               | running across GPU and much bigger model
        
             | sabareesh wrote:
             | This much bigger model (500M), P2P is enabled via Mailbox.
             | It is expected because of memory to compute ratio
        
               | huac wrote:
               | can you elaborate?
        
         | doubloon wrote:
         | Dude im hoping we get rid of Nvidia completely. I can run
         | llama.cpp inference on a 7B model on my 24 core cpu intel
         | machine using just CPU and it only uses about 4gb of ram and is
         | not that slow. If we could have massive parallel arm core or
         | even riscv machines without the Cuda issues with proprietary
         | driver hell it would be much more open source. And much less
         | wonkage for the normie user
        
       | karpathy wrote:
       | Hi HN the main (more detailed) article is here
       | https://github.com/karpathy/llm.c/discussions/481
       | 
       | Happy to answer questions!
        
         | 1024core wrote:
         | Thank you, from an appreciative reader!
        
         | ngiyabonga wrote:
         | Hi Andrej!
         | 
         | First, thank you for your teaching, it has helped me a lot,
         | didn't think I'd ever have the chance to say thank you, but
         | here you are and I hope this gets to you!
         | 
         | Question - what's a relevant (05-2024) baseline to compare the
         | performance of c code to? Back when you made nanoGPT you were
         | seeing "the file train.py reproduces GPT-2 (124M) on
         | OpenWebText, running on a single 8XA100 40GB node in about 4
         | days of training". So twice the memory on the c node, but
         | unsure of data size /epochs, any other details I may be
         | missing. I.e. what's the net uplift of running c vs "legacy"
         | torch code?
         | 
         | Thanks again for everything.
        
           | karpathy wrote:
           | The baseline is definitely PyTorch (or JAX), and indeed
           | something like nanoGPT. I just never got nanoGPT "past the
           | finish line" of really crossing the t's and dotting the i's
           | and reproducing the models with as much care as I did now and
           | here in llm.c, and getting to the point where it's a single
           | launch command that just does the thing.
           | 
           | I think I'll try to develop the `train_gpt2.py` inside llm.c
           | to be that, so that we have the two implementations exactly
           | side by side, and it's all nice and comparable.
           | 
           | The C/CUDA code is currently a little bit faster than PyTorch
           | (last time I measured ~2 weeks ago it was about 6% faster),
           | and I think we can push this further. This is done by
           | manually hard-coding a bunch of fusions/optimizations that
           | are non-trivial for torch.compile to find (e.g. our
           | FusedClassifier). But PyTorch has some pending work/PRs that
           | will also speed up their side a lot.
           | 
           | Ultimately my interest in llm.c is to have a nice, clean,
           | minimal, super dependency-light repo in direct C/CUDA
           | implementation, which I find aesthetically pleasing. And on
           | top of that, educational, i.e. using all of the above as an
           | endpoint of an intro LLM course.
        
             | ilaksh wrote:
             | Just out of curiosity, how do you feel about Tinygrad? They
             | just released 0.9 and are also on the HN home page today.
        
             | raymond_goo wrote:
             | Maybe talk to MasterClass...
        
         | sturza wrote:
         | Do you think grokking leads to proper generalized reasoning?
         | https://arxiv.org/abs/2405.15071
        
           | bilsbie wrote:
           | Any tips on understanding grokking? I'm not following that
           | paper.
        
             | sturza wrote:
             | Grokking: Generalization Beyond Overfitting on Small
             | Algorithmic Datasets. Overfitting and being cool about it
             | and some new behavior might emerge.
        
         | espadrine wrote:
         | How big of a perf improvement would result from using the
         | architectural tweaks that Llama3 and others have put in place
         | since GPT-2?
        
           | karpathy wrote:
           | My understanding and suspicion is mostly less than you think.
           | Llama 3 architecture has the following changes on GPT-2:
           | 
           | 1. delete the absolute positional encoding and replace with
           | RoPE
           | 
           | 2. delete all biases in all layers (in LayerNorms, they turn
           | into RMSNorm)
           | 
           | 3. GeLU -> SwiGLU non-linearity in the MLP
           | 
           | 4. longer context length
           | 
           | 5. architecture hyperparameter changes, e.g. slightly
           | different aspect ratios
           | 
           | And there was a paper that I can't find the reference to
           | anymore that claimed that if you train long enough, the gap
           | becomes even lower. Possibly because the absolutely
           | positional encoding has enough time to train more fully,
           | where as the RoPE layer benefits from the "inductive bias" it
           | adds in the earlier stages of training.
           | 
           | But I don't have full confidence on the above claim, maybe
           | someone has tried or has better/concrete reference.
        
             | jorlow wrote:
             | Note llama's feed forward is a bit different too:
             | self.w2(F.silu(self.w1(x)) * self.w3(x))
             | 
             | I.e. the nonlinearity is a gate.
             | 
             | https://github.com/meta-
             | llama/llama3/blob/14aab0428d3ec3a959...
        
               | soraki_soladead wrote:
               | Fwiw, that's SwiGLU in #3 above. Swi = Swish = silu. GLU
               | is gated linear unit; the gate construction you describe.
        
         | lagrange77 wrote:
         | Thank you for the effort you put in your educational work, it
         | helped me and others a lot! In fact, i'm training my nanoGPT
         | version right now. :)
         | 
         | > Ultimately my interest in llm.c is to have a nice, clean,
         | minimal, super dependency-light repo in direct C/CUDA
         | implementation, which I find aesthetically pleasing.
         | 
         | Also, it's awesome that you spend your time on your passion.
         | 
         | Any plans on making a video series on llm.c? :D
        
           | karpathy wrote:
           | Yes definitely. Related tweet of mine:
           | 
           | https://x.com/karpathy/status/1760388761349927356?lang=en
           | 
           | 1. Build the thing
           | 
           | 2. Build the ramp
           | 
           | Currently on step 1 :). It helps to build it first so you
           | know where you are going, and then you can more easily re-
           | build it when you're vector pointed at the end result.
        
             | lagrange77 wrote:
             | That's fantastic. My gradient field is pointing towards it.
             | 
             | Thank you again!
        
             | htrp wrote:
             | Everytime you take gardening leave, you build something new
             | and interesting!
        
             | LorenzoGood wrote:
             | I love when you leave your job.
        
         | 363849473754 wrote:
         | You might have covered this topic before, but I'm curious about
         | the main performance differences between nanoGPT and llm.c. I'm
         | planning to take your "Zero to Hero" course, and I'd like to
         | know how capable the nanoGPT chatbot you'll build is. Is its
         | quality comparable to GPT-2 when used as a chatbot?
        
           | karpathy wrote:
           | Zero To Hero doesn't make it all the way to a chatbot, it
           | stops at pretraining, and even that at a fairly small scale
           | or character-level transformer on TinyShakespeare. I think
           | it's a good conceptual intro but you don't get too too far as
           | a competent chatbot. I think I should be able to improve on
           | this soon.
        
             | 363849473754 wrote:
             | Thanks! So, you are considering expanding the Zero to Hero
             | series to include building a basic GPT-2 toy chatbot? I
             | believe you mentioned in one of the early lectures that you
             | planned to include building a toy version of Dalle. Do you
             | still have plans for that as well?
        
             | maskil wrote:
             | Please do! It's a fantastic series!
        
         | dang wrote:
         | Ok, we've changed the URL to that from
         | https://twitter.com/karpathy/status/1795484547267834137 above.
         | Thanks!
        
           | karpathy wrote:
           | sounds good. both work, (though) I think HN has a bit of an
           | anti-twitter bias.
        
             | pests wrote:
             | First, love the videos and other work you've been doing.
             | The micrograd videos are a great way to show people this is
             | all math in the end, and I've linked to specific timestamps
             | in that video and others more times than I can count.
             | 
             | For why I think we have a anti-twitter bias...
             | 
             | Twitter doesn't show replies or any further context without
             | being logged in. Most people will have accounts but I know
             | a lot here deleted theirs or refuse to use it for one
             | reason or another.
             | 
             | Also IMO most here are going to want to read the full
             | source so it just cuts out the middleman. This would
             | usually fall under the "Please submit the original source.
             | If a post reports on something found on another site,
             | submit the latter." guideline which is a little different
             | since the source is yourself, but still the Twitter post
             | doesn't add anything new or novel.
        
               | karpathy wrote:
               | fwiw I totally understand the sentiment! it's actually a
               | bit sad to me that so much of our content is moving from
               | the shared, open web to platforms like twitter,
               | unfortunately there seems to be too much value add around
               | built-in discoverability, comments, ease of authoring,
               | for many people revenue sharing, etc.
        
               | pests wrote:
               | Yes, definitely. I had to double check your age
               | (apologies! feels rude somehow) and yep, we're basically
               | the same age. The web was different back then. Maybe not
               | better; maybe that's nostalgia. But never before has more
               | creators had as many tools and avenues to promote and
               | monotonize their work as they do now.
        
             | dang wrote:
             | I agree - Twitter is still the primary source for a lot of
             | original work and original thoughts. Unfortunately it's
             | gotten more complicated because (1) the threads there have
             | gotten less accessible and (2) some people have assigned
             | the entire site to one side of the culture war.
        
           | wrboyce wrote:
           | Could you mention what the link has been changed from too?
           | Sometimes it helps with context when reading the comments.
           | Thanks!
        
             | dang wrote:
             | I agree that it helps! but I did mention it, no? Admittedly
             | "to that from" is a bit of an awkward construction
        
               | wrboyce wrote:
               | _facepalm_ I'd had a few whiskies and misread your
               | comment. Sorry about that!
        
         | m11a wrote:
         | Why write in CUDA and not just use PyTorch etc?
         | 
         | if performance, how much faster is it, out of curiosity?
        
           | kgwgk wrote:
           | > Why write in CUDA and not just use PyTorch etc?
           | 
           | "LLM training in simple, pure C/CUDA. There is no need for
           | 245MB of PyTorch or 107MB of cPython. [...] A few more words
           | on what I want this repo to be: First, I want llm.c to be a
           | place for education."
        
         | simonw wrote:
         | > Keep in mind that here we trained for 10B tokens, while GPT-3
         | models were all trained for 300B tokens. [...] GPT-3 actually
         | didn't change too much at all about the model (context size
         | 1024 -> 2048, I think that's it?).
         | 
         | Andrej, based on that do you have a rough cost estimate for
         | what it would take to train a GPT-3 Ada (350M)? Do you plan to
         | get there with llm.c ?
        
           | karpathy wrote:
           | The 350M model I trained last night was 30B tokens, 14 hours,
           | ~$200. Conveniently, 300B is exactly 10X the tokens so ~$2K
           | would be the estimate. You'd have to wait 140 hours on one
           | box though. Getting an H100 box instead of A100 will already
           | cut the time latency down probably by a factor of 2-3X, for
           | free, even without going to fp8 (which we do plan to
           | support).
           | 
           | So TLDR at this model scale, llm.c is already there
           | functionally, I think, it's a matter of the compute resources
           | and patience. I currently have this one box from Lambda and I
           | have to look around for a few more boxes and merge the
           | pending PR for multi-node training support. Getting all of
           | this into a nice, stable state is probably a good chunk of
           | the pending work right now.
        
         | localhost wrote:
         | How large is the set of binaries needed to do this training
         | job? The current pytorch + CUDA ecosystem is so incredibly
         | gigantic and manipulating those container images is painful
         | because they are so large. I was hopeful that this would be the
         | beginnings of a much smaller training/fine-tuning stack?
        
           | karpathy wrote:
           | That is 100% my intention and hope and I think we are very
           | close to deleting all of that. Right now on master, I am
           | already only using Python for the tokenization preprocessing.
           | In principle the requirements for llm.c should be extremely
           | minimal. I think this a few days of work that is high on my
           | mind.
           | 
           | Biggest problem right now is finding a place that can host
           | the 135GB of tokens for FineWeb100B. Will probably use S3 or
           | something.
           | 
           | Related see: https://github.com/karpathy/llm.c/issues/482
        
             | metadat wrote:
             | Could this be a good case for a torrent?
        
         | dekhn wrote:
         | Would you consider switching your interest to protein structure
         | prediction? In particular, the current most advanced model is a
         | closed-source, closed-weights system that was trained on a
         | proprietary hardware. It is intentionally kept that way for now
         | to enable deepmind to commercialize their product.
         | 
         | The goal here isn't to make the best performing model: it's
         | ablation. How much can we remove from protein structure
         | prediction (such as multiple sequence alignments and molecular
         | dynamics, which were two improvements in AF3), while still
         | having a generalized model that can predict novel folds.
         | 
         | Then focus on teaching the minimal necessary math and code to
         | reproduce the results to the larger biological community. All I
         | can say about AF3 is that it literally taught me that
         | everything I learned about protein structure prediction in the
         | last 30 years was misguided, or outright wrong.
         | 
         | Don't worry about drug discovery or any of the hard stuff. Just
         | continue to show that all that's required to predict novel
         | structures is the existing PDB.
        
           | treme wrote:
           | lol I appreciate your effort to guide his genius towards 'max
           | human good'
        
           | wizzwizz4 wrote:
           | > _switching your interest_
           | 
           | That's not usually how it works.
           | 
           | > _Just continue to show that all that 's required to predict
           | novel structures is the existing PDB._
           | 
           | Sounds like you know a lot about this topic. You should do
           | it!
        
             | dekhn wrote:
             | Yes I already published several papers in the area, but I
             | don't work on it any more.
        
         | jonesn11 wrote:
         | Like the FAQ, you correctly anticipated my questions.
        
         | 0x1ceb00da wrote:
         | Hi. Is it possible to somehow run llm.c on an amd gpu?
        
           | anthonix1 wrote:
           | Yeah, I just reproduced the GPT2 from scratch results in 8.75
           | hours on 4x 7900 XTX. The fork is here:
           | https://github.com/anthonix/llm.c
        
         | sytelus wrote:
         | So, NanoGPT took 1.8 days on 8xA100 for 124M model training on
         | 30.7B tokens using flash attention. This would translate to
         | 14.4hr for 10B tokens. With llm.c it is ~1.5 hr which is almost
         | 10X speedup!
         | 
         | Does this look ballpark correct? Is there any summary of where
         | majority of this improvement comes from?
        
       | notg963 wrote:
       | Do you have plans to create videos for the llm.c?
        
       | anoy8888 wrote:
       | Can it be done in rust ?
        
       | celltalk wrote:
       | Time for llm videos!
        
       | natsucks wrote:
       | In your opinion is it important for ML engineers to know C?
        
         | brcmthrowaway wrote:
         | 0% chance
        
         | esafak wrote:
         | You'd have to be deep into ML Infrastructure to use C, probably
         | via CUDA. No-one who develops or uses ML models touches C or
         | even C++. tinygrad and llama.cpp are exceptions.
        
         | adeptima wrote:
         | Spend one year to study multiple languages - bash, C, C++, Go,
         | Python ... and even Mojo or Rust. 10-20 hours a week. Being
         | able to read top programming languages is the best investment I
         | ever made. You will become fearless and can see the matrix ;)
        
           | mode80 wrote:
           | I did this and wrote about my experience:
           | 
           | https://mode80.github.io/7-langs-in-12-months.html
           | 
           | I don't regret it. But if ML is your main goal, Python is
           | where you will end up because it's where the libraries are.
        
       | aliljet wrote:
       | Is there a reason you're not trying to port this into an even
       | more stack agnostic world without CUDA?
        
       | zimabluerain wrote:
       | the code works well on H100:
       | https://x.com/Yuchenj_UW/status/1795554739633221804
        
       | ls612 wrote:
       | Is this the sort of thing that a person with curiosity and a 4090
       | could do? It says he used 8xA100s in the cloud to do this but is
       | it just a matter of the 4090 going 8x slower or will memory
       | constraints kill the whole endeavour?
        
         | smaddox wrote:
         | 4090 should have enough VRAM for 124M param training. Even at
         | float32 precision, with AdamW optimizer, parameters should only
         | be ~2GB (124M params x 4 bytes per param x ~4 for optimizer
         | weight overhead). So there should be plenty of remaining space
         | for activations.
        
       | adeptima wrote:
       | Andrej Karpathy karpathy is a magician!
       | 
       | But being the coolest kid on the block with pure C/CUDA
       | implementation is not enough https://github.com/karpathy/llm.c
       | 
       | Studying a baby Llama 2 model source code in pure Mojo is the
       | next level https://github.com/tairov/llama2.mojo
       | https://github.com/tairov/llama2.mojo/blob/master/llama2.moj...
       | 
       | Mojo Lang - Tomorrow's High Performance Python? (with Chris
       | Lattner) https://www.youtube.com/watch?v=JRcXUuQYR90
       | 
       | Andrej Karpathy and Chris Lattner collab is on my wishlist ;)
        
       | metalloid wrote:
       | This awesome!
       | 
       | We need a series on how to build the llm.c from the scratch. Any
       | volunteer?
       | 
       | :-)
        
       | akkishore wrote:
       | Hi Andrej,
       | 
       | Huge fan of all the work you do. Wanted to understand something
       | fundamental and whom better to ask than you: Whats so special
       | about the transformer architecture that its able to predict the
       | next token so beautifully understanding all the intricate
       | previous token relationships? I understand Attention but what so
       | special about this architecture that no other architectures are
       | able to "attend" appropriately to previous tokens? Being a CS
       | guy, its really hard for me to fathom that we have not yet
       | created another architecture which can perform similarly.
        
         | smaddox wrote:
         | Transformers have quadratic computational complexity in
         | sequence length, i.e. O(N^2) where N is the sequence length.
         | RNNs, Linformer, Mamba, etc. have linear or quasi-linear
         | computational complexity in sequence length, which often
         | bottlenecks information movement across tokens.
         | 
         | In theory, if you grew the RNN's state quadratically vs
         | sequence length, you could likely achieve comparable
         | performance to transformers, but it would likely be less
         | efficient than transformers.
        
       | unknown2342 wrote:
       | Thank you for your work Andrej! <3
        
       ___________________________________________________________________
       (page generated 2024-05-29 23:03 UTC)