[HN Gopher] Reproducing GPT-2 in llm.c
___________________________________________________________________
Reproducing GPT-2 in llm.c
Author : tosh
Score : 587 points
Date : 2024-05-28 15:58 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| indigodaddy wrote:
| Looks like this is re: training, but wonder how inference would
| be on some garbage older machine with no GPU on this model?
| ryankrage77 wrote:
| Last time I tried GPT-2 on CPU (which I think was shortly
| before chatGPT was launched), I was getting about 0.2
| tokens/sec. CPU utilization was low though, so running
| inference in parralel gave better results. I was using 2 x
| E5-2660's.
| int_19h wrote:
| DDR5 helps a lot. You can actually run stuff like LLaMA at >1
| tok/s on the CPU with high-end gaming hardware these days.
| doubloon wrote:
| I have a 24 core Intel cpu and llama3.cpp runs llama3
| surprisingly fast in surprisingly little RAM. Yes it
| becomes a space heater but theres light at the end of the
| cuda free tunnel
| benterix wrote:
| I just hope than in a couple of years we'll see a submission here
| titled "Reproduce GPT-4 on legacy RTX 4090."
|
| Because currently even with open source (?) models we are still
| consumers, and the training is still the domain of the rich.
| ravetcofx wrote:
| Accessing the dataset to train from scratch will be the biggest
| hurdle, now a lot of the pile has had ladder pulled since GPT-4
| CamperBob2 wrote:
| Someone will come along and say "Why don't you just mirror
| Anna's Archive?" in 3...2...1...
| sebzim4500 wrote:
| I think between Anna's Archive, fineweb and as many github
| repos as you can scrape you can get a pretty decent
| dataset.
|
| I doubt Anna's Archive would produce a good model on its
| own though.
| exe34 wrote:
| i suppose you wouldn't be able to use it for external
| services, but internally, I'm sure you can find some books
| that fell off the back of a truck...
| HeatrayEnjoyer wrote:
| No reason you can't go external. GPT was trained using
| ebook torrent sites
| artninja1988 wrote:
| OpenAI has enough money to hire lawyers to defend it
| until the end of time though
| meiraleal wrote:
| I'm okay with paying for datasets
| CamperBob2 wrote:
| Depends on how the courts rule. If the copyright
| maximalists prevail, only the wealthiest entities will be
| able to afford to license a useful data set.
|
| Paradoxically enough, this is the outcome that most "Hacker
| News" denizens seem to be rooting for.
| meiraleal wrote:
| I'd still get most of my dataset from torrent but I could
| pay for specific things like high quality source code.
| groby_b wrote:
| It's almost as if people believe in fairness and
| compensating people for their work.
|
| Also, it's worth noting that this is only true as long as
| we're stuck in the "must train on the entire sum total of
| human output ever created" local minimum for machine
| learning. Given that most biological entities learn with
| much less data, this might well be the thing that prods
| ML research to using an approach that isn't "IDK, buy a
| few containers of GPUs, and half a DC of storage, see if
| that makes things better".
| nwsm wrote:
| > It's almost as if people believe in fairness and
| compensating people for their work.
|
| Yet in this case we are talking about compensating the
| compilers/massagers/owners of the datasets, not the
| original authors from wherever the data was originally
| scraped.
| wizzwizz4 wrote:
| Copyright is hideously broken, but in theory: the owners
| only own it because they compensate the authors, which
| they only do out of an expectation of future profit (on
| average).
|
| That theory's a fantasy, because extractive systems
| involving gatekeepers get established, but in _this
| specific case_ , enforcing copyright would make things
| fairer for authors. There's no extractive copyright-
| taking gatekeeper for websites: scrapers don't get
| copyright, so can't re-license the material they've
| scraped (unless it's permissively-licensed or something).
| GaggiX wrote:
| https://huggingface.co/datasets/HuggingFaceFW/fineweb has 15T
| cleaned and deduplicated english web data tokens.
| ravetcofx wrote:
| Holy crap, Does huggingface charge for bandwidth if you're
| downloading 45 terabytes??
| drexlspivey wrote:
| I believe they are hosting it on Cloudflare who doesn't
| charge for egress
| fragmede wrote:
| More specifically, Cloudflare R2 doesn't charge for
| egress, and Cloudflare doesn't charge for egress to
| members in the Bandwidth Alliance which include Azure,
| Google Cloud, Oracle, Alibaba Cloud, and others, though
| critically not AWS.
|
| They very much do charge egress fees elsewhere.
| andersa wrote:
| Fun trivia: downloading 45TB costs about $60, according
| to Cloudflare.
| verticalscaler wrote:
| That's what Cloudflare charges. It costs them around 6
| cents.
| kazanz wrote:
| Wish I could say I'm surprised you're getting downvotes.
| Carrier costs are some of the lowest costs for hosting
| providers. Yet that fact seems to elude a majority of the
| community here.
| andersa wrote:
| That's what they said it costs on their blog, not that
| they charge that. https://blog.cloudflare.com/aws-
| egregious-egress
|
| Where are you getting 6 cents from?
| vineyardmike wrote:
| We won't ever get there or need to because GPT-4 wasn't trained
| on one GPU it was trained on thousands. The (most likely)
| biggest meaningful difference between -2 and -4 is the number
| of parameters and the training data/duration. I don't think
| you'd really learn much more.
| elicksaur wrote:
| It's not about learning. It's about owning. Exactly the
| reason OpenAI stopped being open. Having GPT-4-quality LLMs
| created by anyone with a gaming PC would be pretty radical.
| vineyardmike wrote:
| And you won't get there. Those models are far too large for
| a 2024 GPU. Llama-3 70b is arguably close to GPT-4 but is
| still too large for gaming GPUs (and probably for many
| years of GPU updates)
| elicksaur wrote:
| "You won't get there" is a pretty vast statement for all
| of the future. Two fairly reasonable predictions: 1) the
| compute needed to get GPT4 performance will decrease. 2)
| the compute on consumer GPUs will increase.
|
| At some point they cross, and you will be able to run a
| GPT4-quality LLM on a consumer GPU. At some point after
| that, you'll be able to run a GPT4-quality LLM on a 2024
| consumer GPU if you can find one.
|
| Important to emphasize, I'm not saying "GPT-4". Llama-3
| was trained on 24k GPU clusters. "Able to do the exact
| same processing at 1/24k the compute" is different from
| "Able to get equivalent performance at 1/24k compute".
| Even then, given a long enough time scale, the former is
| possible.
| vineyardmike wrote:
| > 1) the compute needed to get GPT4 performance will
| decrease. 2) the compute on consumer GPUs will increase.
|
| I'm assuming we're just talking inference here...
|
| Sure compute abilities for consumers will increase but
| the original comment had a fixed GPU - the 4090. I can
| already eke out LLama3:8b on my MacBook Air, and Apple
| will sell you a laptop capable of running the full sized
| LLama.
|
| There is a direct correlation between parameters and
| "knowledge" for an LM. There's some open questions as to
| density (LLaMa3 specifically challenged previous
| assumptions) but it seems implausible to fit an
| equivalent model as GPT4 into 24gb vram. Just like
| compression, you can't shrink forever.
|
| GPT-4 and GPT-2 are pretty similar architecturally (I
| assume). So if abilities don't matter, we can already run
| GPT-2 so we're basically there for 4.
| Invictus0 wrote:
| I'm not saying this to be rude, but I think you have a deep
| misunderstanding of how AI training works. You cannot just skip
| the matrix multiplications necessary to train the model, or get
| current hardware to do it faster.
| xdavidliu wrote:
| was the first sentence really necessary? The second sentence
| seems fine by itself.
| nickpsecurity wrote:
| There's work on replacing multiplication. Here's four
| examples:
|
| https://openaccess.thecvf.com/content_CVPR_2020/papers/Chen_.
| ..
|
| https://arxiv.org/abs/2012.03458
|
| https://openaccess.thecvf.com/content/CVPR2021W/MAI/papers/E.
| ..
|
| https://arxiv.org/pdf/2106.10860
| benterix wrote:
| No offence taken! As far as my (shallow!) understanding goes,
| the main challenge is the need for many GPUs with huge
| amounts of memory, and it still takes ages to train the
| model. So regarding the use of consumer GPUs, some work has
| been done already, and I've seen some setups where people
| combine of these and are successful. As for the the other
| aspects, maybe at some point we distill what is really needed
| to a smaller but excellent dataset that would give similar
| results in the final models.
| auspiv wrote:
| Considering it takes 8x A100 GPUs (80GB VRAM) to train GPT-2, I
| think it'll take far more than a single 4090.
| bufo wrote:
| The RTX 4090 has about the same BF16 Tensor Core TOPs than
| the A100, assuming 50% MFU (like the A100 40 GB PCIe) it
| would take 8x longer on 1 RTX 4090 vs 8x A100 80GB SXM, so 12
| hours. Datasheet here for the TOPs
| https://images.nvidia.com/aem-
| dam/Solutions/geforce/ada/nvid... 50% MFU should be
| achievable on the 4090.
| anthonix1 wrote:
| Nah, I reproduced on 4x 7900 XTX machine in 8.75 hours, so a
| single 7900 XTX (costs less than $1k) could do it in under 24
| hours. Was hitting 55.4% MFU.
| anthonix1 wrote:
| FWIW, I'm seeing ~318,000 toks/sec throughput on a 4x AMD 7900
| XTX machine (less than $4k worth of GPU), using the same
| settings as in the post (0.5M batch size etc).
| pama wrote:
| Did you reproduce the evaluation as well?
| anthonix1 wrote:
| It converges similarly on smaller datasets.
|
| About to kick off a training from scratch run on the same
| fineweb-10B, which at 324k toks/sec should take about 8.6
| hours. And with my kWh cost, that is about $2.50 cost to
| train.
|
| Will report back tomorrow when the training has finished..
| anthonix1 wrote:
| So... successfully reproduced in ~8.75 hours, taking about
| 18 kWh / $2.70
|
| The first run actually failed at step 3000 or so, and I
| realized I had a bug in my attention / matmul kernels, but
| after fixing that and restarting it worked great
|
| [1] https://github.com/anthonix/llm.c
| Manabu-eo wrote:
| How much % of the theoretical FLOPs are you getting with
| those 7900 XTX on training?
| sabareesh wrote:
| Well here is a comment on 4090
| https://github.com/karpathy/llm.c/discussions/481#discussion...
| huac wrote:
| 25% MFU :( maybe because of the P2P nerf?
| anthonix1 wrote:
| Maybe get a 7900 XTX. 122 TFLOPS of BF16/FP16 for less than
| $1k and I'm getting 55.4% MFU
| sabareesh wrote:
| These are not apples to apple comparison, as this is
| running across GPU and much bigger model
| sabareesh wrote:
| This much bigger model (500M), P2P is enabled via Mailbox.
| It is expected because of memory to compute ratio
| huac wrote:
| can you elaborate?
| doubloon wrote:
| Dude im hoping we get rid of Nvidia completely. I can run
| llama.cpp inference on a 7B model on my 24 core cpu intel
| machine using just CPU and it only uses about 4gb of ram and is
| not that slow. If we could have massive parallel arm core or
| even riscv machines without the Cuda issues with proprietary
| driver hell it would be much more open source. And much less
| wonkage for the normie user
| karpathy wrote:
| Hi HN the main (more detailed) article is here
| https://github.com/karpathy/llm.c/discussions/481
|
| Happy to answer questions!
| 1024core wrote:
| Thank you, from an appreciative reader!
| ngiyabonga wrote:
| Hi Andrej!
|
| First, thank you for your teaching, it has helped me a lot,
| didn't think I'd ever have the chance to say thank you, but
| here you are and I hope this gets to you!
|
| Question - what's a relevant (05-2024) baseline to compare the
| performance of c code to? Back when you made nanoGPT you were
| seeing "the file train.py reproduces GPT-2 (124M) on
| OpenWebText, running on a single 8XA100 40GB node in about 4
| days of training". So twice the memory on the c node, but
| unsure of data size /epochs, any other details I may be
| missing. I.e. what's the net uplift of running c vs "legacy"
| torch code?
|
| Thanks again for everything.
| karpathy wrote:
| The baseline is definitely PyTorch (or JAX), and indeed
| something like nanoGPT. I just never got nanoGPT "past the
| finish line" of really crossing the t's and dotting the i's
| and reproducing the models with as much care as I did now and
| here in llm.c, and getting to the point where it's a single
| launch command that just does the thing.
|
| I think I'll try to develop the `train_gpt2.py` inside llm.c
| to be that, so that we have the two implementations exactly
| side by side, and it's all nice and comparable.
|
| The C/CUDA code is currently a little bit faster than PyTorch
| (last time I measured ~2 weeks ago it was about 6% faster),
| and I think we can push this further. This is done by
| manually hard-coding a bunch of fusions/optimizations that
| are non-trivial for torch.compile to find (e.g. our
| FusedClassifier). But PyTorch has some pending work/PRs that
| will also speed up their side a lot.
|
| Ultimately my interest in llm.c is to have a nice, clean,
| minimal, super dependency-light repo in direct C/CUDA
| implementation, which I find aesthetically pleasing. And on
| top of that, educational, i.e. using all of the above as an
| endpoint of an intro LLM course.
| ilaksh wrote:
| Just out of curiosity, how do you feel about Tinygrad? They
| just released 0.9 and are also on the HN home page today.
| raymond_goo wrote:
| Maybe talk to MasterClass...
| sturza wrote:
| Do you think grokking leads to proper generalized reasoning?
| https://arxiv.org/abs/2405.15071
| bilsbie wrote:
| Any tips on understanding grokking? I'm not following that
| paper.
| sturza wrote:
| Grokking: Generalization Beyond Overfitting on Small
| Algorithmic Datasets. Overfitting and being cool about it
| and some new behavior might emerge.
| espadrine wrote:
| How big of a perf improvement would result from using the
| architectural tweaks that Llama3 and others have put in place
| since GPT-2?
| karpathy wrote:
| My understanding and suspicion is mostly less than you think.
| Llama 3 architecture has the following changes on GPT-2:
|
| 1. delete the absolute positional encoding and replace with
| RoPE
|
| 2. delete all biases in all layers (in LayerNorms, they turn
| into RMSNorm)
|
| 3. GeLU -> SwiGLU non-linearity in the MLP
|
| 4. longer context length
|
| 5. architecture hyperparameter changes, e.g. slightly
| different aspect ratios
|
| And there was a paper that I can't find the reference to
| anymore that claimed that if you train long enough, the gap
| becomes even lower. Possibly because the absolutely
| positional encoding has enough time to train more fully,
| where as the RoPE layer benefits from the "inductive bias" it
| adds in the earlier stages of training.
|
| But I don't have full confidence on the above claim, maybe
| someone has tried or has better/concrete reference.
| jorlow wrote:
| Note llama's feed forward is a bit different too:
| self.w2(F.silu(self.w1(x)) * self.w3(x))
|
| I.e. the nonlinearity is a gate.
|
| https://github.com/meta-
| llama/llama3/blob/14aab0428d3ec3a959...
| soraki_soladead wrote:
| Fwiw, that's SwiGLU in #3 above. Swi = Swish = silu. GLU
| is gated linear unit; the gate construction you describe.
| lagrange77 wrote:
| Thank you for the effort you put in your educational work, it
| helped me and others a lot! In fact, i'm training my nanoGPT
| version right now. :)
|
| > Ultimately my interest in llm.c is to have a nice, clean,
| minimal, super dependency-light repo in direct C/CUDA
| implementation, which I find aesthetically pleasing.
|
| Also, it's awesome that you spend your time on your passion.
|
| Any plans on making a video series on llm.c? :D
| karpathy wrote:
| Yes definitely. Related tweet of mine:
|
| https://x.com/karpathy/status/1760388761349927356?lang=en
|
| 1. Build the thing
|
| 2. Build the ramp
|
| Currently on step 1 :). It helps to build it first so you
| know where you are going, and then you can more easily re-
| build it when you're vector pointed at the end result.
| lagrange77 wrote:
| That's fantastic. My gradient field is pointing towards it.
|
| Thank you again!
| htrp wrote:
| Everytime you take gardening leave, you build something new
| and interesting!
| LorenzoGood wrote:
| I love when you leave your job.
| 363849473754 wrote:
| You might have covered this topic before, but I'm curious about
| the main performance differences between nanoGPT and llm.c. I'm
| planning to take your "Zero to Hero" course, and I'd like to
| know how capable the nanoGPT chatbot you'll build is. Is its
| quality comparable to GPT-2 when used as a chatbot?
| karpathy wrote:
| Zero To Hero doesn't make it all the way to a chatbot, it
| stops at pretraining, and even that at a fairly small scale
| or character-level transformer on TinyShakespeare. I think
| it's a good conceptual intro but you don't get too too far as
| a competent chatbot. I think I should be able to improve on
| this soon.
| 363849473754 wrote:
| Thanks! So, you are considering expanding the Zero to Hero
| series to include building a basic GPT-2 toy chatbot? I
| believe you mentioned in one of the early lectures that you
| planned to include building a toy version of Dalle. Do you
| still have plans for that as well?
| maskil wrote:
| Please do! It's a fantastic series!
| dang wrote:
| Ok, we've changed the URL to that from
| https://twitter.com/karpathy/status/1795484547267834137 above.
| Thanks!
| karpathy wrote:
| sounds good. both work, (though) I think HN has a bit of an
| anti-twitter bias.
| pests wrote:
| First, love the videos and other work you've been doing.
| The micrograd videos are a great way to show people this is
| all math in the end, and I've linked to specific timestamps
| in that video and others more times than I can count.
|
| For why I think we have a anti-twitter bias...
|
| Twitter doesn't show replies or any further context without
| being logged in. Most people will have accounts but I know
| a lot here deleted theirs or refuse to use it for one
| reason or another.
|
| Also IMO most here are going to want to read the full
| source so it just cuts out the middleman. This would
| usually fall under the "Please submit the original source.
| If a post reports on something found on another site,
| submit the latter." guideline which is a little different
| since the source is yourself, but still the Twitter post
| doesn't add anything new or novel.
| karpathy wrote:
| fwiw I totally understand the sentiment! it's actually a
| bit sad to me that so much of our content is moving from
| the shared, open web to platforms like twitter,
| unfortunately there seems to be too much value add around
| built-in discoverability, comments, ease of authoring,
| for many people revenue sharing, etc.
| pests wrote:
| Yes, definitely. I had to double check your age
| (apologies! feels rude somehow) and yep, we're basically
| the same age. The web was different back then. Maybe not
| better; maybe that's nostalgia. But never before has more
| creators had as many tools and avenues to promote and
| monotonize their work as they do now.
| dang wrote:
| I agree - Twitter is still the primary source for a lot of
| original work and original thoughts. Unfortunately it's
| gotten more complicated because (1) the threads there have
| gotten less accessible and (2) some people have assigned
| the entire site to one side of the culture war.
| wrboyce wrote:
| Could you mention what the link has been changed from too?
| Sometimes it helps with context when reading the comments.
| Thanks!
| dang wrote:
| I agree that it helps! but I did mention it, no? Admittedly
| "to that from" is a bit of an awkward construction
| wrboyce wrote:
| _facepalm_ I'd had a few whiskies and misread your
| comment. Sorry about that!
| m11a wrote:
| Why write in CUDA and not just use PyTorch etc?
|
| if performance, how much faster is it, out of curiosity?
| kgwgk wrote:
| > Why write in CUDA and not just use PyTorch etc?
|
| "LLM training in simple, pure C/CUDA. There is no need for
| 245MB of PyTorch or 107MB of cPython. [...] A few more words
| on what I want this repo to be: First, I want llm.c to be a
| place for education."
| simonw wrote:
| > Keep in mind that here we trained for 10B tokens, while GPT-3
| models were all trained for 300B tokens. [...] GPT-3 actually
| didn't change too much at all about the model (context size
| 1024 -> 2048, I think that's it?).
|
| Andrej, based on that do you have a rough cost estimate for
| what it would take to train a GPT-3 Ada (350M)? Do you plan to
| get there with llm.c ?
| karpathy wrote:
| The 350M model I trained last night was 30B tokens, 14 hours,
| ~$200. Conveniently, 300B is exactly 10X the tokens so ~$2K
| would be the estimate. You'd have to wait 140 hours on one
| box though. Getting an H100 box instead of A100 will already
| cut the time latency down probably by a factor of 2-3X, for
| free, even without going to fp8 (which we do plan to
| support).
|
| So TLDR at this model scale, llm.c is already there
| functionally, I think, it's a matter of the compute resources
| and patience. I currently have this one box from Lambda and I
| have to look around for a few more boxes and merge the
| pending PR for multi-node training support. Getting all of
| this into a nice, stable state is probably a good chunk of
| the pending work right now.
| localhost wrote:
| How large is the set of binaries needed to do this training
| job? The current pytorch + CUDA ecosystem is so incredibly
| gigantic and manipulating those container images is painful
| because they are so large. I was hopeful that this would be the
| beginnings of a much smaller training/fine-tuning stack?
| karpathy wrote:
| That is 100% my intention and hope and I think we are very
| close to deleting all of that. Right now on master, I am
| already only using Python for the tokenization preprocessing.
| In principle the requirements for llm.c should be extremely
| minimal. I think this a few days of work that is high on my
| mind.
|
| Biggest problem right now is finding a place that can host
| the 135GB of tokens for FineWeb100B. Will probably use S3 or
| something.
|
| Related see: https://github.com/karpathy/llm.c/issues/482
| metadat wrote:
| Could this be a good case for a torrent?
| dekhn wrote:
| Would you consider switching your interest to protein structure
| prediction? In particular, the current most advanced model is a
| closed-source, closed-weights system that was trained on a
| proprietary hardware. It is intentionally kept that way for now
| to enable deepmind to commercialize their product.
|
| The goal here isn't to make the best performing model: it's
| ablation. How much can we remove from protein structure
| prediction (such as multiple sequence alignments and molecular
| dynamics, which were two improvements in AF3), while still
| having a generalized model that can predict novel folds.
|
| Then focus on teaching the minimal necessary math and code to
| reproduce the results to the larger biological community. All I
| can say about AF3 is that it literally taught me that
| everything I learned about protein structure prediction in the
| last 30 years was misguided, or outright wrong.
|
| Don't worry about drug discovery or any of the hard stuff. Just
| continue to show that all that's required to predict novel
| structures is the existing PDB.
| treme wrote:
| lol I appreciate your effort to guide his genius towards 'max
| human good'
| wizzwizz4 wrote:
| > _switching your interest_
|
| That's not usually how it works.
|
| > _Just continue to show that all that 's required to predict
| novel structures is the existing PDB._
|
| Sounds like you know a lot about this topic. You should do
| it!
| dekhn wrote:
| Yes I already published several papers in the area, but I
| don't work on it any more.
| jonesn11 wrote:
| Like the FAQ, you correctly anticipated my questions.
| 0x1ceb00da wrote:
| Hi. Is it possible to somehow run llm.c on an amd gpu?
| anthonix1 wrote:
| Yeah, I just reproduced the GPT2 from scratch results in 8.75
| hours on 4x 7900 XTX. The fork is here:
| https://github.com/anthonix/llm.c
| sytelus wrote:
| So, NanoGPT took 1.8 days on 8xA100 for 124M model training on
| 30.7B tokens using flash attention. This would translate to
| 14.4hr for 10B tokens. With llm.c it is ~1.5 hr which is almost
| 10X speedup!
|
| Does this look ballpark correct? Is there any summary of where
| majority of this improvement comes from?
| notg963 wrote:
| Do you have plans to create videos for the llm.c?
| anoy8888 wrote:
| Can it be done in rust ?
| celltalk wrote:
| Time for llm videos!
| natsucks wrote:
| In your opinion is it important for ML engineers to know C?
| brcmthrowaway wrote:
| 0% chance
| esafak wrote:
| You'd have to be deep into ML Infrastructure to use C, probably
| via CUDA. No-one who develops or uses ML models touches C or
| even C++. tinygrad and llama.cpp are exceptions.
| adeptima wrote:
| Spend one year to study multiple languages - bash, C, C++, Go,
| Python ... and even Mojo or Rust. 10-20 hours a week. Being
| able to read top programming languages is the best investment I
| ever made. You will become fearless and can see the matrix ;)
| mode80 wrote:
| I did this and wrote about my experience:
|
| https://mode80.github.io/7-langs-in-12-months.html
|
| I don't regret it. But if ML is your main goal, Python is
| where you will end up because it's where the libraries are.
| aliljet wrote:
| Is there a reason you're not trying to port this into an even
| more stack agnostic world without CUDA?
| zimabluerain wrote:
| the code works well on H100:
| https://x.com/Yuchenj_UW/status/1795554739633221804
| ls612 wrote:
| Is this the sort of thing that a person with curiosity and a 4090
| could do? It says he used 8xA100s in the cloud to do this but is
| it just a matter of the 4090 going 8x slower or will memory
| constraints kill the whole endeavour?
| smaddox wrote:
| 4090 should have enough VRAM for 124M param training. Even at
| float32 precision, with AdamW optimizer, parameters should only
| be ~2GB (124M params x 4 bytes per param x ~4 for optimizer
| weight overhead). So there should be plenty of remaining space
| for activations.
| adeptima wrote:
| Andrej Karpathy karpathy is a magician!
|
| But being the coolest kid on the block with pure C/CUDA
| implementation is not enough https://github.com/karpathy/llm.c
|
| Studying a baby Llama 2 model source code in pure Mojo is the
| next level https://github.com/tairov/llama2.mojo
| https://github.com/tairov/llama2.mojo/blob/master/llama2.moj...
|
| Mojo Lang - Tomorrow's High Performance Python? (with Chris
| Lattner) https://www.youtube.com/watch?v=JRcXUuQYR90
|
| Andrej Karpathy and Chris Lattner collab is on my wishlist ;)
| metalloid wrote:
| This awesome!
|
| We need a series on how to build the llm.c from the scratch. Any
| volunteer?
|
| :-)
| akkishore wrote:
| Hi Andrej,
|
| Huge fan of all the work you do. Wanted to understand something
| fundamental and whom better to ask than you: Whats so special
| about the transformer architecture that its able to predict the
| next token so beautifully understanding all the intricate
| previous token relationships? I understand Attention but what so
| special about this architecture that no other architectures are
| able to "attend" appropriately to previous tokens? Being a CS
| guy, its really hard for me to fathom that we have not yet
| created another architecture which can perform similarly.
| smaddox wrote:
| Transformers have quadratic computational complexity in
| sequence length, i.e. O(N^2) where N is the sequence length.
| RNNs, Linformer, Mamba, etc. have linear or quasi-linear
| computational complexity in sequence length, which often
| bottlenecks information movement across tokens.
|
| In theory, if you grew the RNN's state quadratically vs
| sequence length, you could likely achieve comparable
| performance to transformers, but it would likely be less
| efficient than transformers.
| unknown2342 wrote:
| Thank you for your work Andrej! <3
___________________________________________________________________
(page generated 2024-05-29 23:03 UTC)