[HN Gopher] llm.c - LLM training in simple, pure C/CUDA
___________________________________________________________________
llm.c - LLM training in simple, pure C/CUDA
Author : tosh
Score : 335 points
Date : 2024-04-08 20:38 UTC (2 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| tosh wrote:
| > LLM training in simple, pure C/CUDA. There is no need for 245MB
| of PyTorch or 107MB of cPython
| api wrote:
| Python has been popular for this because it's convenient to
| quickly hack on and experiment with, not because it's the most
| efficient thing.
| im3w1l wrote:
| The overhead really isn't that bad is it? Since the the
| python code is mostly about saying multiply matrix A with
| matrix B, and then that actual computation is done by
| optimized low level code.
| jiggawatts wrote:
| I suspect that this has a high chance of running afoul of
| Ahmdal's Law. Even if you can parallelise the bulk of the
| computation, the serial parts remain single-threaded and
| start to dominate the total runtime.
| bongodongobob wrote:
| For that stuff, yeah you're correct.
|
| What I've seen is issues with the implementation of those
| libraries in a project.
|
| I don't remember exactly, but I was playing with someone's
| wrapper for some kind of machine learning snake game and it
| was taking way longer than it should have on back of the
| napkin math.
|
| The issue was using either a dict or a list in a hot loop
| and changing it to the other sped it up like 1000x.
|
| So it's easy to think "yeah this library is optimized" but
| then you build something on top of it that is not obviously
| going to slow it down.
|
| But, that's the Python tradeoff.
| 0cf8612b2e1e wrote:
| That sounds irrelevant to Python and just a matter of
| slow code cropping up in libraries until someone runs a
| profiler.
| littlestymaar wrote:
| But then again if your program have places where choosing
| the right Python primitive is important for performance,
| then using python is affecting performance here since
| even the best algorithm in Python would be slower than
| the equivalent C.
|
| Most of the time it doesn't matter because there's
| nothing hoy on the Python side, but if there is, then
| Python is going to be slowing your stuff down.
| hcarvalhoalves wrote:
| > The issue was using either a dict or a list in a hot
| loop and changing it to the other sped it up like 1000x.
|
| The programmer using the wrong data structure is not a
| problem with the language.
| CamperBob2 wrote:
| It really is with Python. There are simply too many
| containers and container-like concepts. Lists, arrays,
| sets, dicts...
| 0cf8612b2e1e wrote:
| What modern language doesn't have those?
|
| Go kind of cheats and has maps play double duty as sets.
| llm_nerd wrote:
| It depends on how you define overhead. Runtime overhead and
| memory usage is absolutely marginal, and the tightest, most
| perfect implementation will have trouble beating it.
|
| Instead people are trying to optimize install size of
| dependencies, which while maybe a fun hacking project...who
| really cares?
| QuadmasterXLII wrote:
| 107MB of cPython defeated
|
| Go to try for self
|
| Step 1 download 2.4GB of CUDA
| simonw wrote:
| The size of CUDA really is astonishing. Any chance someone
| might figure out how to slim that down?
| dartos wrote:
| Nvidia is the only one who could, since they own it.
| xiphias2 wrote:
| Talking directly to the kernel / driver / firmware.
|
| As others have said, George Hotz is doing his best in
| reverse-engineering and skipping layers.
| jsheard wrote:
| Taking a peek inside the package it seems to mostly be the
| libraries - CuFFT alone is about 350MB for example, twice
| over for the debug and release versions. I'm guessing those
| are probably fat binaries pre-compiled for every generation
| of Nvidia hardware rather than just the PTX bytecode, which
| would help to speed up fresh builds, at the expense of
| being huge.
| maille wrote:
| Raise your voice on their forum:
| https://forums.developer.nvidia.com/t/how-to-overcome-the-
| hu... Tried my luck 2 years ago but it keeps increasing.
| phdelightful wrote:
| Here's a blog that breaks down how large different pieces
| of CUDA are:
|
| https://carlpearson.net/post/20231023-cuda-releases/
| gcr wrote:
| I mean, being fair, the 2.4GB CUDA SDK is absolutely required
| for the cPython implementation as well
| dwroberts wrote:
| A bunch of install methods for torch via pip include ~1.5GB
| of lib/ because of CUDA. libtorch_cuda.so is like 800MB on
| its own
| fsloth wrote:
| I don't think it's about the byte size, but the inherent
| complexity of the implementation. 1000 lines of C code is
| extremely simple by any standard. Whereas a sundry collection
| of Python and PyTorch libraries is anything but.
| brcmthrowaway wrote:
| Very sad, shouldve used an agnostic framework instead of CUDA
| jsheard wrote:
| It's only ~1000 LoC, seems like a pretty good case study to
| port over to other runtimes and show they can stand up to CUDA.
| geph2021 wrote:
| As far as I can tell, its optional dependency is Open MP, not
| CUDA. Doesn't seem directly dependent on CUDA.
| gpderetta wrote:
| Yes, a quick skim of the code only shows openmp dependency.
| The C/CUDA reference might have meant to be C/OMP .
|
| Although I wonder if it would work well with GCC PTX OMP
| offloading.
| dlazaro wrote:
| The plan is to eventually implement with CUDA:
|
| "Currently, I am working on [...] direct CUDA implementation,
| which will be significantly faster and probably come close to
| PyTorch."
| robrenaud wrote:
| Are there any strong LLMs trained without CUDA?
| blackeyeblitzar wrote:
| Yes, there are several. See this blog post from Databricks
| describing the landscape of LLMs trained on AMD hardware for
| example: https://www.databricks.com/blog/training-llms-scale-
| amd-mi25...
|
| The most interesting one IMO is OLMo from AI2, which is truly
| open. You can read their blog post about it
| (https://blog.allenai.org/hello-olmo-a-truly-open-
| llm-43f7e73...) but basically it is open everything - they
| released everything you need to reproduce their weights
| (training data, training code, evaluation code, and weights)
| with a friendly (Apache) license.
| ZoomerCretin wrote:
| Gemini and Gemma were trained on Google's TPUs.
| exe34 wrote:
| Looking forward to your patches!
| andrewstuart wrote:
| OT but question from someone curious..... is Cuda still
| entrenched as the only option for doing AI or is there growing
| support for AMD/Intel/Other ways of doing AI?
| BraverHeart wrote:
| George Hotz is attempting to solve this:
| https://github.com/tinygrad/tinygrad
| ZoomerCretin wrote:
| He loudly gave up on AMD after they did not fix a blocker he
| had for 5+ months and gave him the runaround the entire time
| when he asked for the code to fix it himself. He is still
| shipping the AMD tinybox with huge warning labels.
| WithinReason wrote:
| There are some stirrings but don't hold your breath
| blackeyeblitzar wrote:
| See my comment on this here:
| https://news.ycombinator.com/item?id=39973816
| sigmoid10 wrote:
| There are a few attempts here and there in various stages of
| progression. But right now, nothing matches Nvidia+CUDA in
| speed and usability.
| adam_arthur wrote:
| You can run inference today on pretty much any card.
|
| Download Ollama on a modern MacBook and can run 13B and even
| higher (if your RAM allows) at fast speeds. People run smaller
| models locally on their phones
|
| Google has trained their latest models on their own TPUs... not
| using Nvidia to my knowledge.
|
| So, no, there are alternatives. CUDA has the largest mindshare
| on the training side though.
| taminka wrote:
| there are obv alternatives from both intel and amd, performant
| blas/dnn packages, but small teams don't use them bc cuda is
| easier to use and has more support, and larger teams don't use
| them bc they have deals w/ nvidia or not enough GPUs are
| available or they're after the absolute best performance (which
| is still nvidia) or bc of other stuff like unstable drivers or
| smth
| triyambakam wrote:
| When Lex recently talked to Andre, Andre said that he gets
| positively obsessed with a problem and says "this must exist". I
| imagine this must be one of those outputs.
| yinser wrote:
| I've seen his nano GPT implemented using JAX, now we have C/CUDA.
| I'd love to see if nano GPT could be doable in Mojo. I took a
| stab at a Mojo conversion of his Wavenet project (Andrej's zero
| to hero course) and I gotta say... python has so many nice
| features lol. Stating the obvious I know but what you see done in
| 6 lines of python takes so much more work in other languages.
| cb321 wrote:
| For a prior generation of karpathy-splaining this is this Nim
| port: https://github.com/Vindaar/llama2nim - maybe of interest
| if you are interested in Mojo.
| blackeyeblitzar wrote:
| It would be great if someone created a tutorial around this
| explaining exactly how it works and how to do a test training
| run. I'm aware it's not feasible to train a "real" model on
| personal hardware but it would be nice to have a practical
| learning experience. I'm not sure if there are good alternatives
| for that.
| qwertox wrote:
| > direct CUDA implementation, which will be significantly faster
| and probably come close to PyTorch.
|
| It almost hurts, to read that PyTorch is faster.
|
| But then again, with these GPU-RAM-prices, let's see how it
| speeds up the CPU.
|
| We really need SO-DIMM slots on the RTX series (or AMD/Intel
| equivalent) so that we can expand the RAM as we need it to. Is
| there a technical problem to it?
| LatticeAnimal wrote:
| > We really need SO-DIMM slots on the RTX series (or AMD/Intel
| equivalent) so that we can expand the RAM as we need it to. Is
| there a technical problem to it?
|
| I imagine it would incur a non trivial latency and cost
| penalty. The memory modules are placed pretty close to the
| compute die right now. Cooling would also have to change (the
| memory modules produce a lot of heat).
|
| But there is also no reason for any of the GPU manufacturers to
| do this. A skew with twice as much memory can go for a lot more
| than the difference in memory cost alone
| SunlitCat wrote:
| And especially doing "interesting" combinations of gpu and
| memory.
|
| Like lower end gpu with 16 GB of VRAM, but offering just 8 /
| 12 GB of VRAM in the middle class and then again 16 GB in the
| upper class of gpu selection.
| jsheard wrote:
| Memory speed is more or less directly proportional to how close
| the memory is to the processor, with the fastest memory being
| literally inside the processor (SRAM cache), followed by memory
| on the same package as the processor (HBM GPUs, Apple
| M-series), followed by soldered down discrete memory chips
| (regular GPUs, games consoles), followed by socketed DIMMs in
| distant last place. There's not really any getting around it,
| the bandwidth that GPUs crave just isn't compatible with
| modularity.
|
| Even CPUs are starting to move their memory closer to the core
| in the name of performance, as mentioned Apple is already doing
| it, Intel is making Xeons with on-chip memory now, and they
| have a version aimed at consumers on their roadmap.
| wtallis wrote:
| FYI, most discrete GPUs with discrete memory packages
| soldered to the board near the GPU are running at
| substantially higher memory frequencies than the on-package
| DRAM in Apple's chips. But running GDDR at those speeds costs
| a lot of power.
| tverbeure wrote:
| For data rates, as in bandwidth per IO pin, distance is
| really only a secondary factor. HBM memory, for example, runs
| at substantially lower data rates than GDDR, yet it sits
| right next to the GPU die compared to centimeters for the
| GDDR. And high-speed serial links run at speeds that are an
| order of magnitude higher than even the internal register
| files of a CPU.
| tverbeure wrote:
| Check out PCB back drilling. It's a process where you remove a
| few hundred microns from the vias that are used to connect GDDR
| RAMs to the GPUs, to avoid reflections due to the impedance
| mismatch that's caused by the stub.
|
| When you have a pulse coded signal traveling at close to 10GHz,
| everything becomes an antenna. The technical problem is that
| you can't do this with a flimsy connector like the ones used
| for DIMMs. The reason GDDR can have a bandwidth per pin that is
| 4 times higher than regular DDR is because they are soldered
| down on the PCB.
| patrick-fitz wrote:
| https://twitter.com/karpathy/status/1777427944971083809
|
| > And once this is a in a bit more stable state: videos on
| building this in more detail and from scratch.
|
| Looking forward to watching the videos.
| 0cf8612b2e1e wrote:
| I love his videos. They are dense, but I get a lot out of them.
| sghiassy wrote:
| +100 thank you karpathy!
| fori1to10 wrote:
| It should be rewritten in Rust. (Just joking)
| idkwhatimdoin wrote:
| If I was starting from scratch, what resources should I start
| with to build up an understanding of what this code does and how
| to read it? It's quite dense and my knowledge of LLMs is quite
| minimal. Are these terse variable names standard in LLM-land?
| tayo42 wrote:
| Check out his zero to hero series. Which builds this with
| python and later pytorch, then probably his other mini C based
| projects.
| flockonus wrote:
| Question, apologize if slightly off-topic, it's something I'd
| like to use this project for: Is there an example of how to train
| GPT-2 on time series, in particular with covariates?
|
| As my understanding of LLM goes at a basic level it's predicting
| the next token from previous tokens, which sounds directionally
| similar to time series (perhaps letting aside periodicity).
| teruakohatu wrote:
| Yes general LLM models can be used for time series forecasting:
|
| https://github.com/KimMeen/Time-LLM
| andy99 wrote:
| I'd like to think he took the name from my llm.f90 project
| https://github.com/rbitr/llm.f90
|
| It was originally based off of Karpathy's llama2.c but I renamed
| it when I added support for other architectures.
|
| Probable a coincidence :)
___________________________________________________________________
(page generated 2024-04-08 23:00 UTC)