[HN Gopher] llm.c - LLM training in simple, pure C/CUDA
       ___________________________________________________________________
        
       llm.c - LLM training in simple, pure C/CUDA
        
       Author : tosh
       Score  : 335 points
       Date   : 2024-04-08 20:38 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | tosh wrote:
       | > LLM training in simple, pure C/CUDA. There is no need for 245MB
       | of PyTorch or 107MB of cPython
        
         | api wrote:
         | Python has been popular for this because it's convenient to
         | quickly hack on and experiment with, not because it's the most
         | efficient thing.
        
           | im3w1l wrote:
           | The overhead really isn't that bad is it? Since the the
           | python code is mostly about saying multiply matrix A with
           | matrix B, and then that actual computation is done by
           | optimized low level code.
        
             | jiggawatts wrote:
             | I suspect that this has a high chance of running afoul of
             | Ahmdal's Law. Even if you can parallelise the bulk of the
             | computation, the serial parts remain single-threaded and
             | start to dominate the total runtime.
        
             | bongodongobob wrote:
             | For that stuff, yeah you're correct.
             | 
             | What I've seen is issues with the implementation of those
             | libraries in a project.
             | 
             | I don't remember exactly, but I was playing with someone's
             | wrapper for some kind of machine learning snake game and it
             | was taking way longer than it should have on back of the
             | napkin math.
             | 
             | The issue was using either a dict or a list in a hot loop
             | and changing it to the other sped it up like 1000x.
             | 
             | So it's easy to think "yeah this library is optimized" but
             | then you build something on top of it that is not obviously
             | going to slow it down.
             | 
             | But, that's the Python tradeoff.
        
               | 0cf8612b2e1e wrote:
               | That sounds irrelevant to Python and just a matter of
               | slow code cropping up in libraries until someone runs a
               | profiler.
        
               | littlestymaar wrote:
               | But then again if your program have places where choosing
               | the right Python primitive is important for performance,
               | then using python is affecting performance here since
               | even the best algorithm in Python would be slower than
               | the equivalent C.
               | 
               | Most of the time it doesn't matter because there's
               | nothing hoy on the Python side, but if there is, then
               | Python is going to be slowing your stuff down.
        
               | hcarvalhoalves wrote:
               | > The issue was using either a dict or a list in a hot
               | loop and changing it to the other sped it up like 1000x.
               | 
               | The programmer using the wrong data structure is not a
               | problem with the language.
        
               | CamperBob2 wrote:
               | It really is with Python. There are simply too many
               | containers and container-like concepts. Lists, arrays,
               | sets, dicts...
        
               | 0cf8612b2e1e wrote:
               | What modern language doesn't have those?
               | 
               | Go kind of cheats and has maps play double duty as sets.
        
             | llm_nerd wrote:
             | It depends on how you define overhead. Runtime overhead and
             | memory usage is absolutely marginal, and the tightest, most
             | perfect implementation will have trouble beating it.
             | 
             | Instead people are trying to optimize install size of
             | dependencies, which while maybe a fun hacking project...who
             | really cares?
        
         | QuadmasterXLII wrote:
         | 107MB of cPython defeated
         | 
         | Go to try for self
         | 
         | Step 1 download 2.4GB of CUDA
        
           | simonw wrote:
           | The size of CUDA really is astonishing. Any chance someone
           | might figure out how to slim that down?
        
             | dartos wrote:
             | Nvidia is the only one who could, since they own it.
        
             | xiphias2 wrote:
             | Talking directly to the kernel / driver / firmware.
             | 
             | As others have said, George Hotz is doing his best in
             | reverse-engineering and skipping layers.
        
             | jsheard wrote:
             | Taking a peek inside the package it seems to mostly be the
             | libraries - CuFFT alone is about 350MB for example, twice
             | over for the debug and release versions. I'm guessing those
             | are probably fat binaries pre-compiled for every generation
             | of Nvidia hardware rather than just the PTX bytecode, which
             | would help to speed up fresh builds, at the expense of
             | being huge.
        
             | maille wrote:
             | Raise your voice on their forum:
             | https://forums.developer.nvidia.com/t/how-to-overcome-the-
             | hu... Tried my luck 2 years ago but it keeps increasing.
        
             | phdelightful wrote:
             | Here's a blog that breaks down how large different pieces
             | of CUDA are:
             | 
             | https://carlpearson.net/post/20231023-cuda-releases/
        
           | gcr wrote:
           | I mean, being fair, the 2.4GB CUDA SDK is absolutely required
           | for the cPython implementation as well
        
           | dwroberts wrote:
           | A bunch of install methods for torch via pip include ~1.5GB
           | of lib/ because of CUDA. libtorch_cuda.so is like 800MB on
           | its own
        
           | fsloth wrote:
           | I don't think it's about the byte size, but the inherent
           | complexity of the implementation. 1000 lines of C code is
           | extremely simple by any standard. Whereas a sundry collection
           | of Python and PyTorch libraries is anything but.
        
       | brcmthrowaway wrote:
       | Very sad, shouldve used an agnostic framework instead of CUDA
        
         | jsheard wrote:
         | It's only ~1000 LoC, seems like a pretty good case study to
         | port over to other runtimes and show they can stand up to CUDA.
        
         | geph2021 wrote:
         | As far as I can tell, its optional dependency is Open MP, not
         | CUDA. Doesn't seem directly dependent on CUDA.
        
           | gpderetta wrote:
           | Yes, a quick skim of the code only shows openmp dependency.
           | The C/CUDA reference might have meant to be C/OMP .
           | 
           | Although I wonder if it would work well with GCC PTX OMP
           | offloading.
        
           | dlazaro wrote:
           | The plan is to eventually implement with CUDA:
           | 
           | "Currently, I am working on [...] direct CUDA implementation,
           | which will be significantly faster and probably come close to
           | PyTorch."
        
         | robrenaud wrote:
         | Are there any strong LLMs trained without CUDA?
        
           | blackeyeblitzar wrote:
           | Yes, there are several. See this blog post from Databricks
           | describing the landscape of LLMs trained on AMD hardware for
           | example: https://www.databricks.com/blog/training-llms-scale-
           | amd-mi25...
           | 
           | The most interesting one IMO is OLMo from AI2, which is truly
           | open. You can read their blog post about it
           | (https://blog.allenai.org/hello-olmo-a-truly-open-
           | llm-43f7e73...) but basically it is open everything - they
           | released everything you need to reproduce their weights
           | (training data, training code, evaluation code, and weights)
           | with a friendly (Apache) license.
        
           | ZoomerCretin wrote:
           | Gemini and Gemma were trained on Google's TPUs.
        
         | exe34 wrote:
         | Looking forward to your patches!
        
       | andrewstuart wrote:
       | OT but question from someone curious..... is Cuda still
       | entrenched as the only option for doing AI or is there growing
       | support for AMD/Intel/Other ways of doing AI?
        
         | BraverHeart wrote:
         | George Hotz is attempting to solve this:
         | https://github.com/tinygrad/tinygrad
        
           | ZoomerCretin wrote:
           | He loudly gave up on AMD after they did not fix a blocker he
           | had for 5+ months and gave him the runaround the entire time
           | when he asked for the code to fix it himself. He is still
           | shipping the AMD tinybox with huge warning labels.
        
         | WithinReason wrote:
         | There are some stirrings but don't hold your breath
        
         | blackeyeblitzar wrote:
         | See my comment on this here:
         | https://news.ycombinator.com/item?id=39973816
        
         | sigmoid10 wrote:
         | There are a few attempts here and there in various stages of
         | progression. But right now, nothing matches Nvidia+CUDA in
         | speed and usability.
        
         | adam_arthur wrote:
         | You can run inference today on pretty much any card.
         | 
         | Download Ollama on a modern MacBook and can run 13B and even
         | higher (if your RAM allows) at fast speeds. People run smaller
         | models locally on their phones
         | 
         | Google has trained their latest models on their own TPUs... not
         | using Nvidia to my knowledge.
         | 
         | So, no, there are alternatives. CUDA has the largest mindshare
         | on the training side though.
        
         | taminka wrote:
         | there are obv alternatives from both intel and amd, performant
         | blas/dnn packages, but small teams don't use them bc cuda is
         | easier to use and has more support, and larger teams don't use
         | them bc they have deals w/ nvidia or not enough GPUs are
         | available or they're after the absolute best performance (which
         | is still nvidia) or bc of other stuff like unstable drivers or
         | smth
        
       | triyambakam wrote:
       | When Lex recently talked to Andre, Andre said that he gets
       | positively obsessed with a problem and says "this must exist". I
       | imagine this must be one of those outputs.
        
       | yinser wrote:
       | I've seen his nano GPT implemented using JAX, now we have C/CUDA.
       | I'd love to see if nano GPT could be doable in Mojo. I took a
       | stab at a Mojo conversion of his Wavenet project (Andrej's zero
       | to hero course) and I gotta say... python has so many nice
       | features lol. Stating the obvious I know but what you see done in
       | 6 lines of python takes so much more work in other languages.
        
         | cb321 wrote:
         | For a prior generation of karpathy-splaining this is this Nim
         | port: https://github.com/Vindaar/llama2nim - maybe of interest
         | if you are interested in Mojo.
        
       | blackeyeblitzar wrote:
       | It would be great if someone created a tutorial around this
       | explaining exactly how it works and how to do a test training
       | run. I'm aware it's not feasible to train a "real" model on
       | personal hardware but it would be nice to have a practical
       | learning experience. I'm not sure if there are good alternatives
       | for that.
        
       | qwertox wrote:
       | > direct CUDA implementation, which will be significantly faster
       | and probably come close to PyTorch.
       | 
       | It almost hurts, to read that PyTorch is faster.
       | 
       | But then again, with these GPU-RAM-prices, let's see how it
       | speeds up the CPU.
       | 
       | We really need SO-DIMM slots on the RTX series (or AMD/Intel
       | equivalent) so that we can expand the RAM as we need it to. Is
       | there a technical problem to it?
        
         | LatticeAnimal wrote:
         | > We really need SO-DIMM slots on the RTX series (or AMD/Intel
         | equivalent) so that we can expand the RAM as we need it to. Is
         | there a technical problem to it?
         | 
         | I imagine it would incur a non trivial latency and cost
         | penalty. The memory modules are placed pretty close to the
         | compute die right now. Cooling would also have to change (the
         | memory modules produce a lot of heat).
         | 
         | But there is also no reason for any of the GPU manufacturers to
         | do this. A skew with twice as much memory can go for a lot more
         | than the difference in memory cost alone
        
           | SunlitCat wrote:
           | And especially doing "interesting" combinations of gpu and
           | memory.
           | 
           | Like lower end gpu with 16 GB of VRAM, but offering just 8 /
           | 12 GB of VRAM in the middle class and then again 16 GB in the
           | upper class of gpu selection.
        
         | jsheard wrote:
         | Memory speed is more or less directly proportional to how close
         | the memory is to the processor, with the fastest memory being
         | literally inside the processor (SRAM cache), followed by memory
         | on the same package as the processor (HBM GPUs, Apple
         | M-series), followed by soldered down discrete memory chips
         | (regular GPUs, games consoles), followed by socketed DIMMs in
         | distant last place. There's not really any getting around it,
         | the bandwidth that GPUs crave just isn't compatible with
         | modularity.
         | 
         | Even CPUs are starting to move their memory closer to the core
         | in the name of performance, as mentioned Apple is already doing
         | it, Intel is making Xeons with on-chip memory now, and they
         | have a version aimed at consumers on their roadmap.
        
           | wtallis wrote:
           | FYI, most discrete GPUs with discrete memory packages
           | soldered to the board near the GPU are running at
           | substantially higher memory frequencies than the on-package
           | DRAM in Apple's chips. But running GDDR at those speeds costs
           | a lot of power.
        
           | tverbeure wrote:
           | For data rates, as in bandwidth per IO pin, distance is
           | really only a secondary factor. HBM memory, for example, runs
           | at substantially lower data rates than GDDR, yet it sits
           | right next to the GPU die compared to centimeters for the
           | GDDR. And high-speed serial links run at speeds that are an
           | order of magnitude higher than even the internal register
           | files of a CPU.
        
         | tverbeure wrote:
         | Check out PCB back drilling. It's a process where you remove a
         | few hundred microns from the vias that are used to connect GDDR
         | RAMs to the GPUs, to avoid reflections due to the impedance
         | mismatch that's caused by the stub.
         | 
         | When you have a pulse coded signal traveling at close to 10GHz,
         | everything becomes an antenna. The technical problem is that
         | you can't do this with a flimsy connector like the ones used
         | for DIMMs. The reason GDDR can have a bandwidth per pin that is
         | 4 times higher than regular DDR is because they are soldered
         | down on the PCB.
        
       | patrick-fitz wrote:
       | https://twitter.com/karpathy/status/1777427944971083809
       | 
       | > And once this is a in a bit more stable state: videos on
       | building this in more detail and from scratch.
       | 
       | Looking forward to watching the videos.
        
         | 0cf8612b2e1e wrote:
         | I love his videos. They are dense, but I get a lot out of them.
        
           | sghiassy wrote:
           | +100 thank you karpathy!
        
       | fori1to10 wrote:
       | It should be rewritten in Rust. (Just joking)
        
       | idkwhatimdoin wrote:
       | If I was starting from scratch, what resources should I start
       | with to build up an understanding of what this code does and how
       | to read it? It's quite dense and my knowledge of LLMs is quite
       | minimal. Are these terse variable names standard in LLM-land?
        
         | tayo42 wrote:
         | Check out his zero to hero series. Which builds this with
         | python and later pytorch, then probably his other mini C based
         | projects.
        
       | flockonus wrote:
       | Question, apologize if slightly off-topic, it's something I'd
       | like to use this project for: Is there an example of how to train
       | GPT-2 on time series, in particular with covariates?
       | 
       | As my understanding of LLM goes at a basic level it's predicting
       | the next token from previous tokens, which sounds directionally
       | similar to time series (perhaps letting aside periodicity).
        
         | teruakohatu wrote:
         | Yes general LLM models can be used for time series forecasting:
         | 
         | https://github.com/KimMeen/Time-LLM
        
       | andy99 wrote:
       | I'd like to think he took the name from my llm.f90 project
       | https://github.com/rbitr/llm.f90
       | 
       | It was originally based off of Karpathy's llama2.c but I renamed
       | it when I added support for other architectures.
       | 
       | Probable a coincidence :)
        
       ___________________________________________________________________
       (page generated 2024-04-08 23:00 UTC)