[HN Gopher] Llm.c - LLM training in simple, pure C/CUDA
       ___________________________________________________________________
        
       Llm.c - LLM training in simple, pure C/CUDA
        
       Author : tosh
       Score  : 956 points
       Date   : 2024-04-08 20:38 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | tosh wrote:
       | > LLM training in simple, pure C/CUDA. There is no need for 245MB
       | of PyTorch or 107MB of cPython
        
         | api wrote:
         | Python has been popular for this because it's convenient to
         | quickly hack on and experiment with, not because it's the most
         | efficient thing.
        
           | im3w1l wrote:
           | The overhead really isn't that bad is it? Since the the
           | python code is mostly about saying multiply matrix A with
           | matrix B, and then that actual computation is done by
           | optimized low level code.
        
             | jiggawatts wrote:
             | I suspect that this has a high chance of running afoul of
             | Ahmdal's Law. Even if you can parallelise the bulk of the
             | computation, the serial parts remain single-threaded and
             | start to dominate the total runtime.
        
               | taneq wrote:
               | I don't think the serial parts of ML training are
               | Python's fault, are they? It's all "operation B depends
               | on the output of operation A".
        
             | bongodongobob wrote:
             | For that stuff, yeah you're correct.
             | 
             | What I've seen is issues with the implementation of those
             | libraries in a project.
             | 
             | I don't remember exactly, but I was playing with someone's
             | wrapper for some kind of machine learning snake game and it
             | was taking way longer than it should have on back of the
             | napkin math.
             | 
             | The issue was using either a dict or a list in a hot loop
             | and changing it to the other sped it up like 1000x.
             | 
             | So it's easy to think "yeah this library is optimized" but
             | then you build something on top of it that is not obviously
             | going to slow it down.
             | 
             | But, that's the Python tradeoff.
        
               | 0cf8612b2e1e wrote:
               | That sounds irrelevant to Python and just a matter of
               | slow code cropping up in libraries until someone runs a
               | profiler.
        
               | littlestymaar wrote:
               | But then again if your program have places where choosing
               | the right Python primitive is important for performance,
               | then using python is affecting performance here since
               | even the best algorithm in Python would be slower than
               | the equivalent C.
               | 
               | Most of the time it doesn't matter because there's
               | nothing hoy on the Python side, but if there is, then
               | Python is going to be slowing your stuff down.
        
               | hcarvalhoalves wrote:
               | > The issue was using either a dict or a list in a hot
               | loop and changing it to the other sped it up like 1000x.
               | 
               | The programmer using the wrong data structure is not a
               | problem with the language.
        
               | CamperBob2 wrote:
               | It really is with Python. There are simply too many
               | containers and container-like concepts. Lists, arrays,
               | sets, dicts...
        
               | 0cf8612b2e1e wrote:
               | What modern language doesn't have those?
               | 
               | Go kind of cheats and has maps play double duty as sets.
        
               | bongodongobob wrote:
               | Kinda. I guess my native tongue is C/C++ and I wouldn't
               | expect such a huge performance difference when using an
               | array vs a linked list or something.
               | 
               | It's not like I had millions of items in that structure
               | either, it was like 100. I think it contained the batch
               | training data from each round. I tried to find the
               | project but couldn't.
               | 
               | I was just shocked that there was such a huge difference
               | between primitive data structures. In that situation, I
               | wouldn't have guessed it would make a difference.
        
             | llm_nerd wrote:
             | It depends on how you define overhead. Runtime overhead and
             | memory usage is absolutely marginal, and the tightest, most
             | perfect implementation will have trouble beating it.
             | 
             | Instead people are trying to optimize install size of
             | dependencies, which while maybe a fun hacking project...who
             | really cares?
        
         | QuadmasterXLII wrote:
         | 107MB of cPython defeated
         | 
         | Go to try for self
         | 
         | Step 1 download 2.4GB of CUDA
        
           | simonw wrote:
           | The size of CUDA really is astonishing. Any chance someone
           | might figure out how to slim that down?
        
             | dartos wrote:
             | Nvidia is the only one who could, since they own it.
        
             | xiphias2 wrote:
             | Talking directly to the kernel / driver / firmware.
             | 
             | As others have said, George Hotz is doing his best in
             | reverse-engineering and skipping layers.
        
             | jsheard wrote:
             | Taking a peek inside the package it seems to mostly be the
             | libraries - CuFFT alone is about 350MB for example, twice
             | over for the debug and release versions. I'm guessing those
             | are probably fat binaries pre-compiled for every generation
             | of Nvidia hardware rather than just the PTX bytecode, which
             | would help to speed up fresh builds, at the expense of
             | being huge.
        
             | maille wrote:
             | Raise your voice on their forum:
             | https://forums.developer.nvidia.com/t/how-to-overcome-the-
             | hu... Tried my luck 2 years ago but it keeps increasing.
        
             | phdelightful wrote:
             | Here's a blog that breaks down how large different pieces
             | of CUDA are:
             | 
             | https://carlpearson.net/post/20231023-cuda-releases/
        
           | gcr wrote:
           | I mean, being fair, the 2.4GB CUDA SDK is absolutely required
           | for the cPython implementation as well
        
           | dwroberts wrote:
           | A bunch of install methods for torch via pip include ~1.5GB
           | of lib/ because of CUDA. libtorch_cuda.so is like 800MB on
           | its own
        
           | fsloth wrote:
           | I don't think it's about the byte size, but the inherent
           | complexity of the implementation. 1000 lines of C code is
           | extremely simple by any standard. Whereas a sundry collection
           | of Python and PyTorch libraries is anything but.
        
       | brcmthrowaway wrote:
       | Very sad, shouldve used an agnostic framework instead of CUDA
        
         | jsheard wrote:
         | It's only ~1000 LoC, seems like a pretty good case study to
         | port over to other runtimes and show they can stand up to CUDA.
        
         | geph2021 wrote:
         | As far as I can tell, its optional dependency is Open MP, not
         | CUDA. Doesn't seem directly dependent on CUDA.
        
           | gpderetta wrote:
           | Yes, a quick skim of the code only shows openmp dependency.
           | The C/CUDA reference might have meant to be C/OMP .
           | 
           | Although I wonder if it would work well with GCC PTX OMP
           | offloading.
        
           | dlazaro wrote:
           | The plan is to eventually implement with CUDA:
           | 
           | "Currently, I am working on [...] direct CUDA implementation,
           | which will be significantly faster and probably come close to
           | PyTorch."
        
         | robrenaud wrote:
         | Are there any strong LLMs trained without CUDA?
        
           | blackeyeblitzar wrote:
           | Yes, there are several. See this blog post from Databricks
           | describing the landscape of LLMs trained on AMD hardware for
           | example: https://www.databricks.com/blog/training-llms-scale-
           | amd-mi25...
           | 
           | The most interesting one IMO is OLMo from AI2, which is truly
           | open. You can read their blog post about it
           | (https://blog.allenai.org/hello-olmo-a-truly-open-
           | llm-43f7e73...) but basically it is open everything - they
           | released everything you need to reproduce their weights
           | (training data, training code, evaluation code, and weights)
           | with a friendly (Apache) license.
        
           | ZoomerCretin wrote:
           | Gemini and Gemma were trained on Google's TPUs.
        
         | exe34 wrote:
         | Looking forward to your patches!
        
       | andrewstuart wrote:
       | OT but question from someone curious..... is Cuda still
       | entrenched as the only option for doing AI or is there growing
       | support for AMD/Intel/Other ways of doing AI?
        
         | BraverHeart wrote:
         | George Hotz is attempting to solve this:
         | https://github.com/tinygrad/tinygrad
        
           | ZoomerCretin wrote:
           | He loudly gave up on AMD after they did not fix a blocker he
           | had for 5+ months and gave him the runaround the entire time
           | when he asked for the code to fix it himself. He is still
           | shipping the AMD tinybox with huge warning labels.
        
             | Art9681 wrote:
             | Didn't they recently announce that everything was open
             | sourced? Would be cool if he took another look at it once
             | all of the souce code is available (if not already).
        
               | xjay wrote:
               | > They haven't open sourced anything. They posted a
               | tweet. [1]
               | 
               | [1@2024-04-06]
               | https://www.youtube.com/watch?v=j7MRj4N2Cyk&t=429s
               | 
               | [Twitch] https://twitch.tv/georgehotz
        
             | magicalhippo wrote:
             | Randomly stumbled over this[1] post with another fed up
             | open source contributor, due to several serious issues with
             | AMDs GPU drivers and firmware that remain unresolved for
             | years. It also references the geohot decision you mention.
             | 
             | Some quotes:
             | 
             |  _I find it incredible that these companies that have large
             | support contracts with you and have invested hundreds of
             | thousands of dollars into your products, have been forced
             | to turn to me, a mostly unknown self-employed hacker with
             | very limited resources to try to work around these bugs
             | (design faults?) in your hardware._
             | 
             |  _In the VFIO space we no longer recommend AMD GPUs at all,
             | in every instance where people ask for which GPU to use for
             | their new build, the advise is to use NVidia._
             | 
             | [1]: https://www.reddit.com/r/Amd/comments/1bsjm5a/letter_t
             | o_amd_...
        
           | fwip wrote:
           | I'm not sure if he's "attempting to solve it" so much as he's
           | looking for yet another way to keep himself famous.
           | 
           | The guy did one good jailbreak for the iPhone, and as near as
           | I can tell, the rest of his work has been a lot of boasting,
           | half-assed hyped-up implementations (e.g: his self-driving
           | car), and trying to befriend other powerful people in tech
           | (see: his promise to single-handedly fix Musk's Twitter). He
           | might be a smart dude, but he vastly overrates his own
           | accomplishments, and doesn't finish near anything he starts.
        
         | WithinReason wrote:
         | There are some stirrings but don't hold your breath
        
         | blackeyeblitzar wrote:
         | See my comment on this here:
         | https://news.ycombinator.com/item?id=39973816
        
         | sigmoid10 wrote:
         | There are a few attempts here and there in various stages of
         | progression. But right now, nothing matches Nvidia+CUDA in
         | speed and usability.
        
         | adam_arthur wrote:
         | You can run inference today on pretty much any card.
         | 
         | Download Ollama on a modern MacBook and can run 13B and even
         | higher (if your RAM allows) at fast speeds. People run smaller
         | models locally on their phones
         | 
         | Google has trained their latest models on their own TPUs... not
         | using Nvidia to my knowledge.
         | 
         | So, no, there are alternatives. CUDA has the largest mindshare
         | on the training side though.
        
         | taminka wrote:
         | there are obv alternatives from both intel and amd, performant
         | blas/dnn packages, but small teams don't use them bc cuda is
         | easier to use and has more support, and larger teams don't use
         | them bc they have deals w/ nvidia or not enough GPUs are
         | available or they're after the absolute best performance (which
         | is still nvidia) or bc of other stuff like unstable drivers or
         | smth
        
         | towelpluswater wrote:
         | Modular Mojo is the most well funded and full of respectable
         | players for making an alternative possible
        
           | pavelstoev wrote:
           | Check out Hidet [1]. Not as well funded, but delivers Python
           | based ML acceleration with GPU support (unlike Mojo).
           | 
           | [1] https://github.com/hidet-org/hidet
        
       | triyambakam wrote:
       | When Lex recently talked to Andre, Andre said that he gets
       | positively obsessed with a problem and says "this must exist". I
       | imagine this must be one of those outputs.
        
       | yinser wrote:
       | I've seen his nano GPT implemented using JAX, now we have C/CUDA.
       | I'd love to see if nano GPT could be doable in Mojo. I took a
       | stab at a Mojo conversion of his Wavenet project (Andrej's zero
       | to hero course) and I gotta say... python has so many nice
       | features lol. Stating the obvious I know but what you see done in
       | 6 lines of python takes so much more work in other languages.
        
         | cb321 wrote:
         | For a prior generation of karpathy-splaining this is this Nim
         | port: https://github.com/Vindaar/llama2nim - maybe of interest
         | if you are interested in Mojo.
        
           | yinser wrote:
           | Thank you!
        
         | pavelstoev wrote:
         | How in Mojo do you support GPU data parallelism and all the
         | benefits it brings ?
        
           | KeplerBoy wrote:
           | You don't. Mojo doesn't support GPUs at the moment, which
           | says a lot about a language which claims to be AI first.
        
             | pjmlp wrote:
             | They only made Mojo available outside the preview circle
             | about a couple of months ago, and it is yet to run on
             | Windows laptops of researchers.
             | 
             | I love the attitude of considering 0.x languages production
             | ready for all imaginable kinds of workloads.
        
             | yinser wrote:
             | If you want CUDA up front go write PyTorch. No one is
             | stopping you. Modular's goal was to leverage MLIR first and
             | bring GPUs in later. They're barely a year old company.
        
         | auraham wrote:
         | Where is the GPT implementation in JAX? I only found this [1]
         | in PyTorch and NumPy.
         | 
         | [1] https://github.com/karpathy/nanoGPT
        
       | blackeyeblitzar wrote:
       | It would be great if someone created a tutorial around this
       | explaining exactly how it works and how to do a test training
       | run. I'm aware it's not feasible to train a "real" model on
       | personal hardware but it would be nice to have a practical
       | learning experience. I'm not sure if there are good alternatives
       | for that.
        
         | vineyardmike wrote:
         | The author has a whole series where he does exactly that.
         | YouTube videos, code examples, documentation, everything.
         | Explains the math, explains how to code it, explains the
         | architecture. Everything.
        
         | karpathy wrote:
         | I wrote this, which might be a bit helpful:
         | https://github.com/karpathy/llm.c/blob/master/doc/layernorm/...
         | 
         | But if you don't have the background, I'd recommend my YouTube
         | videos, see the Zero To Hero playlist:
         | https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxb...
        
           | blackeyeblitzar wrote:
           | Thank you so much for responding. I will definitely check
           | these out and also pass it on to others who might be
           | interested.
        
           | MAMAMassakali wrote:
           | Thank you so much for the Zero To Hero playlist!
        
       | qwertox wrote:
       | > direct CUDA implementation, which will be significantly faster
       | and probably come close to PyTorch.
       | 
       | It almost hurts, to read that PyTorch is faster.
       | 
       | But then again, with these GPU-RAM-prices, let's see how it
       | speeds up the CPU.
       | 
       | We really need SO-DIMM slots on the RTX series (or AMD/Intel
       | equivalent) so that we can expand the RAM as we need it to. Is
       | there a technical problem to it?
        
         | LatticeAnimal wrote:
         | > We really need SO-DIMM slots on the RTX series (or AMD/Intel
         | equivalent) so that we can expand the RAM as we need it to. Is
         | there a technical problem to it?
         | 
         | I imagine it would incur a non trivial latency and cost
         | penalty. The memory modules are placed pretty close to the
         | compute die right now. Cooling would also have to change (the
         | memory modules produce a lot of heat).
         | 
         | But there is also no reason for any of the GPU manufacturers to
         | do this. A skew with twice as much memory can go for a lot more
         | than the difference in memory cost alone
        
           | SunlitCat wrote:
           | And especially doing "interesting" combinations of gpu and
           | memory.
           | 
           | Like lower end gpu with 16 GB of VRAM, but offering just 8 /
           | 12 GB of VRAM in the middle class and then again 16 GB in the
           | upper class of gpu selection.
        
           | ItsBob wrote:
           | I don't disagree but (I know nothing about this btw...) would
           | it not benefit in terms of, say, a L3 cache kind of thing?
           | 
           | Imagine you could stick 2 x 64GB DDR5 DIMMS on the GPU in
           | sockets, would that not be faster to access than the
           | motherboard DIMMS? It won't be as fast as on-die memory of
           | course but could it not act like a sort of halfway house?
        
         | jsheard wrote:
         | Memory speed is more or less directly proportional to how close
         | the memory is to the processor, with the fastest memory being
         | literally inside the processor (SRAM cache), followed by memory
         | on the same package as the processor (HBM GPUs, Apple
         | M-series), followed by soldered down discrete memory chips
         | (regular GPUs, games consoles), followed by socketed DIMMs in
         | distant last place. There's not really any getting around it,
         | the bandwidth that GPUs crave just isn't compatible with
         | modularity.
         | 
         | Even CPUs are starting to move their memory closer to the core
         | in the name of performance, as mentioned Apple is already doing
         | it, Intel is making Xeons with on-chip memory now, and they
         | have a version aimed at consumers on their roadmap.
        
           | wtallis wrote:
           | FYI, most discrete GPUs with discrete memory packages
           | soldered to the board near the GPU are running at
           | substantially higher memory frequencies than the on-package
           | DRAM in Apple's chips. But running GDDR at those speeds costs
           | a lot of power.
        
             | osigurdson wrote:
             | I watched a presentation on this today. The presenter
             | focused on the soldering and proximity as well. Is this
             | really the only difference or is this transistor based
             | memory (like L1, L2, etc.)? I get the proximity factor of
             | course (1ft / ns EE rule of thumb). In any case, soldering
             | and proximity don't seem like breakthrough innovations (but
             | maybe I am wrong).
        
               | MobiusHorizons wrote:
               | Gpu ram is typically gddr6 or gddr6x which is a different
               | standard to the chips used in ddr5 for example. GPUs have
               | terrible latency to ram, but enormous throughput, and I
               | assume the chips are internally optimized for that. Many
               | aspects of a design change when you choose different
               | latency or clockspeed targets translating into different
               | power / area calculations.
        
           | tverbeure wrote:
           | For data rates, as in bandwidth per IO pin, distance is
           | really only a secondary factor. HBM memory, for example, runs
           | at substantially lower data rates than GDDR, yet it sits
           | right next to the GPU die compared to centimeters for the
           | GDDR. And high-speed serial links run at speeds that are an
           | order of magnitude higher than even the internal register
           | files of a CPU.
        
           | viraptor wrote:
           | That's true it's got an impact, but I think there's still
           | space available for "slightly slower with 2x memory" models.
           | For many local uses, new cards are way past the "fast enough"
           | line, but having 64gb on them would be really beneficial.
           | 
           | It's love to see some experiments / different SKUs in this
           | area, given people are already diy-ing extra memory on
           | NVIDIA. (https://hackaday.com/2021/01/29/add-an-extra-8gb-of-
           | vram-to-... there were stable experiments later on, but I
           | don't have a link now)
        
             | schneehertz wrote:
             | Graphics card manufacturers believe that selling high-
             | memory consumer graphics cards will affect the market for
             | commercial computing cards, so they will not do so, that's
             | all.
        
               | airspresso wrote:
               | Nice room for a new player to disrupt then
        
               | einsteinx2 wrote:
               | Problem is, making a board design using an existing GPU
               | chip and sticking more RAM into it is (relatively) simple
               | but of course none of the GPU chip makers would allow
               | partners to do that. Making your own GPU chip that's
               | competitive with Nvidia or AMD's current offerings is a
               | massive undertaking and pretty much impossible for a
               | newcomer.
               | 
               | Just look at how much trouble Intel has had breaking into
               | the discrete GPU market or even just how hard it's been
               | for AMD to compete with Nvidia even with decades of
               | experience in the market.
               | 
               | And if some newcomer _could_ make a competitive GPU with
               | large memory capacity they'd be crazy not to sell it at
               | datacenter prices, maybe just undercutting the others but
               | a few grand but still way more expensive than any
               | consumer GPU you can buy today, even a 4090.
        
               | tlb wrote:
               | It's not doable at the board level. High-end GPUs use HBM
               | in the same package, connected using a silicon
               | interposer.
        
         | tverbeure wrote:
         | Check out PCB back drilling. It's a process where you remove a
         | few hundred microns from the vias that are used to connect GDDR
         | RAMs to the GPUs, to avoid reflections due to the impedance
         | mismatch that's caused by the stub.
         | 
         | When you have a pulse coded signal traveling at close to 10GHz,
         | everything becomes an antenna. The technical problem is that
         | you can't do this with a flimsy connector like the ones used
         | for DIMMs. The reason GDDR can have a bandwidth per pin that is
         | 4 times higher than regular DDR is because they are soldered
         | down on the PCB.
        
         | hahnchen wrote:
         | > It almost hurts, to read that PyTorch is faster.
         | 
         | Why?
        
         | theGeatZhopa wrote:
         | NVIDIA hates that trick.
        
       | patrick-fitz wrote:
       | https://twitter.com/karpathy/status/1777427944971083809
       | 
       | > And once this is a in a bit more stable state: videos on
       | building this in more detail and from scratch.
       | 
       | Looking forward to watching the videos.
        
         | 0cf8612b2e1e wrote:
         | I love his videos. They are dense, but I get a lot out of them.
        
           | sghiassy wrote:
           | +100 thank you karpathy!
        
       | fori1to10 wrote:
       | It should be rewritten in Rust. (Just joking)
        
         | eclectic29 wrote:
         | Sshh! I asked why it was written in C and got flagged.
        
         | ddggdd wrote:
         | I just pasted the code into claude and reading the converted
         | rust now, definitely need extra work
        
         | naruhodo wrote:
         | I think you just cracked AI safety.
        
       | idkwhatimdoin wrote:
       | If I was starting from scratch, what resources should I start
       | with to build up an understanding of what this code does and how
       | to read it? It's quite dense and my knowledge of LLMs is quite
       | minimal. Are these terse variable names standard in LLM-land?
        
         | tayo42 wrote:
         | Check out his zero to hero series. Which builds this with
         | python and later pytorch, then probably his other mini C based
         | projects.
        
         | vineyardmike wrote:
         | Terse variables are a C thing.
         | 
         | "What resources would I need" -> you're literally commenting on
         | a teachers content. Karpathy (the author) has a very
         | informative YouTube channel where he goes step by step through
         | everything. He has a ton of repos and tutorials. Dig a little.
         | 
         | If all else fails... Google it.
        
           | viraptor wrote:
           | > Terse variables are a C thing.
           | 
           | They're a math / toy code thing. Large C projects have long
           | descriptive names just like other languages.
        
           | idkwhatimdoin wrote:
           | > you're literally commenting on a teachers content.
           | 
           | How am I supposed to know that?
           | 
           | > Karpathy (the author) has a very informative YouTube
           | channel where he goes step by step through everything.
           | 
           | Or that, without knowing that he's a teacher?
           | 
           | > Terse variables are a C thing.
           | 
           | I didn't realize variables had to be so short in C. Glad I
           | write C++ professionally where they've added support for
           | longer variable names.
           | 
           | > If all else fails... Google it.
           | 
           | There's a lot of LLM garbage out there. I got an answer here
           | in a few minutes pointing to Karpathy's course which seems
           | very high quality.
           | 
           | Be kinder.
        
             | vineyardmike wrote:
             | > How am I supposed to know that?
             | 
             | You're not supposed to know that. You asked a question, and
             | this is you being told the answer.
             | 
             | It's very convenient that the author of the post is quite
             | literally the world's most prolific teacher on this topic.
             | Makes it easy to find Karpathy. You shouldn't be expected
             | to otherwise know that (or else why ask if you knew).
             | 
             | > I didn't realize variables had to be so short in C. Glad
             | I write C++ professionally where they've added support for
             | longer variable names.
             | 
             | This feels like a joke but old C compilers did have
             | variable length limits. This is part of why C historically
             | had shorter variables than other more modern languages.
             | 
             | Sorry if it came off rude, the internet is hard to
             | communicate over.
             | 
             | https://publications.gbdirect.co.uk/c_book/chapter2/keyword
             | s...
        
         | satokema wrote:
         | As siblings have said, his video series are quite good. But if
         | you're just looking at this repo only, you probably want to
         | look at the python reference implementation. (The C is designed
         | to exactly replicate its functionality.)
        
       | flockonus wrote:
       | Question, apologize if slightly off-topic, it's something I'd
       | like to use this project for: Is there an example of how to train
       | GPT-2 on time series, in particular with covariates?
       | 
       | As my understanding of LLM goes at a basic level it's predicting
       | the next token from previous tokens, which sounds directionally
       | similar to time series (perhaps letting aside periodicity).
        
         | teruakohatu wrote:
         | Yes general LLM models can be used for time series forecasting:
         | 
         | https://github.com/KimMeen/Time-LLM
        
         | EricLeer wrote:
         | Yes there are many attempts in applying a transformers to
         | timeseries forecasting. For instance (but there are many more):
         | - Timegpt https://arxiv.org/abs/2310.03589 - Chronos
         | https://github.com/amazon-science/chronos-forecasting
         | 
         | These kind of papers often talk the world, but often lack a
         | proper baseline model. They only compare against very simple
         | (naive forecast), or non tuned models. In my experience a
         | gradient boosting model will probably solve 95% of your
         | forecasting problems, and trying to get fancy with a
         | transformer (or even just a simple neural net) is more trouble
         | then it is worth.
        
       | andy99 wrote:
       | I'd like to think he took the name from my llm.f90 project
       | https://github.com/rbitr/llm.f90
       | 
       | It was originally based off of Karpathy's llama2.c but I renamed
       | it when I added support for other architectures.
       | 
       | Probable a coincidence :)
        
         | matteogrella wrote:
         | I'm the creator behind https://github.com/nlpodyssey/rwkv.f90.
         | How about joining forces?
        
           | andy99 wrote:
           | I'll send you an email
        
         | bee_rider wrote:
         | In f90? That's pretty cool.
         | 
         | On a related note, IMO it would be pretty cool if we could get
         | an LLM implementation that provides and RCI interface like all
         | the old computational codes used to.
        
       | tehsauce wrote:
       | Another awesome project! Note that as of this moment the CUDA
       | part is aspirational. There is no gpu code in the repo yet.
        
       | richrichie wrote:
       | See, C does it very well. Great stuff. Karpathy has a gift for
       | teaching.
        
       | osigurdson wrote:
       | Kind of amazing that something that can be expressed in ~1000
       | lines of code has completely turned the world on its head.
        
         | daniel_reetz wrote:
         | Echoes of DeCSS ;)
        
         | KeplerBoy wrote:
         | Which important concept or algorithm can't be expressed in
         | <=1000 lines? Seems like a pretty common theme among
         | groundbreaking ideas.
        
           | Y_Y wrote:
           | That's a good question. Unfortunately I think you're asking
           | to compute the Kolmogorov complexity of every interesting
           | concept we have that doesn't yet have an implementation less
           | than n=1000 lines, which is equivalent to the halting problem
           | (modulo unbounded memory).
           | 
           | If you could exhaustively list all the interesting algorithms
           | (hard but feasible) you could potentially prove a lower bound
           | for each one's complexity by writing a shorter than n
           | implementation (hard, probably infeasiblel and show
           | positively that GP's prop isn't true. On the other hand
           | showing that it was true would require either some very
           | clever proof which can't apply to all programs, but somehow
           | only these interesting ones (very likely impossible) or
           | enumerate all C^n programs where C is the number of possible
           | lines (something like 64^80) and show that none of them
           | implements at least one of the interesting algorithms
           | (absurdly impossible).
        
             | nextaccountic wrote:
             | You are right but I think that there's a more interesting
             | question: do humans stumble upon those large
             | interesting/great algorithms in practice?
             | 
             | The key point here is that we are looking at algorithms
             | already discovered in human history rather than enumerating
             | all possible interesting algorithms. Of course there is an
             | interesting algorithm that is very large, but humans don't
             | discover them in practice. If you look up a list of
             | greatest algorithms in history, they will be rather small
             | in length. Many of them can be sketched in a whiteboard
             | 
             | I think that what is happening here is that our minds just
             | can't hold billions of concepts at once. So if you have an
             | algorithm with billions of things, it was most likely
             | produced by a machine. Handcrafted things, on the other
             | hand, are smaller in comparison
             | 
             | Another thing is that our minds like conceptual simplicity
             | and view simplicity as a kind of beauty. So if we have a
             | great algorithm but it is too large, we look for ways to
             | express them in succinct ways (the right abstractions can
             | help with that, and also help with understanding the
             | algorithm better). We end up succeeding because the
             | algorithms themselves had low Kolmogorov complexity (and
             | thus, if they are too large they probably can be further
             | compressed)
        
           | magnat wrote:
           | Most modern A/V codecs won't fit in that limit by several
           | orders of magnitude.
           | 
           | Even standard-compliant JPEG decoder would be hard to squeeze
           | without some serious codegolfing. Discarding some barely used
           | features gets you close to that limit, though [1].
           | 
           | Smallest popular TCP/IP stack [2] is ~20kLoC.
           | 
           | [1] https://github.com/richgel999/picojpeg
           | 
           | [2] https://savannah.nongnu.org/projects/lwip/
        
             | epr wrote:
             | A JPEG decoder or TCP stack are very clearly not individual
             | concepts though. There's obviously some subjectivity as to
             | what constitutes a single "concept" or "algorithm", but I'm
             | not sure either of those two examples are in a gray area.
             | 
             | A single concept might be implementing just ARP or a
             | discrete cosine transform. If you wanted to do a full TCP
             | stack or JPEG decoder, that would make a lot more sense
             | after building their internal components one by one.
        
             | naasking wrote:
             | JPEGDEC seems to be about 500 lines:
             | https://github.com/bitbank2/JPEGDEC
        
               | jiripospisil wrote:
               | More like 5000 https://github.com/bitbank2/JPEGDEC/blob/m
               | aster/src/jpeg.inl
        
         | datascienced wrote:
         | Err... and the exobytes of training data
        
           | rnewme wrote:
           | Ah, not really
        
             | datascienced wrote:
             | "Here's one I trained earlier"?
        
         | holoduke wrote:
         | Speed of hardware did. Back in 80a they already knew the
         | principles of llm training. It only took one week to train
         | 10.000 tokens.
        
           | toxik wrote:
           | Got a reference on that claim?
        
       | mrbonner wrote:
       | Wow, and this is done after a recent trip to Bhutan to clear his
       | head! I follow karpathy on twitter and he posted that 2 weeks
       | without constantly looking and checking his phone kind of turns
       | off the constantly on radio in his head.
        
       | convexstrictly wrote:
       | Candle is a minimalist ML framework for Rust with a focus on
       | performance (including GPU support) and ease of use
       | 
       | https://github.com/huggingface/candle
        
         | imjonse wrote:
         | Candle focuses on inference though.
        
           | l-m-z wrote:
           | Candle dev here, we also support training/backdrop! We
           | certainly focus on optimizing inference performance but
           | hopefully that should improve the training efficiency too.
        
           | revskill wrote:
           | What is referecing ?
        
             | HarHarVeryFunny wrote:
             | Inference means using the neural net, as opposed to
             | training it.
             | 
             | During inference you feed an input into the NN and it
             | passes through it in "forwards" direction (i.e. from input
             | to output), being modified according to the "weights" that
             | were learnt during training, to derive the output.
             | 
             | During training, each training sample is first fed forwards
             | through the NN, the same way as for inference, but then the
             | output of the model (which at the beginning of training
             | will be random/wrong) is compared to the correct/desired
             | output for that training sample, and a corresponding error
             | value will then be fed backwards (from output to input)
             | through the NN according to the "backpropagation" mechanism
             | to update the weights.
             | 
             | Training is a lot more involved than inference since it
             | involves this backpropagation step.
        
         | basbuller wrote:
         | Not barely as minimal as Karpathy his implementation
        
         | jeroenvlek wrote:
         | Love Candle! I actually ported Karpathy's previous GPT tutorial
         | to candle, including training [0]
         | 
         | [0] https://www.perceptivebits.com/building-gpt-from-scratch-
         | in-...
        
         | 0xfedbee wrote:
         | I wouldn't call in "minimalist" after seeing Karpathy's code.
        
       | rurban wrote:
       | https://github.com/robjinman/Richard uses Vulkan, thus is
       | portable across GPU's and much faster. It also has more kernels.
       | In simple C++
        
         | flohofwoe wrote:
         | Or rather GLSL... The C++ code looks like it's mostly just
         | scaffolding to kick off the actually important GPU work, and
         | for that it's a surprising amount of code. Quite typical both
         | for Vulkan and C++ though ;)
        
       | robot wrote:
       | very cool, also the coding style looks good.
        
       | classiebit2025 wrote:
       | https://classiebit.com/eventmie-pro If you looking to host your
       | events but you don't have any platform to host, then once visit
       | Eventmie Pro Platform, Which is the best event management
       | platform in 2024.
        
       | antirez wrote:
       | Is this able to replace PyTorch, ... in normal practice? No.
       | 
       | Does this show that in general the most used ML frameworks are a
       | mess? Yes.
        
         | bootsmann wrote:
         | This is a bit an apples and oranges comparison. Pytorch is a
         | research framework not a transformer inference library.
        
           | antirez wrote:
           | This post is about _training_ not inference. And llama.cpp
           | has similarly simple LoRa training code. There is nothing in
           | neural networks themselves so complex to justify the amount
           | of complexity the Python-ML community piled up. MLX, for
           | instance, is a similarly general purpose research framework
           | that is a fraction of the size.
        
             | HarHarVeryFunny wrote:
             | Sure neural networks in of themselves are conceptually
             | simple, and not difficult to code. Andrew Ng's original
             | Coursera class is all you need to go from zero knowledge to
             | building MATLAB based neural nets in this same hard coded
             | style.
             | 
             | However, there is a huge difference in functionality (hence
             | complexity) in a framework such as PyTorch vs hardcoding a
             | single NN. It's a bit like the difference between writing a
             | toy compiler in CompSci class vs a production one that
             | supports optimization, multiple targets, etc, etc.
             | 
             | The first step in convenience beyond hardcoding models, was
             | frameworks like the original Torch, and original
             | TensorFlow. Those frameworks let you explicitly assemble a
             | neural net out of modular "lego blocks" (tensor
             | operations), then just call model.forward() or
             | model.backward() - no need to yourself write the forwards
             | and backwards functions.
             | 
             | What PyTorch (successor to Torch) did was increase the
             | complexity of the framework, but bring massive ease-of-use
             | to the developer, by getting rid of the explicit lego-block
             | assembly process, and instead let the developer just write
             | arbitrary Python code corresponding to what they want the
             | model to do, and then PyTorch itself build the model
             | internally and therefore is able to infer the backward
             | function. This extra functionality/ease-of-use, but with
             | corresponding internal complexity, is what differentiated
             | PyTorch from TensorFlow, made it so succesful, and caused
             | most developers to switch to it.
             | 
             | There is also a lot of other functionality in PyTorch that
             | adds to the complexity - supporting multiple back ends,
             | custom CUDA/etc kernels beyond what is provided by cuDNN,
             | etc, etc.
        
               | antirez wrote:
               | I know all this things. Again: look at MLX.
        
         | HarHarVeryFunny wrote:
         | > Does this show that in general the most used ML frameworks
         | are a mess? Yes.
         | 
         | Not really ... there is little to no overlap with what a
         | framework like PyTorch does. There is no tensor class, no
         | autograd, etc. Just malloc, a bunch of hand calculated pointers
         | into that chunk of memory, and hand written gradient functions.
         | I assume the intent here is to be educational by stripping away
         | the layers of abstraction to make it clearer what is going on.
         | 
         | Frankly though, this code (all that pointer math!) is a mess
         | too, maybe written this way to make it easy to port to cuDNN
         | which is at a similarly low level (other than having tensor
         | descriptors which make the memory layout more flexible).
         | 
         | If you want to write your own tensor class and reusable NN
         | framework, then the lines of code go up very rapidly. I did one
         | in C++ a while back, and the tensor class alone was 20K LOC.
        
       | milansuk wrote:
       | This is an implementation of a transformer and in README it's
       | presented as text->text. Tokens are just integers going in and
       | out.
       | 
       | Is it possible to use it to train other types of
       | LLMs(text->image, image->text, speech->text, etc.)?
        
         | bootsmann wrote:
         | The transformer itself just takes arrays of numbers and turns
         | them into arrays of numbers. What you are interested in is the
         | process that happens before and after the transformer.
        
         | _giorgio_ wrote:
         | Yes, anything can be an input token.
         | 
         | Patch of pixels ---> token Fragment of input Audio ---> token
         | etc
        
       | zzbn00 wrote:
       | Very nice.
       | 
       | In my experience much of the complexity of numerical software is
       | to enable the search for the algorithm that works well with the
       | problem/data you have. Once you know the exact algorithm you
       | want, it is possible to make a nice clean minimalistic
       | implementation, but that does not mean such an implementation
       | would have been easy at the beginning.
        
       | davedx wrote:
       | On one hand, really nice to see the whole thing in 1000 lines of
       | C code.
       | 
       | On the other hand, that malloc function low key terrifies me. :)
        
         | flohofwoe wrote:
         | Better to be explicit than hiding unsafe memory accesses under
         | C++ stdlib classes like std::vector which don't do range
         | checking either in operator[]. And in this sort of code,
         | automatically injected runtime range checks would most likely
         | hurt performance enough to matter.
         | 
         | I would still run the code through the Clang static analyzer
         | and a couple of test runs in ASAN and UBSAN to be sure that
         | nothing slipped through.
        
       | sirsinsalot wrote:
       | Karpathy's code, teaching and contribution to the body of
       | knowledge in this area really is admirable.
       | 
       | Sadly I am a generalist, but if I were a specialist, I would hope
       | to contribute as openly and widely as Karpathy.
       | 
       | Not clout chasing, click-bait, "top 5 javascript frameworks of
       | 2023!" ... just high quality output that marks a specialist.
       | 
       | Sorry to gush.
        
       | lubesGordi wrote:
       | Quick question, is this just pure C code that can be loaded into
       | an Nvidia gpu and run (via the python code)? I scanned the C and
       | didn't see anything CUDA related (maybe I missed something, I'm
       | not a GPU programmer!). K mentions something about a direct CUDA
       | implementation coming soon, how would that be different than what
       | this is?
        
         | whb07 wrote:
         | It's not, if you look at his X account, he talks about his work
         | adding on the CUDA parts
        
       | waynecochran wrote:
       | Fantastic -- gotta love Andrej. I am sick of the ball and chain
       | that is Python and all of its environment dependencies. It is
       | nice to shed all the weight and get down to the metal.
        
         | toxik wrote:
         | Yeah, as long as you don't want to change the network
         | architecture.
         | 
         | Edit: and you trust that Andrei didn't screw up anywhere while
         | hand rolling all the gradient calculations.
        
       ___________________________________________________________________
       (page generated 2024-04-09 23:01 UTC)