[HN Gopher] Llm.c - LLM training in simple, pure C/CUDA
___________________________________________________________________
Llm.c - LLM training in simple, pure C/CUDA
Author : tosh
Score : 956 points
Date : 2024-04-08 20:38 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| tosh wrote:
| > LLM training in simple, pure C/CUDA. There is no need for 245MB
| of PyTorch or 107MB of cPython
| api wrote:
| Python has been popular for this because it's convenient to
| quickly hack on and experiment with, not because it's the most
| efficient thing.
| im3w1l wrote:
| The overhead really isn't that bad is it? Since the the
| python code is mostly about saying multiply matrix A with
| matrix B, and then that actual computation is done by
| optimized low level code.
| jiggawatts wrote:
| I suspect that this has a high chance of running afoul of
| Ahmdal's Law. Even if you can parallelise the bulk of the
| computation, the serial parts remain single-threaded and
| start to dominate the total runtime.
| taneq wrote:
| I don't think the serial parts of ML training are
| Python's fault, are they? It's all "operation B depends
| on the output of operation A".
| bongodongobob wrote:
| For that stuff, yeah you're correct.
|
| What I've seen is issues with the implementation of those
| libraries in a project.
|
| I don't remember exactly, but I was playing with someone's
| wrapper for some kind of machine learning snake game and it
| was taking way longer than it should have on back of the
| napkin math.
|
| The issue was using either a dict or a list in a hot loop
| and changing it to the other sped it up like 1000x.
|
| So it's easy to think "yeah this library is optimized" but
| then you build something on top of it that is not obviously
| going to slow it down.
|
| But, that's the Python tradeoff.
| 0cf8612b2e1e wrote:
| That sounds irrelevant to Python and just a matter of
| slow code cropping up in libraries until someone runs a
| profiler.
| littlestymaar wrote:
| But then again if your program have places where choosing
| the right Python primitive is important for performance,
| then using python is affecting performance here since
| even the best algorithm in Python would be slower than
| the equivalent C.
|
| Most of the time it doesn't matter because there's
| nothing hoy on the Python side, but if there is, then
| Python is going to be slowing your stuff down.
| hcarvalhoalves wrote:
| > The issue was using either a dict or a list in a hot
| loop and changing it to the other sped it up like 1000x.
|
| The programmer using the wrong data structure is not a
| problem with the language.
| CamperBob2 wrote:
| It really is with Python. There are simply too many
| containers and container-like concepts. Lists, arrays,
| sets, dicts...
| 0cf8612b2e1e wrote:
| What modern language doesn't have those?
|
| Go kind of cheats and has maps play double duty as sets.
| bongodongobob wrote:
| Kinda. I guess my native tongue is C/C++ and I wouldn't
| expect such a huge performance difference when using an
| array vs a linked list or something.
|
| It's not like I had millions of items in that structure
| either, it was like 100. I think it contained the batch
| training data from each round. I tried to find the
| project but couldn't.
|
| I was just shocked that there was such a huge difference
| between primitive data structures. In that situation, I
| wouldn't have guessed it would make a difference.
| llm_nerd wrote:
| It depends on how you define overhead. Runtime overhead and
| memory usage is absolutely marginal, and the tightest, most
| perfect implementation will have trouble beating it.
|
| Instead people are trying to optimize install size of
| dependencies, which while maybe a fun hacking project...who
| really cares?
| QuadmasterXLII wrote:
| 107MB of cPython defeated
|
| Go to try for self
|
| Step 1 download 2.4GB of CUDA
| simonw wrote:
| The size of CUDA really is astonishing. Any chance someone
| might figure out how to slim that down?
| dartos wrote:
| Nvidia is the only one who could, since they own it.
| xiphias2 wrote:
| Talking directly to the kernel / driver / firmware.
|
| As others have said, George Hotz is doing his best in
| reverse-engineering and skipping layers.
| jsheard wrote:
| Taking a peek inside the package it seems to mostly be the
| libraries - CuFFT alone is about 350MB for example, twice
| over for the debug and release versions. I'm guessing those
| are probably fat binaries pre-compiled for every generation
| of Nvidia hardware rather than just the PTX bytecode, which
| would help to speed up fresh builds, at the expense of
| being huge.
| maille wrote:
| Raise your voice on their forum:
| https://forums.developer.nvidia.com/t/how-to-overcome-the-
| hu... Tried my luck 2 years ago but it keeps increasing.
| phdelightful wrote:
| Here's a blog that breaks down how large different pieces
| of CUDA are:
|
| https://carlpearson.net/post/20231023-cuda-releases/
| gcr wrote:
| I mean, being fair, the 2.4GB CUDA SDK is absolutely required
| for the cPython implementation as well
| dwroberts wrote:
| A bunch of install methods for torch via pip include ~1.5GB
| of lib/ because of CUDA. libtorch_cuda.so is like 800MB on
| its own
| fsloth wrote:
| I don't think it's about the byte size, but the inherent
| complexity of the implementation. 1000 lines of C code is
| extremely simple by any standard. Whereas a sundry collection
| of Python and PyTorch libraries is anything but.
| brcmthrowaway wrote:
| Very sad, shouldve used an agnostic framework instead of CUDA
| jsheard wrote:
| It's only ~1000 LoC, seems like a pretty good case study to
| port over to other runtimes and show they can stand up to CUDA.
| geph2021 wrote:
| As far as I can tell, its optional dependency is Open MP, not
| CUDA. Doesn't seem directly dependent on CUDA.
| gpderetta wrote:
| Yes, a quick skim of the code only shows openmp dependency.
| The C/CUDA reference might have meant to be C/OMP .
|
| Although I wonder if it would work well with GCC PTX OMP
| offloading.
| dlazaro wrote:
| The plan is to eventually implement with CUDA:
|
| "Currently, I am working on [...] direct CUDA implementation,
| which will be significantly faster and probably come close to
| PyTorch."
| robrenaud wrote:
| Are there any strong LLMs trained without CUDA?
| blackeyeblitzar wrote:
| Yes, there are several. See this blog post from Databricks
| describing the landscape of LLMs trained on AMD hardware for
| example: https://www.databricks.com/blog/training-llms-scale-
| amd-mi25...
|
| The most interesting one IMO is OLMo from AI2, which is truly
| open. You can read their blog post about it
| (https://blog.allenai.org/hello-olmo-a-truly-open-
| llm-43f7e73...) but basically it is open everything - they
| released everything you need to reproduce their weights
| (training data, training code, evaluation code, and weights)
| with a friendly (Apache) license.
| ZoomerCretin wrote:
| Gemini and Gemma were trained on Google's TPUs.
| exe34 wrote:
| Looking forward to your patches!
| andrewstuart wrote:
| OT but question from someone curious..... is Cuda still
| entrenched as the only option for doing AI or is there growing
| support for AMD/Intel/Other ways of doing AI?
| BraverHeart wrote:
| George Hotz is attempting to solve this:
| https://github.com/tinygrad/tinygrad
| ZoomerCretin wrote:
| He loudly gave up on AMD after they did not fix a blocker he
| had for 5+ months and gave him the runaround the entire time
| when he asked for the code to fix it himself. He is still
| shipping the AMD tinybox with huge warning labels.
| Art9681 wrote:
| Didn't they recently announce that everything was open
| sourced? Would be cool if he took another look at it once
| all of the souce code is available (if not already).
| xjay wrote:
| > They haven't open sourced anything. They posted a
| tweet. [1]
|
| [1@2024-04-06]
| https://www.youtube.com/watch?v=j7MRj4N2Cyk&t=429s
|
| [Twitch] https://twitch.tv/georgehotz
| magicalhippo wrote:
| Randomly stumbled over this[1] post with another fed up
| open source contributor, due to several serious issues with
| AMDs GPU drivers and firmware that remain unresolved for
| years. It also references the geohot decision you mention.
|
| Some quotes:
|
| _I find it incredible that these companies that have large
| support contracts with you and have invested hundreds of
| thousands of dollars into your products, have been forced
| to turn to me, a mostly unknown self-employed hacker with
| very limited resources to try to work around these bugs
| (design faults?) in your hardware._
|
| _In the VFIO space we no longer recommend AMD GPUs at all,
| in every instance where people ask for which GPU to use for
| their new build, the advise is to use NVidia._
|
| [1]: https://www.reddit.com/r/Amd/comments/1bsjm5a/letter_t
| o_amd_...
| fwip wrote:
| I'm not sure if he's "attempting to solve it" so much as he's
| looking for yet another way to keep himself famous.
|
| The guy did one good jailbreak for the iPhone, and as near as
| I can tell, the rest of his work has been a lot of boasting,
| half-assed hyped-up implementations (e.g: his self-driving
| car), and trying to befriend other powerful people in tech
| (see: his promise to single-handedly fix Musk's Twitter). He
| might be a smart dude, but he vastly overrates his own
| accomplishments, and doesn't finish near anything he starts.
| WithinReason wrote:
| There are some stirrings but don't hold your breath
| blackeyeblitzar wrote:
| See my comment on this here:
| https://news.ycombinator.com/item?id=39973816
| sigmoid10 wrote:
| There are a few attempts here and there in various stages of
| progression. But right now, nothing matches Nvidia+CUDA in
| speed and usability.
| adam_arthur wrote:
| You can run inference today on pretty much any card.
|
| Download Ollama on a modern MacBook and can run 13B and even
| higher (if your RAM allows) at fast speeds. People run smaller
| models locally on their phones
|
| Google has trained their latest models on their own TPUs... not
| using Nvidia to my knowledge.
|
| So, no, there are alternatives. CUDA has the largest mindshare
| on the training side though.
| taminka wrote:
| there are obv alternatives from both intel and amd, performant
| blas/dnn packages, but small teams don't use them bc cuda is
| easier to use and has more support, and larger teams don't use
| them bc they have deals w/ nvidia or not enough GPUs are
| available or they're after the absolute best performance (which
| is still nvidia) or bc of other stuff like unstable drivers or
| smth
| towelpluswater wrote:
| Modular Mojo is the most well funded and full of respectable
| players for making an alternative possible
| pavelstoev wrote:
| Check out Hidet [1]. Not as well funded, but delivers Python
| based ML acceleration with GPU support (unlike Mojo).
|
| [1] https://github.com/hidet-org/hidet
| triyambakam wrote:
| When Lex recently talked to Andre, Andre said that he gets
| positively obsessed with a problem and says "this must exist". I
| imagine this must be one of those outputs.
| yinser wrote:
| I've seen his nano GPT implemented using JAX, now we have C/CUDA.
| I'd love to see if nano GPT could be doable in Mojo. I took a
| stab at a Mojo conversion of his Wavenet project (Andrej's zero
| to hero course) and I gotta say... python has so many nice
| features lol. Stating the obvious I know but what you see done in
| 6 lines of python takes so much more work in other languages.
| cb321 wrote:
| For a prior generation of karpathy-splaining this is this Nim
| port: https://github.com/Vindaar/llama2nim - maybe of interest
| if you are interested in Mojo.
| yinser wrote:
| Thank you!
| pavelstoev wrote:
| How in Mojo do you support GPU data parallelism and all the
| benefits it brings ?
| KeplerBoy wrote:
| You don't. Mojo doesn't support GPUs at the moment, which
| says a lot about a language which claims to be AI first.
| pjmlp wrote:
| They only made Mojo available outside the preview circle
| about a couple of months ago, and it is yet to run on
| Windows laptops of researchers.
|
| I love the attitude of considering 0.x languages production
| ready for all imaginable kinds of workloads.
| yinser wrote:
| If you want CUDA up front go write PyTorch. No one is
| stopping you. Modular's goal was to leverage MLIR first and
| bring GPUs in later. They're barely a year old company.
| auraham wrote:
| Where is the GPT implementation in JAX? I only found this [1]
| in PyTorch and NumPy.
|
| [1] https://github.com/karpathy/nanoGPT
| blackeyeblitzar wrote:
| It would be great if someone created a tutorial around this
| explaining exactly how it works and how to do a test training
| run. I'm aware it's not feasible to train a "real" model on
| personal hardware but it would be nice to have a practical
| learning experience. I'm not sure if there are good alternatives
| for that.
| vineyardmike wrote:
| The author has a whole series where he does exactly that.
| YouTube videos, code examples, documentation, everything.
| Explains the math, explains how to code it, explains the
| architecture. Everything.
| karpathy wrote:
| I wrote this, which might be a bit helpful:
| https://github.com/karpathy/llm.c/blob/master/doc/layernorm/...
|
| But if you don't have the background, I'd recommend my YouTube
| videos, see the Zero To Hero playlist:
| https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxb...
| blackeyeblitzar wrote:
| Thank you so much for responding. I will definitely check
| these out and also pass it on to others who might be
| interested.
| MAMAMassakali wrote:
| Thank you so much for the Zero To Hero playlist!
| qwertox wrote:
| > direct CUDA implementation, which will be significantly faster
| and probably come close to PyTorch.
|
| It almost hurts, to read that PyTorch is faster.
|
| But then again, with these GPU-RAM-prices, let's see how it
| speeds up the CPU.
|
| We really need SO-DIMM slots on the RTX series (or AMD/Intel
| equivalent) so that we can expand the RAM as we need it to. Is
| there a technical problem to it?
| LatticeAnimal wrote:
| > We really need SO-DIMM slots on the RTX series (or AMD/Intel
| equivalent) so that we can expand the RAM as we need it to. Is
| there a technical problem to it?
|
| I imagine it would incur a non trivial latency and cost
| penalty. The memory modules are placed pretty close to the
| compute die right now. Cooling would also have to change (the
| memory modules produce a lot of heat).
|
| But there is also no reason for any of the GPU manufacturers to
| do this. A skew with twice as much memory can go for a lot more
| than the difference in memory cost alone
| SunlitCat wrote:
| And especially doing "interesting" combinations of gpu and
| memory.
|
| Like lower end gpu with 16 GB of VRAM, but offering just 8 /
| 12 GB of VRAM in the middle class and then again 16 GB in the
| upper class of gpu selection.
| ItsBob wrote:
| I don't disagree but (I know nothing about this btw...) would
| it not benefit in terms of, say, a L3 cache kind of thing?
|
| Imagine you could stick 2 x 64GB DDR5 DIMMS on the GPU in
| sockets, would that not be faster to access than the
| motherboard DIMMS? It won't be as fast as on-die memory of
| course but could it not act like a sort of halfway house?
| jsheard wrote:
| Memory speed is more or less directly proportional to how close
| the memory is to the processor, with the fastest memory being
| literally inside the processor (SRAM cache), followed by memory
| on the same package as the processor (HBM GPUs, Apple
| M-series), followed by soldered down discrete memory chips
| (regular GPUs, games consoles), followed by socketed DIMMs in
| distant last place. There's not really any getting around it,
| the bandwidth that GPUs crave just isn't compatible with
| modularity.
|
| Even CPUs are starting to move their memory closer to the core
| in the name of performance, as mentioned Apple is already doing
| it, Intel is making Xeons with on-chip memory now, and they
| have a version aimed at consumers on their roadmap.
| wtallis wrote:
| FYI, most discrete GPUs with discrete memory packages
| soldered to the board near the GPU are running at
| substantially higher memory frequencies than the on-package
| DRAM in Apple's chips. But running GDDR at those speeds costs
| a lot of power.
| osigurdson wrote:
| I watched a presentation on this today. The presenter
| focused on the soldering and proximity as well. Is this
| really the only difference or is this transistor based
| memory (like L1, L2, etc.)? I get the proximity factor of
| course (1ft / ns EE rule of thumb). In any case, soldering
| and proximity don't seem like breakthrough innovations (but
| maybe I am wrong).
| MobiusHorizons wrote:
| Gpu ram is typically gddr6 or gddr6x which is a different
| standard to the chips used in ddr5 for example. GPUs have
| terrible latency to ram, but enormous throughput, and I
| assume the chips are internally optimized for that. Many
| aspects of a design change when you choose different
| latency or clockspeed targets translating into different
| power / area calculations.
| tverbeure wrote:
| For data rates, as in bandwidth per IO pin, distance is
| really only a secondary factor. HBM memory, for example, runs
| at substantially lower data rates than GDDR, yet it sits
| right next to the GPU die compared to centimeters for the
| GDDR. And high-speed serial links run at speeds that are an
| order of magnitude higher than even the internal register
| files of a CPU.
| viraptor wrote:
| That's true it's got an impact, but I think there's still
| space available for "slightly slower with 2x memory" models.
| For many local uses, new cards are way past the "fast enough"
| line, but having 64gb on them would be really beneficial.
|
| It's love to see some experiments / different SKUs in this
| area, given people are already diy-ing extra memory on
| NVIDIA. (https://hackaday.com/2021/01/29/add-an-extra-8gb-of-
| vram-to-... there were stable experiments later on, but I
| don't have a link now)
| schneehertz wrote:
| Graphics card manufacturers believe that selling high-
| memory consumer graphics cards will affect the market for
| commercial computing cards, so they will not do so, that's
| all.
| airspresso wrote:
| Nice room for a new player to disrupt then
| einsteinx2 wrote:
| Problem is, making a board design using an existing GPU
| chip and sticking more RAM into it is (relatively) simple
| but of course none of the GPU chip makers would allow
| partners to do that. Making your own GPU chip that's
| competitive with Nvidia or AMD's current offerings is a
| massive undertaking and pretty much impossible for a
| newcomer.
|
| Just look at how much trouble Intel has had breaking into
| the discrete GPU market or even just how hard it's been
| for AMD to compete with Nvidia even with decades of
| experience in the market.
|
| And if some newcomer _could_ make a competitive GPU with
| large memory capacity they'd be crazy not to sell it at
| datacenter prices, maybe just undercutting the others but
| a few grand but still way more expensive than any
| consumer GPU you can buy today, even a 4090.
| tlb wrote:
| It's not doable at the board level. High-end GPUs use HBM
| in the same package, connected using a silicon
| interposer.
| tverbeure wrote:
| Check out PCB back drilling. It's a process where you remove a
| few hundred microns from the vias that are used to connect GDDR
| RAMs to the GPUs, to avoid reflections due to the impedance
| mismatch that's caused by the stub.
|
| When you have a pulse coded signal traveling at close to 10GHz,
| everything becomes an antenna. The technical problem is that
| you can't do this with a flimsy connector like the ones used
| for DIMMs. The reason GDDR can have a bandwidth per pin that is
| 4 times higher than regular DDR is because they are soldered
| down on the PCB.
| hahnchen wrote:
| > It almost hurts, to read that PyTorch is faster.
|
| Why?
| theGeatZhopa wrote:
| NVIDIA hates that trick.
| patrick-fitz wrote:
| https://twitter.com/karpathy/status/1777427944971083809
|
| > And once this is a in a bit more stable state: videos on
| building this in more detail and from scratch.
|
| Looking forward to watching the videos.
| 0cf8612b2e1e wrote:
| I love his videos. They are dense, but I get a lot out of them.
| sghiassy wrote:
| +100 thank you karpathy!
| fori1to10 wrote:
| It should be rewritten in Rust. (Just joking)
| eclectic29 wrote:
| Sshh! I asked why it was written in C and got flagged.
| ddggdd wrote:
| I just pasted the code into claude and reading the converted
| rust now, definitely need extra work
| naruhodo wrote:
| I think you just cracked AI safety.
| idkwhatimdoin wrote:
| If I was starting from scratch, what resources should I start
| with to build up an understanding of what this code does and how
| to read it? It's quite dense and my knowledge of LLMs is quite
| minimal. Are these terse variable names standard in LLM-land?
| tayo42 wrote:
| Check out his zero to hero series. Which builds this with
| python and later pytorch, then probably his other mini C based
| projects.
| vineyardmike wrote:
| Terse variables are a C thing.
|
| "What resources would I need" -> you're literally commenting on
| a teachers content. Karpathy (the author) has a very
| informative YouTube channel where he goes step by step through
| everything. He has a ton of repos and tutorials. Dig a little.
|
| If all else fails... Google it.
| viraptor wrote:
| > Terse variables are a C thing.
|
| They're a math / toy code thing. Large C projects have long
| descriptive names just like other languages.
| idkwhatimdoin wrote:
| > you're literally commenting on a teachers content.
|
| How am I supposed to know that?
|
| > Karpathy (the author) has a very informative YouTube
| channel where he goes step by step through everything.
|
| Or that, without knowing that he's a teacher?
|
| > Terse variables are a C thing.
|
| I didn't realize variables had to be so short in C. Glad I
| write C++ professionally where they've added support for
| longer variable names.
|
| > If all else fails... Google it.
|
| There's a lot of LLM garbage out there. I got an answer here
| in a few minutes pointing to Karpathy's course which seems
| very high quality.
|
| Be kinder.
| vineyardmike wrote:
| > How am I supposed to know that?
|
| You're not supposed to know that. You asked a question, and
| this is you being told the answer.
|
| It's very convenient that the author of the post is quite
| literally the world's most prolific teacher on this topic.
| Makes it easy to find Karpathy. You shouldn't be expected
| to otherwise know that (or else why ask if you knew).
|
| > I didn't realize variables had to be so short in C. Glad
| I write C++ professionally where they've added support for
| longer variable names.
|
| This feels like a joke but old C compilers did have
| variable length limits. This is part of why C historically
| had shorter variables than other more modern languages.
|
| Sorry if it came off rude, the internet is hard to
| communicate over.
|
| https://publications.gbdirect.co.uk/c_book/chapter2/keyword
| s...
| satokema wrote:
| As siblings have said, his video series are quite good. But if
| you're just looking at this repo only, you probably want to
| look at the python reference implementation. (The C is designed
| to exactly replicate its functionality.)
| flockonus wrote:
| Question, apologize if slightly off-topic, it's something I'd
| like to use this project for: Is there an example of how to train
| GPT-2 on time series, in particular with covariates?
|
| As my understanding of LLM goes at a basic level it's predicting
| the next token from previous tokens, which sounds directionally
| similar to time series (perhaps letting aside periodicity).
| teruakohatu wrote:
| Yes general LLM models can be used for time series forecasting:
|
| https://github.com/KimMeen/Time-LLM
| EricLeer wrote:
| Yes there are many attempts in applying a transformers to
| timeseries forecasting. For instance (but there are many more):
| - Timegpt https://arxiv.org/abs/2310.03589 - Chronos
| https://github.com/amazon-science/chronos-forecasting
|
| These kind of papers often talk the world, but often lack a
| proper baseline model. They only compare against very simple
| (naive forecast), or non tuned models. In my experience a
| gradient boosting model will probably solve 95% of your
| forecasting problems, and trying to get fancy with a
| transformer (or even just a simple neural net) is more trouble
| then it is worth.
| andy99 wrote:
| I'd like to think he took the name from my llm.f90 project
| https://github.com/rbitr/llm.f90
|
| It was originally based off of Karpathy's llama2.c but I renamed
| it when I added support for other architectures.
|
| Probable a coincidence :)
| matteogrella wrote:
| I'm the creator behind https://github.com/nlpodyssey/rwkv.f90.
| How about joining forces?
| andy99 wrote:
| I'll send you an email
| bee_rider wrote:
| In f90? That's pretty cool.
|
| On a related note, IMO it would be pretty cool if we could get
| an LLM implementation that provides and RCI interface like all
| the old computational codes used to.
| tehsauce wrote:
| Another awesome project! Note that as of this moment the CUDA
| part is aspirational. There is no gpu code in the repo yet.
| richrichie wrote:
| See, C does it very well. Great stuff. Karpathy has a gift for
| teaching.
| osigurdson wrote:
| Kind of amazing that something that can be expressed in ~1000
| lines of code has completely turned the world on its head.
| daniel_reetz wrote:
| Echoes of DeCSS ;)
| KeplerBoy wrote:
| Which important concept or algorithm can't be expressed in
| <=1000 lines? Seems like a pretty common theme among
| groundbreaking ideas.
| Y_Y wrote:
| That's a good question. Unfortunately I think you're asking
| to compute the Kolmogorov complexity of every interesting
| concept we have that doesn't yet have an implementation less
| than n=1000 lines, which is equivalent to the halting problem
| (modulo unbounded memory).
|
| If you could exhaustively list all the interesting algorithms
| (hard but feasible) you could potentially prove a lower bound
| for each one's complexity by writing a shorter than n
| implementation (hard, probably infeasiblel and show
| positively that GP's prop isn't true. On the other hand
| showing that it was true would require either some very
| clever proof which can't apply to all programs, but somehow
| only these interesting ones (very likely impossible) or
| enumerate all C^n programs where C is the number of possible
| lines (something like 64^80) and show that none of them
| implements at least one of the interesting algorithms
| (absurdly impossible).
| nextaccountic wrote:
| You are right but I think that there's a more interesting
| question: do humans stumble upon those large
| interesting/great algorithms in practice?
|
| The key point here is that we are looking at algorithms
| already discovered in human history rather than enumerating
| all possible interesting algorithms. Of course there is an
| interesting algorithm that is very large, but humans don't
| discover them in practice. If you look up a list of
| greatest algorithms in history, they will be rather small
| in length. Many of them can be sketched in a whiteboard
|
| I think that what is happening here is that our minds just
| can't hold billions of concepts at once. So if you have an
| algorithm with billions of things, it was most likely
| produced by a machine. Handcrafted things, on the other
| hand, are smaller in comparison
|
| Another thing is that our minds like conceptual simplicity
| and view simplicity as a kind of beauty. So if we have a
| great algorithm but it is too large, we look for ways to
| express them in succinct ways (the right abstractions can
| help with that, and also help with understanding the
| algorithm better). We end up succeeding because the
| algorithms themselves had low Kolmogorov complexity (and
| thus, if they are too large they probably can be further
| compressed)
| magnat wrote:
| Most modern A/V codecs won't fit in that limit by several
| orders of magnitude.
|
| Even standard-compliant JPEG decoder would be hard to squeeze
| without some serious codegolfing. Discarding some barely used
| features gets you close to that limit, though [1].
|
| Smallest popular TCP/IP stack [2] is ~20kLoC.
|
| [1] https://github.com/richgel999/picojpeg
|
| [2] https://savannah.nongnu.org/projects/lwip/
| epr wrote:
| A JPEG decoder or TCP stack are very clearly not individual
| concepts though. There's obviously some subjectivity as to
| what constitutes a single "concept" or "algorithm", but I'm
| not sure either of those two examples are in a gray area.
|
| A single concept might be implementing just ARP or a
| discrete cosine transform. If you wanted to do a full TCP
| stack or JPEG decoder, that would make a lot more sense
| after building their internal components one by one.
| naasking wrote:
| JPEGDEC seems to be about 500 lines:
| https://github.com/bitbank2/JPEGDEC
| jiripospisil wrote:
| More like 5000 https://github.com/bitbank2/JPEGDEC/blob/m
| aster/src/jpeg.inl
| datascienced wrote:
| Err... and the exobytes of training data
| rnewme wrote:
| Ah, not really
| datascienced wrote:
| "Here's one I trained earlier"?
| holoduke wrote:
| Speed of hardware did. Back in 80a they already knew the
| principles of llm training. It only took one week to train
| 10.000 tokens.
| toxik wrote:
| Got a reference on that claim?
| mrbonner wrote:
| Wow, and this is done after a recent trip to Bhutan to clear his
| head! I follow karpathy on twitter and he posted that 2 weeks
| without constantly looking and checking his phone kind of turns
| off the constantly on radio in his head.
| convexstrictly wrote:
| Candle is a minimalist ML framework for Rust with a focus on
| performance (including GPU support) and ease of use
|
| https://github.com/huggingface/candle
| imjonse wrote:
| Candle focuses on inference though.
| l-m-z wrote:
| Candle dev here, we also support training/backdrop! We
| certainly focus on optimizing inference performance but
| hopefully that should improve the training efficiency too.
| revskill wrote:
| What is referecing ?
| HarHarVeryFunny wrote:
| Inference means using the neural net, as opposed to
| training it.
|
| During inference you feed an input into the NN and it
| passes through it in "forwards" direction (i.e. from input
| to output), being modified according to the "weights" that
| were learnt during training, to derive the output.
|
| During training, each training sample is first fed forwards
| through the NN, the same way as for inference, but then the
| output of the model (which at the beginning of training
| will be random/wrong) is compared to the correct/desired
| output for that training sample, and a corresponding error
| value will then be fed backwards (from output to input)
| through the NN according to the "backpropagation" mechanism
| to update the weights.
|
| Training is a lot more involved than inference since it
| involves this backpropagation step.
| basbuller wrote:
| Not barely as minimal as Karpathy his implementation
| jeroenvlek wrote:
| Love Candle! I actually ported Karpathy's previous GPT tutorial
| to candle, including training [0]
|
| [0] https://www.perceptivebits.com/building-gpt-from-scratch-
| in-...
| 0xfedbee wrote:
| I wouldn't call in "minimalist" after seeing Karpathy's code.
| rurban wrote:
| https://github.com/robjinman/Richard uses Vulkan, thus is
| portable across GPU's and much faster. It also has more kernels.
| In simple C++
| flohofwoe wrote:
| Or rather GLSL... The C++ code looks like it's mostly just
| scaffolding to kick off the actually important GPU work, and
| for that it's a surprising amount of code. Quite typical both
| for Vulkan and C++ though ;)
| robot wrote:
| very cool, also the coding style looks good.
| classiebit2025 wrote:
| https://classiebit.com/eventmie-pro If you looking to host your
| events but you don't have any platform to host, then once visit
| Eventmie Pro Platform, Which is the best event management
| platform in 2024.
| antirez wrote:
| Is this able to replace PyTorch, ... in normal practice? No.
|
| Does this show that in general the most used ML frameworks are a
| mess? Yes.
| bootsmann wrote:
| This is a bit an apples and oranges comparison. Pytorch is a
| research framework not a transformer inference library.
| antirez wrote:
| This post is about _training_ not inference. And llama.cpp
| has similarly simple LoRa training code. There is nothing in
| neural networks themselves so complex to justify the amount
| of complexity the Python-ML community piled up. MLX, for
| instance, is a similarly general purpose research framework
| that is a fraction of the size.
| HarHarVeryFunny wrote:
| Sure neural networks in of themselves are conceptually
| simple, and not difficult to code. Andrew Ng's original
| Coursera class is all you need to go from zero knowledge to
| building MATLAB based neural nets in this same hard coded
| style.
|
| However, there is a huge difference in functionality (hence
| complexity) in a framework such as PyTorch vs hardcoding a
| single NN. It's a bit like the difference between writing a
| toy compiler in CompSci class vs a production one that
| supports optimization, multiple targets, etc, etc.
|
| The first step in convenience beyond hardcoding models, was
| frameworks like the original Torch, and original
| TensorFlow. Those frameworks let you explicitly assemble a
| neural net out of modular "lego blocks" (tensor
| operations), then just call model.forward() or
| model.backward() - no need to yourself write the forwards
| and backwards functions.
|
| What PyTorch (successor to Torch) did was increase the
| complexity of the framework, but bring massive ease-of-use
| to the developer, by getting rid of the explicit lego-block
| assembly process, and instead let the developer just write
| arbitrary Python code corresponding to what they want the
| model to do, and then PyTorch itself build the model
| internally and therefore is able to infer the backward
| function. This extra functionality/ease-of-use, but with
| corresponding internal complexity, is what differentiated
| PyTorch from TensorFlow, made it so succesful, and caused
| most developers to switch to it.
|
| There is also a lot of other functionality in PyTorch that
| adds to the complexity - supporting multiple back ends,
| custom CUDA/etc kernels beyond what is provided by cuDNN,
| etc, etc.
| antirez wrote:
| I know all this things. Again: look at MLX.
| HarHarVeryFunny wrote:
| > Does this show that in general the most used ML frameworks
| are a mess? Yes.
|
| Not really ... there is little to no overlap with what a
| framework like PyTorch does. There is no tensor class, no
| autograd, etc. Just malloc, a bunch of hand calculated pointers
| into that chunk of memory, and hand written gradient functions.
| I assume the intent here is to be educational by stripping away
| the layers of abstraction to make it clearer what is going on.
|
| Frankly though, this code (all that pointer math!) is a mess
| too, maybe written this way to make it easy to port to cuDNN
| which is at a similarly low level (other than having tensor
| descriptors which make the memory layout more flexible).
|
| If you want to write your own tensor class and reusable NN
| framework, then the lines of code go up very rapidly. I did one
| in C++ a while back, and the tensor class alone was 20K LOC.
| milansuk wrote:
| This is an implementation of a transformer and in README it's
| presented as text->text. Tokens are just integers going in and
| out.
|
| Is it possible to use it to train other types of
| LLMs(text->image, image->text, speech->text, etc.)?
| bootsmann wrote:
| The transformer itself just takes arrays of numbers and turns
| them into arrays of numbers. What you are interested in is the
| process that happens before and after the transformer.
| _giorgio_ wrote:
| Yes, anything can be an input token.
|
| Patch of pixels ---> token Fragment of input Audio ---> token
| etc
| zzbn00 wrote:
| Very nice.
|
| In my experience much of the complexity of numerical software is
| to enable the search for the algorithm that works well with the
| problem/data you have. Once you know the exact algorithm you
| want, it is possible to make a nice clean minimalistic
| implementation, but that does not mean such an implementation
| would have been easy at the beginning.
| davedx wrote:
| On one hand, really nice to see the whole thing in 1000 lines of
| C code.
|
| On the other hand, that malloc function low key terrifies me. :)
| flohofwoe wrote:
| Better to be explicit than hiding unsafe memory accesses under
| C++ stdlib classes like std::vector which don't do range
| checking either in operator[]. And in this sort of code,
| automatically injected runtime range checks would most likely
| hurt performance enough to matter.
|
| I would still run the code through the Clang static analyzer
| and a couple of test runs in ASAN and UBSAN to be sure that
| nothing slipped through.
| sirsinsalot wrote:
| Karpathy's code, teaching and contribution to the body of
| knowledge in this area really is admirable.
|
| Sadly I am a generalist, but if I were a specialist, I would hope
| to contribute as openly and widely as Karpathy.
|
| Not clout chasing, click-bait, "top 5 javascript frameworks of
| 2023!" ... just high quality output that marks a specialist.
|
| Sorry to gush.
| lubesGordi wrote:
| Quick question, is this just pure C code that can be loaded into
| an Nvidia gpu and run (via the python code)? I scanned the C and
| didn't see anything CUDA related (maybe I missed something, I'm
| not a GPU programmer!). K mentions something about a direct CUDA
| implementation coming soon, how would that be different than what
| this is?
| whb07 wrote:
| It's not, if you look at his X account, he talks about his work
| adding on the CUDA parts
| waynecochran wrote:
| Fantastic -- gotta love Andrej. I am sick of the ball and chain
| that is Python and all of its environment dependencies. It is
| nice to shed all the weight and get down to the metal.
| toxik wrote:
| Yeah, as long as you don't want to change the network
| architecture.
|
| Edit: and you trust that Andrei didn't screw up anywhere while
| hand rolling all the gradient calculations.
___________________________________________________________________
(page generated 2024-04-09 23:01 UTC)