[HN Gopher] GPU Survival Toolkit for the AI age
       ___________________________________________________________________
        
       GPU Survival Toolkit for the AI age
        
       Author : lordwiz
       Score  : 262 points
       Date   : 2023-11-12 14:37 UTC (8 hours ago)
        
 (HTM) web link (journal.hexmos.com)
 (TXT) w3m dump (journal.hexmos.com)
        
       | lordwiz wrote:
       | In this AI Age, It is crucial for developers to have a
       | fundamental understanding of GPUs and their application to AI
       | development.
        
         | gmfawcett wrote:
         | Crucial, for all developers? The great majority will get AI
         | through an API.
        
           | athreyac8 wrote:
           | After a while there will be a time when you will have to be
           | lending others API, for that yes crucial I guess.
        
       | anonylizard wrote:
       | I think python is dominant in AI, because the python-C
       | relationship mirrors the CPU-GPU relationship.
       | 
       | GPUs are extremely performant, and also very hard to code in, so
       | people just use highly abstracted API calls like pytorch to
       | command the GPU.
       | 
       | C is very performant, and hard to code in, so people just use
       | python as a abstraction layer over C.
       | 
       | Its not clear if people need to understand GPUs that much (Unless
       | you are deep in AI training/ops land). In time, since moore's law
       | has ended and multithreading becomes the dominant mode of speed
       | increases, there'll probably be brand new languages dedicated to
       | this new paradigm of parallel programming. Mojo is a start.
        
         | sitkack wrote:
         | Moore's law is far from over and multithreading is not the
         | answer. Your opening sentence is spot on tho.
        
           | bmc7505 wrote:
           | Care to elaborate?
        
             | guyomes wrote:
             | No idea about the future of the Moore's law precisely. Yet
             | recent research results show that there is still room for
             | faster semiconductors, as discussed on HN [0].
             | 
             | [0]: https://news.ycombinator.com/item?id=38201624
        
               | bmc7505 wrote:
               | I mean why isn't multithreading the answer?
        
               | samus wrote:
               | Multithreading has the following disadvantages:
               | 
               | * Overhead: the overhead to start and manage multiple
               | threads is considerable in practice. Most multithreaded
               | algorithms are in fact slower than optimized serial
               | implementations when n_threads=1
               | 
               | * Communication: threads have to communicate with each
               | other and synchronize access to shared resources.
               | "Embarrassingly parallel" problems don't require
               | synchronization, but many interesting problems are not of
               | that kind.
               | 
               | * Amdahl's law: there is a point of diminishing returns
               | on parallelizing an application since it quite likely
               | contains parts that are not easily parallelized.
        
               | bmc7505 wrote:
               | Well, yes. But assuming the question is how to reduce
               | overall latency, what alternative is there besides
               | algorithmic improvements or increasing clock frequency?
        
           | Kamq wrote:
           | > Moore's law is far from over and multithreading is not the
           | answer.
           | 
           | Wut? We hit the power wall back in 2004. There was a little
           | bit of optimization around the memory wall and ilp wall
           | afterwards, but really, cores haven't gotten faster since.
           | 
           | It's been all about being able to cram more cores in since
           | then, which implies at least multi-threading, but multi-
           | processing is basically required to get the most out of a cpu
           | these days.
        
             | ikura wrote:
             | Moore's law is "the observation that the number of
             | transistors in an integrated circuit doubles about every
             | two years". For a while clock speed was a proxy for that
             | metric, but it's not the 'law' itself.
        
               | Kamq wrote:
               | Yeah, but today number of cores is the rough proxy for
               | that metric.
               | 
               | How do you operate in that world if "multithreading isn't
               | the answer"?
        
               | samus wrote:
               | Modern CPUs contain a lot more computing units than
               | cores. For a while, hyperthreading was thought to be a
               | useful way to make use of them. More recently, people
               | have turned to advanced instruction sets like SSE and
               | AVX.
        
               | FridgeSeal wrote:
               | Those things aren't mutually exclusive. Also firstly, I
               | suspect there's more "low hanging fruit" in making more
               | software make use of more cores. We're increasingly
               | getting better languages, tooling and libs for multi
               | threading stuff, and it's far more in the realm of your
               | average developer than writing SIMD compatible code and
               | making sure your code can pipeline properly.
        
               | samus wrote:
               | Threads are of course appropriate to implement high-level
               | concurrency and parallelism. But for fine-grained
               | parallelism, they are unwieldy and have high overhead.
               | 
               | Spreading an algorithm across multiple threads makes it
               | more difficult for an optimizing compiler to find
               | opportunities for SIMD optimization.
               | 
               | Similarly to how modern languages make it easier to
               | safely use threads, runtimes also make it easier to take
               | advantage of SIMD optimizations. For example, recently a
               | SIMD-optimized sorting algorithm was included in OpenJDK.
               | Apart from that, SIMD is way less brittle at runtime than
               | GPUs and other accelerators.
        
         | danielmarkbruce wrote:
         | GPUs aren't that difficult to program. CUDA is fairly
         | straightforward for many tasks and in many cases there is an
         | easy 100x improvement in processing speed just sitting there to
         | be had with <100 lines of code.
        
           | z2h-a6n wrote:
           | Sure, but if you've never used CUDA or any other GPU
           | framework, how many lines of documentation do you need to
           | read, and how many lines of code are you likely to write and
           | rewrite and delete before you end up with those <100 lines of
           | working code?
        
             | danielmarkbruce wrote:
             | They have nice getting started guides. Try it and see.
             | It's... really pretty simple. There is a reason they've
             | built a trillion $$ company - they've done a great job.
        
               | z2h-a6n wrote:
               | Ok, I tried it (for as long as my limited free time and
               | interest in CUDA allows). The closest I came to a getting
               | started guide is this [0], which by my (perhaps naive
               | count) is 25561 lines of documentation, and I would
               | probably need to learn more C++ to understand it in
               | detail.
               | 
               | I'm sure CUDA is great, and if I had more free time
               | and/or better reasons to improve the performance of my
               | code it would probably be great for me. My point was
               | mainly that a few lines of code which may be trivial for
               | one person to write may not be for someone else with
               | different experience. Depending on what the code is being
               | used for even a vast increase in performance may not be
               | worth the extra time it takes to implement it.
               | 
               | [0] https://docs.nvidia.com/cuda/cuda-c-programming-
               | guide/index....
        
               | danielmarkbruce wrote:
               | https://developer.nvidia.com/blog/even-easier-
               | introduction-c...
        
               | jprete wrote:
               | I haven't seen it but I can believe it - parallel
               | programming of any variety is hard, and to be successful
               | as a vendor of such a system would require uncommonly
               | good API design to get the kind of traction that CUDA has
               | gotten.
        
             | Kamq wrote:
             | > Sure, but if you've never used CUDA or any other GPU
             | framework, how many lines of documentation do you need to
             | read, and how many lines of code are you likely to write
             | and rewrite and delete before you end up with those <100
             | lines of working code?
             | 
             | If you're already familiar with one of the languages that
             | the nvidia compiler supports? Not that many. For people
             | familiar with C or C++, it's a couple extra attributes, and
             | a slightly different syntax for launching kernels vs a
             | regular function call. I'm admittedly not experienced with
             | Fortran, which is the other language they support, so I
             | can't speak to that. There's c-style memory allocation
             | functions, which might be annoying to C++ devs, but it's
             | nothing that would confuse them.
             | 
             | Edit: There's also a couple weird magic globals you have
             | access to in a kernel (blockIdx, blockDim, threadIdx), but
             | those are generally covered in the intros.
        
               | wiz21c wrote:
               | My experience is different. I had to code various
               | instances of "parallel reduction" and "prefix sum" and
               | it's not easy to get into it (took me a day or two).
               | Moreover, coming from an age where 640KB of RAM where
               | considered quite enough, truly realizing the power of the
               | GPU was not easy because my tasks are not quite parallel
               | and not quite thread coherent (I grant you that doing
               | naturally parallel stuff is dead simple). It took me a
               | while and a lot of nvidia-nsight to max out the GPU...
               | Moreover, I was a bit slow to actually understand how
               | powerful a GPU is (for example, my GPU gave poor
               | performances unless I gave it a problem big enough, so I
               | was (wrongly) disappointed when testing toy problems)
               | 
               | But once that challenge is overcome, GPU truly rocks.
               | 
               | Finally debugging complex shaders (I do some specific
               | case of computational fluid dynamics where equations are
               | not that easy, full of "if/then" edge cases, etc) is not
               | fun at all, tooling is sorely missed (unless I've missed
               | something)
        
               | Kamq wrote:
               | This is fair, and if you've got the time and inclination,
               | I'd love to hear about your experience and the tricks you
               | ended up pulling. There are definitely advanced areas of
               | CUDA, and you can go deeper on nearly anything.
               | 
               | But we are in a comment chain spawned by:
               | 
               | > CUDA is fairly straightforward for many tasks and in
               | many cases there is an easy 100x improvement in
               | processing speed just sitting there to be had with <100
               | lines of code.
               | 
               | And a follow up comment about how easy it would be to
               | write that "<100 lines of code", so I feel like we're
               | definitely talking about the easy case of naturally
               | parallel calculations, and sticking to that as an intro
               | seems fair to me.
        
               | llukas wrote:
               | There are libraries that help with more tricky stuff on-
               | device like cub or cuFFTDx.
        
               | samus wrote:
               | Parallel reduction and prefix sums are exercises to train
               | people how to reformulate algorithms for the GPU and how
               | to identify performance bottlenecks specific to GPUs. In
               | practice, you'd use library functions.
        
         | mft_ wrote:
         | I've wondered for a while: is there a space for a (new?)
         | language which _invisibly_ maximises performance, whatever
         | hardware it is run on?
         | 
         | As in, _every instruction_ , from a simple loop of calculations
         | onward, is designed behind the scenes so that it intelligently
         | maximises usage of every available CPU core in parallel, and
         | also farms everything possible out to the GPU?
         | 
         | Has this been done? Is it possible?
        
           | teaearlgraycold wrote:
           | I'm not sure if that's a good idea at the moment, but we
           | should start with making development with vector instructions
           | more approachable. The code should look more or less the same
           | as working with u64s.
        
           | jfoutz wrote:
           | There's definitely a space for it. It may even be possible.
           | But if you consider the long history of lisp discussions
           | (flamewars?) about "a sufficiently smart compiler" and
           | comparisons to C. Or maybe Java vs C++, it seems unlikely. At
           | least very very difficult.
           | 
           | There are little bits of research on algorithm replacement.
           | Like, have the compiler detect that you're trying to sort,
           | and generate the code for quick sort or timsort. it works,
           | kinda. There are a lot of ways to hide a sort in code, and
           | the compiler can't readily find them all.
        
           | Simpliplant wrote:
           | Not exactly it but Mojo sounds closest from available options
           | 
           | https://www.modular.com/mojo
        
           | huijzer wrote:
           | There are many languages doing that more or less. Jax and
           | Mojo for example.
        
           | pjc50 wrote:
           | I'm not sure that's even possible in principle; consider the
           | various anti-performance algorithms of proof-of-waste
           | systems, where every step is data-dependent on the previous
           | one and the table of intermediate results required may be
           | made arbitrarily big.
           | 
           | It's a bit like "design a zip algorithm which can compress
           | any file".
        
             | lwhi wrote:
             | I'd imagine it wouldn't be very difficult to build language
             | constructs that are able to denote when high parallelism is
             | desirable; and let the compiler deal with this information
             | as necessary.
        
             | drdeca wrote:
             | I don't see why such a "proof of waste" algorithm would be
             | an obstacle to such an optimizer existing. Wouldn't it just
             | be that for such computational problems, the optimal
             | implementation would still be rather costly? That doesn't
             | mean the optimizer failed. If it made the program as
             | efficient as possible, for the computational task it
             | implements, then the optimizer has done its job.
        
           | howling wrote:
           | You might be interested in
           | https://github.com/HigherOrderCO/HVM
        
             | runlaszlorun wrote:
             | HVM looks very interesting. Thx for posting.
        
           | kaba0 wrote:
           | Not for mixed CPU/GPU, but there is the concept of a
           | superoptimizer, that basically brute forces for the most
           | optimal correct code. But it is not practical, besides using
           | for very very short program snippets (and they are usually
           | CPU-only, though there is nothing fundamental why it couldn't
           | utilize the GPU as well).
           | 
           | There is also https://futhark-lang.org/ , though I haven't
           | tried it, just heard about it.
        
         | sgbeal wrote:
         | > C is very performant, and hard to code in, so people just use
         | python as a abstraction layer over C.
         | 
         | C is a way of life. Those of us who code exclusively, or nearly
         | so, in C cannot stomach python's notion of "significant white-
         | space."
        
           | yoyohello13 wrote:
           | I code in both (c for hobbies and python professionally) and
           | "significant white space" is a non-issue if you spend any
           | amount of time getting used to it.
           | 
           | Complaining about significant white-space is like complaining
           | that lisp has too many parentheses. It's an aesthetic
           | preference that just doesn't matter in practice.
        
             | adventured wrote:
             | A form of Sayre's Law is very common in tech (eg spaces vs
             | tabs; framework vs framework; language vs language).
        
           | adolph wrote:
           | > cannot stomach python's notion of "significant white-
           | space."
           | 
           | Why belly ache about it? Whitespace is significant to one's
           | fellow humans.
        
             | boredtofears wrote:
             | Precisely why it should be of no significance to the
             | machine.
        
               | drdrey wrote:
               | Source code is not for the machine to read, it's for your
               | fellow humans
        
           | Phemist wrote:
           | Wait till you start using the black formatter tool.
           | 
           | Well-known for supporting any formatting style you like ;)
        
           | dboreham wrote:
           | You get used to the significant whitespace. (C programmer
           | since ~1978).
        
             | bigstrat2003 wrote:
             | I never did, and it's one of the things I hate most about
             | Python to this day. I still use Python because it's the
             | best tool a lot of the time, but it's such a terrible
             | language decision to have significant whitespace imo.
        
           | bart_spoon wrote:
           | All programming languages, including C, have significant
           | white space. Python just has slightly more.
        
           | mhh__ wrote:
           | I actually find python and C very similar in spirit.
           | 
           | Syntax is mostly an irrelevance, they have surprisingly
           | similar patterns in my opinion.
           | 
           | In a modern language I want a type system that both reduces
           | risk and reduces typing -- safety and metaprogramming. C
           | obviously doesn't, python doesn't really either.
           | 
           | Python's approach to dynamic-ness is very similar to how I'd
           | expect C to be as a dynamic language (if it had proper
           | arrays/lists).
        
         | mhh__ wrote:
         | You can pretty easily get C performance (I'd argue that C's
         | lack of abstraction makes slow but simple code more appealing)
         | with pythonic expressiveness pretty easily with a more modern
         | language.
        
       | the__alchemist wrote:
       | Another tip that took me longer than I wished to figure out.
       | 
       | Use CUDA, vice graphics APIs+compute. The latter (Vulkan compute
       | etc) is high friction. CUDA is far easier to write in, and the
       | resulting code is easier to maintain and update.
        
         | behnamoh wrote:
         | Yes, and in the process, contribute to Nvidia's monopoly.
        
           | bogwog wrote:
           | It isn't the consumer's responsibility to regulate the
           | market.
        
             | danielmarkbruce wrote:
             | And, in this case NVIDIA earned it. They built a very
             | useful software layer around their chips.
        
             | dzikimarian wrote:
             | If consumers don't care about their money, then who would?
        
             | calamari4065 wrote:
             | Then whose responsibility is it? Corporations? The
             | government? Or maybe the tooth fairy?
        
             | tjoff wrote:
             | That is a weak argument that could be used to justify tons
             | of behavior, very convenient.
             | 
             | Vote with your feet. Maybe you can't or can't afford it,
             | then at least admit the problem to yourself and maybe don't
             | try to persuade others in order to feel better for your own
             | decision.
        
           | tovej wrote:
           | Eh, CUDA can mostly be transformed to HIP, unless you use
           | specialized NVIDIA stuff.
        
           | the__alchemist wrote:
           | I'm with you, and am surprised there isn't comparable
           | competition.
        
           | raincole wrote:
           | _Very_ few professional artists refuse using Photoshop or
           | After Effects because it will  "contribute to Adobe's
           | monopoly".
           | 
           | But for some reasons professional programmers are judged
           | under a much higher moral standard.
        
             | ickelbawd wrote:
             | I think because professional artists can't typically make
             | their software tools. Whereas engineers could in theory
             | make their own tools. Naturally few do in practice though
             | as tech has become far too large and specialized. But our
             | roots are where our values and ideals come from.
        
               | hutzlibu wrote:
               | That is a really theoretical point.
               | 
               | If I start to work on a tool, then I cannot work anymore
               | on what I actually wanted to do. And it just so happens
               | ... that this is exactly what I did and I can just say,
               | it usually takes way longer than the most pessimistic
               | estimate one can come up with, so yes, one can decide to
               | switch careers and try to get funding to (re)build what
               | is not offered to acceptable conditions (but in my case
               | the tool simply did not exist, though).
               | 
               | Just like an artist can switch career, study CS, build on
               | his own a tool a professional company build with a team
               | over years - and then someday work with his tool to
               | acomplish his original work. In (simplified) theories,
               | lots of things are possible ..
        
             | Aurornis wrote:
             | > But for some reasons professional programmers are judged
             | under a much higher moral standard
             | 
             | Not in the real world. Most programmers who are trying to
             | get a job done won't avoid CUDA or AWS or other tools just
             | to avoid "contributing to a monopoly". When responsible
             | programmers have a job to do and tools are available to
             | help with the job, they get used.
             | 
             | A programmer who avoids mainstream tools on principle is
             | liable to get surpassed by their peers very quickly. I've
             | only met a few people like this in industry and they didn't
             | last very long trying to do everything the hard way just to
             | avoid tools from corporations or monopolies or open source
             | that wasn't pure enough for their standards.
             | 
             | It's only really in internet comment sections that people
             | push ideological purity like this.
        
               | z3phyr wrote:
               | The same attitude brought us adaptation of linux. So IDK
        
               | arcanemachiner wrote:
               | Most lottery tickets aren't winners.
        
             | Karliss wrote:
             | Tool choice of artists has close to 0 impact on people
             | interacting with final work. Choices made by programmers
             | are amplified through the users of produced software.
        
             | timeon wrote:
             | And they ended up with Creative Cloud bloatware.
        
             | yowlingcat wrote:
             | > But for some reasons professional programmers are judged
             | under a much higher moral standard.
             | 
             | I believe the key word there is "professional" -- one of
             | the challenges of a venue like HN is the professional
             | engineers and the less-professional ones interact from
             | worldviews and use cases so distinct that they may as well
             | be separate universes. In other spaces, we wouldn't let a
             | top doctor have to explain very basic concepts about the
             | commercial practice of medicine to an amateur "skeptic" and
             | yet so many discussions on HN degenerate along just these
             | lines.
             | 
             | On the other hand, it's that very same inclusiveness and
             | generally high discourse in spite of that wide expanse
             | which make HN such a special community, so I'm not sure
             | what to conclude besides this unfortunate characteristic
             | being a necessary "feature, not a bug" of the community.
             | There's no way around it that wouldn't make the community a
             | lesser place, I think.
        
           | gjsman-1000 wrote:
           | So sayeth the person who has never written OpenCL.
        
           | ilaksh wrote:
           | Is there something that is not vendor specific? Maybe a
           | parallel programming language that compiles to different
           | targets?
           | 
           | ..and doesn't suck.
        
             | atq2119 wrote:
             | The part that I don't understand is why AMD/Intel/somebody
             | else don't just implement at least the base CUDA for their
             | products.
             | 
             | HIP is basically that, but they still make you jump through
             | hoops to rename everything etc.
             | 
             | There are libraries written at a lower level that wouldn't
             | be immediately portable, but surely that could be addressed
             | over time as well.
        
         | nwoli wrote:
         | I agree cuda is really nice to write in, but what reason do you
         | have to write raw cuda I'm curious? Usually I find that it's
         | pre written kernels you deal with
        
           | the__alchemist wrote:
           | Currently doing computational chemistry, but per the article,
           | it's a fundamental part of my toolkit going forward; I think
           | it will be a useful tool for many applications.
        
         | drdrey wrote:
         | What about on macOS? Is OpenCL viable?
        
         | dinosaurdynasty wrote:
         | I wish someone would make the latter much easier, as someone
         | who is definitely not the only person to get interested in this
         | AI stuff lately who has a high tier AMD card that could
         | _surely_ do this stuff and would like to run this stuff locally
         | for various reasons.
         | 
         | Currently I've given up and use runpod, but still...
        
       | password4321 wrote:
       | I need a buyers guide: what's the minimum to spend, and best at a
       | few budget tiers? Unfortunately that info changes occasionally
       | and I'm not sure if there's any resource that keeps on top of
       | things.
        
         | alsodumb wrote:
         | https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...
         | 
         | This is the best one imo.
        
         | coffeebeqn wrote:
         | You can also rent compute online if you don't want to immediate
         | plop down 1-2k
        
         | fulafel wrote:
         | For learning basics of GPU programming your iGPU will do fine.
         | Actual real-world applications are very varied of course.
        
         | johndough wrote:
         | Google Colab, Kaggle Notebooks and Paperspace Notebooks all
         | offer free GPU usage (within limits), so you do not need to
         | spent anything to learn GPU programming.
         | 
         | https://colab.google/
         | 
         | https://www.kaggle.com/docs/notebooks
         | 
         | https://www.paperspace.com/gradient/free-gpu
        
       | alberth wrote:
       | Nx / Axon
       | 
       | Given that most programming languages are designed for sequential
       | processing (like CPUs), but Erlang/Elixir is designed for
       | parallelism (like GPUs) ... I really wonder if Nx / Axon (Elixir)
       | will take off.
       | 
       | https://github.com/elixir-nx/
        
         | oytis wrote:
         | Erlang was designed for distributed systems with a lot of
         | _concurrency_ not for computation-heavy _parallelism_
        
         | coffeebeqn wrote:
         | Would that run on a GPU? I think the future is having both.
         | Sequential programming is still the best abstraction for most
         | tasks that don't require immense parallel execution
        
           | dartos wrote:
           | Axon runs compute graphs on gpu, but elixirs parallelism
           | abstractions run on cpu
        
         | matrss wrote:
         | I am really wondering how well Elixir with Nx would perform for
         | computation heavy workloads on a HPC cluster. Architecturally,
         | it isn't _that_ dissimilar to MPI, which is often used in that
         | field. It should be a lot more accessible though, like numpy
         | and the entire scientific python stack.
        
         | zoogeny wrote:
         | I've been investigating this and I wonder if the combination of
         | Elixir and Nx/Axon might be a good fit for architectures like
         | NVIDIA Grace Hopper where there is a mix of CPU and GPU.
        
       | arriu wrote:
       | Are amd GPUs still to be avoided or are they workable at this
       | point?
        
         | JonChesterfield wrote:
         | The cuda happy path is very polished and works reliably. The
         | amdgpu happy path fights you a little but basically works. I
         | think the amd libraries starting to be packaged under Linux is
         | a big deal.
         | 
         | If you don't want to follow the happy path, on Nvidia you get
         | to beg them to maybe support your use case in future. On
         | amdgpu, you get the option to build it yourself, where almost
         | all the pieces are open source and pliable. The driver ships in
         | Linux. The userspace is on GitHub. It's only GPU firmware which
         | is an opaque blob at present, and that's arguably equivalent to
         | not being able to easily modify the silicon.
        
         | latchkey wrote:
         | AMD GPUs work great, the issue is that people don't want to
         | mess with ROCm/HIP when CUDA is kind of the documented
         | workflow. Along with the fact that ROCm was stagnant for a long
         | time. AMD missed the first AI wave, but are now committed to
         | making ROCm into the best it can be.
         | 
         | The other problem is that there aren't any places to rent the
         | high end AMD AI/ML GPUs, like the MI250's and soon to be
         | released MI300's. They are only available on things like the
         | Frontier super computer, which few developers have access to.
         | "regular" developers are stuck without easy access to this
         | equipment.
         | 
         | I'm working on the later problem. I'd like to create more of a
         | flywheel effect. Get more developers interested in AMD by
         | enabling them to inexpensively rent and do development on them,
         | which will create more demand. @gmail if you'd like to be an
         | early adopter.
        
       | k1ns wrote:
       | I am very new to GPU programming in general and this article was
       | a fun read. It's amazing how far we've come, prime example being
       | able to train a simple "dog or cat" NN that easily.
        
       | z5h wrote:
       | We have compilers (languages) like Futhark that aim to optimize
       | explicitly parallel operations. And universal models of
       | computation like interaction nets that are inherently parallel.
       | 
       | Am I lazy to expect we'll be getting a lot more "parallel-on-the-
       | GPU-for-free" in the future?
        
         | zozbot234 wrote:
         | You can already convert a compute graph to GPU-optimized code
         | using something like Aesara (formerly known as Theano) or
         | TensorFlow. There are also efforts in the systems space that
         | ought to make this kind of thing more widespread in the future,
         | such as the MLIR backend for LLVM.
        
       | rushingcreek wrote:
       | Good read. However, the AWS P5 instance (along with P4d and P4de)
       | is most certainly oriented towards training, not inference. The
       | most inference-friendly instance types are the G4dn and the G5,
       | which feature T4 and A10G GPUs, respectively.
        
         | axpy906 wrote:
         | Came here to say this author forgot G5.
        
       | bilsbie wrote:
       | Can I do this in my GeForce GTX 970 4GB?
        
       | dartos wrote:
       | I don't think transformer models generate multiple tokens in
       | parallel (how could they?)
       | 
       | They just leverage parallelism in making a single prediction
        
         | atomicnature wrote:
         | Transformers tend to be _trained_ in parallel. BERT = 512
         | tokens per context, in parallel. GPT too is trained while
         | feeding in multiple words in parallel. This enables us to build
         | larger models. Older models, such as RNNs couldn 't be trained
         | this way, limiting their power/quality.
        
           | dartos wrote:
           | Ahh, that makes a lot of sense
        
           | zozbot234 wrote:
           | This is only sort of true, since you can still train RNNs
           | (including LSTM, etc.) in big batches-- which is usually
           | plenty enough to make use of your GPU's parallel
           | capabilities. The inherently serial part only applies to the
           | _length_ of your context. Transformer architectures thus
           | happen to be helpful if you have lots of idle GPU 's such
           | that you're actually constrained by not being able to
           | parallelize along the context dimension.
        
             | atomicnature wrote:
             | In RNN, hidden states are to be sequential; in transformers
             | with attention mechanism, we break free of the sequential
             | requirement. Transformers are more amenable to parallelism,
             | and make use of GPUs the most (within the context axis, and
             | outside).
        
         | DougBTX wrote:
         | A quick search uncovers [0] with a hint towards an answer: just
         | train the model to output multiple tokens at once.
         | 
         | [0] https://arxiv.org/abs/2111.12701
        
       | JonChesterfield wrote:
       | The CPUs are good at serial code and GPUs are good at parallel
       | code is kind of true but something of an approximation. Assume
       | equivalent power budget in the roughly hundreds of watts range,
       | then:
       | 
       | A CPU has ~100 "cores" each running one (and-a-hyperthread).
       | independent things, and it hides memory latency by branch
       | prediction and pipelining.
       | 
       | A GPU has ~100 "compute units", each running ~80 independent
       | things interleaved, and it hides memory latency by executing the
       | next instruction from one of the other 80 things.
       | 
       | Terminology is a bit of a mess, and the CPU probably has a 256bit
       | wide vector unit while the GPU probably has a 2048bit wide vector
       | unit, but from a short distance the two architectures look rather
       | similar.
        
         | bee_rider wrote:
         | I'm always surprised there isn't a movement toward pairing a
         | few low latency cores with a large number of high throughput
         | cores. Surround single Intel P core with a bunch of E cores.
         | Then, hanging off the E cores, stick a bunch of iGPU cores
         | and/or AVX-512 units.
         | 
         | Call it Xeon Chi.
        
           | softfalcon wrote:
           | Neat idea, probably even viable!
           | 
           | I think they may have a hurdle of getting folks to buy into
           | the concept though.
           | 
           | I imagine it would be analogous to how Arria FPGA's were
           | included with certain Xeon CPU's. Which further backs up your
           | point that this could happen in the near future!
        
           | Const-me wrote:
           | I think one possible reason for that, ideally these things
           | need different memory.
           | 
           | If you use high-bandwidth high-latency GDDR memory, CPU cores
           | will underperform due to high latency, like there:
           | https://www.tomshardware.com/reviews/amd-4700s-desktop-
           | kit-r...
           | 
           | If you use low-latency memory, GPU cores will underperform
           | due to low bandwidth, see modern AMD APUs with many RDNA3
           | cores connected to DDR5 memory. On paper, Radeon 780M
           | delivers up to 9 FP32 TFLOPS, the figure is close to desktop
           | version of Radeon RX 6700 which is substantially faster in
           | gaming.
        
             | bee_rider wrote:
             | Hmm, that is a good point. Since it is a dream-computer
             | anyway, maybe we can do 2.5d packaging; put the ddr memory
             | right on top so the P cores can reach it quickly, then
             | surround the whole thing with GDDR.
        
           | pixelpoet wrote:
           | You mean like an iGPU?
           | 
           | Edit: Oh, thanks for the downvote, with no discussion of the
           | question. I'll just sit here quietly with my commercial
           | OpenCL software that happily exploits these vector units
           | attached to the normal CPU cores.
        
         | mmoskal wrote:
         | GPU has 10x the memory bandwidth of the CPU though, which
         | becomes relevant for the LLMs where you essentially have to
         | read the whole memory (if you're batching optimally, that is
         | using all the memory either for weights or for KV cache) to
         | produce one token of output.
        
           | winwang wrote:
           | GPUs also have 10x-100x FP/INT8 throughput watt-for-watt.
        
           | hurryer wrote:
           | GPU also has 10x memory latency compared to CPU.
           | 
           | And memory access order is much more important that on CPU.
           | Truly random access has very bad performance.
        
       | Matumio wrote:
       | > When faced with multiple tasks, a CPU allocates its resources
       | to address each task one after the other
       | 
       | Ha! I wish CPUs were still that simple.
       | 
       | Granted, it is legitimate for the article to focus on the
       | programming model. But "CPUs execute instructions sequentially"
       | is basically wrong if you talk about performance. (There are
       | pipelines executing instructions in parallel, there is SIMD, and
       | multiple cores can work on the same problem.)
        
         | pclmulqdq wrote:
         | I think this post focused on the wrong things here. CPUs with
         | AVX-512 also have massive data parallelism, and CPUs can
         | execute many instructions at the same time. The big difference
         | is that CPUs spend a lot of their silicon and power handling
         | control flow to execute one thread efficiently, while GPUs
         | spend that silicon on more compute units and hide control flow
         | and memory latency by executing a lot of threads.
        
         | mhh__ wrote:
         | It will do multipleS SIMD instructions at the same time, too.
        
       | aunty_helen wrote:
       | We're back to "every developer must know" clickbait articles?
        
         | igh4st wrote:
         | it seems so... Should take this article's statements with a
         | grain of salt.
        
         | mhh__ wrote:
         | Although I think they'll be replaced by ChatGPT a _good_
         | article in that style is actually quite valuable.
         | 
         | I like attacking complexity head on, and have a good knowledge
         | of both quantitative methods & qualitative details of (say)
         | computer hardware so having an article that can tell me the
         | nitty gritty details of a field is appreciated.
         | 
         | Take "What every programmer should know about memory" -- should
         | _every_ programmer know? Perhaps not, but every _good_
         | programmer should at least have an appreciation of how a
         | computer actually works. This pays dividends _everywhere_ --
         | locality (the main idea that you should take away from that
         | article) is fast, easy to follow, and usually a result of good
         | code that fits a problem well.
        
       | shortrounddev2 wrote:
       | This article claims to be something every developer must know,
       | but it's a discussion of how GPUs are used in AI. Most developers
       | are not AI developers, nor do they interact with AI or use GPUs
       | directly. Not to mention the fact that this articles barely
       | mentions 3d graphics at all, the reason gpus exist
        
         | lucb1e wrote:
         | One can benefit from knowing fundamentals of an adjacent field,
         | especially something as broadly applicable as machine learning.
         | 
         | - You might want to use some ML in the project you are assigned
         | next month
         | 
         | - It can help collaborating with someone who tackles that
         | aspect of a project
         | 
         | - Fundamental knowledge helps you understand the "AI" stuff
         | being marketed to your manager
         | 
         | The "I don't need this adjacent field" mentality feels familiar
         | from schools I went to: first I did system administration where
         | my classmates didn't care about programming because they felt
         | like they didn't understand it anyway and they would never need
         | it (scripting, anyone?); then I switched to a software
         | development school where, guess what, the kids couldn't care
         | about networking and they'd never need it anyway. I don't
         | understand it, to me it's both interesting, but more
         | practically: fast-forward five years and the term devops became
         | popular in job ads.
         | 
         | The article is 1500 words at a rough count. Average reading
         | speed is 250wpm, but for studying something, let's assume half
         | of that: 1500/125 = 12 minutes of your time. Perhaps you toy
         | around with it a little, run the code samples, and spend two
         | hours learning. That's not a huge time investment. Assuming
         | this is a good starting guide in the first place.
        
           | mrec wrote:
           | The objection isn't to the notion that "One can benefit from
           | knowing fundamentals of an adjacent field". It's that this is
           | "The bare minimum every developer must know". That's a much,
           | _much_ stronger claim.
           | 
           | I've come to see this sort of clickbait headline as playing
           | on the prevalence of imposter-syndrome insecurity among devs,
           | and try to ignore them on general principle.
        
             | lucb1e wrote:
             | Fair enough! I can kind of see the point that, if every
             | developer knew some basics, it would help them make good
             | decisions about their own projects, even if the answer is
             | "no, this doesn't need ML". On the other hand, you're of
             | course right that if you don't use ML, then it's clearly
             | not something you "must" know to do your job well.
        
         | sigmonsays wrote:
         | yeah a lot of assumptions were made that are inaccurate.
         | 
         | I agree that most developers are not AI developers... OP seems
         | to be a bit out of touch with the general population and
         | otherwise is assuming the world around them based on their own
         | perception.
        
         | Der_Einzige wrote:
         | Don't worry, you'll either be an AI developer or unemployed
         | within 5 years. This is indeed important for you, regardless if
         | you recognize this yet or not.
        
         | pixelpoet wrote:
         | Not to mention their passing example of Mandelbrot set
         | rendering only gets a 10x speedup, despite being the absolute
         | posterchild of FLOPs-limited computation.
         | 
         | Terrible article IMO.
        
           | pclmulqdq wrote:
           | You would expect at least 1000x, and that's probably where it
           | would be if they didn't include JIT compile time in their
           | time. Mandelbrot sets are a perfect example of a calculation
           | a GPU is good at.
        
         | j45 wrote:
         | Understanding now hardware is used is very beneficial for
         | programmers
         | 
         | Lost of programmers started with an understanding of what
         | happens physically on the hardware when code runs and it is
         | unfair advantage when debugging at times
        
         | oytis wrote:
         | > Most developers are not AI developers
         | 
         | I remember how I joined a startup after working for a
         | traditional embedded shop and a colleague made (friendly) fun
         | of me for not knowing how to use curl to post a JSON request. I
         | learned a lot since then about backend, frontend and
         | infrastructure despite still being an embedded developer. It
         | seems likely that people all around the industry will be in a
         | similar position when it comes to AI in the next years.
        
           | hhjinks wrote:
           | What do you think the industry will look like in the near
           | future?
        
         | outside1234 wrote:
         | And honestly, for most "AI developers" if you are training your
         | own model these days (versus using an already trained one) -
         | you are probably doing it wrong.
        
         | sbmthakur wrote:
         | I would have probably opened it if it weren't for the title
         | bait.
        
         | bigstrat2003 wrote:
         | I've noticed that every time I see an article claiming that its
         | subject is something "every developer must know", that claim is
         | false. Maybe there are articles which contain information that
         | everyone must know, but all I encounter is clickbait.
        
         | BlueTemplar wrote:
         | Even worse, it says "GPUs", but isn't CUDA a closed feature
         | limited to Nvidia cards, and maybe even a subset of them ?
         | 
         | (I'm not touching Nvidia since they don't provide open source
         | drivers.)
        
       | lucb1e wrote:
       | > AWS GPU Instances: A Beginner's Guide [...] Here are the
       | different types of AWS GPU instances and their use cases
       | 
       | The section goes on to teach Amazon-specific terminology and
       | products.
       | 
       | A "bare minimum everyone must know" guide should not include
       | vendor-specific guidance. I had this in school with Microsoft
       | already, with never a mention of Linux because they already paid
       | for Windows Server licenses for all of us...
       | 
       | Edit: and speaking of inclusivity, the screenshots-of-text have
       | their alt text set to "Alt text". Very useful. It doesn't need to
       | be verbatim copies, but it could at least summarize in a few
       | words what you're meant to get from the terminal screenshot to
       | help people that use screen readers.
       | 
       | Since this comment floated to the top, I want to also say that I
       | didn't mean for this to dominate the conversation! The guide may
       | not be perfect, but it helped me by showing how to run arbitrary
       | code on my GPU. A few years ago I also looked into it, but came
       | away thinking it's dark magic that I can't make use of. The
       | practical examples in both high- and low-level languages are
       | useful
       | 
       | Another edit: cool, this comment went from all the way at the top
       | to all the way at the bottom, without losing a single vote. I
       | agree it shouldn't be the very top thing, but this moderation
       | also feels weird
        
         | mikehollinger wrote:
         | Agreed. This isn't actually that useful of a guide in the first
         | place.
         | 
         | Tbh the most basic question is: "are you innovating inside the
         | AI box or outside the AI box?"
         | 
         | If inside - this guide doesn't really share anything practical.
         | Like if you're going to be tinkering with a core algorithm and
         | trying to optimize it, understanding BLAS and cuBLAS or
         | whatever AMD / Apple / Google equivalent, then understanding
         | what pandas, torch, numpy and a variety of other tools are
         | doing for you, then being able to wield these effectively makes
         | more sense.
         | 
         | If outside the box - understanding how to spot the signs of
         | inefficient use of resource - whether that's network, storage,
         | accelerator, cpu, or memory, and then reasoning through how to
         | reduce that bottleneck.
         | 
         | Like - I'm certain we will see this in the near future, but off
         | the top of my head the innocent but incorrect things people do:
         | 1. Sending single requests, instead of batching 2. Using a
         | synchronous programming model when asynchronous is probably
         | better 3. Sending data across a compute boundary unnecessarily
         | 4. Sending too much data 5. Assuming all accelerators are the
         | same. That T4 gpu is cheaper than an H100 for a reason. 6.
         | Ignoring bandwidth limitations 7. Ignoring access patterns
        
         | StableAlkyne wrote:
         | Are there any surveys of just how many Windows Servers boxes
         | exist?
         | 
         | Even when I was working at an Azure-only shop, I've never
         | actually seen anyone use Windows Server. Lots of CentOS (before
         | IBM ruined it) and other Unixes, but never a Windows Server.
        
           | lucb1e wrote:
           | We come across them all the time when doing internal network
           | pentests (most organizations use AD for managing their fleet
           | of end-user systems), and occasionally external tests as
           | well. Stackoverflow is a site that comes to mind as being
           | known for running their production systems on Windows Server.
           | 
           | It's useful to have experienced, but I do take issue with
           | exclusively (or primarily) focusing on one ecosystem as a
           | mostly-publicly-funded school.
        
             | StableAlkyne wrote:
             | Huh, TIL StackOverflow is on Windows Server
        
       | Matumio wrote:
       | The Mandelbrot example seems to make interpreted Python stand in
       | for "the CPU performance"?
       | 
       | If that's true, then I'm surprised they only see a 10x speed-up.
       | I would expect more from only compiling that loop for the CPU.
       | (Comparing to interpreted Python without numpy.) Given they
       | already have a numba version, why not compile it for the CPU and
       | compare?
       | 
       | Also, they say consumer CPUs have 2-16 cores. (Who has 2 cores
       | these days?) They go on suggest to rent an AWS GPU for $3 per
       | hour. You're more likely to get 128 cores for that price, still
       | on a single VM.
       | 
       | Not saying it will be easy to write multi-threaded code for the
       | CPU. But if you're lucky, the Python library you're using already
       | does it.
        
         | lucb1e wrote:
         | > Also, they say consumer CPUs have 2-16 cores. (Who has 2
         | cores these days?)
         | 
         | Pretty sure my mom's laptop has 2 cores; I can't think of
         | anyone whose daily driver has 16 cores. Real cores, not
         | hyperthread stuff running at 0.3x the performance of a real
         | core.
         | 
         | As for the 128-core server system, note that those cores are
         | typically about as powerful as a 2008 notebook. My decade-old
         | laptop CPU outperforms what you get at DigitalOcean today, and
         | storage performance is a similar story. The sheer number makes
         | up for it, of course, but "number of cores" is not a 1:1
         | comparable metric.
         | 
         | Agree, though, that the 10x speedup seems low. Perhaps, at
         | 0.4s, a relatively large fraction of that time is spent on
         | initializing the Python runtime (`time python3 -c
         | 'print("1337")'` = 60ms), the module they import that needs to
         | do device discovery, etc.? Hashcat, for example, takes like 15
         | seconds to get started even if it then runs very fast after
         | that.
        
           | 65a wrote:
           | My 2016 desktop had 22 cores and 44 threads. You can have the
           | same processor for < $200 on ebay right now.
        
       | shmerl wrote:
       | Shouldn't the article mention SIMD? I haven't seen it even being
       | brought up.
        
       | johndough wrote:
       | The code in this article is incorrect. The CUDA kernel is never
       | called:
       | https://github.com/RijulTP/GPUToolkit/blob/f17fec12e008d0d37...
       | 
       | I'd also like to point out that 90 % of the time spent to
       | "compute" the Mandelbrot set with the JIT-compiled code is spent
       | on compiling the function, not on computation.
       | 
       | If you actually want to learn something about CUDA, implementing
       | matrix multiplication is a great exercise. Here are two
       | tutorials:
       | 
       | https://cnugteren.github.io/tutorial/pages/page1.html
       | 
       | https://siboehm.com/articles/22/CUDA-MMM
        
         | Handprint4469 wrote:
         | Thank you for this, comments like yours is exactly why I keep
         | coming back to HN.
        
         | sevagh wrote:
         | >If you actually want to learn something about CUDA,
         | implementing matrix multiplication is a great exercise.
         | 
         | There is SAXPY (matrix math A*X+Y), purportedly ([1]) the hello
         | world of parallel math code.
         | 
         | >SAXPY stands for "Single-Precision A*X Plus Y". It is a
         | function in the standard Basic Linear Algebra Subroutines
         | (BLAS)library. SAXPY is a combination of scalar multiplication
         | and vector addition, and it's very simple: it takes as input
         | two vectors of 32-bit floats X and Y with N elements each, and
         | a scalar value A. It multiplies each element X[i] by A and adds
         | the result to Y[i].
         | 
         | [1]: https://developer.nvidia.com/blog/six-ways-saxpy/
        
       | convexstrictly wrote:
       | A great beginner guide to GPU programming concepts:
       | 
       | https://github.com/srush/GPU-Puzzles
        
       ___________________________________________________________________
       (page generated 2023-11-12 23:01 UTC)