[HN Gopher] GPU Survival Toolkit for the AI age
___________________________________________________________________
GPU Survival Toolkit for the AI age
Author : lordwiz
Score : 262 points
Date : 2023-11-12 14:37 UTC (8 hours ago)
(HTM) web link (journal.hexmos.com)
(TXT) w3m dump (journal.hexmos.com)
| lordwiz wrote:
| In this AI Age, It is crucial for developers to have a
| fundamental understanding of GPUs and their application to AI
| development.
| gmfawcett wrote:
| Crucial, for all developers? The great majority will get AI
| through an API.
| athreyac8 wrote:
| After a while there will be a time when you will have to be
| lending others API, for that yes crucial I guess.
| anonylizard wrote:
| I think python is dominant in AI, because the python-C
| relationship mirrors the CPU-GPU relationship.
|
| GPUs are extremely performant, and also very hard to code in, so
| people just use highly abstracted API calls like pytorch to
| command the GPU.
|
| C is very performant, and hard to code in, so people just use
| python as a abstraction layer over C.
|
| Its not clear if people need to understand GPUs that much (Unless
| you are deep in AI training/ops land). In time, since moore's law
| has ended and multithreading becomes the dominant mode of speed
| increases, there'll probably be brand new languages dedicated to
| this new paradigm of parallel programming. Mojo is a start.
| sitkack wrote:
| Moore's law is far from over and multithreading is not the
| answer. Your opening sentence is spot on tho.
| bmc7505 wrote:
| Care to elaborate?
| guyomes wrote:
| No idea about the future of the Moore's law precisely. Yet
| recent research results show that there is still room for
| faster semiconductors, as discussed on HN [0].
|
| [0]: https://news.ycombinator.com/item?id=38201624
| bmc7505 wrote:
| I mean why isn't multithreading the answer?
| samus wrote:
| Multithreading has the following disadvantages:
|
| * Overhead: the overhead to start and manage multiple
| threads is considerable in practice. Most multithreaded
| algorithms are in fact slower than optimized serial
| implementations when n_threads=1
|
| * Communication: threads have to communicate with each
| other and synchronize access to shared resources.
| "Embarrassingly parallel" problems don't require
| synchronization, but many interesting problems are not of
| that kind.
|
| * Amdahl's law: there is a point of diminishing returns
| on parallelizing an application since it quite likely
| contains parts that are not easily parallelized.
| bmc7505 wrote:
| Well, yes. But assuming the question is how to reduce
| overall latency, what alternative is there besides
| algorithmic improvements or increasing clock frequency?
| Kamq wrote:
| > Moore's law is far from over and multithreading is not the
| answer.
|
| Wut? We hit the power wall back in 2004. There was a little
| bit of optimization around the memory wall and ilp wall
| afterwards, but really, cores haven't gotten faster since.
|
| It's been all about being able to cram more cores in since
| then, which implies at least multi-threading, but multi-
| processing is basically required to get the most out of a cpu
| these days.
| ikura wrote:
| Moore's law is "the observation that the number of
| transistors in an integrated circuit doubles about every
| two years". For a while clock speed was a proxy for that
| metric, but it's not the 'law' itself.
| Kamq wrote:
| Yeah, but today number of cores is the rough proxy for
| that metric.
|
| How do you operate in that world if "multithreading isn't
| the answer"?
| samus wrote:
| Modern CPUs contain a lot more computing units than
| cores. For a while, hyperthreading was thought to be a
| useful way to make use of them. More recently, people
| have turned to advanced instruction sets like SSE and
| AVX.
| FridgeSeal wrote:
| Those things aren't mutually exclusive. Also firstly, I
| suspect there's more "low hanging fruit" in making more
| software make use of more cores. We're increasingly
| getting better languages, tooling and libs for multi
| threading stuff, and it's far more in the realm of your
| average developer than writing SIMD compatible code and
| making sure your code can pipeline properly.
| samus wrote:
| Threads are of course appropriate to implement high-level
| concurrency and parallelism. But for fine-grained
| parallelism, they are unwieldy and have high overhead.
|
| Spreading an algorithm across multiple threads makes it
| more difficult for an optimizing compiler to find
| opportunities for SIMD optimization.
|
| Similarly to how modern languages make it easier to
| safely use threads, runtimes also make it easier to take
| advantage of SIMD optimizations. For example, recently a
| SIMD-optimized sorting algorithm was included in OpenJDK.
| Apart from that, SIMD is way less brittle at runtime than
| GPUs and other accelerators.
| danielmarkbruce wrote:
| GPUs aren't that difficult to program. CUDA is fairly
| straightforward for many tasks and in many cases there is an
| easy 100x improvement in processing speed just sitting there to
| be had with <100 lines of code.
| z2h-a6n wrote:
| Sure, but if you've never used CUDA or any other GPU
| framework, how many lines of documentation do you need to
| read, and how many lines of code are you likely to write and
| rewrite and delete before you end up with those <100 lines of
| working code?
| danielmarkbruce wrote:
| They have nice getting started guides. Try it and see.
| It's... really pretty simple. There is a reason they've
| built a trillion $$ company - they've done a great job.
| z2h-a6n wrote:
| Ok, I tried it (for as long as my limited free time and
| interest in CUDA allows). The closest I came to a getting
| started guide is this [0], which by my (perhaps naive
| count) is 25561 lines of documentation, and I would
| probably need to learn more C++ to understand it in
| detail.
|
| I'm sure CUDA is great, and if I had more free time
| and/or better reasons to improve the performance of my
| code it would probably be great for me. My point was
| mainly that a few lines of code which may be trivial for
| one person to write may not be for someone else with
| different experience. Depending on what the code is being
| used for even a vast increase in performance may not be
| worth the extra time it takes to implement it.
|
| [0] https://docs.nvidia.com/cuda/cuda-c-programming-
| guide/index....
| danielmarkbruce wrote:
| https://developer.nvidia.com/blog/even-easier-
| introduction-c...
| jprete wrote:
| I haven't seen it but I can believe it - parallel
| programming of any variety is hard, and to be successful
| as a vendor of such a system would require uncommonly
| good API design to get the kind of traction that CUDA has
| gotten.
| Kamq wrote:
| > Sure, but if you've never used CUDA or any other GPU
| framework, how many lines of documentation do you need to
| read, and how many lines of code are you likely to write
| and rewrite and delete before you end up with those <100
| lines of working code?
|
| If you're already familiar with one of the languages that
| the nvidia compiler supports? Not that many. For people
| familiar with C or C++, it's a couple extra attributes, and
| a slightly different syntax for launching kernels vs a
| regular function call. I'm admittedly not experienced with
| Fortran, which is the other language they support, so I
| can't speak to that. There's c-style memory allocation
| functions, which might be annoying to C++ devs, but it's
| nothing that would confuse them.
|
| Edit: There's also a couple weird magic globals you have
| access to in a kernel (blockIdx, blockDim, threadIdx), but
| those are generally covered in the intros.
| wiz21c wrote:
| My experience is different. I had to code various
| instances of "parallel reduction" and "prefix sum" and
| it's not easy to get into it (took me a day or two).
| Moreover, coming from an age where 640KB of RAM where
| considered quite enough, truly realizing the power of the
| GPU was not easy because my tasks are not quite parallel
| and not quite thread coherent (I grant you that doing
| naturally parallel stuff is dead simple). It took me a
| while and a lot of nvidia-nsight to max out the GPU...
| Moreover, I was a bit slow to actually understand how
| powerful a GPU is (for example, my GPU gave poor
| performances unless I gave it a problem big enough, so I
| was (wrongly) disappointed when testing toy problems)
|
| But once that challenge is overcome, GPU truly rocks.
|
| Finally debugging complex shaders (I do some specific
| case of computational fluid dynamics where equations are
| not that easy, full of "if/then" edge cases, etc) is not
| fun at all, tooling is sorely missed (unless I've missed
| something)
| Kamq wrote:
| This is fair, and if you've got the time and inclination,
| I'd love to hear about your experience and the tricks you
| ended up pulling. There are definitely advanced areas of
| CUDA, and you can go deeper on nearly anything.
|
| But we are in a comment chain spawned by:
|
| > CUDA is fairly straightforward for many tasks and in
| many cases there is an easy 100x improvement in
| processing speed just sitting there to be had with <100
| lines of code.
|
| And a follow up comment about how easy it would be to
| write that "<100 lines of code", so I feel like we're
| definitely talking about the easy case of naturally
| parallel calculations, and sticking to that as an intro
| seems fair to me.
| llukas wrote:
| There are libraries that help with more tricky stuff on-
| device like cub or cuFFTDx.
| samus wrote:
| Parallel reduction and prefix sums are exercises to train
| people how to reformulate algorithms for the GPU and how
| to identify performance bottlenecks specific to GPUs. In
| practice, you'd use library functions.
| mft_ wrote:
| I've wondered for a while: is there a space for a (new?)
| language which _invisibly_ maximises performance, whatever
| hardware it is run on?
|
| As in, _every instruction_ , from a simple loop of calculations
| onward, is designed behind the scenes so that it intelligently
| maximises usage of every available CPU core in parallel, and
| also farms everything possible out to the GPU?
|
| Has this been done? Is it possible?
| teaearlgraycold wrote:
| I'm not sure if that's a good idea at the moment, but we
| should start with making development with vector instructions
| more approachable. The code should look more or less the same
| as working with u64s.
| jfoutz wrote:
| There's definitely a space for it. It may even be possible.
| But if you consider the long history of lisp discussions
| (flamewars?) about "a sufficiently smart compiler" and
| comparisons to C. Or maybe Java vs C++, it seems unlikely. At
| least very very difficult.
|
| There are little bits of research on algorithm replacement.
| Like, have the compiler detect that you're trying to sort,
| and generate the code for quick sort or timsort. it works,
| kinda. There are a lot of ways to hide a sort in code, and
| the compiler can't readily find them all.
| Simpliplant wrote:
| Not exactly it but Mojo sounds closest from available options
|
| https://www.modular.com/mojo
| huijzer wrote:
| There are many languages doing that more or less. Jax and
| Mojo for example.
| pjc50 wrote:
| I'm not sure that's even possible in principle; consider the
| various anti-performance algorithms of proof-of-waste
| systems, where every step is data-dependent on the previous
| one and the table of intermediate results required may be
| made arbitrarily big.
|
| It's a bit like "design a zip algorithm which can compress
| any file".
| lwhi wrote:
| I'd imagine it wouldn't be very difficult to build language
| constructs that are able to denote when high parallelism is
| desirable; and let the compiler deal with this information
| as necessary.
| drdeca wrote:
| I don't see why such a "proof of waste" algorithm would be
| an obstacle to such an optimizer existing. Wouldn't it just
| be that for such computational problems, the optimal
| implementation would still be rather costly? That doesn't
| mean the optimizer failed. If it made the program as
| efficient as possible, for the computational task it
| implements, then the optimizer has done its job.
| howling wrote:
| You might be interested in
| https://github.com/HigherOrderCO/HVM
| runlaszlorun wrote:
| HVM looks very interesting. Thx for posting.
| kaba0 wrote:
| Not for mixed CPU/GPU, but there is the concept of a
| superoptimizer, that basically brute forces for the most
| optimal correct code. But it is not practical, besides using
| for very very short program snippets (and they are usually
| CPU-only, though there is nothing fundamental why it couldn't
| utilize the GPU as well).
|
| There is also https://futhark-lang.org/ , though I haven't
| tried it, just heard about it.
| sgbeal wrote:
| > C is very performant, and hard to code in, so people just use
| python as a abstraction layer over C.
|
| C is a way of life. Those of us who code exclusively, or nearly
| so, in C cannot stomach python's notion of "significant white-
| space."
| yoyohello13 wrote:
| I code in both (c for hobbies and python professionally) and
| "significant white space" is a non-issue if you spend any
| amount of time getting used to it.
|
| Complaining about significant white-space is like complaining
| that lisp has too many parentheses. It's an aesthetic
| preference that just doesn't matter in practice.
| adventured wrote:
| A form of Sayre's Law is very common in tech (eg spaces vs
| tabs; framework vs framework; language vs language).
| adolph wrote:
| > cannot stomach python's notion of "significant white-
| space."
|
| Why belly ache about it? Whitespace is significant to one's
| fellow humans.
| boredtofears wrote:
| Precisely why it should be of no significance to the
| machine.
| drdrey wrote:
| Source code is not for the machine to read, it's for your
| fellow humans
| Phemist wrote:
| Wait till you start using the black formatter tool.
|
| Well-known for supporting any formatting style you like ;)
| dboreham wrote:
| You get used to the significant whitespace. (C programmer
| since ~1978).
| bigstrat2003 wrote:
| I never did, and it's one of the things I hate most about
| Python to this day. I still use Python because it's the
| best tool a lot of the time, but it's such a terrible
| language decision to have significant whitespace imo.
| bart_spoon wrote:
| All programming languages, including C, have significant
| white space. Python just has slightly more.
| mhh__ wrote:
| I actually find python and C very similar in spirit.
|
| Syntax is mostly an irrelevance, they have surprisingly
| similar patterns in my opinion.
|
| In a modern language I want a type system that both reduces
| risk and reduces typing -- safety and metaprogramming. C
| obviously doesn't, python doesn't really either.
|
| Python's approach to dynamic-ness is very similar to how I'd
| expect C to be as a dynamic language (if it had proper
| arrays/lists).
| mhh__ wrote:
| You can pretty easily get C performance (I'd argue that C's
| lack of abstraction makes slow but simple code more appealing)
| with pythonic expressiveness pretty easily with a more modern
| language.
| the__alchemist wrote:
| Another tip that took me longer than I wished to figure out.
|
| Use CUDA, vice graphics APIs+compute. The latter (Vulkan compute
| etc) is high friction. CUDA is far easier to write in, and the
| resulting code is easier to maintain and update.
| behnamoh wrote:
| Yes, and in the process, contribute to Nvidia's monopoly.
| bogwog wrote:
| It isn't the consumer's responsibility to regulate the
| market.
| danielmarkbruce wrote:
| And, in this case NVIDIA earned it. They built a very
| useful software layer around their chips.
| dzikimarian wrote:
| If consumers don't care about their money, then who would?
| calamari4065 wrote:
| Then whose responsibility is it? Corporations? The
| government? Or maybe the tooth fairy?
| tjoff wrote:
| That is a weak argument that could be used to justify tons
| of behavior, very convenient.
|
| Vote with your feet. Maybe you can't or can't afford it,
| then at least admit the problem to yourself and maybe don't
| try to persuade others in order to feel better for your own
| decision.
| tovej wrote:
| Eh, CUDA can mostly be transformed to HIP, unless you use
| specialized NVIDIA stuff.
| the__alchemist wrote:
| I'm with you, and am surprised there isn't comparable
| competition.
| raincole wrote:
| _Very_ few professional artists refuse using Photoshop or
| After Effects because it will "contribute to Adobe's
| monopoly".
|
| But for some reasons professional programmers are judged
| under a much higher moral standard.
| ickelbawd wrote:
| I think because professional artists can't typically make
| their software tools. Whereas engineers could in theory
| make their own tools. Naturally few do in practice though
| as tech has become far too large and specialized. But our
| roots are where our values and ideals come from.
| hutzlibu wrote:
| That is a really theoretical point.
|
| If I start to work on a tool, then I cannot work anymore
| on what I actually wanted to do. And it just so happens
| ... that this is exactly what I did and I can just say,
| it usually takes way longer than the most pessimistic
| estimate one can come up with, so yes, one can decide to
| switch careers and try to get funding to (re)build what
| is not offered to acceptable conditions (but in my case
| the tool simply did not exist, though).
|
| Just like an artist can switch career, study CS, build on
| his own a tool a professional company build with a team
| over years - and then someday work with his tool to
| acomplish his original work. In (simplified) theories,
| lots of things are possible ..
| Aurornis wrote:
| > But for some reasons professional programmers are judged
| under a much higher moral standard
|
| Not in the real world. Most programmers who are trying to
| get a job done won't avoid CUDA or AWS or other tools just
| to avoid "contributing to a monopoly". When responsible
| programmers have a job to do and tools are available to
| help with the job, they get used.
|
| A programmer who avoids mainstream tools on principle is
| liable to get surpassed by their peers very quickly. I've
| only met a few people like this in industry and they didn't
| last very long trying to do everything the hard way just to
| avoid tools from corporations or monopolies or open source
| that wasn't pure enough for their standards.
|
| It's only really in internet comment sections that people
| push ideological purity like this.
| z3phyr wrote:
| The same attitude brought us adaptation of linux. So IDK
| arcanemachiner wrote:
| Most lottery tickets aren't winners.
| Karliss wrote:
| Tool choice of artists has close to 0 impact on people
| interacting with final work. Choices made by programmers
| are amplified through the users of produced software.
| timeon wrote:
| And they ended up with Creative Cloud bloatware.
| yowlingcat wrote:
| > But for some reasons professional programmers are judged
| under a much higher moral standard.
|
| I believe the key word there is "professional" -- one of
| the challenges of a venue like HN is the professional
| engineers and the less-professional ones interact from
| worldviews and use cases so distinct that they may as well
| be separate universes. In other spaces, we wouldn't let a
| top doctor have to explain very basic concepts about the
| commercial practice of medicine to an amateur "skeptic" and
| yet so many discussions on HN degenerate along just these
| lines.
|
| On the other hand, it's that very same inclusiveness and
| generally high discourse in spite of that wide expanse
| which make HN such a special community, so I'm not sure
| what to conclude besides this unfortunate characteristic
| being a necessary "feature, not a bug" of the community.
| There's no way around it that wouldn't make the community a
| lesser place, I think.
| gjsman-1000 wrote:
| So sayeth the person who has never written OpenCL.
| ilaksh wrote:
| Is there something that is not vendor specific? Maybe a
| parallel programming language that compiles to different
| targets?
|
| ..and doesn't suck.
| atq2119 wrote:
| The part that I don't understand is why AMD/Intel/somebody
| else don't just implement at least the base CUDA for their
| products.
|
| HIP is basically that, but they still make you jump through
| hoops to rename everything etc.
|
| There are libraries written at a lower level that wouldn't
| be immediately portable, but surely that could be addressed
| over time as well.
| nwoli wrote:
| I agree cuda is really nice to write in, but what reason do you
| have to write raw cuda I'm curious? Usually I find that it's
| pre written kernels you deal with
| the__alchemist wrote:
| Currently doing computational chemistry, but per the article,
| it's a fundamental part of my toolkit going forward; I think
| it will be a useful tool for many applications.
| drdrey wrote:
| What about on macOS? Is OpenCL viable?
| dinosaurdynasty wrote:
| I wish someone would make the latter much easier, as someone
| who is definitely not the only person to get interested in this
| AI stuff lately who has a high tier AMD card that could
| _surely_ do this stuff and would like to run this stuff locally
| for various reasons.
|
| Currently I've given up and use runpod, but still...
| password4321 wrote:
| I need a buyers guide: what's the minimum to spend, and best at a
| few budget tiers? Unfortunately that info changes occasionally
| and I'm not sure if there's any resource that keeps on top of
| things.
| alsodumb wrote:
| https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...
|
| This is the best one imo.
| coffeebeqn wrote:
| You can also rent compute online if you don't want to immediate
| plop down 1-2k
| fulafel wrote:
| For learning basics of GPU programming your iGPU will do fine.
| Actual real-world applications are very varied of course.
| johndough wrote:
| Google Colab, Kaggle Notebooks and Paperspace Notebooks all
| offer free GPU usage (within limits), so you do not need to
| spent anything to learn GPU programming.
|
| https://colab.google/
|
| https://www.kaggle.com/docs/notebooks
|
| https://www.paperspace.com/gradient/free-gpu
| alberth wrote:
| Nx / Axon
|
| Given that most programming languages are designed for sequential
| processing (like CPUs), but Erlang/Elixir is designed for
| parallelism (like GPUs) ... I really wonder if Nx / Axon (Elixir)
| will take off.
|
| https://github.com/elixir-nx/
| oytis wrote:
| Erlang was designed for distributed systems with a lot of
| _concurrency_ not for computation-heavy _parallelism_
| coffeebeqn wrote:
| Would that run on a GPU? I think the future is having both.
| Sequential programming is still the best abstraction for most
| tasks that don't require immense parallel execution
| dartos wrote:
| Axon runs compute graphs on gpu, but elixirs parallelism
| abstractions run on cpu
| matrss wrote:
| I am really wondering how well Elixir with Nx would perform for
| computation heavy workloads on a HPC cluster. Architecturally,
| it isn't _that_ dissimilar to MPI, which is often used in that
| field. It should be a lot more accessible though, like numpy
| and the entire scientific python stack.
| zoogeny wrote:
| I've been investigating this and I wonder if the combination of
| Elixir and Nx/Axon might be a good fit for architectures like
| NVIDIA Grace Hopper where there is a mix of CPU and GPU.
| arriu wrote:
| Are amd GPUs still to be avoided or are they workable at this
| point?
| JonChesterfield wrote:
| The cuda happy path is very polished and works reliably. The
| amdgpu happy path fights you a little but basically works. I
| think the amd libraries starting to be packaged under Linux is
| a big deal.
|
| If you don't want to follow the happy path, on Nvidia you get
| to beg them to maybe support your use case in future. On
| amdgpu, you get the option to build it yourself, where almost
| all the pieces are open source and pliable. The driver ships in
| Linux. The userspace is on GitHub. It's only GPU firmware which
| is an opaque blob at present, and that's arguably equivalent to
| not being able to easily modify the silicon.
| latchkey wrote:
| AMD GPUs work great, the issue is that people don't want to
| mess with ROCm/HIP when CUDA is kind of the documented
| workflow. Along with the fact that ROCm was stagnant for a long
| time. AMD missed the first AI wave, but are now committed to
| making ROCm into the best it can be.
|
| The other problem is that there aren't any places to rent the
| high end AMD AI/ML GPUs, like the MI250's and soon to be
| released MI300's. They are only available on things like the
| Frontier super computer, which few developers have access to.
| "regular" developers are stuck without easy access to this
| equipment.
|
| I'm working on the later problem. I'd like to create more of a
| flywheel effect. Get more developers interested in AMD by
| enabling them to inexpensively rent and do development on them,
| which will create more demand. @gmail if you'd like to be an
| early adopter.
| k1ns wrote:
| I am very new to GPU programming in general and this article was
| a fun read. It's amazing how far we've come, prime example being
| able to train a simple "dog or cat" NN that easily.
| z5h wrote:
| We have compilers (languages) like Futhark that aim to optimize
| explicitly parallel operations. And universal models of
| computation like interaction nets that are inherently parallel.
|
| Am I lazy to expect we'll be getting a lot more "parallel-on-the-
| GPU-for-free" in the future?
| zozbot234 wrote:
| You can already convert a compute graph to GPU-optimized code
| using something like Aesara (formerly known as Theano) or
| TensorFlow. There are also efforts in the systems space that
| ought to make this kind of thing more widespread in the future,
| such as the MLIR backend for LLVM.
| rushingcreek wrote:
| Good read. However, the AWS P5 instance (along with P4d and P4de)
| is most certainly oriented towards training, not inference. The
| most inference-friendly instance types are the G4dn and the G5,
| which feature T4 and A10G GPUs, respectively.
| axpy906 wrote:
| Came here to say this author forgot G5.
| bilsbie wrote:
| Can I do this in my GeForce GTX 970 4GB?
| dartos wrote:
| I don't think transformer models generate multiple tokens in
| parallel (how could they?)
|
| They just leverage parallelism in making a single prediction
| atomicnature wrote:
| Transformers tend to be _trained_ in parallel. BERT = 512
| tokens per context, in parallel. GPT too is trained while
| feeding in multiple words in parallel. This enables us to build
| larger models. Older models, such as RNNs couldn 't be trained
| this way, limiting their power/quality.
| dartos wrote:
| Ahh, that makes a lot of sense
| zozbot234 wrote:
| This is only sort of true, since you can still train RNNs
| (including LSTM, etc.) in big batches-- which is usually
| plenty enough to make use of your GPU's parallel
| capabilities. The inherently serial part only applies to the
| _length_ of your context. Transformer architectures thus
| happen to be helpful if you have lots of idle GPU 's such
| that you're actually constrained by not being able to
| parallelize along the context dimension.
| atomicnature wrote:
| In RNN, hidden states are to be sequential; in transformers
| with attention mechanism, we break free of the sequential
| requirement. Transformers are more amenable to parallelism,
| and make use of GPUs the most (within the context axis, and
| outside).
| DougBTX wrote:
| A quick search uncovers [0] with a hint towards an answer: just
| train the model to output multiple tokens at once.
|
| [0] https://arxiv.org/abs/2111.12701
| JonChesterfield wrote:
| The CPUs are good at serial code and GPUs are good at parallel
| code is kind of true but something of an approximation. Assume
| equivalent power budget in the roughly hundreds of watts range,
| then:
|
| A CPU has ~100 "cores" each running one (and-a-hyperthread).
| independent things, and it hides memory latency by branch
| prediction and pipelining.
|
| A GPU has ~100 "compute units", each running ~80 independent
| things interleaved, and it hides memory latency by executing the
| next instruction from one of the other 80 things.
|
| Terminology is a bit of a mess, and the CPU probably has a 256bit
| wide vector unit while the GPU probably has a 2048bit wide vector
| unit, but from a short distance the two architectures look rather
| similar.
| bee_rider wrote:
| I'm always surprised there isn't a movement toward pairing a
| few low latency cores with a large number of high throughput
| cores. Surround single Intel P core with a bunch of E cores.
| Then, hanging off the E cores, stick a bunch of iGPU cores
| and/or AVX-512 units.
|
| Call it Xeon Chi.
| softfalcon wrote:
| Neat idea, probably even viable!
|
| I think they may have a hurdle of getting folks to buy into
| the concept though.
|
| I imagine it would be analogous to how Arria FPGA's were
| included with certain Xeon CPU's. Which further backs up your
| point that this could happen in the near future!
| Const-me wrote:
| I think one possible reason for that, ideally these things
| need different memory.
|
| If you use high-bandwidth high-latency GDDR memory, CPU cores
| will underperform due to high latency, like there:
| https://www.tomshardware.com/reviews/amd-4700s-desktop-
| kit-r...
|
| If you use low-latency memory, GPU cores will underperform
| due to low bandwidth, see modern AMD APUs with many RDNA3
| cores connected to DDR5 memory. On paper, Radeon 780M
| delivers up to 9 FP32 TFLOPS, the figure is close to desktop
| version of Radeon RX 6700 which is substantially faster in
| gaming.
| bee_rider wrote:
| Hmm, that is a good point. Since it is a dream-computer
| anyway, maybe we can do 2.5d packaging; put the ddr memory
| right on top so the P cores can reach it quickly, then
| surround the whole thing with GDDR.
| pixelpoet wrote:
| You mean like an iGPU?
|
| Edit: Oh, thanks for the downvote, with no discussion of the
| question. I'll just sit here quietly with my commercial
| OpenCL software that happily exploits these vector units
| attached to the normal CPU cores.
| mmoskal wrote:
| GPU has 10x the memory bandwidth of the CPU though, which
| becomes relevant for the LLMs where you essentially have to
| read the whole memory (if you're batching optimally, that is
| using all the memory either for weights or for KV cache) to
| produce one token of output.
| winwang wrote:
| GPUs also have 10x-100x FP/INT8 throughput watt-for-watt.
| hurryer wrote:
| GPU also has 10x memory latency compared to CPU.
|
| And memory access order is much more important that on CPU.
| Truly random access has very bad performance.
| Matumio wrote:
| > When faced with multiple tasks, a CPU allocates its resources
| to address each task one after the other
|
| Ha! I wish CPUs were still that simple.
|
| Granted, it is legitimate for the article to focus on the
| programming model. But "CPUs execute instructions sequentially"
| is basically wrong if you talk about performance. (There are
| pipelines executing instructions in parallel, there is SIMD, and
| multiple cores can work on the same problem.)
| pclmulqdq wrote:
| I think this post focused on the wrong things here. CPUs with
| AVX-512 also have massive data parallelism, and CPUs can
| execute many instructions at the same time. The big difference
| is that CPUs spend a lot of their silicon and power handling
| control flow to execute one thread efficiently, while GPUs
| spend that silicon on more compute units and hide control flow
| and memory latency by executing a lot of threads.
| mhh__ wrote:
| It will do multipleS SIMD instructions at the same time, too.
| aunty_helen wrote:
| We're back to "every developer must know" clickbait articles?
| igh4st wrote:
| it seems so... Should take this article's statements with a
| grain of salt.
| mhh__ wrote:
| Although I think they'll be replaced by ChatGPT a _good_
| article in that style is actually quite valuable.
|
| I like attacking complexity head on, and have a good knowledge
| of both quantitative methods & qualitative details of (say)
| computer hardware so having an article that can tell me the
| nitty gritty details of a field is appreciated.
|
| Take "What every programmer should know about memory" -- should
| _every_ programmer know? Perhaps not, but every _good_
| programmer should at least have an appreciation of how a
| computer actually works. This pays dividends _everywhere_ --
| locality (the main idea that you should take away from that
| article) is fast, easy to follow, and usually a result of good
| code that fits a problem well.
| shortrounddev2 wrote:
| This article claims to be something every developer must know,
| but it's a discussion of how GPUs are used in AI. Most developers
| are not AI developers, nor do they interact with AI or use GPUs
| directly. Not to mention the fact that this articles barely
| mentions 3d graphics at all, the reason gpus exist
| lucb1e wrote:
| One can benefit from knowing fundamentals of an adjacent field,
| especially something as broadly applicable as machine learning.
|
| - You might want to use some ML in the project you are assigned
| next month
|
| - It can help collaborating with someone who tackles that
| aspect of a project
|
| - Fundamental knowledge helps you understand the "AI" stuff
| being marketed to your manager
|
| The "I don't need this adjacent field" mentality feels familiar
| from schools I went to: first I did system administration where
| my classmates didn't care about programming because they felt
| like they didn't understand it anyway and they would never need
| it (scripting, anyone?); then I switched to a software
| development school where, guess what, the kids couldn't care
| about networking and they'd never need it anyway. I don't
| understand it, to me it's both interesting, but more
| practically: fast-forward five years and the term devops became
| popular in job ads.
|
| The article is 1500 words at a rough count. Average reading
| speed is 250wpm, but for studying something, let's assume half
| of that: 1500/125 = 12 minutes of your time. Perhaps you toy
| around with it a little, run the code samples, and spend two
| hours learning. That's not a huge time investment. Assuming
| this is a good starting guide in the first place.
| mrec wrote:
| The objection isn't to the notion that "One can benefit from
| knowing fundamentals of an adjacent field". It's that this is
| "The bare minimum every developer must know". That's a much,
| _much_ stronger claim.
|
| I've come to see this sort of clickbait headline as playing
| on the prevalence of imposter-syndrome insecurity among devs,
| and try to ignore them on general principle.
| lucb1e wrote:
| Fair enough! I can kind of see the point that, if every
| developer knew some basics, it would help them make good
| decisions about their own projects, even if the answer is
| "no, this doesn't need ML". On the other hand, you're of
| course right that if you don't use ML, then it's clearly
| not something you "must" know to do your job well.
| sigmonsays wrote:
| yeah a lot of assumptions were made that are inaccurate.
|
| I agree that most developers are not AI developers... OP seems
| to be a bit out of touch with the general population and
| otherwise is assuming the world around them based on their own
| perception.
| Der_Einzige wrote:
| Don't worry, you'll either be an AI developer or unemployed
| within 5 years. This is indeed important for you, regardless if
| you recognize this yet or not.
| pixelpoet wrote:
| Not to mention their passing example of Mandelbrot set
| rendering only gets a 10x speedup, despite being the absolute
| posterchild of FLOPs-limited computation.
|
| Terrible article IMO.
| pclmulqdq wrote:
| You would expect at least 1000x, and that's probably where it
| would be if they didn't include JIT compile time in their
| time. Mandelbrot sets are a perfect example of a calculation
| a GPU is good at.
| j45 wrote:
| Understanding now hardware is used is very beneficial for
| programmers
|
| Lost of programmers started with an understanding of what
| happens physically on the hardware when code runs and it is
| unfair advantage when debugging at times
| oytis wrote:
| > Most developers are not AI developers
|
| I remember how I joined a startup after working for a
| traditional embedded shop and a colleague made (friendly) fun
| of me for not knowing how to use curl to post a JSON request. I
| learned a lot since then about backend, frontend and
| infrastructure despite still being an embedded developer. It
| seems likely that people all around the industry will be in a
| similar position when it comes to AI in the next years.
| hhjinks wrote:
| What do you think the industry will look like in the near
| future?
| outside1234 wrote:
| And honestly, for most "AI developers" if you are training your
| own model these days (versus using an already trained one) -
| you are probably doing it wrong.
| sbmthakur wrote:
| I would have probably opened it if it weren't for the title
| bait.
| bigstrat2003 wrote:
| I've noticed that every time I see an article claiming that its
| subject is something "every developer must know", that claim is
| false. Maybe there are articles which contain information that
| everyone must know, but all I encounter is clickbait.
| BlueTemplar wrote:
| Even worse, it says "GPUs", but isn't CUDA a closed feature
| limited to Nvidia cards, and maybe even a subset of them ?
|
| (I'm not touching Nvidia since they don't provide open source
| drivers.)
| lucb1e wrote:
| > AWS GPU Instances: A Beginner's Guide [...] Here are the
| different types of AWS GPU instances and their use cases
|
| The section goes on to teach Amazon-specific terminology and
| products.
|
| A "bare minimum everyone must know" guide should not include
| vendor-specific guidance. I had this in school with Microsoft
| already, with never a mention of Linux because they already paid
| for Windows Server licenses for all of us...
|
| Edit: and speaking of inclusivity, the screenshots-of-text have
| their alt text set to "Alt text". Very useful. It doesn't need to
| be verbatim copies, but it could at least summarize in a few
| words what you're meant to get from the terminal screenshot to
| help people that use screen readers.
|
| Since this comment floated to the top, I want to also say that I
| didn't mean for this to dominate the conversation! The guide may
| not be perfect, but it helped me by showing how to run arbitrary
| code on my GPU. A few years ago I also looked into it, but came
| away thinking it's dark magic that I can't make use of. The
| practical examples in both high- and low-level languages are
| useful
|
| Another edit: cool, this comment went from all the way at the top
| to all the way at the bottom, without losing a single vote. I
| agree it shouldn't be the very top thing, but this moderation
| also feels weird
| mikehollinger wrote:
| Agreed. This isn't actually that useful of a guide in the first
| place.
|
| Tbh the most basic question is: "are you innovating inside the
| AI box or outside the AI box?"
|
| If inside - this guide doesn't really share anything practical.
| Like if you're going to be tinkering with a core algorithm and
| trying to optimize it, understanding BLAS and cuBLAS or
| whatever AMD / Apple / Google equivalent, then understanding
| what pandas, torch, numpy and a variety of other tools are
| doing for you, then being able to wield these effectively makes
| more sense.
|
| If outside the box - understanding how to spot the signs of
| inefficient use of resource - whether that's network, storage,
| accelerator, cpu, or memory, and then reasoning through how to
| reduce that bottleneck.
|
| Like - I'm certain we will see this in the near future, but off
| the top of my head the innocent but incorrect things people do:
| 1. Sending single requests, instead of batching 2. Using a
| synchronous programming model when asynchronous is probably
| better 3. Sending data across a compute boundary unnecessarily
| 4. Sending too much data 5. Assuming all accelerators are the
| same. That T4 gpu is cheaper than an H100 for a reason. 6.
| Ignoring bandwidth limitations 7. Ignoring access patterns
| StableAlkyne wrote:
| Are there any surveys of just how many Windows Servers boxes
| exist?
|
| Even when I was working at an Azure-only shop, I've never
| actually seen anyone use Windows Server. Lots of CentOS (before
| IBM ruined it) and other Unixes, but never a Windows Server.
| lucb1e wrote:
| We come across them all the time when doing internal network
| pentests (most organizations use AD for managing their fleet
| of end-user systems), and occasionally external tests as
| well. Stackoverflow is a site that comes to mind as being
| known for running their production systems on Windows Server.
|
| It's useful to have experienced, but I do take issue with
| exclusively (or primarily) focusing on one ecosystem as a
| mostly-publicly-funded school.
| StableAlkyne wrote:
| Huh, TIL StackOverflow is on Windows Server
| Matumio wrote:
| The Mandelbrot example seems to make interpreted Python stand in
| for "the CPU performance"?
|
| If that's true, then I'm surprised they only see a 10x speed-up.
| I would expect more from only compiling that loop for the CPU.
| (Comparing to interpreted Python without numpy.) Given they
| already have a numba version, why not compile it for the CPU and
| compare?
|
| Also, they say consumer CPUs have 2-16 cores. (Who has 2 cores
| these days?) They go on suggest to rent an AWS GPU for $3 per
| hour. You're more likely to get 128 cores for that price, still
| on a single VM.
|
| Not saying it will be easy to write multi-threaded code for the
| CPU. But if you're lucky, the Python library you're using already
| does it.
| lucb1e wrote:
| > Also, they say consumer CPUs have 2-16 cores. (Who has 2
| cores these days?)
|
| Pretty sure my mom's laptop has 2 cores; I can't think of
| anyone whose daily driver has 16 cores. Real cores, not
| hyperthread stuff running at 0.3x the performance of a real
| core.
|
| As for the 128-core server system, note that those cores are
| typically about as powerful as a 2008 notebook. My decade-old
| laptop CPU outperforms what you get at DigitalOcean today, and
| storage performance is a similar story. The sheer number makes
| up for it, of course, but "number of cores" is not a 1:1
| comparable metric.
|
| Agree, though, that the 10x speedup seems low. Perhaps, at
| 0.4s, a relatively large fraction of that time is spent on
| initializing the Python runtime (`time python3 -c
| 'print("1337")'` = 60ms), the module they import that needs to
| do device discovery, etc.? Hashcat, for example, takes like 15
| seconds to get started even if it then runs very fast after
| that.
| 65a wrote:
| My 2016 desktop had 22 cores and 44 threads. You can have the
| same processor for < $200 on ebay right now.
| shmerl wrote:
| Shouldn't the article mention SIMD? I haven't seen it even being
| brought up.
| johndough wrote:
| The code in this article is incorrect. The CUDA kernel is never
| called:
| https://github.com/RijulTP/GPUToolkit/blob/f17fec12e008d0d37...
|
| I'd also like to point out that 90 % of the time spent to
| "compute" the Mandelbrot set with the JIT-compiled code is spent
| on compiling the function, not on computation.
|
| If you actually want to learn something about CUDA, implementing
| matrix multiplication is a great exercise. Here are two
| tutorials:
|
| https://cnugteren.github.io/tutorial/pages/page1.html
|
| https://siboehm.com/articles/22/CUDA-MMM
| Handprint4469 wrote:
| Thank you for this, comments like yours is exactly why I keep
| coming back to HN.
| sevagh wrote:
| >If you actually want to learn something about CUDA,
| implementing matrix multiplication is a great exercise.
|
| There is SAXPY (matrix math A*X+Y), purportedly ([1]) the hello
| world of parallel math code.
|
| >SAXPY stands for "Single-Precision A*X Plus Y". It is a
| function in the standard Basic Linear Algebra Subroutines
| (BLAS)library. SAXPY is a combination of scalar multiplication
| and vector addition, and it's very simple: it takes as input
| two vectors of 32-bit floats X and Y with N elements each, and
| a scalar value A. It multiplies each element X[i] by A and adds
| the result to Y[i].
|
| [1]: https://developer.nvidia.com/blog/six-ways-saxpy/
| convexstrictly wrote:
| A great beginner guide to GPU programming concepts:
|
| https://github.com/srush/GPU-Puzzles
___________________________________________________________________
(page generated 2023-11-12 23:01 UTC)