[HN Gopher] Render a neural network into CUDA/HIP code
___________________________________________________________________
Render a neural network into CUDA/HIP code
Author : fzliu
Score : 122 points
Date : 2023-06-02 17:14 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| Havoc wrote:
| Interesting that AMD GPUs seem to be 1st class citizens here.
| Consumer class gear is much cheaper per unit of VRAM by the looks
| of it
| brucethemoose2 wrote:
| Also here are some other interesting projects in the ML
| compilation space:
|
| - Apache TVM (mlc-llm is a good demo)
|
| - Hidet (a torch.compile backend)
|
| - Alibaba BladeDISC
|
| - Nvidia TensorRT (a classic, but much less of a nightmare to
| install now)
|
| - Torch MLIR (SHARK has some demos/implementations)
| jahewson wrote:
| And of course, Chris Lattner's Modular AI
| https://www.modular.com/
| homarp wrote:
| CUDA: NVIDIA GPU 'framework'
|
| HIP: AMD GPU 'framework'
|
| This takes Neural Network defined in python and convert them to
| C++ code calling CUDA / HIP for maximum inference speed
| iaw wrote:
| Anyone seen details on if this can handle splitting a model
| across GPUs?
| sosodev wrote:
| The latency improvements are impressive but the ability to run
| models beyond their typical memory limitations is way cooler.
| hintymad wrote:
| I like this humility: "AITemplate is co-created by Meta
| engineers: Bing Xu, Ying Zhang, Hao Lu, Yang Chen, and Terry
| Chen, with major contributions coming from more talented
| engineers."
| samstave wrote:
| ELI5 what this means?
|
| I am losing my bibliography, etymology and vocabulary with every
| single AI advancement article.
|
| Where learn AI vocab, please?
|
| -
|
| I nee an FN AI teacher to just give me daily updates on AI and
| verbiage, models, etc...
|
| Hey AI - if you're so smart, build a podcast that teaches me
| about yourself and how to be a better meat parent whom made
| you.///\\\\\\\
| bagels wrote:
| What is an FN AI?
| samstave wrote:
| A 'fUCKIn' ai'
| dragonwriter wrote:
| An autonomous Belgian firearm.
| skirmish wrote:
| Starting with a trained PyTorch model, it builds optimized C++
| binaries for running inference (not training) on NVidia and AMD
| GPUs. Various optimizations mentioned a lot, so presumably
| models run faster than with just running them via regular
| PyTorch.
| stevenwliao wrote:
| How much faster is it?
| pumanoir wrote:
| Depends on the model and GPU. Here is an example of almost
| 2x on a 3060 for StableDiffusion:
| https://www.youtube.com/watch?v=_6BsUijOWoM
| prsutherland wrote:
| I'm curious why that is called "rendering" rather than
| "compiling". Is the code boiler plate and just a change in
| the NN's representation?
| iaw wrote:
| Very much not an expert here but what I understand is that most
| deep learning frameworks (PyTorch, Tensorflow, etc.) have some
| overhead associated with them just being on the graphics card.
| This takes PyTorch code and removes the overhead by translating
| the network into a "native" language for the card (CUDA for
| NVIDIA).
|
| What I'm not sure is what "HIP" is in this context.
|
| The way I'm reading this is it's the difference between running
| code in an interpreter vs. on the bare metal (for the GPU)
| [deleted]
| entropicdrifter wrote:
| HIP is AMD's open-source re-implementation of CUDA libraries.
| viewtransform wrote:
| HIP is AMD's contribution to the open source community to
| overcome Nvidia's CUDA software moat
|
| You write in HIP C++ language and run them on either NVIDIA
| or AMD platforms. This way you get cross-platform code and
| are not stuck with Nvidia.
|
| Use HIPify tool to automatically convert existing sources
| from CUDA to HIP.
|
| It's been around for many years - but the fact that so many
| people still don't know about it - speaks for the sad state
| of AMD communication.
|
| https://docs.amd.com/bundle/HIP-Programming-
| Guide-v5.3/page/...
| my123 wrote:
| HIP is a pathetic CUDA API clone. Gratuitous renames
| don't do it any good and are more representative of NIH
| rather than anything else (sadly).
|
| They should have shipped a proper header set instead of
| hipify.
| femto113 wrote:
| It doesn't really help understand what they are, but for
| completeness CUDA is an acronym for "Compute Unified Device
| Architecture" while HIP is "Heterogeneous-compute Interface for
| Portability"
| born-jre wrote:
| at first glance i thought may be its like tinygrad. but looks has
| many ops than tiny grad but most maps to underlying hardware
| provided ops?
|
| i wonder how well tinygrad's apporach will work out, ops fusion
| sounds easy, just walk a graph, pattern match it and lower to
| hardware provided ops?
|
| Anyway if anyone wants to understand the philosophy behind
| tinygrad, this file is great start
| https://github.com/geohot/tinygrad/blob/master/docs/abstract...
| bguberfain wrote:
| It reminds me Theano
| antinucleon wrote:
| AITemplate's original designer is here. We quit Meta in January
| and start HippoML (https://hippoml.com/). We just disclosed our
| new engine's performance on LLM: https://blog.hippoml.com/large-
| language-model-inference-from... On Apple M2 Max our new engine
| encode/decode is 13.8X/2.4X faster than llama.cpp
| huevosabio wrote:
| Any idea how hippo, AI Template and TVM compare in performance?
| antinucleon wrote:
| Hippo is faster than AITemplate, and supports more generative
| models. We haven't compared vs TVM, but for absolute token/s
| on M2 Max, Hippo is able to run decoding on LLAMA with
| datacenter level GPUs performance (with other SW).
| huevosabio wrote:
| Thanks, I've added myself to the waitlist. Please let us
| know when this can be tried!
| ralfd wrote:
| What is your planned business model here?
| antinucleon wrote:
| We will disclose more details very soon.
| brucethemoose2 wrote:
| Very interesting.
|
| Is 8bit/4bit support in the works? Will it work with
| bitsandbytes out of the box? Speedy inference is great, but in
| practice many users are running the biggest ~4-bit LLM that
| will fit into their RAM/VRAM pool these days. This is why
| llama.cpp is so good, its (AFAIK) the only implementation that
| will split a 4 bit quantized model so easily.
| antinucleon wrote:
| Yes. We support >= 1bit <= 16bit models out of box for
| various of models.
| sroussey wrote:
| Would it work with instructor-xl or similar which is designed
| for embeddings and retrieval? On device for privacy is key.
| antinucleon wrote:
| Yes
| mhh__ wrote:
| Really doesn't surprise me that much. Llama.cpp seems like an
| OK first passs but I assume there is loads of time left on the
| table in terms of graph optimizations optimizing for the memory
| hierarchy properly.
| brucethemoose2 wrote:
| It also doesn't use Apple GPUs at all. Its 100% CPU
| inference, with some CUDA/OpenCL (but no metal and no zero-
| copy) offload at the moment.
| antinucleon wrote:
| It is actually non-trivial to get GPU run fast, especially
| on SoC with strong CPU like M2.
| hutzlibu wrote:
| GPU programming in general is definitely not trivial, as
| I can confirm with struggeling to learn WebGPU right now.
|
| But it really depends on the problem, simple math
| operations on lots of data is usually indeed trivially
| faster. Like AI mostly is with math on matrices.
|
| Or for example I just implemented a simple 2D
| raycastsolver in wgsl and as a first project it is
| totally not optimized - but even on my old laptop with
| crappy integrated GPU, but (relativly) fast CPU - I can
| now do 10000 raycasts per frame easily, while the cpu
| (wasm!) struggles with 500.
|
| The raw power of the gpu is really awesome. But every
| step is hard and debugging a mess. Which is why only a
| handful of people seems do be doing it. But now would
| probably be a good time, to get into it.. as I think gpu
| compute just has started and will get big.
| paulmd wrote:
| I've been out of the space for a long time, and it's
| possible you know these already, but these are a couple
| weird tricks that can help:
|
| * Radix sort is your friend. Fun fact, O(n log n) is not
| the fastest a sort can run, it's the fastest a
| _comparison-based_ sort can run. Radix sort runs in O(N)
| time, and in fact parallelizes extremely well on GPUs.
| Extremely. They are great at it. And there are in-place
| radix sorts too, just a bit slower (same asymptotic
| performance tho).
|
| * "Insert an element into this collection" style steps
| can be replaced by a sort and a prefix-sum operation. If
| you know the offset of the first element with key K, and
| you know the offset of the first element with key J, you
| know the offset and size of that "collection" for K
| within a flat array ("size(K) = offset(J) - offset(K)").
| Both of these run super fast in parallel and if you can
| tweak your problem around to be some kind of sorting
| operation that usually produces good speedups like this.
| Easiest way to get a speedup from everything I've heard.
|
| * Recomputing is often much faster than storing
| intermediate results. "Procedural generation" is
| interesting because you can re-compute the generation
| step on demand. Random123 is also very nice compared to a
| (CuRand) mersenne twister/etc - why are you, a believer
| in the cryptographic maxim that hash(key, msg) is
| uncorrelated to hash(key, msg+1), still storing RNG
| state? Being able to play back arbitrary parts of a
| datastream at will is incredibly powerful, you can
| fastforward and rewind through the data previously used
| to interact with an item, as long as you know the epoch
| of the interaction you want for a particular key. And
| because computation is cheaper than memory, and memory
| bandwidth - it's really actually practically free in
| program time terms to just do some math. This is a form
| of data compression and performance enhancement.
|
| * Generally you must understand the idea of coalescing
| and divergence and keep those lanes executing. And it is
| highly preferable to use sorts and scans and butterfly
| operations (reduction, etc) even within a warp, because
| traditional "mutex/atomic" paradigms don't work well with
| 100k threads. But this is just the programming idioms of
| this particular platform, I am sure LISP is similar too
| in terms of "oh that's how you do that" once you're
| accustomed.
|
| * Texture maps aren't just for graphics, they are a black
| box that lets the GPU perform 2D and 3D coalescing and
| some interpolation.
|
| * Intelligent use of constants memory is another one,
| probably as is the use of CPU memory. If a value will be
| seldom accessed, you can probably stuff it into host
| memory and just accept the slowdown. Or you can store
| only epochs on the GPU and recompute intermediate values
| as needed. Try to ensure that all threads in a warp will
| do it too (sorting vs recomputing).
|
| * Raytracing is of course impervious to all of this (so
| don't worry too much that you can't magically hammer a
| speedup out of it, nobody really can). You can accelerate
| the raycasting and intersection testing (and AMD and
| NVIDIA and Intel all do this differently) but as a
| general matter rays are completely random and
| uncoalesced. Ray sorting/shader execution reordering is
| something that needs hardware assistance, and Intel and
| NVIDIA both have hardware along these lines. The idea of
| Intel of making a facility for async future/promise
| dispatch for sparse tasks (and then sorting the shaders
| to get good coalescing/etc) is really neat and they've
| said it's going to come to GPGPU.
| https://youtu.be/SA1yvWs3lHU?t=289
|
| * You can, however, use your rays more efficiently. And
| that's an area of active focus for everyone. And I think
| more efficient use of TAAU samples is probably where
| raster is going too.
| boywitharupee wrote:
| zero-copy with mmap was added to llama.cpp, but the way it
| was implemented sparked controversy.
| fathyb wrote:
| I think GP meant zero-copy communication with the GPU,
| eg. through `newBufferWithBytesNoCopy` [0], which is only
| possible with unified memory architectures, eg.
| integrated GPUs.
|
| The mmap change was just about mapping the model files in
| memory instead of copying them, which has less overhead.
|
| [0]: https://developer.apple.com/documentation/metal/mtld
| evice/14...
| yeison wrote:
| Did Facebook invest in this. Is that why it's under
| Facebookincubator?
| antinucleon wrote:
| We developed AITemplate majorly for Meta's focus at that
| time, eg Ads/Ranking need. For HippoML is startup we are
| building for Generative AI. HippoML is not using AITemplate.
| yeison wrote:
| Will this have some similarities to what Mojo is trying to
| solve?
| antinucleon wrote:
| Mojo is trying to create a new language to solve the problem,
| and specialized for CPU. We are using a more pragmatic way to
| solve GPU AI computation problem.
| jph00 wrote:
| Mojo is not at all specialised for CPU. It sits on top of
| MLIR and excellent support for all major accelerators is
| planned.
| thewataccount wrote:
| Do you know how it's speed compares to exllama, specifically
| with an nvidia gpu by chance?
| antinucleon wrote:
| We haven't compared yet.
| cypress66 wrote:
| I don't see any comparisons with torch.compile. Kind of unfair to
| compare it to eager mode.
| brucethemoose2 wrote:
| I just ran a 512x512 Stable Diffusion benchmark with this
| yesterday
|
| Pytorch Eager Mode with some optimizations: ~6it/s
|
| Pytorch Inductor (torch.compile with dynamic=True): ~7it/s
|
| AITemplate: ~9it/s
|
| All of them support changing settings and such, albeit with some
| work in progress bugs/caveats.
|
| That is 512x512 on a 2060, so I would expect the gains to be
| bigger on newer GPUs with more overhead to take advantage of.
| maxilevi wrote:
| Did you try TensorRT?
| brucethemoose2 wrote:
| Not yet. TRT diffusion has been an _enormous_ pain in the
| past, so I have kinda avoided it, but Nvidia just recently
| contributed an img2img pipline in HF diffusers.
___________________________________________________________________
(page generated 2023-06-02 23:00 UTC)