[HN Gopher] What happens when you vectorize wide PyTorch express...
___________________________________________________________________
What happens when you vectorize wide PyTorch expressions?
Author : mrcslws
Score : 107 points
Date : 2023-10-26 13:41 UTC (9 hours ago)
(HTM) web link (probablymarcus.com)
(TXT) w3m dump (probablymarcus.com)
| intalentive wrote:
| Did you leave Numenta? Enjoyed the paper discussions you all
| posted to YT.
| mrcslws wrote:
| Glad to hear :)
|
| Yes, I'm off doing my own thing now. Deep Learning went so much
| further than I ever expected, and now I'm drawn to all the
| things that can be built today. Who knows, maybe I'll swing
| back into neuroscience in a few years. (Still friends with my
| old coworkers / bosses.)
| gregjm wrote:
| > My so-called CPU "active" time is actually an inferred value;
| CUDA spins the CPU 100% constantly, even when the CPU is just
| waiting for the GPU
|
| The CUDA Runtime and Driver APIs allow you to use"blocking
| synchronization" where the CPU will go to sleep while waiting for
| synchronization with the device. However, it seems that PyTorch
| doesn't expose this functionality in any of its Python APIs:
|
| https://github.com/pytorch/pytorch/issues/28224
|
| What happens when you try using ctypes to call into libcudart.so
| to set the device flags as described in the above issue? You'll
| have to call torch.cuda.init() for it to work, and unfortunately
| it won't work if PyTorch is launching kernels from other threads.
| mrcslws wrote:
| Aha, I was hoping to learn about something like this, thanks
| for sharing. I'll try this some time. PyTorch does use
| different threads for the forward and backward pass, so as you
| suggest, setting that flag might only improve the forward pass.
| gregjm wrote:
| The CUDA Runtime and Driver APIs have per-thread state, so
| using threads would unfortunately bypass our trick here to
| set the flag. Assuming you're on Linux, I might suggest
| creating a shared library to intercept calls to the Driver
| API, as all Runtime functions are implemented as wrappers
| around Driver functions. You'd have to intercept all calls to
| context creation and flag setting: *
| `cuCtxCreate` * `cuCtxCreate_v3` *
| `cuCtxSetFlags` * `cuDevicePrimaryCtxRetain`
| * `cuDevicePrimaryCtxSetFlags`
|
| ... and make sure that the three least significant bits of
| any `flags` variable are set to `CU_CTX_SCHED_BLOCKING_SYNC`.
|
| cuDevicePrimaryCtxSetFlags:
| https://docs.nvidia.com/cuda/cuda-driver-
| api/group__CUDA__PR...
|
| dlsym(3): https://man.archlinux.org/man/dlsym.3.en
|
| ld.so(8): https://man.archlinux.org/man/ld.so.8.en
| bee_rider wrote:
| I'm somewhat confused as to what _is_ exposed, as the
| description in the quote sounds like a blocking call, but with
| a busy wait, which seems like it couldn't be the only or main
| thing that PyTorch exposes.
| Filligree wrote:
| That is indeed the only API that it exposes.
| pixelpoet wrote:
| I really hope those pow(x, 2) calls are getting turned into x *
| x, else it's a performance catastrophe / extreme beginner mistake
| even with vectorisation.
|
| Also, this kind of ultra wide buffering consumes a ton of memory
| bandwidth for each operation, instead of keeping a small portion
| in cache/registers. FLOPs are scaling sort of infinitely, whereas
| memory speed is flat, so this is increasingly a losing game; just
| because it's faster than glacial Python doesn't mean it's fast
| compared to a language which actually concerns itself with
| performance or a more cache aware approach.
|
| For an extreme example of how you can even sometimes beat ultra
| optimised GPU ML libraries in this way, check out
| https://github.com/NVlabs/tiny-cuda-nn
| mrcslws wrote:
| I wondered about this same thing. Your logic about
| cache/registers is certainly true on CPUs, but what about GPUs?
| Hence this blurb:
|
| > I studied the CUDA traces closely and found that
| vectorization does indeed reduce many aspects of the GPU
| workload, greatly reducing the number of operations and
| decreasing the total amount of time spent on the fundamental
| computations of the algorithm. However it also introduces
| overhead (mentioned above) by interspersing operations that
| permute and reorder the tensors, or splitting them into groups
| then concatenating results. Sometimes the reduced "fundamental"
| time outweighs the additional overhead, while other times the
| overhead outweighs the reduction in fundamental time.
|
| Here are some examples not included in the blog post:
|
| - Total time spent in aten::cdist kernel -
| Baseline: 2.834s (4900 calls) - Vectorized: 2.686s (500
| calls)
|
| - Total time spent in aten::mul kernel -
| Baseline: 5.745s (80700 calls) - Vectorized: 5.555s (8100
| calls)
|
| This nice little win applies to tons of other kernels, almost
| across the board. As you point out, CPU intuition suggests this
| should have been _slower_ , so this was an interesting outcome.
|
| On the other hand, some specific increases occur:
|
| - Total time spent in aten::cat kernel -
| Baseline: 0.680s - Vectorized: 1.849s
|
| So working in fewer, larger batches doesn't _only_ enable
| outrunning the GPU. It decreases the total GPU workload... then
| adds some overhead. But some of this overhead could be removed
| with custom CUDA kernels, so I think this is an interesting
| direction even if you solve the CPU problem some other way.
|
| (The pow(x, 2) is only there in the toy code, not my actual
| kernel, so I didn't performance-tune it.)
| nixpulvis wrote:
| What's the state-of-the-art in terms of compiler optimization
| here? Seems like auto-vectorization could be a somewhat simple
| transform, no?
| voz_ wrote:
| Pretty cool to see people using compile in the wild :)
| mrcslws wrote:
| Yeah, one unspoken theme of this blog post is "look how nice
| torch.compile" is :)
|
| Fun fact, I had to put in extra work to get torch.compile
| working with my code, for understandable reasons. My library,
| Vexpr, literally runs an interpreter inside of Python, reading
| a big tree-like namedtuple-of-namedtuples "expression" data
| structure and evaluating it recursively. That data structure
| was way too fancy for torch.compile's guards, so I actually
| wrote code [1] that converts a Vexpr expression into a big
| Python code string and evals it, factoring the interpreter out
| of the code, then I pass _that_ eval 'd string into
| torch.compile.
|
| One torch.compile capability I would be excited to see is
| compatibility with torch.vmap. One selling point of Vexpr is
| that you can use vmap with it, so I was sad when I found I
| couldn't use vmap and still support torch.compile. This made me
| convert a bunch of my GP kernels [2] to be batch-aware. (This
| missing capability is also understandable -- both vmap and
| compile are new.)
|
| Anyway, I'm a fan of what y'all are doing!
|
| [1]
| https://github.com/outergroup/vexpr/blob/e732e034768443386f9...
| [2] https://github.com/outergroup/outer-loop-
| cookbook/blob/5d94c...
| bcoates wrote:
| "For example, what if the parallel sums are of different lengths?
| On GPUs, fast parallel reductions only work when inputs all have
| the same length. [...] Vexpr's vectorizer groups the inputs by
| length and performs a reduced number of operations--one for each
| unique length."
|
| I'm surprised this is necessary, I thought modern vectorization
| on both CPU and GPU handled heterogenous vectorization cases like
| this handily with conditional execution (on SMT GPUs) or mask
| registers (on SIMD CPUs)
___________________________________________________________________
(page generated 2023-10-26 23:00 UTC)