[HN Gopher] Scuda - Virtual GPU over IP
___________________________________________________________________
Scuda - Virtual GPU over IP
Author : kevmo314
Score : 188 points
Date : 2024-10-09 13:07 UTC (2 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| ranger_danger wrote:
| This appears to only support CUDA on nvidia. I'm curious why they
| didn't just expose /dev/nvidia-uvm as a socket and forward that
| over the network instead of hooking hundreds of functions (maybe
| it's not that simple and I just don't know).
| monocasa wrote:
| You can't mmap a socket, and mmap is core to how /dev/nvidia-
| uvm works.
| XorNot wrote:
| Which seems weird to me: if we're going to have device files,
| it's super annoying that they actually don't really act like
| files.
|
| Like we really should just have enough rDMA in the kernel to
| let that work.
| monocasa wrote:
| At it's core, this device file is responsible for managing
| a GPU local address space, and sharing memory securely with
| that address space in order to have a place to write
| command buffers and data that the gpu can see. It doesn't
| really make sense without a heavy memory mapping component.
|
| A plan 9 like model that's heavily just a standard file
| would massively cut into gpu performance.
| gorkish wrote:
| I agree with you that making RDMA a more accessible
| commodity technology is very important for "the future of
| compute". Properly configuring something like RoCEv2 or
| Infiniband is expensive and difficult. These technologies
| need to be made more robust in order to be able to run on
| commodity networks.
| majke wrote:
| It's a first time I hear about /dev/nvidia-uvm. Is there any
| documentation on how nvidia API works? Especially, how strong
| is the multi-tenancy story. Can two users use one GPU and
| expect reasonable security?
|
| Last time I checked the GPU did offer some kind of memory
| isolation, but that was only for their datacenter, not
| consumer cards.
| monocasa wrote:
| There's not a lot of docs on how it works. It used to be
| entirely in the closed source driver, now it's mainly a
| thin bridge to the closed source firmware blob.
|
| But yes, for more than a decade now even with consumer
| cards, separate user processes have separate hardware
| enforced contexts. This is as true for consumer cards as it
| is for datacenter cards. This is core to how something like
| webgl works without exposing everything else being rented
| on your desktop to public Internet. There have been bugs,
| but per process hardware isolation with a GPU local mmu has
| been tablestakes for a modern gpu for nearly twenty years.
|
| What datacenter gpus expose in addition to that is multiple
| virtual gpus, sort of like sr-iov, where a single gpu can
| be exposed to multiple CPU kernels running in virtual
| machines.
| afr0ck wrote:
| Well, it's not impossible. It's just software after all. You
| can mmap a remote device file, but you need OS support to do
| the magical paging for you, probably some sort of page
| ownership tracking protocol like in HMM [1], but outside a
| coherence domain.
|
| I was once working on CXL [2] and memory ownership tracking
| in the Linux kernel and wanted to play with Nvidia GPUs, but
| then I hit a wall when I realised that a lot of the
| functionalities were running on the GSP or the firmware blob
| with very little to no documentation, so I ended up generally
| not liking the system software stack of Nvidia and I gave up
| the project. UVM subsystem in the open kernel driver is a bit
| of an exception, but a lot of the control path is still
| handled and controlled from closed-source cuda libraries in
| userspace.
|
| tldr; it's very hard to do systems hacking with Nvidia GPUs.
|
| [1] https://www.kernel.org/doc/html/v5.0/vm/hmm.html [2]
| https://en.wikipedia.org/wiki/Compute_Express_Link
| monocasa wrote:
| Yeah, the Nvidia stuff isn't really made to be hacked on.
|
| I'd check out the AMD side since you can at least have a
| full open source GPU stack to play with, and they make a
| modicum of effort to document their gpus.
| gorkish wrote:
| Granted it requires additional support from your
| nics/switches, but it is probably straightforward to remote
| nvidia-uvm with an RDMA server
| ghxst wrote:
| This looks more like CUDA over IP or am I missing something?
| gpuhacker wrote:
| As this mentions some prior art but not rCUDA
| (https://en.m.wikipedia.org/wiki/RCUDA) I'm a bit confused about
| what makes scuda different.
| kevmo314 wrote:
| I've updated the README! rCUDA is indeed inspiration, in fact
| it inspired scuda's name too :)
| dschuetz wrote:
| More like "virtual cuda only gpu" over IP.
| Ey7NFZ3P0nzAe wrote:
| Well scuda has cuda in the name
| saurik wrote:
| Reminds me of this, from a couple months ago.
|
| https://news.ycombinator.com/item?id=41203475
| friedtofu wrote:
| Was going to post a reference to the same thing! Not sure about
| you but I tested it, and I'm not sure if it was just being
| hugged to death when I used it or not, but the network
| performance was incredibly poor.
|
| Having something that you can self-host, as a user I find this
| really neat but what I really want is something more like
|
| https://github.com/city96/ComfyUI_NetDist + OP's project mashed
| together.
|
| Say I'm almost able to execute a workflow that would normally
| require ~16Gb VRAM. I have a nvidia 3060 12Gb running headless
| with prime/executing the workflow via the CLI.
|
| Right now, I'd probably just have to run the workflow in a
| paperspace(or any other cloud compute) container, or borrow the
| power of a local apple M1 when using the second repository I
| mentioned.
|
| I wish I had something that could lend me extra resources and
| temporarily act as either the host GPU or a secondary depending
| on the memory needed, only when I needed it(if that makes
| sense)
| kbumsik wrote:
| I have heard NVSwitch is used for GPU-to-GPU interconnection over
| network.
|
| How is it different?
| thelastparadise wrote:
| Orders of magnitude slower.
| nsteel wrote:
| Isn't this GPU-to-CPU? And really slow. And only CUDA. And over
| IP. And implemented in software. I think it's really very
| different.
| meowzor wrote:
| nice
| AkashKaStudio wrote:
| Would this let Nvidia card be accessible on Apple Silicon over
| TB4 for training on a e-GPU caddy? Would happily relegate my
| desktop to HTPC/Gaming duties.
| some1else wrote:
| You might have a problem using CUDA as part of the name, since
| Nvidia has it trademarked. Maybe you can switch to Scuba if they
| give you trouble, sounds like a good name for the tool.
| teeray wrote:
| We need to do for CUDA what was done for Jell-o and Kleenex.
| n3storm wrote:
| Buda may Be a Better name
| gchamonlive wrote:
| I have a laptop with a serviceable GPU but only 16gb of ram, and
| another with a low tier GPU but 32gb of ram. Wondering, will it
| be too slow to use the later as the control plane and delegate
| inference to the former laptop using something like comfyui to
| run text-to-image models?
| Technetium wrote:
| It would be nice to have a description added.
| rtghrhtr wrote:
| Everyone hates nvidia but treats ATI as an afterthought. Another
| completely useless tool to throw on the pile.
| dahart wrote:
| > Everyone hates nvidia but treats ATI as an afterthought.
|
| Hehe, do you mean AMD?
| chpatrick wrote:
| What year is it?
| gorkish wrote:
| ATI? afterthought, indeed
| kkielhofner wrote:
| This is very interesting but many of the motivations listed are
| far better served with alternate approaches.
|
| For "remote" model training there is NCCL + Deepspeed/FSDP/etc.
| For remote inferencing there are solutions like Triton Inference
| Server[0] that can do very high-performance hosting of any model
| for inference. For LLMs specifically there are nearly countless
| implementations.
|
| That said, the ability to use this for testing is interesting but
| I wonder about GPU contention and as others have noted the
| performance of such a solution will be terrible even with
| relatively high speed interconnect (100/400gb ethernet, etc).
|
| NCCL has been optimized to support DMA directly between network
| interfaces and GPUs which is of course considerably faster than
| solutions like this. Triton can also make use of shared memory,
| mmap, NCCL, MPI, etc which is one of the many tricks it uses for
| very performant inference - even across multiple chassis over
| another network layer.
|
| [0] - https://github.com/triton-inference-server/server
| theossuary wrote:
| I don't think NCCL + Deepspeed/FSDP are really an alternative
| to Scuda, as they all require the models in question to be
| designed for distributed training. They also require a lot of
| support in the libraries being used.
|
| This has been a struggle for data scientists for a while now. I
| haven't seen a good solution to allow a data scientist to work
| locally, but utilize GPUs remotely, without basically just
| developing remotely (through a VM or Jupyter), or submitting
| remote jobs (through SLURM or a library specific Kubernetes
| integration). Scuda is an interesting step towards a better
| solution for utilizing remote GPUs easily across a wide range
| of libraries, not just Pytorch and Tensorflow.
| seattleeng wrote:
| Why is working locally important?
| theossuary wrote:
| Working locally still matters, and this is from someone who
| normally works in tmux/nvim. When working on vision and 3D
| ML work, being able to quickly open a visualizer windows is
| imperative to understanding what's going on. For Gaussian
| Splatting, point cloud work, SLAM, etc. you have to have
| access to a desktop environment to see visualizations; they
| very rarely work well remotely (even if they have some
| Jupyter support).
|
| Working remotely, when having to use a desktop environment
| is painful, no matter the technology. The best tice come up
| with is using tmux/vim and sunshine/moonlight, but even
| still I'd rather just have access to everything locally.
| elintknower wrote:
| Curious if this could be simplified to provide NVENC over ip?
___________________________________________________________________
(page generated 2024-10-11 23:02 UTC)