[HN Gopher] Show HN: Attaching to a virtual GPU over TCP
       ___________________________________________________________________
        
       Show HN: Attaching to a virtual GPU over TCP
        
       We developed a tool to trick your computer into thinking it's
       attached to a GPU which actually sits across a network. This allows
       you to switch the number or type of GPUs you're using with a single
       command.
        
       Author : bmodel
       Score  : 130 points
       Date   : 2024-08-09 16:50 UTC (6 hours ago)
        
 (HTM) web link (www.thundercompute.com)
 (TXT) w3m dump (www.thundercompute.com)
        
       | talldayo wrote:
       | > Access serverless GPUs through a simple CLI to run your
       | existing code on the cloud while being billed precisely for usage
       | 
       | Hmm... well I just watched you run nvidia-smi in a Mac terminal,
       | which is a platform it's explicitly not supported on. My instant
       | assumption is that your tool copies my code into a private server
       | instance and communicates back and forth to run the commands.
       | 
       | Does this platform expose eGPU capabilities if my host machine
       | supports it? Can I run raster workloads or network it with my own
       | CUDA hardware? The actual way your tool and service connects
       | isn't very clear to me and I assume other developers will be
       | confused too.
        
         | bmodel wrote:
         | Great questions! To clarify the demo, we were ssh'd into a
         | linux machine with no GPU.
         | 
         | Going into more details for how this works, we intercept
         | communication between the CPU and the GPU so only GPU code and
         | commands are sent across the network to a GPU that we are
         | hosting. This way we are able to virtualize a remote GPU and
         | make your computer think it's directly attached to that GPU.
         | 
         | We are not copying your CPU code and running it on our
         | machines. The CPU code runs entirely on your instance (meaning
         | no files need to be copied over or packages installed on the
         | GPU machine). One of the benefits of this approach is that you
         | can easily scale to a more / less powerful GPU without needing
         | to setup a new server.
        
           | billconan wrote:
           | does this mean you have a customized/dummy kernel gpu driver?
           | 
           | will that cause system instability, say, if the network
           | suddenly dropped?
        
             | bmodel wrote:
             | We are not writing any kernel drivers, this runs entirely
             | in userspace (this won't result in a crowdstrike level
             | crash haha).
             | 
             | Given that, if the network suddenly dropped then only the
             | process using the GPU would fail.
        
               | ZeroCool2u wrote:
               | How do you do that exactly? Are you using eBPF or
               | something else?
               | 
               | Also, for my ML workloads the most common bottleneck is
               | GPU VRAM <-> RAM copies. Doesn't this dramatically
               | increase latency? Or is it more like it increases latency
               | on first data transfer, but as long as you dump
               | everything into VRAM all at once at the beginning you're
               | fine? I'd expect this wouldn't play super well with stuff
               | like PyTorch data loaders, but would be curious to hear
               | how you've faired when testing.
        
               | bmodel wrote:
               | We intercept api calls and use our own implementation to
               | forward them to a remote machine. No eBPF (which I
               | believe need to run in the kernel).
               | 
               | As for latency, we've done a lot of work to minimize that
               | as much as possible. You can see the performance we get
               | running inference on BERT from huggingface here:
               | https://youtu.be/qsOBFQZtsFM?t=64. It's still slower than
               | local (mainly for training workloads) but not by as much
               | as you'd expect. We're aiming to reach near parity in the
               | next few months!
        
               | samstave wrote:
               | When you release a self-host version, what would be
               | really neat would be to see it across HFT focused NICs
               | that have huge TCP buffers...
               | 
               | https://www.arista.com/assets/data/pdf/HFT/HFTTradingNetw
               | ork...
               | 
               | Basically taking into account the large buffers and
               | super-time-sensitive nature of HFT networking
               | optimizations, I wonder if your TCP<-->GPU might benefit
               | from both the HW and the learnings of NFT stylings?
        
       | billconan wrote:
       | is this a remote nvapi?
       | 
       | this is awesome. can it do 3d rendering (vulkan/opengl)
        
         | czbond wrote:
         | I am not in this "space", but I second the "this is cool to
         | see", more stuff like this needed on HN.
        
           | cpeterson42 wrote:
           | Appreciate the praise!
        
         | bmodel wrote:
         | Thank you!
         | 
         | > is this a remote nvapi
         | 
         | Essentially yes! Just to be clear, this covers the entire GPU
         | not just the NVAPI (i.e. all of cuda). This functions like you
         | have the physical card directly plugged into the machine.
         | 
         | Right now we don't support vulkan or opengl since we're mostly
         | focusing on AI workloads, however we plan to support these in
         | the future (especially if there is interest!)
        
           | billconan wrote:
           | sorry, I didn't mean nvapi, I meant rmapi.
           | 
           | I bet you saw this https://github.com/mikex86/LibreCuda
           | 
           | they implemented the cuda driver by calling into rmapi.
           | 
           | My understanding is if there is a remote rmapi, other user
           | mode drivers should work out of the box?
        
       | doctorpangloss wrote:
       | I don't get it. Why would I start an instance in ECS, to use your
       | GPUs in ECS, when I could start an instance for the GPUs I want
       | in ECS? Separately, why would I want half of Nitro, instead of
       | real Nitro?
        
         | billconan wrote:
         | it's more transparent to your system, for example, if you have
         | a gui application that needs gpu acceleration on a thin client
         | (Matlab, solidworks, blender), you can do so without setting up
         | ECS. you can develop without any gpu, but suddenly have one
         | when you need to run simulation. this will be way cheaper than
         | AWS.
         | 
         | I think essentially this is solving the same problem Ray
         | (https://www.ray.io/) is solving, but in a more generic way.
         | 
         | it potentially can have finer grained gpu sharing, like a half-
         | gpu.
         | 
         | I'm very excited about this.
        
           | bmodel wrote:
           | Exactly! The finer grain sharing is one of the key things on
           | our radar right now
        
             | goku-goku wrote:
             | www.juicelabs.co does all this today, including the GPU
             | sharing and fractionalization.
        
         | bmodel wrote:
         | Great point, there are a few benefits:
         | 
         | 1. If you're actively developing and need a GPU then you
         | typically would be paying the entire time the instance is
         | running. Using Thunder means you only pay for the GPU while
         | actively using it. Essentially, if you are running CPU only
         | code you would not be paying for any GPU time. The alterative
         | for this is to manually turn the instance on and off which can
         | be annoying.
         | 
         | 2. This allows you to easily scale the type and number of GPUs
         | you're using. For example, say you want to do development on a
         | cheap T4 instance and run a full DL training job on a set of 8
         | A100. Instead of needing to swap instances and setup everything
         | again, you can just run a command and then start running on the
         | more powerful GPUs.
        
           | doctorpangloss wrote:
           | Okay, but your GPUs are in ECS. Don't I just want this
           | feature from Amazon, not you, and natively via Nitro? Or even
           | Google has TPU attachments.
           | 
           | > 1. If you're actively developing and need a GPU [for
           | fractional amounts of time]...
           | 
           | Why would I need a GPU for a short amount of time during
           | development? For testing?
           | 
           | I don't get it - what would testing an H100 over a TCP
           | connection tell me? It's like, yeah, I can do that, but it
           | doesn't represent an environment I am going to use for real.
           | Nobody runs applications to GPUs on buses virtualized over
           | TCP connections, so what exactly would I be validating?
        
             | bmodel wrote:
             | I don't believe Nitro would allow you to access a GPU
             | that's not directly connected to the CPU that the VM is
             | running on. So swapping between GPU type or scaling to
             | multiple GPUs is still a problem.
             | 
             | From the developer perspective, you wouldn't know that the
             | H100 is across a network. The experience will be as if your
             | computer is directly attached to an H100. The benefit here
             | is that if you're not actively using the H100 (such as when
             | you're setting up the instance or after the training job
             | completes) you are not paying for the H100.
        
               | doctorpangloss wrote:
               | Okay, a mock H100 object would also save me money. I
               | could pretend a 3090 is an A100. "The experience would be
               | that a 3090 is an A100." Apples to oranges comparison?
               | It's using a GPU attached to the machine versus a GPU
               | that crosses a VPC boundary. Do you see what I am saying?
               | 
               | I would never run a training job on a GPU virtualized
               | over TCP connection. I would never run a training job
               | that requires 80GB of VRAM on a 24GB VRAM device.
               | 
               | Whom is this for? Who needs to save kopecks on a single
               | GPU who needs H100s?
        
       | steelbrain wrote:
       | Ah this is quite interesting! I had a usecase where I needed a
       | GPU-over-IP but only for transcoding videos. I had a not-so-
       | powerful AMD GPU in my homelab server that somehow kept crashing
       | the kernel any time I tried to encode videos with it and also an
       | NVIDIA RTX 3080 in a gaming machine.
       | 
       | So I wrote https://github.com/steelbrain/ffmpeg-over-ip and had
       | the server running in the windows machine and the client in the
       | media server (could be plex, emby, jellyfin etc) and it worked
       | flawlessly.
        
         | crishoj wrote:
         | Interesting. Do you know if your tool supports conversions
         | resulting in multiple files, such as HLS and its myriad of
         | timeslice files?
        
         | bhaney wrote:
         | This is more or less what I was hoping for when I saw the
         | submission title. Was disappointed to see that the submission
         | wasn't actually a useful generic tool but instead a paid cloud
         | service. Of course the real content is in the comments.
         | 
         | As an aside, are there any uses for GPU-over-network other than
         | video encoding? The increased latency seems like it would
         | prohibit anything machine learning related or graphics
         | intensive.
        
           | trws wrote:
           | Some computation tasks can tolerate the latency if they're
           | written with enough overlap and can keep enough of the data
           | resident, but they usually need more performant networking
           | than this. See older efforts like rcuda for remote cuda over
           | infiniband as an example. It's not ideal, but sometimes worth
           | it. Usually the win is in taking a multi-GPU app and giving
           | it 16 or 32 of them rather than a single remote GPU though.
        
         | toomuchtodo wrote:
         | Have you done a Show HN yet? If not, please consider doing so!
         | 
         | https://gist.github.com/tzmartin/88abb7ef63e41e27c2ec9a5ce5d...
         | 
         | https://news.ycombinator.com/showhn.html
         | 
         | https://news.ycombinator.com/item?id=22336638
        
       | cpeterson42 wrote:
       | Given the interest here we decided to open up T4 instances for
       | free. Would love for y'all to try it and let us know your
       | thoughts!
        
       | tptacek wrote:
       | This is neat. Were you able to get MIG or vGPUs working with it?
        
         | bmodel wrote:
         | We haven't tested with MIG or vGPU, but I think it would work
         | since it's essentially physically partitioning the GPU.
         | 
         | One of our main goals for the near future is to allow GPU
         | sharing. This would be better than MIG or vGPU since we'd allow
         | users to use the entire GPU memory instead of restricting them
         | to a fraction.
        
           | tptacek wrote:
           | We had a hell of a time dealing with the licensing issues and
           | ultimately just gave up and give people whole GPUs.
           | 
           | What are you doing to reset the GPU to clean state after a
           | run? It's surprisingly complicated to do this securely (we're
           | writing up a back-to-back sequence of audits we did with
           | Atredis and Tetrel; should be publishing in a month or two).
        
             | bmodel wrote:
             | We kill the process to reset the GPU. Since we only store
             | GPU state that's the only clean up we need to do
        
               | tptacek wrote:
               | Hm. Ok. Well, this is all very cool! Congrats on
               | shipping.
        
       | kawsper wrote:
       | Cool idea, nice product page!
       | 
       | Does anyone know if this is possible with USB?
       | 
       | I have a Davinci Resolve license USB-dongle I'd like to not
       | plugging into my laptop.
        
         | kevmo314 wrote:
         | You can do that with USB/IP: https://usbip.sourceforge.net/
        
       | orsorna wrote:
       | So what exactly is the pricing model? Do I need a quote? Because
       | otherwise I don't see how to determine it without creating an
       | account which is needlessly gatekeeping.
        
         | bmodel wrote:
         | We're still in our beta so it's entirely free for now (we can't
         | promise a bug-free experience)! You have to make an account but
         | it won't require payment details.
         | 
         | Down the line we want to move to a pay-as-you-go model.
        
       | Cieric wrote:
       | This is interesting, but I'm more interested in self-hosting. I
       | already have a lot of GPUs (some running some not.) Does this
       | have a self-hosting option so I can use the GPUs I already have?
        
         | cpeterson42 wrote:
         | We don't support self hosting yet but the same technology
         | should work well here. Many of the same benefits apply in a
         | self-hosted setting, namely efficient workload scheduling, GPU-
         | sharing, and ease-of-use. Definitely open to this possibility
         | in the future!
        
         | goku-goku wrote:
         | Juice does have this ability today! :)
         | 
         | www.juicelabs.co
        
           | Cieric wrote:
           | Thanks, but I've already evaluated JuiceLabs and it does not
           | handle what I need it to. Plus with the whole project going
           | commercial and the community edition being neglected I no
           | longer have any interest in trying to support the project
           | either.
        
         | covi wrote:
         | If you want to use your own GPUs or cloud accounts but with a
         | great dev experience, see SkyPilot.
        
       | cpeterson42 wrote:
       | We created a discord for the latest updates, bug reports, feature
       | suggestions, and memes. We will try to respond to any issues and
       | suggestions as quickly as we can! Feel free to join here:
       | https://discord.gg/nwuETS9jJK
        
       | throwaway888abc wrote:
       | Does it work for gaming on windows ? or even linux ?
        
         | cpeterson42 wrote:
         | In theory yes. In practice, however, latency between the CPU
         | and remote GPU makes this impractical
        
       | rubatuga wrote:
       | What ML packages do you support? In the comments below it says
       | you do not support Vulkan or OpenGL. Does this support AMD GPUs
       | as well?
        
         | bmodel wrote:
         | We have tested this with pytorch and huggingface and it is
         | mostly stable (we know there are issues with pycuda and jax).
         | In theory this should work with any libraries, however we're
         | still actively developing this so bugs will show up
        
       | the_reader wrote:
       | Would be possible to mix it with Blender?
        
         | bmodel wrote:
         | At the moment out tech is linux-only so it would not work with
         | Blender.
         | 
         | Down the line, we could see this being used for batched render
         | jobs (i.e. to replace a render farm).
        
           | comex wrote:
           | Blender can run on Linux...
        
             | bmodel wrote:
             | Oh nice, I didn't know that! In that case it might work,
             | you could try running `tnr run ./blender` (replace the
             | ./blender with how you'd launch blender from the CLI) to
             | see what happens. We haven't tested it so I can't make
             | promises about performance or stability :)
        
       | teaearlgraycold wrote:
       | This could be perfect for us. We need very limited bandwidth but
       | have high compute needs.
        
         | bmodel wrote:
         | Awesome, we'd love to chat! You can reach us at
         | founders@thundercompute.com or join the discord
         | https://discord.gg/nwuETS9jJK!
        
         | goku-goku wrote:
         | Feel free to reach out www.juicelabs.co
        
       | bkitano19 wrote:
       | this is nuts
        
         | cpeterson42 wrote:
         | We think so too, big things coming :)
        
         | goku-goku wrote:
         | www.juicelabs.co
        
       | dishsoap wrote:
       | For anyone curious about how this actually works, it looks like a
       | library is injected into your process to hook these functions [1]
       | in order to forward them to the service.
       | 
       | [1] https://pastebin.com/raw/kCYmXr5A
        
       | Zambyte wrote:
       | Reminds me of Plan9 :)
        
         | K0IN wrote:
         | can you elaborate a bit on why? (noob here)
        
       | radarsat1 wrote:
       | I'm confused, if this operates at the CPU/GPU boundary doesn't it
       | create a massive I/O bottleneck for any dataset that doesn't fit
       | into VRAM? I'm probably misunderstanding how it works but if it
       | intercepts GPU i/o then it must stream your entire dataset on
       | every epoch to a remote machine, which sounds wasteful, probably
       | I'm not getting this right.
        
         | bmodel wrote:
         | That understanding of the system is correct. To make it
         | practical we've implemented a bunch of optimizations to
         | minimize I/O cost. You can see how it performs on inference
         | with BERT here: https://youtu.be/qsOBFQZtsFM?t=69.
         | 
         | The overheads are larger for training compared to inference,
         | and we are implementing more optimizations to approach native
         | performance.
        
           | radarsat1 wrote:
           | Aah ok thanks, that was my basic misunderstanding, my mind
           | just jumped straight to my current training needs but for
           | inference it makes a lot of sense. Thanks for the
           | clarification.
        
       | winecamera wrote:
       | I saw that in the tnr CLI, there are hints of an option to self-
       | host a GPU. Is this going to be a released feature?
        
         | cpeterson42 wrote:
         | We don't support self-hosting yet but are considering adding it
         | in the future. We're a small team working as hard as we can :)
         | 
         | Curious where you see this in the CLI, may be an oversight on
         | our part. If you can join the Discord and point us to this bug
         | we would really appreciate it!
        
       | test20240809 wrote:
       | pocl (Portable Computing Language) [1] provides a remote backend
       | [2] that allows for serialization and forwarding of OpenCL
       | commands over a network.
       | 
       | Another solution is qCUDA [3] which is more specialized towards
       | CUDA.
       | 
       | In addition to these solutions, various virtualization solutions
       | today provide some sort of serialization mechanism for GPU
       | commands, so they can be transferred to another host (or
       | process). [4]
       | 
       | One example is the QEMU-based Android Emulator. It is using
       | special translator libraries and a "QEMU Pipe" to efficiently
       | communicate GPU commands from the virtualized Android OS to the
       | host OS [5].
       | 
       | The new Cuttlefish Android emulator [6] uses Gallium3D for
       | transport and the virglrenderer library [7].
       | 
       | I'd expect that the current virtio-gpu implementation in QEMU [8]
       | might make this job even easier, because it includes the
       | Android's gfxstream [9] (formerly called "Vulkan Cereal") that
       | should already support communication over network sockets out of
       | the box.
       | 
       | [1] https://github.com/pocl/pocl
       | 
       | [2] https://portablecl.org/docs/html/remote.html
       | 
       | [3] https://github.com/coldfunction/qCUDA
       | 
       | [4] https://www.linaro.org/blog/a-closer-look-at-virtio-and-
       | gpu-...
       | 
       | [5]
       | https://android.googlesource.com/platform/external/qemu/+/em...
       | 
       | [6] https://source.android.com/docs/devices/cuttlefish/gpu
       | 
       | [7]
       | https://cs.android.com/android/platform/superproject/main/+/...
       | 
       | [8] https://www.qemu.org/docs/master/system/devices/virtio-
       | gpu.h...
       | 
       | [9]
       | https://android.googlesource.com/platform/hardware/google/gf...
        
         | fpoling wrote:
         | Zscaler uses a similar approach in their remote browser. WebGL
         | in the local browser exposed as a GPU to a Chromium instance in
         | the cloud.
        
       | mmsc wrote:
       | What's it like to actually use this for any meaningful
       | throughput? Can this be used for hash cracking? Every time I
       | think about virtual GPUs over a network, I think about botnets.
       | Specifically from
       | https://www.hpcwire.com/2012/12/06/gpu_monster_shreds_passwo...
       | "Gosney first had to convince Mosix co-creator Professor Amnon
       | Barak that he was not going to "turn the world into a giant
       | botnet.""
        
         | cpeterson42 wrote:
         | This is definitely an interesting thought experiment, however
         | in practice our system is closer to AWS than a botnet, as the
         | GPUs are not distributed. This technology does lend itself to
         | some interesting applications with creating very flexible
         | clusters within data centers that we are exploring.
        
       ___________________________________________________________________
       (page generated 2024-08-09 23:00 UTC)