[HN Gopher] Using Libc for GPUs
___________________________________________________________________
Using Libc for GPUs
Author : hochmartinez
Score : 182 points
Date : 2024-12-11 15:29 UTC (4 days ago)
(HTM) web link (libc.llvm.org)
(TXT) w3m dump (libc.llvm.org)
| almostgotcaught wrote:
| this was presented at llvm this year
| https://www.youtube.com/watch?v=4TxGWis1mws - it was a nice talk.
| kelsey98765431 wrote:
| Running normal c directly on gpu has been the dream for a long
| time. this looks excellent
| C-programmer wrote:
| Genuinely, why?
|
| - For new code, all of the functions
| [here](https://libc.llvm.org/gpu/support.html#libc-gpu-support)
| you can do without just fine.
|
| - For old code: * Your project is large enough
| that you are likely use using an unsupported libc function
| somewhere. * Your project is small enough that you
| would benefit from just implementing a new kernel yourself.
|
| I am biased because I avoid the C standard library even on the
| CPU, but this seems like a technology that raises the floor not
| the ceiling of what is possible.
| tredre3 wrote:
| > this seems like a technology that raises the floor not the
| ceiling of what is possible.
|
| In your view, how is making GPU programming easier a bad
| thing?
| convolvatron wrote:
| that's clearly not a bad thing. however encouraging people
| to run mutating, procedural code with explicit loops and
| aliasing maybe isn't the right path to get there.
| particularly if you just drag forward all the weird old
| baggage with libc and its horrible string conventions.
|
| I think any programming environment that treats a gpu as a
| really slow serial cpu isn't really what you want(?)
| quotemstr wrote:
| What if it encourages people to write parallel and
| functional code on CPUs? That'd be a good thing.
| Influence works both ways.
|
| The bigger problem is that GPUs have various platform
| features (shared memory, explicit cache residency and
| invalidation management) that CPUs sadly don't yet. Sure,
| you _could_ expose these facilities via compiler
| intrinsics, but then you end up code that might be
| syntactically valid C but is alien both to CPUs and human
| minds
| fc417fc802 wrote:
| > is alien both to CPUs and human minds
|
| On the contrary I would love that. The best case scenario
| in my mind is being able to express the native paradigms
| of all relevant platforms while writing a single piece of
| code that can then be compiled for any number of backends
| and dynamically retargeted between them at runtime. It
| would make debugging and just about everything else SO
| MUCH EASIER.
|
| The equivalent of being able to compile some subset of
| functions for both ARM and x86 and then being able to
| dynamically dispatch to either version at runtime, except
| replace ARM with a list of all the GPU ISAs that you care
| about.
| JonChesterfield wrote:
| One thing this gives you is syscall on the gpu. Functions
| like sprintf are just blobs of userspace code, but others
| like fopen require support from the operating system (or
| whatever else the hardware needs you to do). That plumbing
| was decently annoying to write for the gpu.
|
| These aren't gpu kernels. They're functions to call from
| kernels.
| almostgotcaught wrote:
| > One thing this gives you is syscall on the gpu
|
| i wish people in our industry would stop (forever,
| completely, absolutely) using metaphors/allusions. it's a
| complete disservice to anyone that isn't in on the trick.
| it doesn't give you syscalls. that's impossible because
| there's no sys/os on a gpu and your actual os does not
| (necessarily) have any way to peer into the address
| space/schedular/etc of a gpu core.
|
| what it gives you is something that's working really really
| hard to pretend be a syscall:
|
| > Traditionally, the C library abstracts over several
| functions that interface with the platform's operating
| system through system calls. The GPU however does not
| provide an operating system that can handle target
| dependent operations. Instead, we implemented remote
| procedure calls to interface with the host's operating
| system while executing on a GPU.
|
| https://libc.llvm.org/gpu/rpc.html.
| quotemstr wrote:
| It's a matter of perspective. If you think of the GPU as
| a separate computer, you're right. If you think of it as
| a coprocessor, then the use of RPC is just an
| implementation detail of the system call mechanism, not a
| semantically different thing.
|
| When an old school 486SX delegates a floating point
| instruction to a physically separate 487DX coprocessor,
| is it executing an instruction or doing an RPC? If RPC,
| does the same instruction start being a real instruction
| when you replace your 486SX with a 486DX, with an
| integrated GPU? The program can't tell the difference!
| almostgotcaught wrote:
| > It's a matter of perspective. If you think of the GPU
| as a separate computer, you're right.
|
| this perspective is a function of exactly one thing: do
| you care about the performance of your program? if not
| then sure indulge in whatever abstract perspective you
| want ("it's magic, i just press buttons and the lights
| blink"). but if you don't care about perf then why are
| you using a GPU at all...? so for people that aren't just
| randomly running code on a GPU (for shits and giggles),
| the distinction is very significant between "syscall" and
| syscall.
|
| people who say these things don't program GPUs for a
| living. there are no abstractions unless you don't care
| about your program's performance (in which case why are
| you using a GPU at all).
| quotemstr wrote:
| Not everything in every program is performance critical.
| A pattern I've noticed repeatedly among CUDAheads is the
| idea that "every cycle matters" and therefore we should
| uglify and optimize even cold parts of our CUDA programs.
| That's as much BS on GPU as it is on CPU. In CPU land, we
| moved past this sophomoric attitude decades ago. The GPU
| world might catch up one day.
|
| Are you planning on putting fopen() in an inner loop or
| something? LOL
| nickysielicki wrote:
| genuinely asking: where else should ML engineers focus
| their time, if not on looking at datapath bottlenecks in
| either kernel execution or the networking stack?
| lmm wrote:
| The point is that you should focus on the bottlenecks,
| not on making every random piece of code "as fast as
| possible". And that sometimes other things
| (maintainability, comprehensibility, debuggability) are
| more important than maximum possible performance, even on
| the GPU.
| nickysielicki wrote:
| That's fair, but I didn't understand OP to be claiming
| above that "cudaheads" aren't looking at their
| performance bottlenecks before driving work, just that
| they're looking at the problem incorrectly (and eg: maybe
| should prioritize redesigns over squeezing perf out of
| flawed approaches.)
| almostgotcaught wrote:
| > A pattern I've noticed repeatedly among CUDAheads is
| the idea that "every cycle matters" and therefore we
| should uglify and optimize even cold parts of our CUDA
| programs
|
| I don't know what a "cudahead" is but if you're gonna
| build up a strawman just to chop it down have at it.
| Doesn't change anything about my point - these aren't
| syscalls because there's no sys. I mean the dev here
| literally spells it out correctly so I don't understand
| why there's any debate.
| oivey wrote:
| The whole reason CUDA/GPUs are fast is that they
| explicitly don't match the architecture of CPUs. The
| truly sophomoric attitude is that all compute devices
| should work like CPUs. The point of CUDA/GPUs is to
| provide a different set of abstractions than CPUs that
| enable much higher performance for certain problems.
| Forcing your GPU to execute CPU-like code is a bad
| abstraction.
|
| Your comment about putting fopen in an inner loop really
| betrays that. Every thread in your GPU kernel is going to
| have to wait for your libc call. You're really confused
| if you're talking about hot loops in a GPU kernel.
| saagarjha wrote:
| > A pattern I've noticed repeatedly among CUDAheads is
| the idea that "every cycle matters" and therefore we
| should uglify and optimize even cold parts of our CUDA
| programs.
|
| You're talking to the wrong people; this is definitely
| not true in general.
| JonChesterfield wrote:
| The "proper syscall" isn't a fast thing either. The
| context switch blows out your caches. Part of why I like
| the name syscall is it's an indication to not put it on
| the fast path.
|
| The implementation behind this puts a lot of emphasis on
| performance, though the protocol was heavilt simplfied in
| upstreaming. Running on pcie instead of the APU systems
| makes things rather laggy too. Design is roughly a mashup
| of io_uring and occam, made much more annoying by the GPU
| scheduler constraints.
|
| The two authors of this thing probably count as people
| who program GPUs for a living for what it's worth.
| saati wrote:
| A 486SX never delegates floating point instructions, the
| 487 is a full 486DX that disables the SX and fully takes
| over, you are thinking of 386 and older.
| JonChesterfield wrote:
| Well, I called it syscall because it's a void function of
| 8 u64 arguments which your code stumbles into, gets
| suspended, then restored with new values for those
| integers. That it's a function instead of an instruction
| doesn't change the semantics. My favourite of the uses of
| that is to pass six of those integers to the x64 syscall
| operation.
|
| This isn't misnaming. It's a branch into a trampoline
| that messes about with shared memory to give the effect
| of the x64 syscall you wanted, or some other thing that
| you'd rather do on the cpu.
|
| There's a gpu thing called trap which is closer in
| behaviour to what you're thinking of but it's really
| annoying to work with.
|
| Side note, RPC has a terrible rep for introducing failure
| modes into APIs, but that's completely missing here
| because pcie either works or your machine is gonna have
| to reboot. There are no errors on the interface that can
| be handled by the application.
| almostgotcaught wrote:
| > Well, I called it syscall because it's a void function
| of 8 u64 arguments which your code stumbles into, gets
| suspended, then restored with new values for those
| integers
|
| I'm put it really simply: is there a difference (in perf,
| semantics, whatever) between using this "syscalls" to
| implement fopen on GPU and using a syscall to implement
| fopen on CPU? Note that's a rhetorical question because
| we both already know that the answer is yes. So again
| you're just playing slight of hand in calling them
| syscalls and I'll emphasize: this is a slight of hand
| that the dev himself doesn't play (so why would I take
| your word over his).
| JonChesterfield wrote:
| Wonderfully you don't need to trust my words, you've got
| my code :)
|
| If semantics are different, that's a bug/todo. It'll have
| worse latency than a CPU thread making the same kernel
| request. Throughput shouldn't be way off. The GPU writes
| some integers to memory that the CPU will need to read,
| and then write other integers, and then load those again.
| Plus whatever the x64 syscall itself does. That's a bunch
| of cache line invalidation and reads. It's not as fast as
| if the hardware guys were on board with the strategy but
| I'm optimistic it can be useful today and thus help
| justify changing the hardware/driver stack.
|
| The whole point of libc is to paper over the syscall
| interface. If you start from musl, "syscall" can be a
| table of function pointers or asm. Glibc is more
| obstructive. This libc open codes a bunch of things, with
| a rpc.h file dealing with synchronising memcpy of
| arguments to/from threads running on the CPU which get to
| call into the Linux kernel directly. It's mainly
| carefully placed atomic operations to keep the data
| accesses well defined.
|
| There's also nothing in here which random GPU devs can't
| build themselves. The header files are (now) self
| contained if people would like to use the same mechanism
| for other functionality and don't want to handroll the
| data structure. The most subtle part is getting this to
| work correctly under arbitrary warp divergence on volta.
| It should be an out of the box thing under openmp early
| next year too.
| almostgotcaught wrote:
| > Wonderfully you don't need to trust my words, you've
| got my code :)
|
| My friend it's so incredibly bold of you to claim credit
| for this work when
|
| 1. Joe presented it
|
| 2. Joe's name is the only name on the git blame
|
| 3. _I know Joe_ and _I know_ he did the lion 's share of
| the work
|
| And so I'll repeat: Joe himself calls it rpc so I'm gonna
| keep calling it rpc and not syscall.
| jhuber6 wrote:
| The RPC implementation in LLVM is an adaptation of Jon's
| original state machine (see
| https://github.com/JonChesterfield/hostrpc). It looks
| very different at this point, but we collaborated on the
| initial design before I fleshed out everything else.
| Syscall or not is a bit of a semantic argument, but I
| lean more towards syscall 'inspired'.
| JonChesterfield wrote:
| Here's the algorithm
| https://doi.org/10.1145/3458744.3473357. My paper with
| Joseph on the implementation is at
| https://doi.org/10.1007/978-3-031-40744-4_15.
|
| The syscall layer this runs on was written at
| https://github.com/JonChesterfield/hostrpc, 800 commits
| from May 2020 until Jan 2023. I deliberately wrote that
| in the open, false paths and mistakes and all. Took ages
| for a variety of reasons, not least that this was my side
| project.
|
| You'll find the upstream of that scattered across the
| commits to libc, mostly authored by Joseph (log shows 300
| for him, of which I reviewed 40, and 25 for me). You
| won't find the phone calls and offline design
| discussions. You can find the tricky volta solution at
| https://reviews.llvm.org/D159276 and the initial patch to
| llvm at https://reviews.llvm.org/D145913.
|
| GPU libc is definitely Joseph's baby, not mine, and this
| wouldn't be in trunk if he hadn't stubbornly fought
| through the headwinds to get it there. I'm excited to see
| it generating some discussion on here.
|
| But yeah, I'd say the syscall implementation we're
| discussing here has my name adequately written on it to
| describe it as "my code".
| rowanG077 wrote:
| Why does a perf difference factor into it? There is no
| requirement for a syscall to be this fast or else it
| isn't a syscall. If you have a hot loop you shouldn't be
| putting a syscall in it, not even on the CPU.
| saagarjha wrote:
| It's not a syscall any more than a call into a Wine thunk
| is a syscall. Sure, it implements and emulates a syscall.
| But it's not a syscall.
| nickysielicki wrote:
| https://developer.nvidia.com/blog/simplifying-gpu-
| applicatio...
| JonChesterfield wrote:
| > Genuinely, why?
|
| > ... this seems like a technology that raises the floor not
| the ceiling of what is possible.
|
| The root cause reason for this project existing is to show
| that GPU programming is not synonymous with CUDA (or the
| other offloading languages).
|
| It's nominally to help people run existing code on GPUs.
| Disregarding that use case, it shows that GPUs can actually
| do things like fprintf or open sockets. This is obvious to
| the implementation but seems largely missed by application
| developers. Lots of people think GPUs can only do floating
| point math.
|
| Especially on an APU, where the GPU units and the CPU cores
| can hammer on the same memory, it is a travesty to persist
| with the "offloading to accelerator" model. Raw C++ isn't an
| especially sensible language to program GPUs in but it's
| workable and I think it's better than CUDA.
| rbanffy wrote:
| > Lots of people think GPUs can only do floating point
| math.
|
| IIRC, every Raspberry Pi is brought up by the GPU setting
| up the system before the CPU is brought out of reset and
| the bootloader looks for the OS.
|
| > it is a travesty to persist with the "offloading to
| accelerator" model.
|
| Operating systems would need to support heterogeneous
| processors running programs with different ISAs accessing
| the same pools of memory. I'd _LOVE_ to see that. It 'd be
| extremely convenient to have first-class processes running
| on the GPU MIMD cores.
|
| I'm not sure there is much research done in that space. I
| believe IBM mainframe OSs have something like that because
| programmers are exposed to the various hardware assists
| that run as coprocessors sharing the main memory with the
| OS and applications.
| als0 wrote:
| > I'm not sure there is much research done in that space.
|
| There is. And the finest example I can think of is
| Barrelfish https://barrelfish.org
| rbanffy wrote:
| Interesting - it resembles a network of heterogeneous
| systems that can share a memory space used primarily for
| explicit data exchange. Not quite what I was imagining,
| but probably much simpler to implement than a Unix where
| the kernel can see processes running on different ISAs on
| a shared memory space.
|
| I guess hardware availability is an issue, as there
| aren't many computers with, say, an ARM, a RISC-V, an
| x86, and an AMD iGPU sharing a common memory pool.
|
| OTOH, there are many where a 32-bit ARM shares the memory
| pool with 64-bit cores. Usually the big cores run
| applications while the small ARM does housekeeping or
| other low-latency task.
| als0 wrote:
| > Not quite what I was imagining, but probably much
| simpler to implement than a Unix where the kernel can see
| processes running on different ISAs on a shared memory
| space.
|
| Indeed. The other argument is that treating the computer
| as a distributed system can make it scale better to say
| hundreds of cores compared to a lock-based SMP system.
| rbanffy wrote:
| > treating the computer as a distributed system
|
| Sure, but where's the fun in that?
|
| Up to GPGPUs, there was no reason to build a machine with
| multiple CPUs of different architectures except running
| different OSs on them (such as the Macs, Suns and Unisys
| mainframes with x86 boards for running Windows side-by-
| side with a more civilized OS). With GPGPUs you have
| machines with a set of processors that are good on many
| things, but not great at SIMD and one that's awesome at
| SIMD, but sucks for most other things.
|
| And, as I mentioned before, there are lots of ARM
| machines with 64-bit and ultra-low-power 32-bit cores
| sharing the same memory map. Also, even x86 variants with
| different ISA extensions can be treated as different
| architectures by the OS - Intel had to limit the fast
| cores of its early asymmetric parts because the low-power
| cores couldn't do AVX512 and OSs would not support
| migrating a process to the right core on an invalid
| instruction fault.
| saagarjha wrote:
| The problem is that GPUs are kind of bad at being
| general-purpose, so it doesn't really make sense to
| expose the hardware that way.
| rbanffy wrote:
| If the OS supports it, you can make programs that start
| threads on CPUs and GPUs and let those communicate. You
| run the SIMD-ish functions on the GPUs and the non-SIMD-
| heavy functions on the CPU cores.
|
| I have a strong suspicion GPUs aren't as bad at general-
| purpose stuff as we perceive and we underutilize them
| because it's inconvenient to shuttle data over an
| architectural wall that's not really there in iGPUs.
|
| Maybe it doesn't make sense, but it'd be worth looking
| into just to know where the borders of the problem lie.
| krackers wrote:
| >Disregarding that use case, it shows that GPUs can
| actually do things like fprintf or open sockets.
|
| Can you elaborate on this? My mental model of GPU is
| basically like a huge vector coprocessor. How would things
| like printf or sockets work directly from the GPU when they
| require syscalls to trap into the OS kernel? Given that the
| kernel code is running on the CPU, that seems to imply that
| there needs to be a handover at some point. Or conversely
| even if there was unified memory and the GPU could directly
| address memory-mapped peripherals, you'd basically need to
| reimplement drivers wouldn't you?
| JonChesterfield wrote:
| It's mostly terminology and conventions. On the standard
| system setup, the linux kernel running in a special
| processor mode does these things. Linux userspace asks
| the kernel to do stuff using syscall and memory which
| both kernel and userspace can access. E.g. the io_uring
| register followed by writing packets into the memory.
|
| What the GPU has is read/write access to memory that the
| CPU can also access. And network peripherals etc. You can
| do things like alternately compare-and-swap on the same
| page from x64 threads and amdgpu kernels and it works,
| possibly not quickly on some systems. That's also all
| that the x64 CPU threads have though, modulo the magic
| syscall instruction to ask the kernel to do stuff.
|
| People sometimes get quite cross at my claim that the GPU
| can do fprintf. Cos _actually_ all it can do is write
| numbers into shared memory or raise interrupts such that
| the effect of fprintf is observed. But that 's also all
| the userspace x64 threads do, and this is all libc
| anyway, so I don't see what people are so cross about.
| You're writing C, you call `fprintf(stderr, "Got to
| L42\n");` or whatever, and you see the message on the
| console.
|
| If fprintf compiles into a load of varargs mangling with
| a fwrite underneath, and the varargs stuff runs on the
| GPU silicon and the fwrite goes through a staging buffer
| before some kernel thread deals with it, that seems fine.
|
| I'm pretty sure you could write to an nvme drive directly
| from the gpu, no talking to the host kernel at all, at
| which point you've arguably implemented (part of?) a
| driver for it. You can definitely write to network cards
| from them, without using any of this machinery.
| saagarjha wrote:
| We don't actually allow a GPU to directly fprintf,
| because GPU can't syscall. Only userspace can do that.
| You can have userspace keep polling and then do it on
| behalf of the GPU, but that's not the GPU doing it.
| adrian_b wrote:
| The GPU could do the equivalent of fprintf, if the
| concerned peripherals used only memory-mapped I/O an the
| IOMMU would be configured to allow the GPU to access
| directly those peripherals, without any involvement from
| the OS kernel that runs on the CPU.
|
| This is the same as on the CPU, where the kernel can
| allow a user process to access directly a peripheral,
| without using system calls, by mapping that peripheral in
| the memory space of the user process.
|
| In both cases the peripheral must be assigned exclusively
| to the GPU or the user process. What is lost by not using
| system calls is the ability to share the peripheral
| between multiple processes, but the performance for the
| exclusive user of the peripheral can be considerably
| increased. Of course, the complexity of the user process
| or GPU code is also increased, because it must include
| the equivalent of the kernel device driver for that
| peripheral.
| jhuber6 wrote:
| At some point I was looking into using io_uring for
| something like this. The uring interface just works off
| of `mmap()` memory, which can be registered with the
| GPU's MMU. There's a submission polling setting, which
| means that the GPU can simply write to the pointer and
| the kernel will eventually pick up the write syscall
| associated with it. That would allow you to use
| `snprintf` locally into a buffer and then block on its
| completion. The issue is that the kernel thread goes to
| sleep after some time, so you'd still need a syscall from
| the GPU to wake it up. AMD GPUs actually support software
| level interrupts which could be routed to a syscall, but
| I didn't venture too deep down that rabbit hole.
| einpoklum wrote:
| > The root cause reason for this project existing is to
| show that GPU > programming is not synonymous with CUDA (or
| the other offloading > languages).
|
| 1. The ability to use a particular library does not reflect
| much on which languages can be used.
|
| 2. One you have PTX as a backend target for a compiler,
| obviously you can use all sorts of languages on the
| frontend - which NVIDIA's drivers and libraries won't even
| know about. Or you can just use PTX as your language -
| making your point that GPU programming is not synonymous
| with CUDA C++.
|
| > It's nominally to help people run existing code on GPUs.
|
| I'm worried you might be right. But - we should really not
| encourage people to run existing CPU-side code on GPUs,
| that's rarely (or maybe never?) a good idea.
|
| > Raw C++ isn't an especially sensible language to program
| GPUs in > but it's workable and I think it's better than
| CUDA.
|
| CUDA is an execution ecosystem. The programming language
| for writing kernel code is "CUDA C++", which _is_ C++, plus
| a few builtins functions ... or maybe I'm misunderstanding
| this sentence.
| JonChesterfield wrote:
| GPU offloading languages - cuda, openmp etc - work
| something like:
|
| 1. Split the single source into host parts and gpu parts
|
| 2. Optionally mark up some parts as "kernels", i.e. have
| entry points
|
| 3. Compile them separately, maybe for many architectures
|
| 4. Emit a bunch of metadata for how they're related
|
| 5. Embed the GPU code in marked up sections of the host
| executable
|
| 6. Embed some startup code to find GPUs into the x64
| parts
|
| 7. At runtime, go crawling around the elf section
| launching kernels
|
| This particular library (which happens to be libc) is
| written in C++, compiled with ffreestanding
| target=amdgpu, to LLVM bitcode. If you build a test, it
| compiles to an amdgpu elf file - no x64 code in it, no
| special metadata, no elf-in-elf structure. The entry
| point is called _start. There's a small "loader" program
| which initialises hsa (or cuda) and passes it the address
| of _start.
|
| I'm not convinced by the clever convenience cut-up-and-
| paste-together style embraced by cuda or openmp. This
| approach brings the lack of magic to the forefront. It
| also means we can add it to openmp etc when the reviews
| go through so users of that suddenly find fopen works.
| einpoklum wrote:
| CUDA C++ _can_ work like that. But I would say that these
| are mostly kiddie wheels for convenience. And because, in
| GPU programming, performance is king, most (?) kernel
| developers are likely to eventually need to drop those
| wheels. And then:
|
| * No single source (although some headers might be
| shared)
|
| * Kernels are compiled and linked at runtime, for the
| platform you're on, but also, in the general case, with
| extra definitions not known apriori (and which are
| different for different inputs / over the course of
| running your program), and which have massive effect on
| the code.
|
| * You may or may not use some kind of compiled kernel
| caching mechanism, but you certainly don't have all
| possible combinations of targets and definitions
| available, since that would be millions or compiled
| kernels.
|
| It should also be mentioned that OpenCL never included
| the kiddie wheels to begin with; although I have to admit
| it makes it less convenient to start working with.
| amelius wrote:
| I wonder what C would look like if CPUs would evolve into what
| GPUs are today.
|
| What if a CPU had assembly instructions for everything a GPU
| can do? Would compiler/language designers support them?
| gpderetta wrote:
| there is no reason to wonder: https://ispc.github.io/
| quotemstr wrote:
| I've never understood why people say you "can't" do this or
| that on GPU. A GPU is made of SMs, and each SM is just a CPU
| with very wide SIMD pipes and very good hyperthreading. You can
| take one thread of a warp in a SM and do exactly the same
| things a CPU would do. Would you get 1/32 potential
| performance? Sure. But so what? Years ago, we did plenty of
| useful work with less than 1/32 of a modest CPU, and we can
| again.
|
| One of the more annoying parts of the Nvidia experience is PTX.
| I. I know perfectly well that your CPU/SM/whatever has a
| program counter. Let me manipulate it directly!
| bee_rider wrote:
| I think people are just saying it isn't very cost effective
| to use a whole GPU as 1/32 of a modest (modern?) CPU.
| quotemstr wrote:
| Cars are slow and inefficient in reverse gear but a car
| that couldn't drive in reverse would be broken.
| bee_rider wrote:
| You put "can't" in quotes, so I guess you are quoting
| somebody, but I don't see where the quote is from, so I'm
| not sure what they actually meant.
|
| But I suspect they are using "can't" informally. Like:
| You can't run a drag race in reverse. Ok, you technically
| could, but it would be a silly thing to do.
| imtringued wrote:
| Why would that matter? Adding extra CPU cores to the GPU
| sounds like a much dumber idea to me. You'd be wasting
| silicon on something that is meant to be used infrequently.
|
| Is it really that difficult to imagine 1% of your workload
| doing some sort of management code, except since it is
| running on your GPU, you can now freely intersperse those
| 1% every 50 microseconds without too much overhead?
| Archit3ch wrote:
| No direct Metal target.
| almostgotcaught wrote:
| this is an LLVM project... you want this to work on Metal, ask
| apple to add a Metal backend to LLVM
|
| https://github.com/llvm/llvm-project/tree/main/llvm/lib/Targ...
| rbanffy wrote:
| I am surprised there isn't.
| JonChesterfield wrote:
| No Intel either. The port would be easy - gpuintrin.h abstracts
| over the intrinsics, provide an implementation for those, write
| a loader in terms of opencl or whatever if you want to run the
| test suite.
|
| The protocol needs ordered load/store on shared memory but
| nothing else. I wrote a paper trying to make it clear that
| load/store on shmem was sufficient which doesn't seem to be
| considered persuasive. It's specifically designed to tolerate
| architectures doing slopping things with cache invalidation. It
| could run much faster with fetch_or / fetch_and instructions
| (as APUs have, but PCIe does not). It could also hang off DMA
| but that isn't implemented (I want to have the GPU push packets
| over the network without involving the x64 CPU at all).
| gdiamos wrote:
| Nice to see this finally after 15 years.
| amelius wrote:
| I hate libc. It's such a common cause of versioning problems on
| my system.
|
| Can't they just stop making new versions of it?
| wbl wrote:
| Dynamic linking has some benefits, but many detractors.
| rbanffy wrote:
| It's nice not having to recompile all your software because
| of a CVE impacting libc, or any other fundamental component
| of the system.
| adrian_b wrote:
| In decades of experiencing various computing environments,
| the amount of time wasted to fix applications broken by
| updates to dynamically-linked libraries has dwarfed by a
| few orders of magnitude the amount of time saved by not
| recompiling statically-linked applications because of a new
| library version that fixes security problems.
| rbanffy wrote:
| What scares me about the need to recompile statically
| linked binaries is that the problem is invisible until a
| bug hits. You don't know the statically linked library is
| vulnerable unless you keep track of all versions that
| went into that binary and almost no organization does
| that.
|
| DLL problems are very easy to see and very obvious when
| they happen. But it's been a long while since I last saw
| one.
| adrian_b wrote:
| If some people like dynamically-linked applications, let them
| have such applications.
|
| What annoys me greatly is the existence of unavoidable
| libraries that have been designed intentionally to not work
| when linked statically, which make the life difficult for
| those who prefer to use only statically-linked applications.
| The main offending library is glibc, but there are too many
| other such cases.
| fuhsnn wrote:
| I wonder how GPU is going to access an unknown size NULL
| terminated string in system RAM, the strchr() source looks like
| normal C++. In my minimal Vulkan GPGPU experience the data need
| to be bound to VkDeviceMemory to be accessible through PCI bus
| with compute shader, is LLVM libc runtime doing similar set-ups
| in the background, and if so, is it faster than glibc's hand-
| tuned AVX implementation?
| JonChesterfield wrote:
| This is libc running on the GPU, not a libc spanning CPU and
| GPU. The primary obstruction to doing that is persuading people
| to let go of glibc. The spec of "host runs antique glibc, GPU
| runs some other thing that interops transparently with glibc"
| is a nightmare of hacks and tragedy.
|
| What would be relatively easy to put together is llvm libc
| running on the host x64 and also llvm libc running on the GPU.
| There's then the option to do things like malloc() on the GPU
| and free() the same pointer on the CPU. Making it genuinely
| seamless also involves persuading people to change what
| function pointers are, do some work on the ABI, and preferably
| move to APUs because PCIe is unhelpful.
|
| There's an uphill battle to bring people along on the journey
| of "program the things differently". For example, here's a
| thread trying to drum up enthusiasm for making function
| pointers into integers as that makes passing them between
| heterogenous architectures far easier
| https://discourse.llvm.org/t/rfc-function-pointers-as-
| intege....
| fuhsnn wrote:
| >making function pointers into integers as that makes passing
| them between heterogenous architectures
|
| This is interesting, though function pointers are long
| expected to be address on binary, C-brained people like me
| would probably adapt to the concept of "pointer to a
| heterogeneous lambda object" or "shared id across
| heterogeneous runtimes" easier.
| JonChesterfield wrote:
| Yeah, I might need to take another angle at the branding /
| sales part, possibly with a prototype in hand.
|
| I was hopeful that wasm was sufficient prior art. Whether
| function pointers are absolute addresses, or relative to
| the start of text, or resolved through trampolines filled
| in by the loader, or are offsets into some table is all
| (nominally) invisible to the language.
|
| Integer in [0, k) has the nice feature that multiple GPUs
| can each index into their own lookup table containing
| pointers to the local code section. Or for calling into
| another machine - it's essentially a cheap serialisation /
| perfect hash done at linker time. It makes indirect calls
| slower, but indirect calls are slow anyway so hopefully
| saleable.
| int_19h wrote:
| One could argue that the MMU is a prior art of sorts. We
| haven't really been dealing with direct physical memory
| addresses on most platforms for a very long time.
| yjftsjthsd-h wrote:
| If the host has an antique glibc, surely it has an antique
| llvm libc? Does llvm just have a more stable ABI or
| something?
| einpoklum wrote:
| I am pretty sure this is just a gimmick. I would not call libc
| code in a GPU kernel. It would mean dragging in a whole bunch of
| stuff I don't want, and cant control or pick-and-choose. That
| makes sense for regular processes on the CPU; it does _not_ make
| sense in code you run millions of times in GPU threads.
|
| I see people saying they've "dreamed" of this or have waited so
| long for this to happen... well, my friends, you should not have;
| and I'm afraid you're in for a disappointment.
| JonChesterfield wrote:
| It uses a few mb of contiguous shared memory and periodically
| calling a function from a host thread. Unless you only want
| sprintf or similar in which case neither is needed. The unused
| code deadstrips pretty well. Won't help compilation time.
| Generally you don't want libc calls in numerical kernels doing
| useful stuff - the most common request was for printf as a
| debugging crutch, I mostly wanted mmap.
| einpoklum wrote:
| > It uses a few mb of contiguous shared memory and
| periodically calling a function from a host thread.
|
| Well, if someone wants to shoot themselves in the head, then
| by all means...
|
| > Unless you only want sprintf or similar in which case
| neither is needed. > ... Generally you don't want libc calls
| in numerical kernels doing useful > stuff - the most common
| request was for printf as a debugging crutch
|
| I have actually adapted a library for that particular case:
|
| https://github.com/eyalroz/printf/
|
| I started with a standalone printf-family implementation
| targetting embedded devices, and (among other things) adapted
| it for use also with CUDA.
|
| > I mostly wanted mmap.
|
| Does it really make sense to make a gazillion mmap calls from
| the threads of your GPU kernel? I mean, is it really not
| always better to mmap on the CPU side? At most, I might do it
| asynchronously using a CUDA callback or some other mechanism.
| But I will admit I've not had that use-case.
| imtringued wrote:
| I'm not sure what disappointment you're predicting. Unless your
| GPU is connected through a cache coherent protocol like CXL to
| your CPU, you are unlikely to make your code run faster by
| transferring data back to the CPU and back again to the GPU.
| You have 128 compute units on the 4090, even at a lower
| frequency and higher memory latency, you will probably not end
| up too far away from the performance of an 8 core CPU running
| at 4.5GHz. Nobody is running millions of CPU threads in the
| first place, so you seem to be completely misunderstanding the
| workload here. Nobody wants to speed up their CPU code by
| running it on the GPU, they want to stop slowing down their GPU
| code by waiting for data transfers to and from the CPU.
| jhuber6 wrote:
| Very cool to see my project posted here!
|
| The motivation behind a lot of this was to have community LLVM
| implementations of runtime functions normally provided by the
| vendor libraries (C math, printf, malloc), but if you implement
| one function you may as well implement them all.
|
| Realistically, the infrastructure behind this project is more
| relevant than the C library calls themselves. The examples in the
| linked documentation can be used for any arbitrary C/C++ just as
| well as the LLVM C library, it's simply statically linking. This
| is what allowed me to compile and run more complicated things
| like libc++ and DOOM on the GPU as well. The RPC interface can
| also be used to implement custom host services from the GPU, or
| used to communicate between any two shared memory processes.
| noxa wrote:
| Just wanted to say thanks for pushing on this front! I'm not
| using the libc portion but the improvements to clang/llvm that
| allow this to work have been incredible. When I was looking a
| few months back the only options that felt practical for
| writing large amounts of device code were cuda/hip or opencl
| and a friend suggested I just try C _and it worked_. Definitely
| made my "most practical/coolest that it actually works" list
| for 2024 :)
| selimnairb wrote:
| It seems like unified memory has to be the goal. This all just
| feels like a kludgy workaround until that happens (kind of like
| segmented memory in the 16-bit era).
| LorenDB wrote:
| Is unified memory practical for a "normal" desktop/server
| configuration though? Apple has been doing unified memory, but
| they also have the GPU on the CPU die. I would be interested to
| know if a discrete GPU plugged into a PCIe slot would have
| enough latency to make unified memory impractical.
| selimnairb wrote:
| It's clearly not practical now, but that doesn't mean it
| won't be at some point.
| bsder wrote:
| Why is RBAR insufficient? It's been pretty much supported for
| at least 5 years now.
| gavinsyancey wrote:
| GPUs benefit from extremely high memory bandwidth. RBAR
| helps, but it's a lot worse than a fat bus to a bunch of on-
| card GDDR6X.
|
| A PCIe 4.0x16 link gives 32 GB/s bandwidth; an RTX 4090 has
| over 1 TB/s bandwidth to its on-card memory.
| Hydrocarb0n wrote:
| This is over my head, will it have any application or performance
| improvements for gpu passthrough in VMs? Or games, or make
| hashcat faster?
| saagarjha wrote:
| No. This is mostly a toy rather than a production thing.
| LegNeato wrote:
| If you are interested in this, you might be interested in Rust
| GPU: https://rust-gpu.github.io/
___________________________________________________________________
(page generated 2024-12-15 23:02 UTC)