[HN Gopher] Using Libc for GPUs
       ___________________________________________________________________
        
       Using Libc for GPUs
        
       Author : hochmartinez
       Score  : 182 points
       Date   : 2024-12-11 15:29 UTC (4 days ago)
        
 (HTM) web link (libc.llvm.org)
 (TXT) w3m dump (libc.llvm.org)
        
       | almostgotcaught wrote:
       | this was presented at llvm this year
       | https://www.youtube.com/watch?v=4TxGWis1mws - it was a nice talk.
        
       | kelsey98765431 wrote:
       | Running normal c directly on gpu has been the dream for a long
       | time. this looks excellent
        
         | C-programmer wrote:
         | Genuinely, why?
         | 
         | - For new code, all of the functions
         | [here](https://libc.llvm.org/gpu/support.html#libc-gpu-support)
         | you can do without just fine.
         | 
         | - For old code:                 * Your project is large enough
         | that you are likely use using an unsupported libc function
         | somewhere.            * Your project is small enough that you
         | would benefit from just implementing a new kernel yourself.
         | 
         | I am biased because I avoid the C standard library even on the
         | CPU, but this seems like a technology that raises the floor not
         | the ceiling of what is possible.
        
           | tredre3 wrote:
           | > this seems like a technology that raises the floor not the
           | ceiling of what is possible.
           | 
           | In your view, how is making GPU programming easier a bad
           | thing?
        
             | convolvatron wrote:
             | that's clearly not a bad thing. however encouraging people
             | to run mutating, procedural code with explicit loops and
             | aliasing maybe isn't the right path to get there.
             | particularly if you just drag forward all the weird old
             | baggage with libc and its horrible string conventions.
             | 
             | I think any programming environment that treats a gpu as a
             | really slow serial cpu isn't really what you want(?)
        
               | quotemstr wrote:
               | What if it encourages people to write parallel and
               | functional code on CPUs? That'd be a good thing.
               | Influence works both ways.
               | 
               | The bigger problem is that GPUs have various platform
               | features (shared memory, explicit cache residency and
               | invalidation management) that CPUs sadly don't yet. Sure,
               | you _could_ expose these facilities via compiler
               | intrinsics, but then you end up code that might be
               | syntactically valid C but is alien both to CPUs and human
               | minds
        
               | fc417fc802 wrote:
               | > is alien both to CPUs and human minds
               | 
               | On the contrary I would love that. The best case scenario
               | in my mind is being able to express the native paradigms
               | of all relevant platforms while writing a single piece of
               | code that can then be compiled for any number of backends
               | and dynamically retargeted between them at runtime. It
               | would make debugging and just about everything else SO
               | MUCH EASIER.
               | 
               | The equivalent of being able to compile some subset of
               | functions for both ARM and x86 and then being able to
               | dynamically dispatch to either version at runtime, except
               | replace ARM with a list of all the GPU ISAs that you care
               | about.
        
           | JonChesterfield wrote:
           | One thing this gives you is syscall on the gpu. Functions
           | like sprintf are just blobs of userspace code, but others
           | like fopen require support from the operating system (or
           | whatever else the hardware needs you to do). That plumbing
           | was decently annoying to write for the gpu.
           | 
           | These aren't gpu kernels. They're functions to call from
           | kernels.
        
             | almostgotcaught wrote:
             | > One thing this gives you is syscall on the gpu
             | 
             | i wish people in our industry would stop (forever,
             | completely, absolutely) using metaphors/allusions. it's a
             | complete disservice to anyone that isn't in on the trick.
             | it doesn't give you syscalls. that's impossible because
             | there's no sys/os on a gpu and your actual os does not
             | (necessarily) have any way to peer into the address
             | space/schedular/etc of a gpu core.
             | 
             | what it gives you is something that's working really really
             | hard to pretend be a syscall:
             | 
             | > Traditionally, the C library abstracts over several
             | functions that interface with the platform's operating
             | system through system calls. The GPU however does not
             | provide an operating system that can handle target
             | dependent operations. Instead, we implemented remote
             | procedure calls to interface with the host's operating
             | system while executing on a GPU.
             | 
             | https://libc.llvm.org/gpu/rpc.html.
        
               | quotemstr wrote:
               | It's a matter of perspective. If you think of the GPU as
               | a separate computer, you're right. If you think of it as
               | a coprocessor, then the use of RPC is just an
               | implementation detail of the system call mechanism, not a
               | semantically different thing.
               | 
               | When an old school 486SX delegates a floating point
               | instruction to a physically separate 487DX coprocessor,
               | is it executing an instruction or doing an RPC? If RPC,
               | does the same instruction start being a real instruction
               | when you replace your 486SX with a 486DX, with an
               | integrated GPU? The program can't tell the difference!
        
               | almostgotcaught wrote:
               | > It's a matter of perspective. If you think of the GPU
               | as a separate computer, you're right.
               | 
               | this perspective is a function of exactly one thing: do
               | you care about the performance of your program? if not
               | then sure indulge in whatever abstract perspective you
               | want ("it's magic, i just press buttons and the lights
               | blink"). but if you don't care about perf then why are
               | you using a GPU at all...? so for people that aren't just
               | randomly running code on a GPU (for shits and giggles),
               | the distinction is very significant between "syscall" and
               | syscall.
               | 
               | people who say these things don't program GPUs for a
               | living. there are no abstractions unless you don't care
               | about your program's performance (in which case why are
               | you using a GPU at all).
        
               | quotemstr wrote:
               | Not everything in every program is performance critical.
               | A pattern I've noticed repeatedly among CUDAheads is the
               | idea that "every cycle matters" and therefore we should
               | uglify and optimize even cold parts of our CUDA programs.
               | That's as much BS on GPU as it is on CPU. In CPU land, we
               | moved past this sophomoric attitude decades ago. The GPU
               | world might catch up one day.
               | 
               | Are you planning on putting fopen() in an inner loop or
               | something? LOL
        
               | nickysielicki wrote:
               | genuinely asking: where else should ML engineers focus
               | their time, if not on looking at datapath bottlenecks in
               | either kernel execution or the networking stack?
        
               | lmm wrote:
               | The point is that you should focus on the bottlenecks,
               | not on making every random piece of code "as fast as
               | possible". And that sometimes other things
               | (maintainability, comprehensibility, debuggability) are
               | more important than maximum possible performance, even on
               | the GPU.
        
               | nickysielicki wrote:
               | That's fair, but I didn't understand OP to be claiming
               | above that "cudaheads" aren't looking at their
               | performance bottlenecks before driving work, just that
               | they're looking at the problem incorrectly (and eg: maybe
               | should prioritize redesigns over squeezing perf out of
               | flawed approaches.)
        
               | almostgotcaught wrote:
               | > A pattern I've noticed repeatedly among CUDAheads is
               | the idea that "every cycle matters" and therefore we
               | should uglify and optimize even cold parts of our CUDA
               | programs
               | 
               | I don't know what a "cudahead" is but if you're gonna
               | build up a strawman just to chop it down have at it.
               | Doesn't change anything about my point - these aren't
               | syscalls because there's no sys. I mean the dev here
               | literally spells it out correctly so I don't understand
               | why there's any debate.
        
               | oivey wrote:
               | The whole reason CUDA/GPUs are fast is that they
               | explicitly don't match the architecture of CPUs. The
               | truly sophomoric attitude is that all compute devices
               | should work like CPUs. The point of CUDA/GPUs is to
               | provide a different set of abstractions than CPUs that
               | enable much higher performance for certain problems.
               | Forcing your GPU to execute CPU-like code is a bad
               | abstraction.
               | 
               | Your comment about putting fopen in an inner loop really
               | betrays that. Every thread in your GPU kernel is going to
               | have to wait for your libc call. You're really confused
               | if you're talking about hot loops in a GPU kernel.
        
               | saagarjha wrote:
               | > A pattern I've noticed repeatedly among CUDAheads is
               | the idea that "every cycle matters" and therefore we
               | should uglify and optimize even cold parts of our CUDA
               | programs.
               | 
               | You're talking to the wrong people; this is definitely
               | not true in general.
        
               | JonChesterfield wrote:
               | The "proper syscall" isn't a fast thing either. The
               | context switch blows out your caches. Part of why I like
               | the name syscall is it's an indication to not put it on
               | the fast path.
               | 
               | The implementation behind this puts a lot of emphasis on
               | performance, though the protocol was heavilt simplfied in
               | upstreaming. Running on pcie instead of the APU systems
               | makes things rather laggy too. Design is roughly a mashup
               | of io_uring and occam, made much more annoying by the GPU
               | scheduler constraints.
               | 
               | The two authors of this thing probably count as people
               | who program GPUs for a living for what it's worth.
        
               | saati wrote:
               | A 486SX never delegates floating point instructions, the
               | 487 is a full 486DX that disables the SX and fully takes
               | over, you are thinking of 386 and older.
        
               | JonChesterfield wrote:
               | Well, I called it syscall because it's a void function of
               | 8 u64 arguments which your code stumbles into, gets
               | suspended, then restored with new values for those
               | integers. That it's a function instead of an instruction
               | doesn't change the semantics. My favourite of the uses of
               | that is to pass six of those integers to the x64 syscall
               | operation.
               | 
               | This isn't misnaming. It's a branch into a trampoline
               | that messes about with shared memory to give the effect
               | of the x64 syscall you wanted, or some other thing that
               | you'd rather do on the cpu.
               | 
               | There's a gpu thing called trap which is closer in
               | behaviour to what you're thinking of but it's really
               | annoying to work with.
               | 
               | Side note, RPC has a terrible rep for introducing failure
               | modes into APIs, but that's completely missing here
               | because pcie either works or your machine is gonna have
               | to reboot. There are no errors on the interface that can
               | be handled by the application.
        
               | almostgotcaught wrote:
               | > Well, I called it syscall because it's a void function
               | of 8 u64 arguments which your code stumbles into, gets
               | suspended, then restored with new values for those
               | integers
               | 
               | I'm put it really simply: is there a difference (in perf,
               | semantics, whatever) between using this "syscalls" to
               | implement fopen on GPU and using a syscall to implement
               | fopen on CPU? Note that's a rhetorical question because
               | we both already know that the answer is yes. So again
               | you're just playing slight of hand in calling them
               | syscalls and I'll emphasize: this is a slight of hand
               | that the dev himself doesn't play (so why would I take
               | your word over his).
        
               | JonChesterfield wrote:
               | Wonderfully you don't need to trust my words, you've got
               | my code :)
               | 
               | If semantics are different, that's a bug/todo. It'll have
               | worse latency than a CPU thread making the same kernel
               | request. Throughput shouldn't be way off. The GPU writes
               | some integers to memory that the CPU will need to read,
               | and then write other integers, and then load those again.
               | Plus whatever the x64 syscall itself does. That's a bunch
               | of cache line invalidation and reads. It's not as fast as
               | if the hardware guys were on board with the strategy but
               | I'm optimistic it can be useful today and thus help
               | justify changing the hardware/driver stack.
               | 
               | The whole point of libc is to paper over the syscall
               | interface. If you start from musl, "syscall" can be a
               | table of function pointers or asm. Glibc is more
               | obstructive. This libc open codes a bunch of things, with
               | a rpc.h file dealing with synchronising memcpy of
               | arguments to/from threads running on the CPU which get to
               | call into the Linux kernel directly. It's mainly
               | carefully placed atomic operations to keep the data
               | accesses well defined.
               | 
               | There's also nothing in here which random GPU devs can't
               | build themselves. The header files are (now) self
               | contained if people would like to use the same mechanism
               | for other functionality and don't want to handroll the
               | data structure. The most subtle part is getting this to
               | work correctly under arbitrary warp divergence on volta.
               | It should be an out of the box thing under openmp early
               | next year too.
        
               | almostgotcaught wrote:
               | > Wonderfully you don't need to trust my words, you've
               | got my code :)
               | 
               | My friend it's so incredibly bold of you to claim credit
               | for this work when
               | 
               | 1. Joe presented it
               | 
               | 2. Joe's name is the only name on the git blame
               | 
               | 3. _I know Joe_ and _I know_ he did the lion 's share of
               | the work
               | 
               | And so I'll repeat: Joe himself calls it rpc so I'm gonna
               | keep calling it rpc and not syscall.
        
               | jhuber6 wrote:
               | The RPC implementation in LLVM is an adaptation of Jon's
               | original state machine (see
               | https://github.com/JonChesterfield/hostrpc). It looks
               | very different at this point, but we collaborated on the
               | initial design before I fleshed out everything else.
               | Syscall or not is a bit of a semantic argument, but I
               | lean more towards syscall 'inspired'.
        
               | JonChesterfield wrote:
               | Here's the algorithm
               | https://doi.org/10.1145/3458744.3473357. My paper with
               | Joseph on the implementation is at
               | https://doi.org/10.1007/978-3-031-40744-4_15.
               | 
               | The syscall layer this runs on was written at
               | https://github.com/JonChesterfield/hostrpc, 800 commits
               | from May 2020 until Jan 2023. I deliberately wrote that
               | in the open, false paths and mistakes and all. Took ages
               | for a variety of reasons, not least that this was my side
               | project.
               | 
               | You'll find the upstream of that scattered across the
               | commits to libc, mostly authored by Joseph (log shows 300
               | for him, of which I reviewed 40, and 25 for me). You
               | won't find the phone calls and offline design
               | discussions. You can find the tricky volta solution at
               | https://reviews.llvm.org/D159276 and the initial patch to
               | llvm at https://reviews.llvm.org/D145913.
               | 
               | GPU libc is definitely Joseph's baby, not mine, and this
               | wouldn't be in trunk if he hadn't stubbornly fought
               | through the headwinds to get it there. I'm excited to see
               | it generating some discussion on here.
               | 
               | But yeah, I'd say the syscall implementation we're
               | discussing here has my name adequately written on it to
               | describe it as "my code".
        
               | rowanG077 wrote:
               | Why does a perf difference factor into it? There is no
               | requirement for a syscall to be this fast or else it
               | isn't a syscall. If you have a hot loop you shouldn't be
               | putting a syscall in it, not even on the CPU.
        
               | saagarjha wrote:
               | It's not a syscall any more than a call into a Wine thunk
               | is a syscall. Sure, it implements and emulates a syscall.
               | But it's not a syscall.
        
           | nickysielicki wrote:
           | https://developer.nvidia.com/blog/simplifying-gpu-
           | applicatio...
        
           | JonChesterfield wrote:
           | > Genuinely, why?
           | 
           | > ... this seems like a technology that raises the floor not
           | the ceiling of what is possible.
           | 
           | The root cause reason for this project existing is to show
           | that GPU programming is not synonymous with CUDA (or the
           | other offloading languages).
           | 
           | It's nominally to help people run existing code on GPUs.
           | Disregarding that use case, it shows that GPUs can actually
           | do things like fprintf or open sockets. This is obvious to
           | the implementation but seems largely missed by application
           | developers. Lots of people think GPUs can only do floating
           | point math.
           | 
           | Especially on an APU, where the GPU units and the CPU cores
           | can hammer on the same memory, it is a travesty to persist
           | with the "offloading to accelerator" model. Raw C++ isn't an
           | especially sensible language to program GPUs in but it's
           | workable and I think it's better than CUDA.
        
             | rbanffy wrote:
             | > Lots of people think GPUs can only do floating point
             | math.
             | 
             | IIRC, every Raspberry Pi is brought up by the GPU setting
             | up the system before the CPU is brought out of reset and
             | the bootloader looks for the OS.
             | 
             | > it is a travesty to persist with the "offloading to
             | accelerator" model.
             | 
             | Operating systems would need to support heterogeneous
             | processors running programs with different ISAs accessing
             | the same pools of memory. I'd _LOVE_ to see that. It 'd be
             | extremely convenient to have first-class processes running
             | on the GPU MIMD cores.
             | 
             | I'm not sure there is much research done in that space. I
             | believe IBM mainframe OSs have something like that because
             | programmers are exposed to the various hardware assists
             | that run as coprocessors sharing the main memory with the
             | OS and applications.
        
               | als0 wrote:
               | > I'm not sure there is much research done in that space.
               | 
               | There is. And the finest example I can think of is
               | Barrelfish https://barrelfish.org
        
               | rbanffy wrote:
               | Interesting - it resembles a network of heterogeneous
               | systems that can share a memory space used primarily for
               | explicit data exchange. Not quite what I was imagining,
               | but probably much simpler to implement than a Unix where
               | the kernel can see processes running on different ISAs on
               | a shared memory space.
               | 
               | I guess hardware availability is an issue, as there
               | aren't many computers with, say, an ARM, a RISC-V, an
               | x86, and an AMD iGPU sharing a common memory pool.
               | 
               | OTOH, there are many where a 32-bit ARM shares the memory
               | pool with 64-bit cores. Usually the big cores run
               | applications while the small ARM does housekeeping or
               | other low-latency task.
        
               | als0 wrote:
               | > Not quite what I was imagining, but probably much
               | simpler to implement than a Unix where the kernel can see
               | processes running on different ISAs on a shared memory
               | space.
               | 
               | Indeed. The other argument is that treating the computer
               | as a distributed system can make it scale better to say
               | hundreds of cores compared to a lock-based SMP system.
        
               | rbanffy wrote:
               | > treating the computer as a distributed system
               | 
               | Sure, but where's the fun in that?
               | 
               | Up to GPGPUs, there was no reason to build a machine with
               | multiple CPUs of different architectures except running
               | different OSs on them (such as the Macs, Suns and Unisys
               | mainframes with x86 boards for running Windows side-by-
               | side with a more civilized OS). With GPGPUs you have
               | machines with a set of processors that are good on many
               | things, but not great at SIMD and one that's awesome at
               | SIMD, but sucks for most other things.
               | 
               | And, as I mentioned before, there are lots of ARM
               | machines with 64-bit and ultra-low-power 32-bit cores
               | sharing the same memory map. Also, even x86 variants with
               | different ISA extensions can be treated as different
               | architectures by the OS - Intel had to limit the fast
               | cores of its early asymmetric parts because the low-power
               | cores couldn't do AVX512 and OSs would not support
               | migrating a process to the right core on an invalid
               | instruction fault.
        
               | saagarjha wrote:
               | The problem is that GPUs are kind of bad at being
               | general-purpose, so it doesn't really make sense to
               | expose the hardware that way.
        
               | rbanffy wrote:
               | If the OS supports it, you can make programs that start
               | threads on CPUs and GPUs and let those communicate. You
               | run the SIMD-ish functions on the GPUs and the non-SIMD-
               | heavy functions on the CPU cores.
               | 
               | I have a strong suspicion GPUs aren't as bad at general-
               | purpose stuff as we perceive and we underutilize them
               | because it's inconvenient to shuttle data over an
               | architectural wall that's not really there in iGPUs.
               | 
               | Maybe it doesn't make sense, but it'd be worth looking
               | into just to know where the borders of the problem lie.
        
             | krackers wrote:
             | >Disregarding that use case, it shows that GPUs can
             | actually do things like fprintf or open sockets.
             | 
             | Can you elaborate on this? My mental model of GPU is
             | basically like a huge vector coprocessor. How would things
             | like printf or sockets work directly from the GPU when they
             | require syscalls to trap into the OS kernel? Given that the
             | kernel code is running on the CPU, that seems to imply that
             | there needs to be a handover at some point. Or conversely
             | even if there was unified memory and the GPU could directly
             | address memory-mapped peripherals, you'd basically need to
             | reimplement drivers wouldn't you?
        
               | JonChesterfield wrote:
               | It's mostly terminology and conventions. On the standard
               | system setup, the linux kernel running in a special
               | processor mode does these things. Linux userspace asks
               | the kernel to do stuff using syscall and memory which
               | both kernel and userspace can access. E.g. the io_uring
               | register followed by writing packets into the memory.
               | 
               | What the GPU has is read/write access to memory that the
               | CPU can also access. And network peripherals etc. You can
               | do things like alternately compare-and-swap on the same
               | page from x64 threads and amdgpu kernels and it works,
               | possibly not quickly on some systems. That's also all
               | that the x64 CPU threads have though, modulo the magic
               | syscall instruction to ask the kernel to do stuff.
               | 
               | People sometimes get quite cross at my claim that the GPU
               | can do fprintf. Cos _actually_ all it can do is write
               | numbers into shared memory or raise interrupts such that
               | the effect of fprintf is observed. But that 's also all
               | the userspace x64 threads do, and this is all libc
               | anyway, so I don't see what people are so cross about.
               | You're writing C, you call `fprintf(stderr, "Got to
               | L42\n");` or whatever, and you see the message on the
               | console.
               | 
               | If fprintf compiles into a load of varargs mangling with
               | a fwrite underneath, and the varargs stuff runs on the
               | GPU silicon and the fwrite goes through a staging buffer
               | before some kernel thread deals with it, that seems fine.
               | 
               | I'm pretty sure you could write to an nvme drive directly
               | from the gpu, no talking to the host kernel at all, at
               | which point you've arguably implemented (part of?) a
               | driver for it. You can definitely write to network cards
               | from them, without using any of this machinery.
        
               | saagarjha wrote:
               | We don't actually allow a GPU to directly fprintf,
               | because GPU can't syscall. Only userspace can do that.
               | You can have userspace keep polling and then do it on
               | behalf of the GPU, but that's not the GPU doing it.
        
               | adrian_b wrote:
               | The GPU could do the equivalent of fprintf, if the
               | concerned peripherals used only memory-mapped I/O an the
               | IOMMU would be configured to allow the GPU to access
               | directly those peripherals, without any involvement from
               | the OS kernel that runs on the CPU.
               | 
               | This is the same as on the CPU, where the kernel can
               | allow a user process to access directly a peripheral,
               | without using system calls, by mapping that peripheral in
               | the memory space of the user process.
               | 
               | In both cases the peripheral must be assigned exclusively
               | to the GPU or the user process. What is lost by not using
               | system calls is the ability to share the peripheral
               | between multiple processes, but the performance for the
               | exclusive user of the peripheral can be considerably
               | increased. Of course, the complexity of the user process
               | or GPU code is also increased, because it must include
               | the equivalent of the kernel device driver for that
               | peripheral.
        
               | jhuber6 wrote:
               | At some point I was looking into using io_uring for
               | something like this. The uring interface just works off
               | of `mmap()` memory, which can be registered with the
               | GPU's MMU. There's a submission polling setting, which
               | means that the GPU can simply write to the pointer and
               | the kernel will eventually pick up the write syscall
               | associated with it. That would allow you to use
               | `snprintf` locally into a buffer and then block on its
               | completion. The issue is that the kernel thread goes to
               | sleep after some time, so you'd still need a syscall from
               | the GPU to wake it up. AMD GPUs actually support software
               | level interrupts which could be routed to a syscall, but
               | I didn't venture too deep down that rabbit hole.
        
             | einpoklum wrote:
             | > The root cause reason for this project existing is to
             | show that GPU > programming is not synonymous with CUDA (or
             | the other offloading > languages).
             | 
             | 1. The ability to use a particular library does not reflect
             | much on which languages can be used.
             | 
             | 2. One you have PTX as a backend target for a compiler,
             | obviously you can use all sorts of languages on the
             | frontend - which NVIDIA's drivers and libraries won't even
             | know about. Or you can just use PTX as your language -
             | making your point that GPU programming is not synonymous
             | with CUDA C++.
             | 
             | > It's nominally to help people run existing code on GPUs.
             | 
             | I'm worried you might be right. But - we should really not
             | encourage people to run existing CPU-side code on GPUs,
             | that's rarely (or maybe never?) a good idea.
             | 
             | > Raw C++ isn't an especially sensible language to program
             | GPUs in > but it's workable and I think it's better than
             | CUDA.
             | 
             | CUDA is an execution ecosystem. The programming language
             | for writing kernel code is "CUDA C++", which _is_ C++, plus
             | a few builtins functions ... or maybe I'm misunderstanding
             | this sentence.
        
               | JonChesterfield wrote:
               | GPU offloading languages - cuda, openmp etc - work
               | something like:
               | 
               | 1. Split the single source into host parts and gpu parts
               | 
               | 2. Optionally mark up some parts as "kernels", i.e. have
               | entry points
               | 
               | 3. Compile them separately, maybe for many architectures
               | 
               | 4. Emit a bunch of metadata for how they're related
               | 
               | 5. Embed the GPU code in marked up sections of the host
               | executable
               | 
               | 6. Embed some startup code to find GPUs into the x64
               | parts
               | 
               | 7. At runtime, go crawling around the elf section
               | launching kernels
               | 
               | This particular library (which happens to be libc) is
               | written in C++, compiled with ffreestanding
               | target=amdgpu, to LLVM bitcode. If you build a test, it
               | compiles to an amdgpu elf file - no x64 code in it, no
               | special metadata, no elf-in-elf structure. The entry
               | point is called _start. There's a small "loader" program
               | which initialises hsa (or cuda) and passes it the address
               | of _start.
               | 
               | I'm not convinced by the clever convenience cut-up-and-
               | paste-together style embraced by cuda or openmp. This
               | approach brings the lack of magic to the forefront. It
               | also means we can add it to openmp etc when the reviews
               | go through so users of that suddenly find fopen works.
        
               | einpoklum wrote:
               | CUDA C++ _can_ work like that. But I would say that these
               | are mostly kiddie wheels for convenience. And because, in
               | GPU programming, performance is king, most (?) kernel
               | developers are likely to eventually need to drop those
               | wheels. And then:
               | 
               | * No single source (although some headers might be
               | shared)
               | 
               | * Kernels are compiled and linked at runtime, for the
               | platform you're on, but also, in the general case, with
               | extra definitions not known apriori (and which are
               | different for different inputs / over the course of
               | running your program), and which have massive effect on
               | the code.
               | 
               | * You may or may not use some kind of compiled kernel
               | caching mechanism, but you certainly don't have all
               | possible combinations of targets and definitions
               | available, since that would be millions or compiled
               | kernels.
               | 
               | It should also be mentioned that OpenCL never included
               | the kiddie wheels to begin with; although I have to admit
               | it makes it less convenient to start working with.
        
         | amelius wrote:
         | I wonder what C would look like if CPUs would evolve into what
         | GPUs are today.
         | 
         | What if a CPU had assembly instructions for everything a GPU
         | can do? Would compiler/language designers support them?
        
           | gpderetta wrote:
           | there is no reason to wonder: https://ispc.github.io/
        
         | quotemstr wrote:
         | I've never understood why people say you "can't" do this or
         | that on GPU. A GPU is made of SMs, and each SM is just a CPU
         | with very wide SIMD pipes and very good hyperthreading. You can
         | take one thread of a warp in a SM and do exactly the same
         | things a CPU would do. Would you get 1/32 potential
         | performance? Sure. But so what? Years ago, we did plenty of
         | useful work with less than 1/32 of a modest CPU, and we can
         | again.
         | 
         | One of the more annoying parts of the Nvidia experience is PTX.
         | I. I know perfectly well that your CPU/SM/whatever has a
         | program counter. Let me manipulate it directly!
        
           | bee_rider wrote:
           | I think people are just saying it isn't very cost effective
           | to use a whole GPU as 1/32 of a modest (modern?) CPU.
        
             | quotemstr wrote:
             | Cars are slow and inefficient in reverse gear but a car
             | that couldn't drive in reverse would be broken.
        
               | bee_rider wrote:
               | You put "can't" in quotes, so I guess you are quoting
               | somebody, but I don't see where the quote is from, so I'm
               | not sure what they actually meant.
               | 
               | But I suspect they are using "can't" informally. Like:
               | You can't run a drag race in reverse. Ok, you technically
               | could, but it would be a silly thing to do.
        
             | imtringued wrote:
             | Why would that matter? Adding extra CPU cores to the GPU
             | sounds like a much dumber idea to me. You'd be wasting
             | silicon on something that is meant to be used infrequently.
             | 
             | Is it really that difficult to imagine 1% of your workload
             | doing some sort of management code, except since it is
             | running on your GPU, you can now freely intersperse those
             | 1% every 50 microseconds without too much overhead?
        
       | Archit3ch wrote:
       | No direct Metal target.
        
         | almostgotcaught wrote:
         | this is an LLVM project... you want this to work on Metal, ask
         | apple to add a Metal backend to LLVM
         | 
         | https://github.com/llvm/llvm-project/tree/main/llvm/lib/Targ...
        
           | rbanffy wrote:
           | I am surprised there isn't.
        
         | JonChesterfield wrote:
         | No Intel either. The port would be easy - gpuintrin.h abstracts
         | over the intrinsics, provide an implementation for those, write
         | a loader in terms of opencl or whatever if you want to run the
         | test suite.
         | 
         | The protocol needs ordered load/store on shared memory but
         | nothing else. I wrote a paper trying to make it clear that
         | load/store on shmem was sufficient which doesn't seem to be
         | considered persuasive. It's specifically designed to tolerate
         | architectures doing slopping things with cache invalidation. It
         | could run much faster with fetch_or / fetch_and instructions
         | (as APUs have, but PCIe does not). It could also hang off DMA
         | but that isn't implemented (I want to have the GPU push packets
         | over the network without involving the x64 CPU at all).
        
       | gdiamos wrote:
       | Nice to see this finally after 15 years.
        
       | amelius wrote:
       | I hate libc. It's such a common cause of versioning problems on
       | my system.
       | 
       | Can't they just stop making new versions of it?
        
         | wbl wrote:
         | Dynamic linking has some benefits, but many detractors.
        
           | rbanffy wrote:
           | It's nice not having to recompile all your software because
           | of a CVE impacting libc, or any other fundamental component
           | of the system.
        
             | adrian_b wrote:
             | In decades of experiencing various computing environments,
             | the amount of time wasted to fix applications broken by
             | updates to dynamically-linked libraries has dwarfed by a
             | few orders of magnitude the amount of time saved by not
             | recompiling statically-linked applications because of a new
             | library version that fixes security problems.
        
               | rbanffy wrote:
               | What scares me about the need to recompile statically
               | linked binaries is that the problem is invisible until a
               | bug hits. You don't know the statically linked library is
               | vulnerable unless you keep track of all versions that
               | went into that binary and almost no organization does
               | that.
               | 
               | DLL problems are very easy to see and very obvious when
               | they happen. But it's been a long while since I last saw
               | one.
        
           | adrian_b wrote:
           | If some people like dynamically-linked applications, let them
           | have such applications.
           | 
           | What annoys me greatly is the existence of unavoidable
           | libraries that have been designed intentionally to not work
           | when linked statically, which make the life difficult for
           | those who prefer to use only statically-linked applications.
           | The main offending library is glibc, but there are too many
           | other such cases.
        
       | fuhsnn wrote:
       | I wonder how GPU is going to access an unknown size NULL
       | terminated string in system RAM, the strchr() source looks like
       | normal C++. In my minimal Vulkan GPGPU experience the data need
       | to be bound to VkDeviceMemory to be accessible through PCI bus
       | with compute shader, is LLVM libc runtime doing similar set-ups
       | in the background, and if so, is it faster than glibc's hand-
       | tuned AVX implementation?
        
         | JonChesterfield wrote:
         | This is libc running on the GPU, not a libc spanning CPU and
         | GPU. The primary obstruction to doing that is persuading people
         | to let go of glibc. The spec of "host runs antique glibc, GPU
         | runs some other thing that interops transparently with glibc"
         | is a nightmare of hacks and tragedy.
         | 
         | What would be relatively easy to put together is llvm libc
         | running on the host x64 and also llvm libc running on the GPU.
         | There's then the option to do things like malloc() on the GPU
         | and free() the same pointer on the CPU. Making it genuinely
         | seamless also involves persuading people to change what
         | function pointers are, do some work on the ABI, and preferably
         | move to APUs because PCIe is unhelpful.
         | 
         | There's an uphill battle to bring people along on the journey
         | of "program the things differently". For example, here's a
         | thread trying to drum up enthusiasm for making function
         | pointers into integers as that makes passing them between
         | heterogenous architectures far easier
         | https://discourse.llvm.org/t/rfc-function-pointers-as-
         | intege....
        
           | fuhsnn wrote:
           | >making function pointers into integers as that makes passing
           | them between heterogenous architectures
           | 
           | This is interesting, though function pointers are long
           | expected to be address on binary, C-brained people like me
           | would probably adapt to the concept of "pointer to a
           | heterogeneous lambda object" or "shared id across
           | heterogeneous runtimes" easier.
        
             | JonChesterfield wrote:
             | Yeah, I might need to take another angle at the branding /
             | sales part, possibly with a prototype in hand.
             | 
             | I was hopeful that wasm was sufficient prior art. Whether
             | function pointers are absolute addresses, or relative to
             | the start of text, or resolved through trampolines filled
             | in by the loader, or are offsets into some table is all
             | (nominally) invisible to the language.
             | 
             | Integer in [0, k) has the nice feature that multiple GPUs
             | can each index into their own lookup table containing
             | pointers to the local code section. Or for calling into
             | another machine - it's essentially a cheap serialisation /
             | perfect hash done at linker time. It makes indirect calls
             | slower, but indirect calls are slow anyway so hopefully
             | saleable.
        
               | int_19h wrote:
               | One could argue that the MMU is a prior art of sorts. We
               | haven't really been dealing with direct physical memory
               | addresses on most platforms for a very long time.
        
           | yjftsjthsd-h wrote:
           | If the host has an antique glibc, surely it has an antique
           | llvm libc? Does llvm just have a more stable ABI or
           | something?
        
       | einpoklum wrote:
       | I am pretty sure this is just a gimmick. I would not call libc
       | code in a GPU kernel. It would mean dragging in a whole bunch of
       | stuff I don't want, and cant control or pick-and-choose. That
       | makes sense for regular processes on the CPU; it does _not_ make
       | sense in code you run millions of times in GPU threads.
       | 
       | I see people saying they've "dreamed" of this or have waited so
       | long for this to happen... well, my friends, you should not have;
       | and I'm afraid you're in for a disappointment.
        
         | JonChesterfield wrote:
         | It uses a few mb of contiguous shared memory and periodically
         | calling a function from a host thread. Unless you only want
         | sprintf or similar in which case neither is needed. The unused
         | code deadstrips pretty well. Won't help compilation time.
         | Generally you don't want libc calls in numerical kernels doing
         | useful stuff - the most common request was for printf as a
         | debugging crutch, I mostly wanted mmap.
        
           | einpoklum wrote:
           | > It uses a few mb of contiguous shared memory and
           | periodically calling a function from a host thread.
           | 
           | Well, if someone wants to shoot themselves in the head, then
           | by all means...
           | 
           | > Unless you only want sprintf or similar in which case
           | neither is needed. > ... Generally you don't want libc calls
           | in numerical kernels doing useful > stuff - the most common
           | request was for printf as a debugging crutch
           | 
           | I have actually adapted a library for that particular case:
           | 
           | https://github.com/eyalroz/printf/
           | 
           | I started with a standalone printf-family implementation
           | targetting embedded devices, and (among other things) adapted
           | it for use also with CUDA.
           | 
           | > I mostly wanted mmap.
           | 
           | Does it really make sense to make a gazillion mmap calls from
           | the threads of your GPU kernel? I mean, is it really not
           | always better to mmap on the CPU side? At most, I might do it
           | asynchronously using a CUDA callback or some other mechanism.
           | But I will admit I've not had that use-case.
        
         | imtringued wrote:
         | I'm not sure what disappointment you're predicting. Unless your
         | GPU is connected through a cache coherent protocol like CXL to
         | your CPU, you are unlikely to make your code run faster by
         | transferring data back to the CPU and back again to the GPU.
         | You have 128 compute units on the 4090, even at a lower
         | frequency and higher memory latency, you will probably not end
         | up too far away from the performance of an 8 core CPU running
         | at 4.5GHz. Nobody is running millions of CPU threads in the
         | first place, so you seem to be completely misunderstanding the
         | workload here. Nobody wants to speed up their CPU code by
         | running it on the GPU, they want to stop slowing down their GPU
         | code by waiting for data transfers to and from the CPU.
        
       | jhuber6 wrote:
       | Very cool to see my project posted here!
       | 
       | The motivation behind a lot of this was to have community LLVM
       | implementations of runtime functions normally provided by the
       | vendor libraries (C math, printf, malloc), but if you implement
       | one function you may as well implement them all.
       | 
       | Realistically, the infrastructure behind this project is more
       | relevant than the C library calls themselves. The examples in the
       | linked documentation can be used for any arbitrary C/C++ just as
       | well as the LLVM C library, it's simply statically linking. This
       | is what allowed me to compile and run more complicated things
       | like libc++ and DOOM on the GPU as well. The RPC interface can
       | also be used to implement custom host services from the GPU, or
       | used to communicate between any two shared memory processes.
        
         | noxa wrote:
         | Just wanted to say thanks for pushing on this front! I'm not
         | using the libc portion but the improvements to clang/llvm that
         | allow this to work have been incredible. When I was looking a
         | few months back the only options that felt practical for
         | writing large amounts of device code were cuda/hip or opencl
         | and a friend suggested I just try C _and it worked_. Definitely
         | made my "most practical/coolest that it actually works" list
         | for 2024 :)
        
       | selimnairb wrote:
       | It seems like unified memory has to be the goal. This all just
       | feels like a kludgy workaround until that happens (kind of like
       | segmented memory in the 16-bit era).
        
         | LorenDB wrote:
         | Is unified memory practical for a "normal" desktop/server
         | configuration though? Apple has been doing unified memory, but
         | they also have the GPU on the CPU die. I would be interested to
         | know if a discrete GPU plugged into a PCIe slot would have
         | enough latency to make unified memory impractical.
        
           | selimnairb wrote:
           | It's clearly not practical now, but that doesn't mean it
           | won't be at some point.
        
         | bsder wrote:
         | Why is RBAR insufficient? It's been pretty much supported for
         | at least 5 years now.
        
           | gavinsyancey wrote:
           | GPUs benefit from extremely high memory bandwidth. RBAR
           | helps, but it's a lot worse than a fat bus to a bunch of on-
           | card GDDR6X.
           | 
           | A PCIe 4.0x16 link gives 32 GB/s bandwidth; an RTX 4090 has
           | over 1 TB/s bandwidth to its on-card memory.
        
       | Hydrocarb0n wrote:
       | This is over my head, will it have any application or performance
       | improvements for gpu passthrough in VMs? Or games, or make
       | hashcat faster?
        
         | saagarjha wrote:
         | No. This is mostly a toy rather than a production thing.
        
       | LegNeato wrote:
       | If you are interested in this, you might be interested in Rust
       | GPU: https://rust-gpu.github.io/
        
       ___________________________________________________________________
       (page generated 2024-12-15 23:02 UTC)