[HN Gopher] Full-scale file system acceleration on GPU [pdf]
       ___________________________________________________________________
        
       Full-scale file system acceleration on GPU [pdf]
        
       Author : west0n
       Score  : 138 points
       Date   : 2024-03-30 01:21 UTC (21 hours ago)
        
 (HTM) web link (dl.gi.de)
 (TXT) w3m dump (dl.gi.de)
        
       | west0n wrote:
       | According to this paper, GPU4FS is a file system that can run on
       | the GPU and be accessed by applications. Since GPUs cannot make
       | system calls, GPU4FS uses shared video memory (VRAM) and a
       | parallel queue implementation. Applications running on the GPU
       | can utilize GPU4FS after modifying their code, eliminating the
       | need for a CPU-side file system when accessing the file system.
       | The experiments are done on Optane memory.
       | 
       | It would be interesting to know if this approach could optimize
       | the performance of training and inference for large models.
        
         | t-3 wrote:
         | GPUs seem to have a lot of memory these days - from my limited
         | knowledge, games and other graphics-intensive applications will
         | use too much to make this approach particularly useful but do
         | other applications have a similar level of utilization?
        
       | magicalhippo wrote:
       | Given that PCIe allows data to be piped directly from one device
       | to another without going through the host CPU[1][2], I guess it
       | might make sense to just have the GPU read blocks straight from
       | the NVMe (or even NVMe-of[3]) rather than having the CPU do a lot
       | of work.
       | 
       | edit: blind as a bat, says so right in the paper of course:
       | 
       |  _PMem is mapped directly to the GPU, and NVMe memory is accessed
       | via Peer to Peer-DMA (P2PDMA)_
       | 
       | [1]: https://nvmexpress.org/wp-content/uploads/Enabling-the-
       | NVMe-...
       | 
       | [2]: https://lwn.net/Articles/767281/
       | 
       | [3]: https://www.nvmexpress.org/wp-
       | content/uploads/NVMe_Over_Fabr...
        
         | wtallis wrote:
         | I'm not sure they're actually doing NVMe yet; using Optane PMem
         | is a bit of a cheat so that accessing storage is just plain
         | memory reads and writes over PCIe. Implementing an NVMe device
         | driver to set up and interact with command queues would be an
         | extra layer of complexity that I think they left for future
         | work.
        
           | magicalhippo wrote:
           | Sure, but my point was that it should be quite possible to
           | get regular NVMes working.
           | 
           | Once you got that then the CPU is just the orchesterator, and
           | wouldn't necessarily need to be so beefy.
        
             | dragontamer wrote:
             | That's just called DirectStorage and was added as part of
             | Windows 10 (erm.... some update in Windows 10).
             | 
             | The PS5 and Xbox both have GPU-access of NVMe Flash.
             | 
             | ------
             | 
             | So you are right. But what you are talking about happened
             | like 5 years ago.
             | 
             | EDIT: https://devblogs.microsoft.com/directx/directstorage-
             | develop...
             | 
             | Looks like 3 years ago for Win10. But I feel like I heard
             | it sooner than that as NVidia or AMD specific API calls.
        
           | nine_k wrote:
           | Didn't they stop making Optane? :(
           | 
           | Also, Optane was like $4 per GB, so a moderately-sized drive,
           | like 256GB, is already above $1000.
        
             | amarcheschi wrote:
             | Yes, Optane isn't produced anymore
        
             | wtallis wrote:
             | The Optane NVMe drives were more like $1-2 per GB when they
             | were new, and are a fair bit cheaper now that they're
             | basically on clearance: https://www.newegg.com/intel-
             | optane-ssd-905p-series-960gb/p/...
             | 
             | But this work used the Optane DC Persistent Memory DIMMs
             | that only work with certain Intel server CPUs. I'm not sure
             | what the typical price people actually paid for those was,
             | but it probably was not actually more expensive than DRAM.
        
       | ec109685 wrote:
       | Interesting they would discuss system call overhead of opening a
       | file, reading from it and closing it. Seems like in almost all
       | cases the open and close calls would be overwhelmed by the other
       | operations.
        
         | eru wrote:
         | For lots of small files, that might not be the case.
         | 
         | (I worked on a FUSE filesystem that had these issues.)
        
           | loeg wrote:
           | It seems more straightforward to fix your data-in-files
           | layout than to implement a novel in-GPU filesystem, though.
           | 
           | I think the main benefit here is not having to do memory
           | copies through the CPU, which frees up memory bandwidth for
           | other things.
        
             | maxcoder4 wrote:
             | There are plenty of cases where you can't just change the
             | file layout. And the GPU filesystem is being implemented by
             | someone else, so the choice is: migrate your data to
             | another filesystem OR fix the data-in-files layout, even
             | though the files may come from completely different source
             | than your application, the layout may be a standard or
             | other applications may depend on it, or you can't easily
             | change it for another reason.
        
               | loeg wrote:
               | If you can get the data into the GPU-native filesystem,
               | you can change the data layout at least as easily. The
               | point is there is some sort of data ingestion pipeline
               | involved.
        
             | eru wrote:
             | > It seems more straightforward to fix your data-in-files
             | layout than to implement a novel in-GPU filesystem, though.
             | 
             | You can improve file-open overhead in conventional
             | filesystems, too. Including the FUSE one I was working on.
        
               | loeg wrote:
               | Sure!
        
       | KingOfCoders wrote:
       | Like Microsoft DirectStorage?
        
         | wtallis wrote:
         | Nope. This is an implementation of one of several things that
         | people often imagine Microsoft's DirectStorage to be, but the
         | real DirectStorage is a lot more mundane.
        
           | KingOfCoders wrote:
           | I have no clue, so I've asked, where is the difference?
        
             | wtallis wrote:
             | DirectStorage is mostly an API for CPU code to
             | asynchronously issue high-level storage requests such as
             | asking for a file to be read from storage and the contents
             | placed in a particular GPU buffer. Behind the scenes, the
             | file contents could in theory be transferred from an SSD to
             | the GPU using P2P DMA, because the OS now has enough of a
             | big-picture view of what's going on to set up that kind of
             | transfer when it's possible. But everything about parsing
             | the filesystem data structures to locate the requested file
             | data and issue commands to the SSD is still done on the CPU
             | by the OS, and the application originating those high-level
             | requests is a process running on the CPU and making system
             | calls.
             | 
             | Making the requests asynchronous and issuing lots of
             | requests in parallel is what makes it possible to get good
             | performance out of flash-based storage; P2P DMA would be a
             | relatively minor optimization on top of that. DirectStorage
             | isn't the only way to asynchronously issue batches of
             | storage requests; Windows has long had IOCP and more
             | recently cloned io_uring from Linux.
             | 
             | DirectStorage 1.1 introduced an optional feature for GPU
             | decompression, so that data which is stored on disk in a
             | (the) supported compressed format can be streamed to the
             | GPU and decompressed there instead of needing a round-trip
             | through the CPU and its RAM for decompression. This could
             | help make the P2P DMA option more widely usable by reducing
             | the cases which need to fall back to the CPU, but
             | decompressing on the GPU is nothing that applications
             | couldn't already implement for themselves; DirectStorage
             | just provides a convenient standardized API for this so
             | that GPU vendors can provide a well-optimized decompression
             | implementation. When P2P DMA isn't available, you can still
             | get some computation offloaded from the CPU to the GPU
             | after the compressed data makes a trip through the CPU's
             | RAM.
             | 
             | (Note: official docs about DirectStorage don't really say
             | anything about P2P DMA, but it's clearly being designed to
             | allow for it in the future.)
             | 
             | The GPU4FS described here is a project to implement the
             | filesystem entirely on the GPU: the code to eg. walk the
             | directory hierarchy and locate what address actually holds
             | the file contents is not on the CPU but on the GPU. This
             | approach means the application running on the GPU needs
             | exclusive ownership of the device holding the filesystem.
             | For now, they're using persistent memory as the backing
             | store, but in the future they could implement NVMe and have
             | storage requests originate from the GPU and be delivered
             | directly to the SSD with no CPU or OS involvement.
        
               | KingOfCoders wrote:
               | Thanks!
        
       | multimind wrote:
       | A friend of mine used to work for a GPU database startup as an
       | integration engineer. He got frustrated because GPU drivers ( not
       | just AMD but also Nvidia ) are intrinsically unstable and not
       | designed for long flawless runs. If a few bits have a wrong value
       | in a deep neural network or a pixel is wrong in a game, it does
       | not matter much. In databases ( or file systems for that matter )
       | it does mean everything! It is hard to believe at first, but his
       | former company now offers solutions without GPU acceleration that
       | simply work, but they also lost their USP.
        
         | yosefk wrote:
         | If the game or your training crashes though, it matters a lot.
         | What sort of bugs give you wrong values without crashing,
         | especially driver bugs?.. something is strange here
        
         | amelius wrote:
         | Yeah, I had a lot of nVidia GPUs suddenly disappear mid-
         | training when even nvidia-smi couldn't find them; this was on
         | different systems (Linux) and only a reboot fixed it.
         | 
         | You don't want this kind of thing happening when it is running
         | a filesystem.
        
           | LeanderK wrote:
           | Strange. I never had any problem with nvidia GPUs, but I only
           | ever used data center GPU like the V100 (and don't set them
           | up myself). There's a lot of things that go wrong, at least
           | my nvidia GPU always works.
        
         | solardev wrote:
         | Could you use some sort of RAID array of GPUs to compensate...?
        
       | yeison wrote:
       | How to get hired by NVIDIA! If it does work it's a brilliant
       | idea.
        
       | afr0ck wrote:
       | I didn't fully read the paper, but few questions come into mind.
       | 
       | 1) How does this work differ from Mark Silberstein's GPUfs from
       | 2014 [1]?
       | 
       | 2) Does this work assume the storage device is only accessed by
       | the GPU? Otherwise, how do you guarantee consistency when
       | multiple processes can map, read and write the same files? You
       | mention POSIX. POSIX has MAP_SHARED. How is this situation
       | handled?
       | 
       | 3) Related to (2), on the device level, how do you sync CPU (on
       | an SMP, multiple cores) and GPU accesses?
       | 
       | [1] https://dl.acm.org/doi/10.1145/2553081
        
       | touisteur wrote:
       | Now this is all fun, but has anyone managed to make these
       | mechanisms work with Multicast PCIe ? I really need GPUdirect and
       | StorageDirect to support this, until PCIe catches up to today's
       | (or Blackwell's) NVLink ... around PCIe 12?
        
       | molticrystal wrote:
       | While it is not a 1:1 comparison there has been a driver for
       | windows that allows the creation of a ram drive from vram for
       | NVIDIA cards.
       | 
       | >GpuRamDrive
       | 
       | >Create a virtual drive backed by GPU RAM.
       | 
       | https://github.com/prsyahmi/GpuRamDrive
       | 
       | Fork with AMD support:
       | 
       | https://github.com/brzz/GpuRamDrive/
       | 
       | Fork that has fixes and support for other cards and additional
       | features:
       | 
       | https://github.com/Ado77/GpuRamDrive
        
         | Zambyte wrote:
         | For Linux: https://wiki.archlinux.org/title/Swap_on_video_RAM
        
         | amarcheschi wrote:
         | I tried and tested it on my 5700xt,in crystaldiskmark i got (5
         | repeeated times on 1giB) Read Write (MB/s) seq1m 2339 2620 q8t1
         | 
         | seq1m 2205 2190 q1t1
         | 
         | rndq32 41.31 38.77
         | 
         | rnd q1t1 34.70 32.80
         | 
         | To be honest i didn't know what to expect, aside for a very
         | high reading and writing speed. I was a bit disappointed in
         | seeing random reading and writing were so slow, the only use i
         | could think about would be having photosets or things like that
         | over there, and then saving the session on ssd when closing the
         | program, but it is easily solved by using a newer nvme ssd
        
       | amelius wrote:
       | A GPU seems overkill when the bottleneck is the I/O.
        
         | OlivierLi wrote:
         | In systems performance I would advise to never think of any
         | workload as unidimensional (ie: Any file system optimization
         | can either improve IO latency or be useless)
         | 
         | Issuing individual truncates of 1B files can be just as much of
         | a CPU problem then an IO one for example.
        
           | amelius wrote:
           | But why wouldn't using one of many CPU cores be sufficient?
        
       | hieu229 wrote:
       | I hope GPU files leads to faster database
        
       | brcmthrowaway wrote:
       | Is this implementing a file system using shader code? Thats
       | insane
       | 
       | are shaders turing complete ? ;)
        
       ___________________________________________________________________
       (page generated 2024-03-30 23:01 UTC)