[HN Gopher] Full-scale file system acceleration on GPU [pdf]
___________________________________________________________________
Full-scale file system acceleration on GPU [pdf]
Author : west0n
Score : 138 points
Date : 2024-03-30 01:21 UTC (21 hours ago)
(HTM) web link (dl.gi.de)
(TXT) w3m dump (dl.gi.de)
| west0n wrote:
| According to this paper, GPU4FS is a file system that can run on
| the GPU and be accessed by applications. Since GPUs cannot make
| system calls, GPU4FS uses shared video memory (VRAM) and a
| parallel queue implementation. Applications running on the GPU
| can utilize GPU4FS after modifying their code, eliminating the
| need for a CPU-side file system when accessing the file system.
| The experiments are done on Optane memory.
|
| It would be interesting to know if this approach could optimize
| the performance of training and inference for large models.
| t-3 wrote:
| GPUs seem to have a lot of memory these days - from my limited
| knowledge, games and other graphics-intensive applications will
| use too much to make this approach particularly useful but do
| other applications have a similar level of utilization?
| magicalhippo wrote:
| Given that PCIe allows data to be piped directly from one device
| to another without going through the host CPU[1][2], I guess it
| might make sense to just have the GPU read blocks straight from
| the NVMe (or even NVMe-of[3]) rather than having the CPU do a lot
| of work.
|
| edit: blind as a bat, says so right in the paper of course:
|
| _PMem is mapped directly to the GPU, and NVMe memory is accessed
| via Peer to Peer-DMA (P2PDMA)_
|
| [1]: https://nvmexpress.org/wp-content/uploads/Enabling-the-
| NVMe-...
|
| [2]: https://lwn.net/Articles/767281/
|
| [3]: https://www.nvmexpress.org/wp-
| content/uploads/NVMe_Over_Fabr...
| wtallis wrote:
| I'm not sure they're actually doing NVMe yet; using Optane PMem
| is a bit of a cheat so that accessing storage is just plain
| memory reads and writes over PCIe. Implementing an NVMe device
| driver to set up and interact with command queues would be an
| extra layer of complexity that I think they left for future
| work.
| magicalhippo wrote:
| Sure, but my point was that it should be quite possible to
| get regular NVMes working.
|
| Once you got that then the CPU is just the orchesterator, and
| wouldn't necessarily need to be so beefy.
| dragontamer wrote:
| That's just called DirectStorage and was added as part of
| Windows 10 (erm.... some update in Windows 10).
|
| The PS5 and Xbox both have GPU-access of NVMe Flash.
|
| ------
|
| So you are right. But what you are talking about happened
| like 5 years ago.
|
| EDIT: https://devblogs.microsoft.com/directx/directstorage-
| develop...
|
| Looks like 3 years ago for Win10. But I feel like I heard
| it sooner than that as NVidia or AMD specific API calls.
| nine_k wrote:
| Didn't they stop making Optane? :(
|
| Also, Optane was like $4 per GB, so a moderately-sized drive,
| like 256GB, is already above $1000.
| amarcheschi wrote:
| Yes, Optane isn't produced anymore
| wtallis wrote:
| The Optane NVMe drives were more like $1-2 per GB when they
| were new, and are a fair bit cheaper now that they're
| basically on clearance: https://www.newegg.com/intel-
| optane-ssd-905p-series-960gb/p/...
|
| But this work used the Optane DC Persistent Memory DIMMs
| that only work with certain Intel server CPUs. I'm not sure
| what the typical price people actually paid for those was,
| but it probably was not actually more expensive than DRAM.
| ec109685 wrote:
| Interesting they would discuss system call overhead of opening a
| file, reading from it and closing it. Seems like in almost all
| cases the open and close calls would be overwhelmed by the other
| operations.
| eru wrote:
| For lots of small files, that might not be the case.
|
| (I worked on a FUSE filesystem that had these issues.)
| loeg wrote:
| It seems more straightforward to fix your data-in-files
| layout than to implement a novel in-GPU filesystem, though.
|
| I think the main benefit here is not having to do memory
| copies through the CPU, which frees up memory bandwidth for
| other things.
| maxcoder4 wrote:
| There are plenty of cases where you can't just change the
| file layout. And the GPU filesystem is being implemented by
| someone else, so the choice is: migrate your data to
| another filesystem OR fix the data-in-files layout, even
| though the files may come from completely different source
| than your application, the layout may be a standard or
| other applications may depend on it, or you can't easily
| change it for another reason.
| loeg wrote:
| If you can get the data into the GPU-native filesystem,
| you can change the data layout at least as easily. The
| point is there is some sort of data ingestion pipeline
| involved.
| eru wrote:
| > It seems more straightforward to fix your data-in-files
| layout than to implement a novel in-GPU filesystem, though.
|
| You can improve file-open overhead in conventional
| filesystems, too. Including the FUSE one I was working on.
| loeg wrote:
| Sure!
| KingOfCoders wrote:
| Like Microsoft DirectStorage?
| wtallis wrote:
| Nope. This is an implementation of one of several things that
| people often imagine Microsoft's DirectStorage to be, but the
| real DirectStorage is a lot more mundane.
| KingOfCoders wrote:
| I have no clue, so I've asked, where is the difference?
| wtallis wrote:
| DirectStorage is mostly an API for CPU code to
| asynchronously issue high-level storage requests such as
| asking for a file to be read from storage and the contents
| placed in a particular GPU buffer. Behind the scenes, the
| file contents could in theory be transferred from an SSD to
| the GPU using P2P DMA, because the OS now has enough of a
| big-picture view of what's going on to set up that kind of
| transfer when it's possible. But everything about parsing
| the filesystem data structures to locate the requested file
| data and issue commands to the SSD is still done on the CPU
| by the OS, and the application originating those high-level
| requests is a process running on the CPU and making system
| calls.
|
| Making the requests asynchronous and issuing lots of
| requests in parallel is what makes it possible to get good
| performance out of flash-based storage; P2P DMA would be a
| relatively minor optimization on top of that. DirectStorage
| isn't the only way to asynchronously issue batches of
| storage requests; Windows has long had IOCP and more
| recently cloned io_uring from Linux.
|
| DirectStorage 1.1 introduced an optional feature for GPU
| decompression, so that data which is stored on disk in a
| (the) supported compressed format can be streamed to the
| GPU and decompressed there instead of needing a round-trip
| through the CPU and its RAM for decompression. This could
| help make the P2P DMA option more widely usable by reducing
| the cases which need to fall back to the CPU, but
| decompressing on the GPU is nothing that applications
| couldn't already implement for themselves; DirectStorage
| just provides a convenient standardized API for this so
| that GPU vendors can provide a well-optimized decompression
| implementation. When P2P DMA isn't available, you can still
| get some computation offloaded from the CPU to the GPU
| after the compressed data makes a trip through the CPU's
| RAM.
|
| (Note: official docs about DirectStorage don't really say
| anything about P2P DMA, but it's clearly being designed to
| allow for it in the future.)
|
| The GPU4FS described here is a project to implement the
| filesystem entirely on the GPU: the code to eg. walk the
| directory hierarchy and locate what address actually holds
| the file contents is not on the CPU but on the GPU. This
| approach means the application running on the GPU needs
| exclusive ownership of the device holding the filesystem.
| For now, they're using persistent memory as the backing
| store, but in the future they could implement NVMe and have
| storage requests originate from the GPU and be delivered
| directly to the SSD with no CPU or OS involvement.
| KingOfCoders wrote:
| Thanks!
| multimind wrote:
| A friend of mine used to work for a GPU database startup as an
| integration engineer. He got frustrated because GPU drivers ( not
| just AMD but also Nvidia ) are intrinsically unstable and not
| designed for long flawless runs. If a few bits have a wrong value
| in a deep neural network or a pixel is wrong in a game, it does
| not matter much. In databases ( or file systems for that matter )
| it does mean everything! It is hard to believe at first, but his
| former company now offers solutions without GPU acceleration that
| simply work, but they also lost their USP.
| yosefk wrote:
| If the game or your training crashes though, it matters a lot.
| What sort of bugs give you wrong values without crashing,
| especially driver bugs?.. something is strange here
| amelius wrote:
| Yeah, I had a lot of nVidia GPUs suddenly disappear mid-
| training when even nvidia-smi couldn't find them; this was on
| different systems (Linux) and only a reboot fixed it.
|
| You don't want this kind of thing happening when it is running
| a filesystem.
| LeanderK wrote:
| Strange. I never had any problem with nvidia GPUs, but I only
| ever used data center GPU like the V100 (and don't set them
| up myself). There's a lot of things that go wrong, at least
| my nvidia GPU always works.
| solardev wrote:
| Could you use some sort of RAID array of GPUs to compensate...?
| yeison wrote:
| How to get hired by NVIDIA! If it does work it's a brilliant
| idea.
| afr0ck wrote:
| I didn't fully read the paper, but few questions come into mind.
|
| 1) How does this work differ from Mark Silberstein's GPUfs from
| 2014 [1]?
|
| 2) Does this work assume the storage device is only accessed by
| the GPU? Otherwise, how do you guarantee consistency when
| multiple processes can map, read and write the same files? You
| mention POSIX. POSIX has MAP_SHARED. How is this situation
| handled?
|
| 3) Related to (2), on the device level, how do you sync CPU (on
| an SMP, multiple cores) and GPU accesses?
|
| [1] https://dl.acm.org/doi/10.1145/2553081
| touisteur wrote:
| Now this is all fun, but has anyone managed to make these
| mechanisms work with Multicast PCIe ? I really need GPUdirect and
| StorageDirect to support this, until PCIe catches up to today's
| (or Blackwell's) NVLink ... around PCIe 12?
| molticrystal wrote:
| While it is not a 1:1 comparison there has been a driver for
| windows that allows the creation of a ram drive from vram for
| NVIDIA cards.
|
| >GpuRamDrive
|
| >Create a virtual drive backed by GPU RAM.
|
| https://github.com/prsyahmi/GpuRamDrive
|
| Fork with AMD support:
|
| https://github.com/brzz/GpuRamDrive/
|
| Fork that has fixes and support for other cards and additional
| features:
|
| https://github.com/Ado77/GpuRamDrive
| Zambyte wrote:
| For Linux: https://wiki.archlinux.org/title/Swap_on_video_RAM
| amarcheschi wrote:
| I tried and tested it on my 5700xt,in crystaldiskmark i got (5
| repeeated times on 1giB) Read Write (MB/s) seq1m 2339 2620 q8t1
|
| seq1m 2205 2190 q1t1
|
| rndq32 41.31 38.77
|
| rnd q1t1 34.70 32.80
|
| To be honest i didn't know what to expect, aside for a very
| high reading and writing speed. I was a bit disappointed in
| seeing random reading and writing were so slow, the only use i
| could think about would be having photosets or things like that
| over there, and then saving the session on ssd when closing the
| program, but it is easily solved by using a newer nvme ssd
| amelius wrote:
| A GPU seems overkill when the bottleneck is the I/O.
| OlivierLi wrote:
| In systems performance I would advise to never think of any
| workload as unidimensional (ie: Any file system optimization
| can either improve IO latency or be useless)
|
| Issuing individual truncates of 1B files can be just as much of
| a CPU problem then an IO one for example.
| amelius wrote:
| But why wouldn't using one of many CPU cores be sufficient?
| hieu229 wrote:
| I hope GPU files leads to faster database
| brcmthrowaway wrote:
| Is this implementing a file system using shader code? Thats
| insane
|
| are shaders turing complete ? ;)
___________________________________________________________________
(page generated 2024-03-30 23:01 UTC)