https://github.com/tinygrad/open-gpu-kernel-modules Skip to content Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Resources + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} tinygrad / open-gpu-kernel-modules Public forked from NVIDIA/open-gpu-kernel-modules * Notifications * Fork 37 * Star 568 * NVIDIA Linux open GPU with P2P support License View license 568 stars 1.1k forks Branches Tags Activity Star Notifications * Code * Issues 2 * Pull requests 0 * Actions * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Security * Insights tinygrad/open-gpu-kernel-modules This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. 550.54.15-p2p BranchesTags Go to file Code Folders and files Name Name Last commit Last commit message date Latest commit History 64 Commits .github/ .github/ ISSUE_TEMPLATE ISSUE_TEMPLATE kernel-open kernel-open nouveau nouveau src src .gitignore .gitignore CHANGELOG.md CHANGELOG.md CODE_OF_CONDUCT.md CODE_OF_CONDUCT.md CONTRIBUTING.md CONTRIBUTING.md COPYING COPYING Makefile Makefile README.md README.md SECURITY.md SECURITY.md install.sh install.sh nv-compiler.sh nv-compiler.sh utils.mk utils.mk version.mk version.mk View all files Repository files navigation * README * Code of conduct * License * Security NVIDIA Linux Open GPU with P2P support This is a fork of NVIDIA's driver with P2P support added for 4090's. ./install.sh to install if that's all you want. You may need to uninstall the driver from DKMS. Your system needs large BAR support and IOMMU off. Not sure all the cache flushes are right, please file issues on here if you find any issues. NOTE: This is not a hack, this is using PCIe according to the spec. With cleanups, this could potentially be upstreamed. How it works Normally, P2P on NVIDIA cards uses MAILBOXP2P. This is some hardware interface designed to allow GPUs to transfer memory back in the days of small BAR. It is not present or disabled in hardware on the 4090s, and that's why P2P doesn't work. There was a bug in early versions of the driver that reported that it did work, and it was actually sending stuff on the PCIe bus. However, because the mailbox hardware wasn't present, these copies wouldn't go to the right place. You could even crash the system by doing something like torch.zeros (10000,10000).cuda().to("cuda:1") In some 3090s and all 4090s, NVIDIA added large BAR support. tiny@tiny14:~$ lspci -s 01:00.0 -v 01:00.0 VGA compatible controller: NVIDIA Corporation AD102 [GeForce RTX 4090] (rev a1) (prog-if 00 [VGA controller]) Subsystem: Micro-Star International Co., Ltd. [MSI] Device 510b Physical Slot: 49 Flags: bus master, fast devsel, latency 0, IRQ 377 Memory at b2000000 (32-bit, non-prefetchable) [size=16M] Memory at 28800000000 (64-bit, prefetchable) [size=32G] Memory at 28400000000 (64-bit, prefetchable) [size=32M] I/O ports at 3000 [size=128] Expansion ROM at b3000000 [virtual] [disabled] [size=512K] Capabilities: Kernel driver in use: nvidia Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia Notice how BAR1 is size 32G. In H100, they also added support for a PCIe mode that uses the BAR directly instead of the mailboxes, called BAR1P2P. So, what happens if we try to enable that on a 4090? We do this by bypassing the HAL and calling a bunch of the GH100 methods directly. Methods like kbusEnableStaticBar1Mapping_GH100, which maps the entire VRAM into BAR1. This mostly just works, but we had to disable the use of that region in the MapAperture function for some reason. Shouldn't matter. [ 3491.654009] NVRM: kbusEnableStaticBar1Mapping_GH100: Static bar1 mapped offset 0x0 size 0x5e9200000 [ 3491.793389] NVRM: kbusEnableStaticBar1Mapping_GH100: Static bar1 mapped offset 0x0 size 0x5e9200000 Perfect, we now have the VRAM mapped. However, it's not that easy to get P2P. When you run ./simpleP2P from cuda-samples, you get this error. [ 3742.840689] NVRM: kbusCreateP2PMappingForBar1P2P_GH100: added PCIe BAR1 P2P mapping between GPU2 and GPU3 [ 3742.840762] NVRM: kbusCreateP2PMappingForBar1P2P_GH100: added PCIe BAR1 P2P mapping between GPU3 and GPU2 [ 3742.841089] NVRM: nvAssertFailed: Assertion failed: (shifted >> pField->shift) == value @ field_desc.h:272 [ 3742.841106] NVRM: nvAssertFailed: Assertion failed: (shifted & pField->maskPos) == shifted @ field_desc.h:273 [ 3742.841281] NVRM: nvAssertFailed: Assertion failed: (shifted >> pField->shift) == value @ field_desc.h:272 [ 3742.841292] NVRM: nvAssertFailed: Assertion failed: (shifted & pField->maskPos) == shifted @ field_desc.h:273 [ 3742.865948] NVRM: GPU at PCI:0000:01:00: GPU-49c7a6c9-e3a8-3b48-f0ba-171520d77dd1 [ 3742.865956] NVRM: Xid (PCI:0000:01:00): 31, pid=21804, name=simpleP2P, Ch 00000013, intr 00000000. MMU Fault: ENGINE CE3 HUBCLIENT_CE1 faulted @ 0x7f97_94000000. Fault is of type FAULT_INFO_TYPE_UNSUPPORTED_KIND ACCESS_TYPE_VIRT_WRITE Failing with an MMU fault. So you dive into this and find that it's using GMMU_APERTURE_PEER as the mapping type. That doesn't seem supported in the 4090. So let's see what types are supported, GMMU_APERTURE_VIDEO,GMMU_APERTURE_SYS_NONCOH, and GMMU_APERTURE_SYS_COH. We don't care about being coherent with the CPU's L2 cache, but it does have to go out the PCIe bus, so we rewrite GMMU_APERTURE_PEER to GMMU_APERTURE_SYS_NONCOH. We also no longer set the peer id that was corrupting the page table. cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 24.21GB/s Preparing host buffer and memcpy to GPU0... Run kernel on GPU1, taking source data from GPU0 and writing to GPU1... Run kernel on GPU0, taking source data from GPU1 and writing to GPU0... Copy data back to host from GPU0 and verify results... Verification error @ element 1: val = 0.000000, ref = 4.000000 Verification error @ element 2: val = 0.000000, ref = 8.000000 Progress! ./simpleP2P appears to work, however the copy isn't happening. The address is likely wrong. It turns out they have a separate field for the peer address called fldAddrPeer, we change that to fldAddrSysmem. We also print out the addresses and note that the physical BAR address isn't being added properly, they provide a field fabricBaseAddress for GMMU_APERTURE_PEER, we reuse it and put the BAR1 base address in there. That's it. Thanks to NVIDIA for writing such a stable driver. And with this, the tinybox green is even better. ~ the tiny corp Functional Enabling peer access between GPU0 and GPU1... Allocating buffers (64MB on GPU0, GPU1 and CPU Host)... Creating event handles... cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 24.44GB/s Preparing host buffer and memcpy to GPU0... Run kernel on GPU1, taking source data from GPU0 and writing to GPU1... Run kernel on GPU0, taking source data from GPU1 and writing to GPU0... Copy data back to host from GPU0 and verify results... Disabling peer access... Shutting down... Test passed Fast Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 0 919.39 50.11 50.15 51.22 50.59 51.22 1 50.19 921.29 50.31 51.21 50.62 51.22 2 50.23 50.55 921.83 51.22 50.39 51.22 3 50.33 50.65 51.20 920.20 50.43 51.22 4 50.18 50.68 50.26 51.22 922.30 51.23 5 50.12 50.09 50.44 51.22 51.21 921.29 And NCCL (aka torch) compatible! tiny@tiny14:~/build/nccl-tests/build$ ./all_reduce_perf -g 6 # nThread 1 nGpus 6 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 26230 on tiny14 device 0 [0x01] NVIDIA GeForce RTX 4090 # Rank 1 Group 0 Pid 26230 on tiny14 device 1 [0x42] NVIDIA GeForce RTX 4090 # Rank 2 Group 0 Pid 26230 on tiny14 device 2 [0x81] NVIDIA GeForce RTX 4090 # Rank 3 Group 0 Pid 26230 on tiny14 device 3 [0x82] NVIDIA GeForce RTX 4090 # Rank 4 Group 0 Pid 26230 on tiny14 device 4 [0xc1] NVIDIA GeForce RTX 4090 # Rank 5 Group 0 Pid 26230 on tiny14 device 5 [0xc2] NVIDIA GeForce RTX 4090 # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 33554432 8388608 float sum -1 2275.1 14.75 24.58 0 2282.5 14.70 24.50 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 24.5413 # About NVIDIA Linux open GPU with P2P support Resources Readme License View license Code of conduct Code of conduct Security policy Security policy Activity Custom properties Stars 568 stars Watchers 8 watching Forks 37 forks Report repository Releases 89 tags Packages 0 No packages published Languages * C 97.7% * C++ 1.8% * Other 0.5% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.