https://github.com/tinygrad/open-gpu-kernel-modules

Skip to content
Toggle navigation
 
Sign in

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Resources
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
tinygrad / open-gpu-kernel-modules Public
forked from NVIDIA/open-gpu-kernel-modules

  * Notifications
  * Fork 37
  * Star 568
  * 

NVIDIA Linux open GPU with P2P support

License

View license
568 stars 1.1k forks Branches Tags Activity
Star
Notifications

  * Code
  * Issues 2
  * Pull requests 0
  * Actions
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Actions
  * Security
  * Insights

tinygrad/open-gpu-kernel-modules

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
 550.54.15-p2p
BranchesTags
  
Go to file
Code
 
 

Folders and files

       Name                Name          Last commit     Last commit
                                           message          date
Latest commit

 

History

64 Commits
 
.github/            .github/
ISSUE_TEMPLATE      ISSUE_TEMPLATE                       

kernel-open         kernel-open                          

nouveau             nouveau                              

src                 src                                  

.gitignore          .gitignore                           

CHANGELOG.md        CHANGELOG.md                         

CODE_OF_CONDUCT.md  CODE_OF_CONDUCT.md                   

CONTRIBUTING.md     CONTRIBUTING.md                      

COPYING             COPYING                              

Makefile            Makefile                             

README.md           README.md                            

SECURITY.md         SECURITY.md                          

install.sh          install.sh                           

nv-compiler.sh      nv-compiler.sh                       

utils.mk            utils.mk                             

version.mk          version.mk                           

View all files

Repository files navigation

  * README
  * Code of conduct
  * License
  * Security

NVIDIA Linux Open GPU with P2P support

 

This is a fork of NVIDIA's driver with P2P support added for 4090's.

./install.sh to install if that's all you want.

You may need to uninstall the driver from DKMS. Your system needs
large BAR support and IOMMU off.

Not sure all the cache flushes are right, please file issues on here
if you find any issues.

NOTE: This is not a hack, this is using PCIe according to the spec.
With cleanups, this could potentially be upstreamed.

How it works

 

Normally, P2P on NVIDIA cards uses MAILBOXP2P. This is some hardware
interface designed to allow GPUs to transfer memory back in the days
of small BAR. It is not present or disabled in hardware on the 4090s,
and that's why P2P doesn't work. There was a bug in early versions of
the driver that reported that it did work, and it was actually
sending stuff on the PCIe bus. However, because the mailbox hardware
wasn't present, these copies wouldn't go to the right place. You
could even crash the system by doing something like torch.zeros
(10000,10000).cuda().to("cuda:1")

In some 3090s and all 4090s, NVIDIA added large BAR support.

tiny@tiny14:~$ lspci -s 01:00.0 -v
01:00.0 VGA compatible controller: NVIDIA Corporation AD102 [GeForce RTX 4090] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Micro-Star International Co., Ltd. [MSI] Device 510b
        Physical Slot: 49
        Flags: bus master, fast devsel, latency 0, IRQ 377
        Memory at b2000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 28800000000 (64-bit, prefetchable) [size=32G]
        Memory at 28400000000 (64-bit, prefetchable) [size=32M]
        I/O ports at 3000 [size=128]
        Expansion ROM at b3000000 [virtual] [disabled] [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Notice how BAR1 is size 32G. In H100, they also added support for a
PCIe mode that uses the BAR directly instead of the mailboxes, called
BAR1P2P. So, what happens if we try to enable that on a 4090?

We do this by bypassing the HAL and calling a bunch of the GH100
methods directly. Methods like kbusEnableStaticBar1Mapping_GH100,
which maps the entire VRAM into BAR1. This mostly just works, but we
had to disable the use of that region in the MapAperture function for
some reason. Shouldn't matter.

[ 3491.654009] NVRM: kbusEnableStaticBar1Mapping_GH100: Static bar1 mapped offset 0x0 size 0x5e9200000
[ 3491.793389] NVRM: kbusEnableStaticBar1Mapping_GH100: Static bar1 mapped offset 0x0 size 0x5e9200000

Perfect, we now have the VRAM mapped. However, it's not that easy to
get P2P. When you run ./simpleP2P from cuda-samples, you get this
error.

[ 3742.840689] NVRM: kbusCreateP2PMappingForBar1P2P_GH100: added PCIe BAR1 P2P mapping between GPU2 and GPU3
[ 3742.840762] NVRM: kbusCreateP2PMappingForBar1P2P_GH100: added PCIe BAR1 P2P mapping between GPU3 and GPU2
[ 3742.841089] NVRM: nvAssertFailed: Assertion failed: (shifted >> pField->shift) == value @ field_desc.h:272
[ 3742.841106] NVRM: nvAssertFailed: Assertion failed: (shifted & pField->maskPos) == shifted @ field_desc.h:273
[ 3742.841281] NVRM: nvAssertFailed: Assertion failed: (shifted >> pField->shift) == value @ field_desc.h:272
[ 3742.841292] NVRM: nvAssertFailed: Assertion failed: (shifted & pField->maskPos) == shifted @ field_desc.h:273
[ 3742.865948] NVRM: GPU at PCI:0000:01:00: GPU-49c7a6c9-e3a8-3b48-f0ba-171520d77dd1
[ 3742.865956] NVRM: Xid (PCI:0000:01:00): 31, pid=21804, name=simpleP2P, Ch 00000013, intr 00000000. MMU Fault: ENGINE CE3 HUBCLIENT_CE1 faulted @ 0x7f97_94000000. Fault is of type FAULT_INFO_TYPE_UNSUPPORTED_KIND ACCESS_TYPE_VIRT_WRITE

Failing with an MMU fault. So you dive into this and find that it's
using GMMU_APERTURE_PEER as the mapping type. That doesn't seem
supported in the 4090. So let's see what types are supported,
GMMU_APERTURE_VIDEO,GMMU_APERTURE_SYS_NONCOH, and
GMMU_APERTURE_SYS_COH. We don't care about being coherent with the
CPU's L2 cache, but it does have to go out the PCIe bus, so we
rewrite GMMU_APERTURE_PEER to GMMU_APERTURE_SYS_NONCOH. We also no
longer set the peer id that was corrupting the page table.

cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 24.21GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 1: val = 0.000000, ref = 4.000000
Verification error @ element 2: val = 0.000000, ref = 8.000000

Progress! ./simpleP2P appears to work, however the copy isn't
happening. The address is likely wrong. It turns out they have a
separate field for the peer address called fldAddrPeer, we change
that to fldAddrSysmem. We also print out the addresses and note that
the physical BAR address isn't being added properly, they provide a
field fabricBaseAddress for GMMU_APERTURE_PEER, we reuse it and put
the BAR1 base address in there.

That's it. Thanks to NVIDIA for writing such a stable driver. And
with this, the tinybox green is even better.

~ the tiny corp

Functional

 

Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 24.44GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

Fast

 

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5
     0 919.39  50.11  50.15  51.22  50.59  51.22
     1  50.19 921.29  50.31  51.21  50.62  51.22
     2  50.23  50.55 921.83  51.22  50.39  51.22
     3  50.33  50.65  51.20 920.20  50.43  51.22
     4  50.18  50.68  50.26  51.22 922.30  51.23
     5  50.12  50.09  50.44  51.22  51.21 921.29

And NCCL (aka torch) compatible!

 

tiny@tiny14:~/build/nccl-tests/build$ ./all_reduce_perf -g 6
# nThread 1 nGpus 6 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  26230 on     tiny14 device  0 [0x01] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid  26230 on     tiny14 device  1 [0x42] NVIDIA GeForce RTX 4090
#  Rank  2 Group  0 Pid  26230 on     tiny14 device  2 [0x81] NVIDIA GeForce RTX 4090
#  Rank  3 Group  0 Pid  26230 on     tiny14 device  3 [0x82] NVIDIA GeForce RTX 4090
#  Rank  4 Group  0 Pid  26230 on     tiny14 device  4 [0xc1] NVIDIA GeForce RTX 4090
#  Rank  5 Group  0 Pid  26230 on     tiny14 device  5 [0xc2] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    33554432       8388608     float     sum      -1   2275.1   14.75   24.58      0   2282.5   14.70   24.50      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 24.5413
#

About

NVIDIA Linux open GPU with P2P support

Resources

Readme

License

View license

Code of conduct

Code of conduct

Security policy

Security policy
Activity
Custom properties

Stars

568 stars

Watchers

8 watching

Forks

37 forks
Report repository

Releases

89 tags

Packages 0

No packages published

Languages

  * C 97.7%
  * C++ 1.8%
  * Other 0.5%

Footer

 (c) 2024 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * Manage cookies
  * Do not share my personal information

You can't perform that action at this time.