https://github.com/srush/GPU-Puzzles

Skip to content

Navigation Menu

Toggle navigation
 
Sign in

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        GitHub Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    By size
      + Enterprise
      + Teams
      + Startups
    By industry
      + Healthcare
      + Financial services
      + Manufacturing
    By use case
      + CI/CD & Automation
      + DevOps
      + DevSecOps
  * Resources
    Topics
      + AI
      + DevOps
      + Security
      + Software Development
      + View all
    Explore
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Enterprise
      +  
        Enterprise platform
        AI-powered developer platform
    Available add-ons
      +  
        Advanced Security
        Enterprise-grade security features
      +  
        GitHub Copilot
        Enterprise-grade AI features
      +  
        Premium Support
        Enterprise-grade 24/7 support
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up Reseting focus
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
srush / GPU-Puzzles Public

  * Notifications You must be signed in to change notification
    settings
  * Fork 349
  * Star 6.4k

Solve puzzles. Learn CUDA.

License

MIT license
6.4k stars 349 forks Branches Tags Activity
Star
Notifications You must be signed in to change notification settings

  * Code
  * Issues 10
  * Pull requests 5
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

srush/GPU-Puzzles

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
 main
BranchesTags
  
Go to file
Code

Folders and files

       Name                Name          Last commit     Last commit
                                           message          date
Latest commit

 

History

44 Commits
 
GPU_puzzlers_files  GPU_puzzlers_files                   
.gitignore          .gitignore                           
GPU_puzzlers.ipynb  GPU_puzzlers.ipynb                   
GPU_puzzlers.py     GPU_puzzlers.py                      
LICENSE             LICENSE                              
README.md           README.md                            
cuda.png            cuda.png                             
lib.py              lib.py                               
robot.png           robot.png                            
View all files

Repository files navigation

  * README
  * MIT license

GPU Puzzles

 

  * by Sasha Rush - srush_nlp

[cuda]

GPU architectures are critical to machine learning, and seem to be
becoming even more important every day. However, you can be an expert
in machine learning without ever touching GPU code. It is hard to
gain intuition working through abstractions.

This notebook is an attempt to teach beginner GPU programming in a
completely interactive fashion. Instead of providing text with
concepts, it throws you right into coding and building GPU kernels.
The exercises use NUMBA which directly maps Python code to CUDA
kernels. It looks like Python but is basically identical to writing
low-level CUDA code. In a few hours, I think you can go from basics
to understanding the real algorithms that power 99% of deep learning
today. If you do want to read the manual, it is here:

NUMBA CUDA Guide

I recommend doing these in Colab, as it is easy to get started. Be
sure to make your own copy, turn on GPU mode in the settings (Runtime
/ Change runtime type, then set Hardware accelerator to GPU), and
then get to coding.

Open In Colab

(If you are into this style of puzzle, also check out my Tensor
Puzzles for PyTorch.)

Walkthrough Guide

!pip install -qqq git+https://github.com/danoneata/chalk@srush-patch-1
!wget -q https://github.com/srush/GPU-Puzzles/raw/main/robot.png https://github.com/srush/GPU-Puzzles/raw/main/lib.py

import numba
import numpy as np
import warnings
from lib import CudaProblem, Coord

warnings.filterwarnings(
    action="ignore", category=numba.NumbaPerformanceWarning, module="numba"
)

Puzzle 1: Map

 

Implement a "kernel" (GPU function) that adds 10 to each position of
vector a and stores it in vector out. You have 1 thread per position.

Warning This code looks like Python but it is really CUDA! You cannot
use standard python tools like list comprehensions or ask for Numpy
properties like shape or size (if you need the size, it is given as
an argument). The puzzles only require doing simple operations,
basically +, *, simple array indexing, for loops, and if statements.
You are allowed to use local variables. If you get an error it is
probably because you did something fancy :).

Tip: Think of the function call as being run 1 time for each thread.
The only difference is that cuda.threadIdx.x changes each time.

def map_spec(a):
    return a + 10


def map_test(cuda):
    def call(out, a) -> None:
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 1 lines)

    return call


SIZE = 4
out = np.zeros((SIZE,))
a = np.arange(SIZE)
problem = CudaProblem(
    "Map", map_test, [a], out, threadsperblock=Coord(SIZE, 1), spec=map_spec
)
problem.show()

# Map

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

svg

problem.check()

Failed Tests.
Yours: [0. 0. 0. 0.]
Spec : [10 11 12 13]

Puzzle 2 - Zip

 

Implement a kernel that adds together each position of a and b and
stores it in out. You have 1 thread per position.

def zip_spec(a, b):
    return a + b


def zip_test(cuda):
    def call(out, a, b) -> None:
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 1 lines)

    return call


SIZE = 4
out = np.zeros((SIZE,))
a = np.arange(SIZE)
b = np.arange(SIZE)
problem = CudaProblem(
    "Zip", zip_test, [a, b], out, threadsperblock=Coord(SIZE, 1), spec=zip_spec
)
problem.show()

# Zip

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

svg

problem.check()

Failed Tests.
Yours: [0. 0. 0. 0.]
Spec : [0 2 4 6]

Puzzle 3 - Guards

 

Implement a kernel that adds 10 to each position of a and stores it
in out. You have more threads than positions.

def map_guard_test(cuda):
    def call(out, a, size) -> None:
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 2 lines)

    return call


SIZE = 4
out = np.zeros((SIZE,))
a = np.arange(SIZE)
problem = CudaProblem(
    "Guard",
    map_guard_test,
    [a],
    out,
    [SIZE],
    threadsperblock=Coord(8, 1),
    spec=map_spec,
)
problem.show()

# Guard

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

svg

problem.check()

Failed Tests.
Yours: [0. 0. 0. 0.]
Spec : [10 11 12 13]

Puzzle 4 - Map 2D

 

Implement a kernel that adds 10 to each position of a and stores it
in out. Input a is 2D and square. You have more threads than
positions.

def map_2D_test(cuda):
    def call(out, a, size) -> None:
        local_i = cuda.threadIdx.x
        local_j = cuda.threadIdx.y
        # FILL ME IN (roughly 2 lines)

    return call


SIZE = 2
out = np.zeros((SIZE, SIZE))
a = np.arange(SIZE * SIZE).reshape((SIZE, SIZE))
problem = CudaProblem(
    "Map 2D", map_2D_test, [a], out, [SIZE], threadsperblock=Coord(3, 3), spec=map_spec
)
problem.show()

# Map 2D

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

svg

problem.check()

Failed Tests.
Yours: [[0. 0.]
 [0. 0.]]
Spec : [[10 11]
 [12 13]]

Puzzle 5 - Broadcast

 

Implement a kernel that adds a and b and stores it in out. Inputs a
and b are vectors. You have more threads than positions.

def broadcast_test(cuda):
    def call(out, a, b, size) -> None:
        local_i = cuda.threadIdx.x
        local_j = cuda.threadIdx.y
        # FILL ME IN (roughly 2 lines)

    return call


SIZE = 2
out = np.zeros((SIZE, SIZE))
a = np.arange(SIZE).reshape(SIZE, 1)
b = np.arange(SIZE).reshape(1, SIZE)
problem = CudaProblem(
    "Broadcast",
    broadcast_test,
    [a, b],
    out,
    [SIZE],
    threadsperblock=Coord(3, 3),
    spec=zip_spec,
)
problem.show()

# Broadcast

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

svg

problem.check()

Failed Tests.
Yours: [[0. 0.]
 [0. 0.]]
Spec : [[0 1]
 [1 2]]

Puzzle 6 - Blocks

 

Implement a kernel that adds 10 to each position of a and stores it
in out. You have fewer threads per block than the size of a.

Tip: A block is a group of threads. The number of threads per block
is limited, but we can have many different blocks. Variable
cuda.blockIdx tells us what block we are in.

def map_block_test(cuda):
    def call(out, a, size) -> None:
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        # FILL ME IN (roughly 2 lines)

    return call


SIZE = 9
out = np.zeros((SIZE,))
a = np.arange(SIZE)
problem = CudaProblem(
    "Blocks",
    map_block_test,
    [a],
    out,
    [SIZE],
    threadsperblock=Coord(4, 1),
    blockspergrid=Coord(3, 1),
    spec=map_spec,
)
problem.show()

# Blocks

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

svg

problem.check()

Failed Tests.
Yours: [0. 0. 0. 0. 0. 0. 0. 0. 0.]
Spec : [10 11 12 13 14 15 16 17 18]

Puzzle 7 - Blocks 2D

 

Implement the same kernel in 2D. You have fewer threads per block
than the size of a in both directions.

def map_block2D_test(cuda):
    def call(out, a, size) -> None:
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        # FILL ME IN (roughly 4 lines)

    return call


SIZE = 5
out = np.zeros((SIZE, SIZE))
a = np.ones((SIZE, SIZE))

problem = CudaProblem(
    "Blocks 2D",
    map_block2D_test,
    [a],
    out,
    [SIZE],
    threadsperblock=Coord(3, 3),
    blockspergrid=Coord(2, 2),
    spec=map_spec,
)
problem.show()

# Blocks 2D

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

svg

problem.check()

Failed Tests.
Yours: [[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
Spec : [[11. 11. 11. 11. 11.]
 [11. 11. 11. 11. 11.]
 [11. 11. 11. 11. 11.]
 [11. 11. 11. 11. 11.]
 [11. 11. 11. 11. 11.]]

Puzzle 8 - Shared

 

Implement a kernel that adds 10 to each position of a and stores it
in out. You have fewer threads per block than the size of a.

Warning: Each block can only have a constant amount of shared memory
that threads in that block can read and write to. This needs to be a
literal python constant not a variable. After writing to shared
memory you need to call cuda.syncthreads to ensure that threads do
not cross.

(This example does not really need shared memory or syncthreads, but
it is a demo.)

TPB = 4
def shared_test(cuda):
    def call(out, a, size) -> None:
        shared = cuda.shared.array(TPB, numba.float32)
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x

        if i < size:
            shared[local_i] = a[i]
            cuda.syncthreads()

        # FILL ME IN (roughly 2 lines)

    return call


SIZE = 8
out = np.zeros(SIZE)
a = np.ones(SIZE)
problem = CudaProblem(
    "Shared",
    shared_test,
    [a],
    out,
    [SIZE],
    threadsperblock=Coord(TPB, 1),
    blockspergrid=Coord(2, 1),
    spec=map_spec,
)
problem.show()

# Shared

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             1 |             0 |             0 |             1 |

svg

problem.check()

Failed Tests.
Yours: [0. 0. 0. 0. 0. 0. 0. 0.]
Spec : [11. 11. 11. 11. 11. 11. 11. 11.]

Puzzle 9 - Pooling

 

Implement a kernel that sums together the last 3 position of a and
stores it in out. You have 1 thread per position. You only need 1
global read and 1 global write per thread.

Tip: Remember to be careful about syncing.

def pool_spec(a):
    out = np.zeros(*a.shape)
    for i in range(a.shape[0]):
        out[i] = a[max(i - 2, 0) : i + 1].sum()
    return out


TPB = 8
def pool_test(cuda):
    def call(out, a, size) -> None:
        shared = cuda.shared.array(TPB, numba.float32)
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 8 lines)

    return call


SIZE = 8
out = np.zeros(SIZE)
a = np.arange(SIZE)
problem = CudaProblem(
    "Pooling",
    pool_test,
    [a],
    out,
    [SIZE],
    threadsperblock=Coord(TPB, 1),
    blockspergrid=Coord(1, 1),
    spec=pool_spec,
)
problem.show()

# Pooling

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

svg

problem.check()

Failed Tests.
Yours: [0. 0. 0. 0. 0. 0. 0. 0.]
Spec : [ 0.  1.  3.  6.  9. 12. 15. 18.]

Puzzle 10 - Dot Product

 

Implement a kernel that computes the dot-product of a and b and
stores it in out. You have 1 thread per position. You only need 2
global reads and 1 global write per thread.

Note: For this problem you don't need to worry about number of shared
reads. We will handle that challenge later.

def dot_spec(a, b):
    return a @ b

TPB = 8
def dot_test(cuda):
    def call(out, a, b, size) -> None:
        shared = cuda.shared.array(TPB, numba.float32)

        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 9 lines)
    return call


SIZE = 8
out = np.zeros(1)
a = np.arange(SIZE)
b = np.arange(SIZE)
problem = CudaProblem(
    "Dot",
    dot_test,
    [a, b],
    out,
    [SIZE],
    threadsperblock=Coord(SIZE, 1),
    blockspergrid=Coord(1, 1),
    spec=dot_spec,
)
problem.show()

# Dot

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

svg

problem.check()

Failed Tests.
Yours: [0.]
Spec : 140

Puzzle 11 - 1D Convolution

 

Implement a kernel that computes a 1D convolution between a and b and
stores it in out. You need to handle the general case. You only need
2 global reads and 1 global write per thread.

def conv_spec(a, b):
    out = np.zeros(*a.shape)
    len = b.shape[0]
    for i in range(a.shape[0]):
        out[i] = sum([a[i + j] * b[j] for j in range(len) if i + j < a.shape[0]])
    return out


MAX_CONV = 4
TPB = 8
TPB_MAX_CONV = TPB + MAX_CONV
def conv_test(cuda):
    def call(out, a, b, a_size, b_size) -> None:
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x

        # FILL ME IN (roughly 17 lines)

    return call


# Test 1

SIZE = 6
CONV = 3
out = np.zeros(SIZE)
a = np.arange(SIZE)
b = np.arange(CONV)
problem = CudaProblem(
    "1D Conv (Simple)",
    conv_test,
    [a, b],
    out,
    [SIZE, CONV],
    Coord(1, 1),
    Coord(TPB, 1),
    spec=conv_spec,
)
problem.show()

# 1D Conv (Simple)

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

svg

problem.check()

Failed Tests.
Yours: [0. 0. 0. 0. 0. 0.]
Spec : [ 5.  8. 11. 14.  5.  0.]

Test 2

out = np.zeros(15)
a = np.arange(15)
b = np.arange(4)
problem = CudaProblem(
    "1D Conv (Full)",
    conv_test,
    [a, b],
    out,
    [15, 4],
    Coord(2, 1),
    Coord(TPB, 1),
    spec=conv_spec,
)
problem.show()

# 1D Conv (Full)

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

svg

problem.check()

Failed Tests.
Yours: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Spec : [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]

Puzzle 12 - Prefix Sum

 

Implement a kernel that computes a sum over a and stores it in out.
If the size of a is greater than the block size, only store the sum
of each block.

We will do this using the parallel prefix sum algorithm in shared
memory. That is, each step of the algorithm should sum together half
the remaining numbers. Follow this diagram:

[178757889-]

TPB = 8
def sum_spec(a):
    out = np.zeros((a.shape[0] + TPB - 1) // TPB)
    for j, i in enumerate(range(0, a.shape[-1], TPB)):
        out[j] = a[i : i + TPB].sum()
    return out


def sum_test(cuda):
    def call(out, a, size: int) -> None:
        cache = cuda.shared.array(TPB, numba.float32)
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 12 lines)

    return call


# Test 1

SIZE = 8
out = np.zeros(1)
inp = np.arange(SIZE)
problem = CudaProblem(
    "Sum (Simple)",
    sum_test,
    [inp],
    out,
    [SIZE],
    Coord(1, 1),
    Coord(TPB, 1),
    spec=sum_spec,
)
problem.show()

# Sum (Simple)

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

svg

problem.check()

Failed Tests.
Yours: [0.]
Spec : [28.]

Test 2

SIZE = 15
out = np.zeros(2)
inp = np.arange(SIZE)
problem = CudaProblem(
    "Sum (Full)",
    sum_test,
    [inp],
    out,
    [SIZE],
    Coord(2, 1),
    Coord(TPB, 1),
    spec=sum_spec,
)
problem.show()

# Sum (Full)

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

svg

problem.check()

Failed Tests.
Yours: [0. 0.]
Spec : [28. 77.]

Puzzle 13 - Axis Sum

 

Implement a kernel that computes a sum over each column of a and
stores it in out.

TPB = 8
def sum_spec(a):
    out = np.zeros((a.shape[0], (a.shape[1] + TPB - 1) // TPB))
    for j, i in enumerate(range(0, a.shape[-1], TPB)):
        out[..., j] = a[..., i : i + TPB].sum(-1)
    return out


def axis_sum_test(cuda):
    def call(out, a, size: int) -> None:
        cache = cuda.shared.array(TPB, numba.float32)
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x
        batch = cuda.blockIdx.y
        # FILL ME IN (roughly 12 lines)

    return call


BATCH = 4
SIZE = 6
out = np.zeros((BATCH, 1))
inp = np.arange(BATCH * SIZE).reshape((BATCH, SIZE))
problem = CudaProblem(
    "Axis Sum",
    axis_sum_test,
    [inp],
    out,
    [SIZE],
    Coord(1, BATCH),
    Coord(TPB, 1),
    spec=sum_spec,
)
problem.show()

# Axis Sum

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

svg

problem.check()

Failed Tests.
Yours: [[0.]
 [0.]
 [0.]
 [0.]]
Spec : [[ 15.]
 [ 51.]
 [ 87.]
 [123.]]

Puzzle 14 - Matrix Multiply!

 

Implement a kernel that multiplies square matrices a and b and stores
the result in out.

Tip: The most efficient algorithm here will copy a block into shared
memory before computing each of the individual row-column dot
products. This is easy to do if the matrix fits in shared memory. Do
that case first. Then update your code to compute a partial
dot-product and iteratively move the part you copied into shared
memory. You should be able to do the hard case in 6 global reads.

def matmul_spec(a, b):
    return a @ b


TPB = 3
def mm_oneblock_test(cuda):
    def call(out, a, b, size: int) -> None:
        a_shared = cuda.shared.array((TPB, TPB), numba.float32)
        b_shared = cuda.shared.array((TPB, TPB), numba.float32)

        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        j = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
        local_i = cuda.threadIdx.x
        local_j = cuda.threadIdx.y
        # FILL ME IN (roughly 14 lines)

    return call

# Test 1

SIZE = 2
out = np.zeros((SIZE, SIZE))
inp1 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE))
inp2 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE)).T

problem = CudaProblem(
    "Matmul (Simple)",
    mm_oneblock_test,
    [inp1, inp2],
    out,
    [SIZE],
    Coord(1, 1),
    Coord(TPB, TPB),
    spec=matmul_spec,
)
problem.show(sparse=True)

# Matmul (Simple)

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

svg

problem.check()

Failed Tests.
Yours: [[0. 0.]
 [0. 0.]]
Spec : [[ 1  3]
 [ 3 13]]

Test 2

SIZE = 8
out = np.zeros((SIZE, SIZE))
inp1 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE))
inp2 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE)).T

problem = CudaProblem(
    "Matmul (Full)",
    mm_oneblock_test,
    [inp1, inp2],
    out,
    [SIZE],
    Coord(3, 3),
    Coord(TPB, TPB),
    spec=matmul_spec,
)
problem.show(sparse=True)

# Matmul (Full)

   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

svg

problem.check()

Failed Tests.
Yours: [[0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]]
Spec : [[  140   364   588   812  1036  1260  1484  1708]
 [  364  1100  1836  2572  3308  4044  4780  5516]
 [  588  1836  3084  4332  5580  6828  8076  9324]
 [  812  2572  4332  6092  7852  9612 11372 13132]
 [ 1036  3308  5580  7852 10124 12396 14668 16940]
 [ 1260  4044  6828  9612 12396 15180 17964 20748]
 [ 1484  4780  8076 11372 14668 17964 21260 24556]
 [ 1708  5516  9324 13132 16940 20748 24556 28364]]

About

Solve puzzles. Learn CUDA.

Topics

machine-learning cuda puzzles

Resources

Readme

License

MIT license
Activity

Stars

6.4k stars

Watchers

33 watching

Forks

349 forks
Report repository

Releases

No releases published

Packages 0

No packages published

Contributors 4

  * @srush srush Sasha Rush
  * @jcupitt jcupitt John Cupitt
  * @danieldjohnson danieldjohnson Daniel D. Johnson
  * @r8bhavneet r8bhavneet

Languages

  * Jupyter Notebook 98.8%
  * Python 1.2%

Footer

 (c) 2024 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * Manage cookies
  * Do not share my personal information

You can't perform that action at this time.