https://salykova.github.io/matmul-cpu logo salykova blog [ ] Beating NumPy's matrix multiplication in 150 lines of C code Jul 1, 2024 * Aman Salykov TL;DR The code from the tutorial is available at matmul.c. This blog post is the result of my attempt to implement high-performance matrix multiplication on CPU while keeping the code simple, portable and scalable. The implementation follows the BLIS design, works for arbitrary matrix sizes, and, when fine-tuned for an AMD Ryzen 7700 (8 cores), outperforms NumPy (=OpenBLAS), achieving over 1 TFLOPS of peak performance across a wide range of matrix sizes. [benchmark_mt] By efficiently parallelizing the code with just 3 lines of OpenMP directives, it's both scalable and easy to understand. The implementation hasn't been tested on other CPUs, so I would appreciate feedback on its performance on your hardware. Although the code is portable and targets Intel Core and AMD Zen CPUs with FMA3 and AVX instructions (i.e., all modern Intel Core and AMD Zen CPUs), please don't expect peak performance without fine-tuning the hyperparameters, such as the number of threads, kernel, and block sizes, unless you are running it on a Ryzen 7700(X). Additionally, on some Intel CPUs, the OpenBLAS implementation might be notably faster due to AVX-512 instructions, which were intentionally omitted here to support a broader range of processors. Throughout this tutorial, we'll implement matrix multiplication from scratch, learning how to optimize and parallelize C code using matrix multiplication as an example. This is my first time writing a blog post. If you enjoy it, please subscribe and share it! I would be happy to hear feedback from all of you. This is the first part of my planned two-part blog series. In the second part, we will learn how to optimize matrix multiplication on GPUs. Stay tuned! Intro Matrix multiplication is an essential part of nearly all modern neural networks. For example, most of the time spent during inference in Transformers is actually taken up by matrix multiplications. Despite using matmul daily in PyTorch, NumPy, or JAX, I've never really thought about how it is designed and implemented to maximize hardware efficiency. To achieve such speeds, NumPy, for instance, relies on external BLAS (Basic Linear Algebra Subprograms) libraries. These libraries implement highly optimized common linear algebra operations such as dot product, matrix multiplication, vector addition, and scalar multiplication. Examples of BLAS implementations include: 1. Intel MKL - optimized for Intel CPUs 2. Accelerate - optimized for Apple CPUs 3. BLIS - open-source, multi-vendor support 4. GotoBLAS - open-source, multi-vendor support 5. OpenBLAS - open-source, based on GotoBLAS 6. etc. If you look at the OpenBLAS code, you'll notice it's a mix of C and low-level assembly code. In fact, OpenBLAS, GotoBLAS, and BLIS are all written in C/FORTRAN/Assembly and contain matmul implementations handcrafted for different CPU types. During runtime, the appropriate function is called depending on the detected CPU device. I challenged myself and asked if it is possible to write a high-performance matmul (across a wide range of matrix sizes) without diving deep into Assembly and Fortran code, at least for my CPU. After some searching on the internet, I found a couple of exciting and educational step-by-step tutorials on how to implement fast matmul from scratch, covering both theoretical and practical aspects: 1. Fast Multidimensional Matrix Multiplication on CPU from Scratch by Simon Boehm. 2. Matrix Multiplication by Sergey Slotin. 3. Geohot's famous stream Can you multiply a matrix? I highly recommend checking out these well-written and well-spoken tutorials with alternative matmul implementations. They helped me better understand the topic and, in some sense, motivated me to write a different implementation. Why? The reason is that all three solutions above work only for specific matrix sizes and do not achieve NumPy's multi-threaded speed (except for Geohot's implementation, which is comparable to NumPy in terms of speed but again works only for specific matrix sizes and requires an extra preswizzle step, resulting in a full copy of one of the input matrices). So, I wasn't satisfied with the results and continued researching until I stumbled across two fascinating papers: "Anatomy of High-Performance Matrix Multiplication" and "Anatomy of High-Performance Many-Threaded Matrix Multiplication". The former presents the BLAS implementation known as GotoBLAS, developed by Kazushige Goto. The latter briefly reviews the design of matmul op used in BLIS (an extended version of GotoBLAS) and discusses different parallelization possibilities for the matmul algorithm. After reading these papers I felt that the BLIS matmul design could potentially achieve all my goals: * NumPy-like multi-threading performance across a broad range of matrix sizes * Simple, portable and scalable C code * Support for a wide variety of processors In the next sections, we will implement the algorithm from the paper and compare it against NumPy. NumPy Performance By default, if installed via pip, numpy uses OpenBLAS on AMD CPUs. Therefore, throughout this tutorial I will use numpy and OpenBLAS interchangeably. Before performing any benchmarks, it's always good practice to specify your hardware specs and development environment to ensure the results can be reproduced: * CPU: Ryzen 7 7700 8 Cores, 16 Threads + Freq: 3.8 GHz + Turbo Freq: 5.3 GHz + L1 Cache: 64 KB (per core) + L2 Cache: 1MB (per core) + L3 Cache: 32MB (shared), 16-way associative * RAM: 32GB DDR5 6000 MHz CL36 * Numpy 1.26.4 * Compiler: clang-17 * Compiler flags: -O2 -mno-avx512f -march=native * OS: Ubuntu 22.04.4 LTS To multiply two float32 matrices A of shape \(M \times K\) and B of shape \(K \times N\), for each element of the resulting matrix C of shape \(M \times N\), we need to calculate the dot product between a row of A and a column of B. This results in \(K\) (additions) + \(K\) (multiplications) = \(2K\) FLoating Point Operations (FLOP) per element of matrix C or \(2KMN\) FLOP in total. [matmul_nai] We will measure performance in terms of FLOP per second FLOP/s=FLOPS. In Python, this can be simply done as follows: import numpy as np import time A = np.random.randn(M, K).astype(np.float32) B = np.random.randn(K, N).astype(np.float32) FLOP = 2*K*M*N start = time.perf_counter() C = A @ B end = time.perf_counter() exec_time = end - start FLOPS = FLOP/exec_time GFLOPS = FLOPS/1e9 Important! When benchmarking code, try to minimize the number of running tasks, especially when measuring multi-threaded code. Results obtained on Windows are usually lower than on Linux. To benchmark numpy's matmul, we will use benchmark_numpy.py, which executes the code snippet above for different matrix sizes in a loop and measures peak/average FLOPS. By default, numpy will use all available cores; however, we can easily change this by setting environment variables before importing numpy and matplotlib os.environ["OPENBLAS_NUM_THREADS"] = "1" os.environ["MKL_NUM_THREADS"] = "1" os.environ["OMP_NUM_THREADS"] = "1" import numpy as np import matplotlib.pyplot as plt To measure Numpy's matmul performance, run python benchmark_numpy.py -NITER=200 -ST -SAVEFIG for single-threaded benchmark and python benchmark_numpy.py -NITER=500 -SAVEFIG for multi-threaded benchmark. On my machine I got the following results: [benchmark_np_mt] [benchmark_np_st] How close are we to the theoretical upper limit achievable on the CPU? Theoretical Limit Recall the computer's memory hierarchy (for now, ignore the layers between registers and RAM; we will discuss them later). [cpu_mem_no_cache] To perform arithmetic operations on data stored in RAM (off-chip memory, slow and large), the data must first be transferred to the CPU and eventually stored in CPU registers (on-chip memory, fast and small). Modern x86 CPUs support SIMD (Single Instruction Multiple Data) extensions, which allow multiple pieces of data to be processed in parallel. There are various SIMD extensions, but the ones relevant to our discussion are Advanced Vector Extensions (AVX) and Fused Multiply-Add (FMA). Both AVX and FMA operate on data stored in special 256-bit YMM registers. Each YMM register can hold up to 256/ 32 = 8 packed single-precision (32-bit) floats. The FMA extension allows a multiply-add operation to be performed in one step on data stored in YMM registers. The corresponding assembly instruction is called VFMADD213PS (PS stands for PackedSingle) and takes three registers (YMM1, YMM2, YMM3) as input to calculate YMM1 * YMM2 + YMM3 and store the result in YMM3, hence the "213" (there are also vfmadd132ps, vfmadd231ps variants). [fmadd] According to the intel intrinsics guide or https://uops.info/ table.html, the throughput (TP) of fused-multiply-add is 0.5 cycles/ instruction or 2 instructions/cycle: [fmadd_uops] Theoretically, the CPU can execute 32 FLOP per cycle = 8 (floats in YMM register) * 2 (add + mul) * 2 (1/TP). On my machine, the CPU boosts up to 5.1 GHz in single-threaded tasks and up to 4.7 GHz in multi-threaded tasks. Therefore, a rough estimation of the maximum achievable FLOPS can be calculated as 5.1GHz * 32 FLOP/cycle = 163 GFLOPS for single-threaded matmul and 4.7GHz * 32 FLOP/cycle * 8 cores = 1203 GFLOPS for multi-threaded matmul. Starting from \(M=N=K= 1000\), numpy reaches on average 92% of the theoretical maximum single-threaded performance and 85% of the multi-threaded. Can we compete with NumPy using plain C code without thousands of lines of low-level assembly code? Naive Implementation Without loss of generality in this implementation we will assume that matrices stored in column-major order. A matrix A of shape MxN is stored as contiguous array of length M*N and an element A[row][col] is accessed via C raw pointer ptr[col*M + row], where 0 <= col <= N-1 and 0 <= row <= M-1. [mem_layout] The naive implementation [matmul_nai] can be implemented as follows: void matmul_naive(float* A, float* B, float* C, const int M, const int N, const int K) { for (int i = 0; i < M; i++) { for (int j = 0; j < N; j++) { for (int p = 0; p < K; p++) { C[j * M + i] += A[p * M + i] * B[j * K + p]; } } } } We iterate over all rows (first loop) and all columns (second loop) of the matrix C and for each element of C we calculate the dot product (third loop) between the corresponding rows and columns of matrices A and B. It's always good to start with simple and robust implementation that can later be used to test optimized implementations for correctness. The file matmul_nave.c contains this implementation. Running the naive implementation clang-17 -O2 -mno-avx512f -march=native matmul_naive.c -o matmul_naive.out && ./matmul_naive.out results in 2.7 GFLOPS on my machine. Nowhere near our target of 1 TFLOPS. Kernel Matrix multiplication $C=AB$ can be decomposed into smaller sub-problems. The idea now is that if the smaller sub-problems can be solved quickly, then the entire matmul will be fast. We first partition the matrix $C$ of shape $M \times N$ into small sub-matrices of shape $m_R \times n_R$, where $n_R \ll N$ and $m_R \ ll M$. To calculate $C=AB$, we iterate over $C$ and compute each of its $m_R \times n_R$ sub-matrices. [matmul_ker] The function that calculates these tiny $m_R \times n_R$ sub-matrices $\bar{C}$ of $C$ is called a kernel or micro-kernel. This is the heart of high-performance matrix multiplication. When we say that a matmul algorithm is optimized for a particular CPU architecture, it often involves kernel optimization. For example, in the BLIS library, the kernels optimized for different processor types can be found under kernels. Let's take a closer look at the kernel. [kernel] To calculate a $m_R \times n_R$ sub-matrix $\bar{C}$ of matrix $C$, we multiply matrix $\bar{A}$ of size $m_R \times K$ and matrix $\bar {B}$ of size $K \times n_R$. If we do this in a naive manner using dot products, we would need to fetch $2K$ (=dot product) elements from RAM to calculate a single element of $\bar{C}$ or $2K m_R n_R$ elements in total to calculate $\bar{C}$. However, there is an alternative strategy that can reduce the number of fetched elements. We first load matrix $\bar{C}$ into SIMD (=YMM) registers (note that we can do this because both $n_R$ and $m_R$ are small). The subscript $R$ in $n_R$ and $m_R$ stands for "registers". Then we iterate over $K$ and in each iteration we load 1 column of $\bar{A}$ and 1 row of $\bar{B}$ into YMM registers (again, note that both the row and the column vectors are small and fit in the registers). Finally, we perform matrix multiplication between the column and the row vectors to update the matrix $\bar{C}$. After $K$ iterations (=rank-1 updates), the matrix $\bar{C}$ is fully computed. [kernel_ran] Example of matrix multiplication between a column and a row vector. Each column of the resulting matrix is computed by multiplying vector $\mathbf{u}$ with a scalar element of the row vector. [outer_prod] Overall we fetched $(m_R + n_R)K + m_R n_R \approx (m_R + n_R)K$ elements into registers. Compared to the naive strategy, we reduced the number by a factor of \[\frac{2m_Rn_RK}{(m_R + n_R)K} = \frac{2m_Rn_R}{m_R + n_R}\] The factor is maximized when both $m_R$, $n_R$ are large and $m_R = n_R$. The values $m_R$ and $n_R$ are usually limited by the available memory in the registers. Now, let's explore how a rank-1 update can be implemented using SIMD instructions. Each rank-1 update is a matrix multiplication between a column of $\bar{A}$ and a row of $\bar{B}$. Note how individual column of $\bar{C}$ is updated via scalar-vector multiplication between a column of $\bar{A}$ and a corresponding scalar element of a row of $\bar{B}$. Thanks to the FMA extension, the update + scalar-vector multiplication can be efficiently calculated via a fused multiply-add instruction. Before executing the FMA instruction, we only need to broadcast the scalar element of the row of $\bar{B}$ to a vector and load it into a YMM register. The parameter $m_R$ determines how many elements are stored in column vectors of $\bar {C}, \bar{A}$ and how many YMM registers we need for this. Since each YMM register can store up to 8 floats, we assume that $m_R$ is a multiple of 8 (8, 16, 24, 32...) and the elements in column vectors are packed into blocks of size 8. Then the number of YMM registers required to store the column vectors can be calculated as $m_R$ / 8. Note that we don't need additional YMM registers for the broadcasted column vector of $\bar{B}$ since the same 8-float vector (YMM Register) can be reused to update all 8-float blocks of the column vector of $\bar{C}$. [kernel_registers] Thus, the complete algorithm for a single rank-1 update of the matrix $\bar{C}$ is as follows: 1. Load matrix $\bar{C}$ into YMM registers 2. Load the column vector of matrix $\bar{A}$ 3. Set n = 1 4. Load the n-th scalar element of the row of $\bar{B}$, broadcast it to a vector and place it into YMM register. 5. Update the n-th column of $\bar{C}$ via fused matrix multiply 6. Increment n by 1. 7. Repeat steps 4-6 until all columns of $\bar{C}$ are updated. The last thing we need to discuss before implementing the kernel in C is how to choose the kernel size = $m_R$ and $n_R$. CPUs that support AVX instructions have 16 YMM registers. From our previous observations, we know that we need $n_R m_R / 8$ registers to store the matrix $\bar{C}$, $m_R/8$ registers to store the column vector of $\bar{A}$ and 1 register for the broadcasted vector of $\bar{B}$. We want $m_R, n_R$ as large as possible and satisfying the following conditions * $n_R m_R/8 + m_R/8 + 1 <= 16$ * $m_R$ is a multiple of 8 In theory we also want $m_R \approx n_R$ to minimize the number of fetched elements. However, in practice, I've found out that the non-square $m_R \times n_R= 16 \times 6$ kernel shows the best results on my machine. You are free to try out different kernel sizes, for example, $8 \times 12$, $8 \times 13$, $8 \times 14$ and compare the performance on your CPU. Let's implement the $16 \times 6$ kernel in C. The code can be found in matmul_kernel.c. To use the SIMD instructions we need to include the immintin.h library. #include the kernel function is declared as follows: void kernel_16x6(float* A, float* B, float* C, const int M, const int N, const int K); The function takes as input 3 matrices + their dimensions and calculates a $16\times6$ sub-matrix $\bar{C}$ of $C$. Inside the function, first, declare the variables that reside in YMM registers: __m256 C_buffer[2][6]; __m256 b_packFloat8; __m256 a0_packFloat8; __m256 a1_packFloat8; The __m256 datatype is a vector of 8 floats (8x32 = 256 bits) that resides in YMM register. C_buffer is a 16x6 sub-matrix of $C$ stored in YMM registers. The first dimension of C_buffer is 2, because we need 16/8=2 registers to store 16 elements. b_packFloat8, a0_packFloat8, a1_packFloat8 are column vectors of $\bar{B}$ and $\ bar{A}$. Again, we need two vectors to store 16 elements of the column vector of $\bar{A}$. Next, we load the sub-matrix $\bar{C}$ into YMM registers: for (int j = 0; j < 6; j++) { C_buffer[0][j] = _mm256_loadu_ps(&C[j * M]); C_buffer[1][j] = _mm256_loadu_ps(&C[j * M + 8]); } SIMD C functions are well documented and can be found in the Intel Intrinsics Guide. For example, _mm256_loadu_ps [intel_intr] In the next step, we iterate over K and, in each iteration, load a column vector of $\bar{A}$, broadcast a scalar value of $\bar{B}$ to a vector, and perform a fused multiply-add operation to update 1 column of C_buffer: for (int p = 0; p < K; p++) { a0_packFloat8 = _mm256_loadu_ps(&A[p * M]); a1_packFloat8 = _mm256_loadu_ps(&A[p * M + 8]); b_packFloat8 = _mm256_broadcast_ss(&B[p]); C_buffer[0][0] = _mm256_fmadd_ps(a0_packFloat8, b_packFloat8, C_buffer[0][0]); C_buffer[1][0] = _mm256_fmadd_ps(a1_packFloat8, b_packFloat8, C_buffer[1][0]); ... } Then repeat the step for the remaining 5 columns. We manually unroll the loop when updating 6 columns of C_buffer so that clang can optimize the code. Finally, we write the sub-matrix C_buffer back to C: for (int j = 0; j < 6; j++) { _mm256_storeu_ps(&C[j * M], C_buffer[0][j]); _mm256_storeu_ps(&C[j * M + 8], C_buffer[1][j]); } To perform matrix multiplication, we simply iterate over the matrix $C$ and apply the kernel to it's sub-matrices: #define MR 16 #define NR 6 void matmul_kernel(float* A, float* B, float* C, const int M, const int N, const int K) { assert(M % MR == 0); assert(N % NR == 0); for (int i = 0; i < M; i += MR) { for (int j = 0; j < N; j += NR) { kernel_16x6(&A[i], &B[j * K], &C[j * M + i], M, N, K); } } } The new implementation clang-17 -O2 -mno-avx512f -march=native -DTEST -DNITER=100 matmul_kernel.c -o matmul_kernel.out && ./matmul_kernel.out results in 147 GFLOPS - a gigantic gain compared to the initial 2.7 GFLOPS. Additionally, we can check the assembly code produced by the compiler via clang-17 -O2 -mno-avx512f -march=native matmul_kernel.c -S > matmul_kernel.txt to ensure that the SIMD instructions and the YMM registers are utilized: vbroadcastss (%rsi,%rbp,4), %ymm14 vbroadcastss (%rbx,%rbp,4), %ymm15 vfmadd231ps %ymm14, %ymm12, %ymm3 # ymm3 = (ymm12 * ymm14) + ymm3 vfmadd231ps %ymm14, %ymm13, %ymm1 # ymm1 = (ymm13 * ymm14) + ymm1 vbroadcastss (%r13,%rbp,4), %ymm14 vfmadd231ps %ymm12, %ymm15, %ymm11 # ymm11 = (ymm15 * ymm12) + ymm11 vfmadd231ps %ymm15, %ymm13, %ymm10 # ymm10 = (ymm13 * ymm15) + ymm10 vfmadd231ps %ymm14, %ymm12, %ymm2 # ymm2 = (ymm12 * ymm14) + ymm2 vfmadd231ps %ymm14, %ymm13, %ymm0 # ymm0 = (ymm13 * ymm14) + ymm0 vbroadcastss (%r12,%rbp,4), %ymm14 vfmadd231ps %ymm14, %ymm12, %ymm5 # ymm5 = (ymm12 * ymm14) + ymm5 vfmadd231ps %ymm14, %ymm13, %ymm4 # ymm4 = (ymm13 * ymm14) + ymm4 vbroadcastss (%r15,%rbp,4), %ymm14 vfmadd231ps %ymm14, %ymm12, %ymm7 # ymm7 = (ymm12 * ymm14) + ymm7 vfmadd231ps %ymm14, %ymm13, %ymm6 # ymm6 = (ymm13 * ymm14) + ymm6 vbroadcastss (%r14,%rbp,4), %ymm14 vfmadd231ps %ymm14, %ymm12, %ymm9 # ymm9 = (ymm12 * ymm14) + ymm9 vfmadd231ps %ymm14, %ymm13, %ymm8 # ymm8 = (ymm13 * ymm14) + ymm8 Masking And Packing You might notice that the current kernel implementation works only for matrix sizes that are multiples of $m_R$ and $n_R$. To make the algorithm work for arbitrary matrix sizes, we need to handle edge cases where the kernel doesn't fully overlap with matrix $C$. [kernel_mas] First of all, we when loading and storing the elements of $C$, we should pick the elements only within the matrix boundary. The case where the number of overlapped columns $n$ is less than $n_R$ is straightforward - we simply iterate over $n$ columns within the $C$ boundary: # n - number of overlapped columns within C boundary # "j