[HN Gopher] Beating NumPy matrix multiplication in 150 lines of C
       ___________________________________________________________________
        
       Beating NumPy matrix multiplication in 150 lines of C
        
       Author : p1esk
       Score  : 326 points
       Date   : 2024-07-03 21:52 UTC (1 days ago)
        
 (HTM) web link (salykova.github.io)
 (TXT) w3m dump (salykova.github.io)
        
       | azornathogron wrote:
       | I've only skimmed so far, but this post has a lot of detail and
       | explanation. Looks like a pretty great description of how fast
       | matrix multiplications are implemented to take into account
       | architectural concerns. Goes on my reading list!
        
         | ghghgfdfgh wrote:
         | I don't save articles often, but maybe one time in a few months
         | I see something I know I will enjoy reading again even after 1
         | or 2 years. Keep up the great work OP!
        
       | le-mark wrote:
       | > This is my first time writing a blog post. If you enjoy it,
       | please subscribe and share it!
       | 
       | Great job! Self publishing things like this were a hallmark of
       | the early internet I for one sorely miss.
        
       | KerrAvon wrote:
       | The article claims this is portable C. Given the use of intel
       | intrinsics, what happens if you try to compile it for ARM64?
        
         | tempay wrote:
         | I think they mean portable between modern x86 CPUs as opposed
         | to truly portable.
        
         | rurban wrote:
         | You'd need first to convert it to portable SIMD intrinsics.
         | There are several libraries.
        
           | pmeira wrote:
           | Apparently it also needs Clang to achieve the same
           | performance: https://news.ycombinator.com/item?id=40875968
        
       | jstrong wrote:
       | in terms of comparing to numpy, how much overhead would there be
       | from Python (vs. running the numpy C code alone)?
        
         | SJC_Hacker wrote:
         | Python can efficiently call C libs if you use ctypes and native
         | pointers, which numpy uses. Of course depends on expected
         | layout.
         | 
         | If you want to convert to Python lists its is going to take
         | time. Not sure about Python arrays.
        
           | brnt wrote:
           | If you use numpy, then you use an ndarray, which you can
           | create from a C array for 'free' (no copying, just set a
           | pointer).
        
           | dgan wrote:
           | I don't recall the link but there was a github repo with
           | comparisons of CFFI implementations in different languages,
           | and from what i remember Python was 'bout 3 orders of
           | magnitude slower than, say, Lua or Ocaml
           | 
           | Edit: ha, found it
           | https://news.ycombinator.com/from?site=github.com/dyu
        
       | ssivark wrote:
       | Most common coding patterns leave a lot of performance unclaimed,
       | by not fully specializing to the hardware. This article is an
       | interesting example. For another interesting demonstration, see
       | this CS classic "There's plenty of room at the top"
       | 
       | https://www.science.org/doi/10.1126/science.aam9744
        
         | hmaarrfk wrote:
         | Thanks for sharing. That was a great read
        
         | auselen wrote:
         | Title comes from:
         | https://en.m.wikipedia.org/wiki/There%27s_Plenty_of_Room_at_...
        
       | ks2048 wrote:
       | This looks like a nice write-up and implementation. I'm left
       | wondering what is the "trick"? How does it manage to beat
       | OpenBLAS, which is assembly+C optimized over decades for this
       | exact problem? It goes into detail about caching, etc - is BLAS
       | is not taking advantage of these things, or is this more tuned to
       | this specific processor, etc?
        
         | sbstp wrote:
         | Maybe -march=native gives it an edge as it compiles for this
         | exact CPU model whereas numpy is compiled for a more generic
         | (older) x86-64. -march=native would probably get v4 on a Ryzen
         | CPU where numpy is probably targeting v1 or v2.
         | 
         | https://en.wikipedia.org/wiki/X86-64#Microarchitecture_level...
        
           | stingraycharles wrote:
           | Doesn't numpy have runtime SIMD dispatching and whatnot based
           | on CPU flags?
           | 
           | E.g. https://github.com/numpy/numpy/blob/main/numpy/_core/src
           | /com...
        
             | KeplerBoy wrote:
             | np.matmul just uses whatever blas library your NumPy
             | distribution was configured for/shipped with.
             | 
             | Could be MKL (i believe the conda version comes with it)
             | but it could also be an ancient version of OpenBLAS you
             | already had installed. So yeah, being faster than np.matmul
             | probably just means your NumPy is not installed optimally.
        
         | hansvm wrote:
         | - OpenBLAS isn't _that_ optimized for any specific modern
         | architecture.
         | 
         | - The matrices weren't that big. Numpy has cffi overhead.
         | 
         | - The perf difference was much more noticeable with _peak_
         | throughput rather than _mean_ throughput, which matters for
         | almost no applications (a few, admittedly, but even where
         | "peak" is close to the right measure you usually want something
         | like the mean of the top-k results or the proportion with under
         | some latency, ...).
         | 
         | - The benchmarking code they displayed runs through Python's
         | allocator for numpy and is suggestive of not going through any
         | allocator for the C implementation. Everything might be fine,
         | but that'd be the first place I checked for microbenchmarking
         | errors or discrepancies (most numpy routines allow in-place
         | operations; given that that's known to be a bottleneck in some
         | applications of numpy, I'd be tempted to explicitly examine
         | benchmarks for in-place versions of both).
         | 
         | - Numpy has some bounds checking and error handling code which
         | runs regardless of the underlying implementation. That's part
         | of why it's so bleedingly slow for small matrices compared to
         | even vanilla Python lists (they tested bigger matrices too, so
         | this isn't the only effect, but I'll mention it anyway). It's
         | hard to make something faster when you add a few thousand
         | cycles of pure overhead.
         | 
         | - This was a very principled approach to saturating the
         | relevant caches. It's "obvious" in some sense, but clear
         | engineering improvements are worth highlighting in discussions
         | like this, in the sense that OpenBLAS, even with many man-
         | hours, likely hasn't thought of everything.
         | 
         | And so on. Anyone can rattle off differences. A proper
         | explanation requires an actual deep-dive into both chunks of
         | code.
        
           | robxorb wrote:
           | To your third point - it looks as if the lines of mean values
           | were averaged, this posts code would still be a clear winner.
        
         | ipsum2 wrote:
         | Comparison with numpy 2.0 should be better for numpy because it
         | integrates Google highway for better simd across different
         | microarchitectures.
        
           | janwas wrote:
           | Highway TL here. The integration is a work in progress; what
           | has been integrated so far is VQSort :)
        
       | dzaima wrote:
       | Though not at all part of the hot path, the inefficiency of the
       | mask generation ('bit_mask' usage) nags me. Some more efficient
       | methods include creating a global constant array of
       | {-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
       | 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0} and loading from it at element
       | offsets 16-m and 8-m, or comparing constant vector
       | {0,1,2,3,4,...} with broadcasted m and m-8.
       | 
       | Very moot nitpick though, given that this is for only one column
       | of the matrix, the following loops of maskload/maskstore will
       | take significantly more time (esp. store, which is still slow on
       | Zen 4[1] despite the AVX-512 instruction (whose only difference
       | is taking the mask in a mask register) being 6x faster), and
       | clang autovectorizes the shifting anyways (maybe like 2-3x slower
       | than my suggestions).
       | 
       | [1]:
       | https://uops.info/table.html?search=vmaskmovps&cb_lat=on&cb_...
        
         | salykova wrote:
         | Hi! I'm the author of the article. It's my really first time
         | optimizing C code and using intrinsics, so I'm definitely not
         | an expert in this area, but Im willing to learn more! Many
         | thanks for your feedback; I truly appreciate comments that
         | provide new perspectives.
         | 
         | Regarding "creating a constant global array and loading from
         | it" - if I recall correctly, I've tested this approach and it
         | was a bit slower than bit mask shifting. But let me re-test
         | this to be 100% sure.
         | 
         | "Comparing a constant vector {0, 1, 2, 3, 4, ...} with
         | broadcasted m and m-8" - good idea, I will try it!
        
         | Const-me wrote:
         | > creating a global constant array
         | 
         | Note you can keep int8_t elements in that array, and sign
         | extend bytes into int32_t while loading. The _mm_loadu_si64 /
         | _mm256_cvtepi8_epi32 combo should compile into a single
         | vpmovsxbd instruction with a memory operand. This way the
         | entire constant array fits in a single cache line, as long as
         | it's aligned properly with alignas(32)
         | 
         | This is good fit for the OP's use case because they need two
         | masks, the second vpmovsxbd instruction will be a guaranteed
         | L1D cache hit.
        
           | dzaima wrote:
           | vpmovsxbd ymm,[...] still presumably decomposes back into two
           | uops (definitely does on intel, but uops.info doesn't show
           | memory uops for AMD); still better than broadcast+compare
           | though (which does still have a load for the constant; and,
           | for that matter, the original shifting version also has
           | multiple loads). Additionally, the int8_t elements mean no
           | cacheline-crossing loads. (there's the more compressed option
           | of only having a {8x-1, 8x0} array, at the cost of more
           | scalar offset computation)
        
       | leeoniya wrote:
       | what about jart's tinyBLAS?
       | 
       | https://hacks.mozilla.org/2024/04/llamafiles-progress-four-m...
       | 
       | also https://justine.lol/matmul/
        
         | salykova wrote:
         | We were actively chatting with Justine yesterday, seems like
         | the implementation is at least 2x faster than tinyBLAS on her
         | workstation. The whole discussion is in Mozilla AI discord:
         | https://discord.com/invite/NSnjHmT5xY
        
           | salykova wrote:
           | "off-topic" channel
        
       | marmaduke wrote:
       | Very nice write up. Those are the kind of matrix sizes that MKL
       | is fairly good at, might be worth a comparison as well?
       | 
       | Also, if you were designing for smaller cases, say MNK=16 or 32,
       | how would you approach it differently? I'm implementing neural
       | ODEs and this is one point I've been considering.
        
         | pmeira wrote:
         | Don't forget BLIS itself!
        
         | gnufx wrote:
         | For very small sizes on amd64, you can, and likely should, use
         | libxsmm. MKL's improved performance in that region was due to
         | libxsmm originally showing it up, but this isn't an Intel CPU
         | anyway.
        
       | SushiHippie wrote:
       | In the README, it says:
       | 
       | > Important! Please don't expect peak performance without fine-
       | tuning the hyperparameters, such as the number of threads, kernel
       | and block sizes, unless you are running it on a Ryzen 7700(X).
       | More on this in the tutorial.
       | 
       | I think I'll need a TL;DR on what to change all these values to.
       | 
       | I have a Ryzen 7950X and as a first test I tried to only change
       | NTHREADS to 32 in benchmark.c, but matmul.c performed worse than
       | NumPy on my machine.
       | 
       | So I took a look at the other values present in the benchmark.c,
       | but MC and NC are already calculated via the amount of threads
       | (so these are probably already 'fine-tuned'?), and I couldn't
       | really understand how KC = 1000 fits for the 7700(X) (the
       | author's CPU) and how I'd need to adjust it for the 7950X (with
       | the informations from the article).
        
         | SushiHippie wrote:
         | Actually, leaving it on 16 threads performs a bit better than
         | setting it to 32 threads.
         | 
         | 16 threads: https://0x0.st/XaDB.png
         | 
         | 32 threads: https://0x0.st/XaDM.png
         | 
         | But still not as fast as it ran on your 7700(X) and NumPy is
         | 2-3x faster than matmul.c on my PC.
         | 
         | I also changed KC to some other values (500:
         | https://0x0.st/XaD9.png, 2000: https://0x0.st/XaDp.png), but it
         | didn't change much performance wise.
        
           | salykova wrote:
           | as we discussed earlier, the code really needs Clang to
           | attain high performance
        
             | SushiHippie wrote:
             | Agreed https://0x0.st/XakD.png ;)
        
       | teo_zero wrote:
       | Does it make sense to compare a C executable with an interpreted
       | Python program that calls a compiled library? Is the difference
       | due to the algorithm or the call stack?
        
         | rustybolt wrote:
         | Yes
        
         | lamcny wrote:
         | Yes, as far as I can tell the array sizes were large enough to
         | make the wrapping overhead negligible.
        
       | david-gpu wrote:
       | There is no reason to burden one side with Python while the other
       | side is C, when they could have just as easily perform an apples-
       | to-apples comparison where both sides are written in C, one
       | calling a BLAS library while the other calls this other
       | implementation.
        
         | moomin wrote:
         | Python is the right thing to compare to here, because it is
         | easily the most popular way to perform these computations in
         | the modern day. Specifically using numpy. The overhead isn't
         | that high, but as mentioned elsewhere in this thread, calling
         | it correctly is important. Pitting naive numpy code against
         | tuned C code is definitely not a fair comparison.
        
           | stinos wrote:
           | > Python is the right thing to compare to here, because it is
           | easily the most popular way to perform these computations in
           | the modern day. Specifically using numpy.
           | 
           | By that reasoning, wouldn't it make more sense to wrap their
           | C code and maybe even make it operate on numpy's array
           | representation, so it can be called from Python?
        
             | moomin wrote:
             | I think it's okay to say "This is the benchmark, now I'm
             | going to compare it against something else." It's up to the
             | reader to decide if a 3% (or 300%) improvement is worth the
             | investment if it involves learning a whole other language.
        
               | david-gpu wrote:
               | If that was the goal, they should have compared NumPy to
               | BLAS. What they did was comparing OpenBLAS wrapped in
               | NumPy with their C code. It is not a reasonable
               | comparison to make.
               | 
               | Look, I'm trying to be charitable to the authors, hard as
               | that might be.
        
               | cnity wrote:
               | There is some reason in this comparison. You might want
               | to answer the question: "if I pick the common approach to
               | matrix multiplication in the world of Data Science
               | (numpy), how far off is the performance from some
               | potential ideal reference implementation?"
               | 
               | I actually do have that question niggling in the back of
               | my mind when I use something like NumPy. I don't
               | necessarily care exactly _where_ the overhead comes from,
               | I might just be interested whether it's close to ideal or
               | not.
        
               | david-gpu wrote:
               | If that was your question, you would compare against a
               | number of BLAS libraries, which are already well
               | optimized.
               | 
               | What they are doing here is patting themselves on the
               | back after handicapping the competition. Not to mention
               | that they have given themselves the chance to cherry pick
               | the very best hyperparameters for this particular
               | comparison while BLAS is limited to using heuristics to
               | guess which of their kernels will suit this particular
               | combination of hardware and parameters.
               | 
               | The authors need to be called out for this contrived
               | comparison.
        
               | pmeira wrote:
               | It's a muddy comparison given that NumPy is commonly used
               | with other BLAS implementations, which the author even
               | lists, but doesn't properly address. Anaconda defaults to
               | Intel oneAPI MKL, for example, and that's a widely used
               | distribution. Not that I think MKL would do great on AMD
               | hardware, BLIS is probably a better alternative.
               | 
               | The author also says "(...) implementation follows the
               | BLIS design", but then proceeds to compare *only* with
               | OpenBLAS. I'd love to see a more thorough analysis, and
               | using C directly would make it easier to compare multiple
               | BLAS libs.
        
             | lamcny wrote:
             | Popular does not mean best.
             | 
             | Suppose that this blog post were part of a series that
             | questions the axiom (largely bolstered by academic
             | marketing) that one needs Python to do array computing.
             | Then it is valid to compare C directly to NumPy.
             | 
             | It isn't even far fetched. The quality of understanding
             | something after having implemented it in C is far greater
             | than the understanding gained by rearranging PyTorch or
             | NumPY snippets.
             | 
             | That said, the Python overhead should not be very high if
             | M=1000, N=1000, K=1000 was used. The article is a bit
             | silent on the array sizes, this is somewhere from the
             | middle of the article.
        
               | CleanRoomClub wrote:
               | Python is popular precisely because non-programmers are
               | able to rearrange snippets and write rudimentary programs
               | without a huge time investment in learning the language
               | or tooling. It's a very high level language with
               | syntactic sugar that has a lot of data science libraries
               | which can call C code for performance, which makes it
               | great for data scientists.
               | 
               | It would be a huge detriment and time sink for these data
               | scientist to take the time to learn to write an
               | equivalent C program if their ultimate goal is to do data
               | science.
        
       | Bayes7 wrote:
       | This was a great read, thanks a lot! One a side note, any one has
       | a good guess what tool/software they used to create the
       | visualisations for matrix multiplications or memory outline?
        
         | salykova wrote:
         | excalidraw <3
        
       | bjourne wrote:
       | Good writeup and commendable of you to make your benchmark so
       | easily repeatable. On my 16-core Xeon(R) W-2245 CPU @ 3.90GHz
       | matmul.c takes about 1.41 seconds to multiply 8192x8192 matrices
       | when compiled with gcc -O3 and 1.47 seconds when compiled with
       | clang -O2, while NumPy does it in 1.07 seconds. I believe an
       | avx512 kernel would be significantly faster. Another reason for
       | the lackluster performance may be omp. IME, you can reduce
       | overhead by managing the thread pool explicitly with pthreads
       | (and use sysconf(_SC_NPROCESSORS_ONLN) instead of hard-coding).
        
       | epr wrote:
       | If the point of this article is that there's generally
       | performance left on the table, if anything it's understating how
       | much room there generally is for improvement considering how much
       | effort goes into matmul libraries compared to most other
       | software.
       | 
       | Getting a 10-1000x or more improvement on existing code is very
       | common without putting in a ton of effort if the code was not
       | already heavily optimized. These are listed roughly in order of
       | importance, but performance is often such a non-consideration
       | from most developers that a little effort goes a long way.
       | 
       | 1. Most importantly, is the algorithm a good choice? Can we
       | eliminate some work entirely? (this is what algo interviews are
       | testing for)
       | 
       | 2. Can we eliminate round trips to the kernel and similar heavy
       | operations? The most common huge gain here is replacing tons of
       | malloc calls with a custom allocator.
       | 
       | 3. Can we vectorize? Explicit vector intrinsics like in the blog
       | post are great, but you can often get the same machine code by
       | reorganizing your data into arrays / struct of arrays rather than
       | arrays of structs.
       | 
       | 4. Can we optimize for cache efficiency? If you already
       | reorganized for vectors this might already be handled, but this
       | can get more complicated with parallel code if you can't isolate
       | data to one thread (false sharing, etc.)
       | 
       | 5. Can we do anything else that's hardware specific? This can be
       | anything from using intrinsics to hand-coding assembly.
        
       | 38 wrote:
       | > #define min(x, y) ((x) < (y) ? (x) : (y))
       | 
       | cursed code. I checked and every single invocation is just `int`,
       | so why do this? you can just write a function:
       | func min(x, y int) int {            if x < y { return x }
       | return y         }
       | 
       | and keep type safety
        
       | throwaway4good wrote:
       | What is the point of making the matrix multiplication itself
       | multithreaded (other than benchmarking)? Wouldn't it be more
       | beneficial in practice to have the multithreadedness in the
       | algorithm that use the multiplication?
        
         | gnufx wrote:
         | That's indeed what's typically done in HPC. However,
         | substituting a parallel BLAS can help the right sort of R code
         | simply, for instance, but HPC codes typically aren't
         | bottlenacked on GEMM.
        
       | softmicro wrote:
       | Is there any explaination for the dropping and rising in GFLOP/S
       | at certain matrix sizes as shown in the plots.
        
       | gnufx wrote:
       | The papers referenced in the BLIS repo are the authoritative
       | reference to understand this stuff. I've no idea why people think
       | optimized BLASes aren't performant; you should expect >90% of CPU
       | peak for sufficiently large matrices, and serial OpenBLAS was
       | generally on a par with MKL last I saw. (BLAS implements GEMM,
       | not matmul, as the basic linear algebra building block.) I don't
       | understand using numpy rather than the usual benchmark
       | frameworks, and on Zen surely you should compare with AMD's BLAS
       | (based on BLIS). BLIS had a better parallelization story than
       | OpenBLAS last I knew. AMD BLIS also has the implementation switch
       | for "small" dimensions, and I don't know if current OpenBLAS
       | does.
       | 
       | SIMD intrinsics aren't needed for micro-kernel vectorization, as
       | a decent C compiler will fully vectorize and unroll it. BLIS'
       | pure C micro-kernel gets >80% of the performance of the hand-
       | optimized implementation on Haswell with appropriate block sizes.
       | The difference is likely to be due to prefecth, but I don't
       | properly understand it.
        
       ___________________________________________________________________
       (page generated 2024-07-04 23:01 UTC)