[HN Gopher] Beating NumPy matrix multiplication in 150 lines of C
___________________________________________________________________
Beating NumPy matrix multiplication in 150 lines of C
Author : p1esk
Score : 326 points
Date : 2024-07-03 21:52 UTC (1 days ago)
(HTM) web link (salykova.github.io)
(TXT) w3m dump (salykova.github.io)
| azornathogron wrote:
| I've only skimmed so far, but this post has a lot of detail and
| explanation. Looks like a pretty great description of how fast
| matrix multiplications are implemented to take into account
| architectural concerns. Goes on my reading list!
| ghghgfdfgh wrote:
| I don't save articles often, but maybe one time in a few months
| I see something I know I will enjoy reading again even after 1
| or 2 years. Keep up the great work OP!
| le-mark wrote:
| > This is my first time writing a blog post. If you enjoy it,
| please subscribe and share it!
|
| Great job! Self publishing things like this were a hallmark of
| the early internet I for one sorely miss.
| KerrAvon wrote:
| The article claims this is portable C. Given the use of intel
| intrinsics, what happens if you try to compile it for ARM64?
| tempay wrote:
| I think they mean portable between modern x86 CPUs as opposed
| to truly portable.
| rurban wrote:
| You'd need first to convert it to portable SIMD intrinsics.
| There are several libraries.
| pmeira wrote:
| Apparently it also needs Clang to achieve the same
| performance: https://news.ycombinator.com/item?id=40875968
| jstrong wrote:
| in terms of comparing to numpy, how much overhead would there be
| from Python (vs. running the numpy C code alone)?
| SJC_Hacker wrote:
| Python can efficiently call C libs if you use ctypes and native
| pointers, which numpy uses. Of course depends on expected
| layout.
|
| If you want to convert to Python lists its is going to take
| time. Not sure about Python arrays.
| brnt wrote:
| If you use numpy, then you use an ndarray, which you can
| create from a C array for 'free' (no copying, just set a
| pointer).
| dgan wrote:
| I don't recall the link but there was a github repo with
| comparisons of CFFI implementations in different languages,
| and from what i remember Python was 'bout 3 orders of
| magnitude slower than, say, Lua or Ocaml
|
| Edit: ha, found it
| https://news.ycombinator.com/from?site=github.com/dyu
| ssivark wrote:
| Most common coding patterns leave a lot of performance unclaimed,
| by not fully specializing to the hardware. This article is an
| interesting example. For another interesting demonstration, see
| this CS classic "There's plenty of room at the top"
|
| https://www.science.org/doi/10.1126/science.aam9744
| hmaarrfk wrote:
| Thanks for sharing. That was a great read
| auselen wrote:
| Title comes from:
| https://en.m.wikipedia.org/wiki/There%27s_Plenty_of_Room_at_...
| ks2048 wrote:
| This looks like a nice write-up and implementation. I'm left
| wondering what is the "trick"? How does it manage to beat
| OpenBLAS, which is assembly+C optimized over decades for this
| exact problem? It goes into detail about caching, etc - is BLAS
| is not taking advantage of these things, or is this more tuned to
| this specific processor, etc?
| sbstp wrote:
| Maybe -march=native gives it an edge as it compiles for this
| exact CPU model whereas numpy is compiled for a more generic
| (older) x86-64. -march=native would probably get v4 on a Ryzen
| CPU where numpy is probably targeting v1 or v2.
|
| https://en.wikipedia.org/wiki/X86-64#Microarchitecture_level...
| stingraycharles wrote:
| Doesn't numpy have runtime SIMD dispatching and whatnot based
| on CPU flags?
|
| E.g. https://github.com/numpy/numpy/blob/main/numpy/_core/src
| /com...
| KeplerBoy wrote:
| np.matmul just uses whatever blas library your NumPy
| distribution was configured for/shipped with.
|
| Could be MKL (i believe the conda version comes with it)
| but it could also be an ancient version of OpenBLAS you
| already had installed. So yeah, being faster than np.matmul
| probably just means your NumPy is not installed optimally.
| hansvm wrote:
| - OpenBLAS isn't _that_ optimized for any specific modern
| architecture.
|
| - The matrices weren't that big. Numpy has cffi overhead.
|
| - The perf difference was much more noticeable with _peak_
| throughput rather than _mean_ throughput, which matters for
| almost no applications (a few, admittedly, but even where
| "peak" is close to the right measure you usually want something
| like the mean of the top-k results or the proportion with under
| some latency, ...).
|
| - The benchmarking code they displayed runs through Python's
| allocator for numpy and is suggestive of not going through any
| allocator for the C implementation. Everything might be fine,
| but that'd be the first place I checked for microbenchmarking
| errors or discrepancies (most numpy routines allow in-place
| operations; given that that's known to be a bottleneck in some
| applications of numpy, I'd be tempted to explicitly examine
| benchmarks for in-place versions of both).
|
| - Numpy has some bounds checking and error handling code which
| runs regardless of the underlying implementation. That's part
| of why it's so bleedingly slow for small matrices compared to
| even vanilla Python lists (they tested bigger matrices too, so
| this isn't the only effect, but I'll mention it anyway). It's
| hard to make something faster when you add a few thousand
| cycles of pure overhead.
|
| - This was a very principled approach to saturating the
| relevant caches. It's "obvious" in some sense, but clear
| engineering improvements are worth highlighting in discussions
| like this, in the sense that OpenBLAS, even with many man-
| hours, likely hasn't thought of everything.
|
| And so on. Anyone can rattle off differences. A proper
| explanation requires an actual deep-dive into both chunks of
| code.
| robxorb wrote:
| To your third point - it looks as if the lines of mean values
| were averaged, this posts code would still be a clear winner.
| ipsum2 wrote:
| Comparison with numpy 2.0 should be better for numpy because it
| integrates Google highway for better simd across different
| microarchitectures.
| janwas wrote:
| Highway TL here. The integration is a work in progress; what
| has been integrated so far is VQSort :)
| dzaima wrote:
| Though not at all part of the hot path, the inefficiency of the
| mask generation ('bit_mask' usage) nags me. Some more efficient
| methods include creating a global constant array of
| {-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,
| 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0} and loading from it at element
| offsets 16-m and 8-m, or comparing constant vector
| {0,1,2,3,4,...} with broadcasted m and m-8.
|
| Very moot nitpick though, given that this is for only one column
| of the matrix, the following loops of maskload/maskstore will
| take significantly more time (esp. store, which is still slow on
| Zen 4[1] despite the AVX-512 instruction (whose only difference
| is taking the mask in a mask register) being 6x faster), and
| clang autovectorizes the shifting anyways (maybe like 2-3x slower
| than my suggestions).
|
| [1]:
| https://uops.info/table.html?search=vmaskmovps&cb_lat=on&cb_...
| salykova wrote:
| Hi! I'm the author of the article. It's my really first time
| optimizing C code and using intrinsics, so I'm definitely not
| an expert in this area, but Im willing to learn more! Many
| thanks for your feedback; I truly appreciate comments that
| provide new perspectives.
|
| Regarding "creating a constant global array and loading from
| it" - if I recall correctly, I've tested this approach and it
| was a bit slower than bit mask shifting. But let me re-test
| this to be 100% sure.
|
| "Comparing a constant vector {0, 1, 2, 3, 4, ...} with
| broadcasted m and m-8" - good idea, I will try it!
| Const-me wrote:
| > creating a global constant array
|
| Note you can keep int8_t elements in that array, and sign
| extend bytes into int32_t while loading. The _mm_loadu_si64 /
| _mm256_cvtepi8_epi32 combo should compile into a single
| vpmovsxbd instruction with a memory operand. This way the
| entire constant array fits in a single cache line, as long as
| it's aligned properly with alignas(32)
|
| This is good fit for the OP's use case because they need two
| masks, the second vpmovsxbd instruction will be a guaranteed
| L1D cache hit.
| dzaima wrote:
| vpmovsxbd ymm,[...] still presumably decomposes back into two
| uops (definitely does on intel, but uops.info doesn't show
| memory uops for AMD); still better than broadcast+compare
| though (which does still have a load for the constant; and,
| for that matter, the original shifting version also has
| multiple loads). Additionally, the int8_t elements mean no
| cacheline-crossing loads. (there's the more compressed option
| of only having a {8x-1, 8x0} array, at the cost of more
| scalar offset computation)
| leeoniya wrote:
| what about jart's tinyBLAS?
|
| https://hacks.mozilla.org/2024/04/llamafiles-progress-four-m...
|
| also https://justine.lol/matmul/
| salykova wrote:
| We were actively chatting with Justine yesterday, seems like
| the implementation is at least 2x faster than tinyBLAS on her
| workstation. The whole discussion is in Mozilla AI discord:
| https://discord.com/invite/NSnjHmT5xY
| salykova wrote:
| "off-topic" channel
| marmaduke wrote:
| Very nice write up. Those are the kind of matrix sizes that MKL
| is fairly good at, might be worth a comparison as well?
|
| Also, if you were designing for smaller cases, say MNK=16 or 32,
| how would you approach it differently? I'm implementing neural
| ODEs and this is one point I've been considering.
| pmeira wrote:
| Don't forget BLIS itself!
| gnufx wrote:
| For very small sizes on amd64, you can, and likely should, use
| libxsmm. MKL's improved performance in that region was due to
| libxsmm originally showing it up, but this isn't an Intel CPU
| anyway.
| SushiHippie wrote:
| In the README, it says:
|
| > Important! Please don't expect peak performance without fine-
| tuning the hyperparameters, such as the number of threads, kernel
| and block sizes, unless you are running it on a Ryzen 7700(X).
| More on this in the tutorial.
|
| I think I'll need a TL;DR on what to change all these values to.
|
| I have a Ryzen 7950X and as a first test I tried to only change
| NTHREADS to 32 in benchmark.c, but matmul.c performed worse than
| NumPy on my machine.
|
| So I took a look at the other values present in the benchmark.c,
| but MC and NC are already calculated via the amount of threads
| (so these are probably already 'fine-tuned'?), and I couldn't
| really understand how KC = 1000 fits for the 7700(X) (the
| author's CPU) and how I'd need to adjust it for the 7950X (with
| the informations from the article).
| SushiHippie wrote:
| Actually, leaving it on 16 threads performs a bit better than
| setting it to 32 threads.
|
| 16 threads: https://0x0.st/XaDB.png
|
| 32 threads: https://0x0.st/XaDM.png
|
| But still not as fast as it ran on your 7700(X) and NumPy is
| 2-3x faster than matmul.c on my PC.
|
| I also changed KC to some other values (500:
| https://0x0.st/XaD9.png, 2000: https://0x0.st/XaDp.png), but it
| didn't change much performance wise.
| salykova wrote:
| as we discussed earlier, the code really needs Clang to
| attain high performance
| SushiHippie wrote:
| Agreed https://0x0.st/XakD.png ;)
| teo_zero wrote:
| Does it make sense to compare a C executable with an interpreted
| Python program that calls a compiled library? Is the difference
| due to the algorithm or the call stack?
| rustybolt wrote:
| Yes
| lamcny wrote:
| Yes, as far as I can tell the array sizes were large enough to
| make the wrapping overhead negligible.
| david-gpu wrote:
| There is no reason to burden one side with Python while the other
| side is C, when they could have just as easily perform an apples-
| to-apples comparison where both sides are written in C, one
| calling a BLAS library while the other calls this other
| implementation.
| moomin wrote:
| Python is the right thing to compare to here, because it is
| easily the most popular way to perform these computations in
| the modern day. Specifically using numpy. The overhead isn't
| that high, but as mentioned elsewhere in this thread, calling
| it correctly is important. Pitting naive numpy code against
| tuned C code is definitely not a fair comparison.
| stinos wrote:
| > Python is the right thing to compare to here, because it is
| easily the most popular way to perform these computations in
| the modern day. Specifically using numpy.
|
| By that reasoning, wouldn't it make more sense to wrap their
| C code and maybe even make it operate on numpy's array
| representation, so it can be called from Python?
| moomin wrote:
| I think it's okay to say "This is the benchmark, now I'm
| going to compare it against something else." It's up to the
| reader to decide if a 3% (or 300%) improvement is worth the
| investment if it involves learning a whole other language.
| david-gpu wrote:
| If that was the goal, they should have compared NumPy to
| BLAS. What they did was comparing OpenBLAS wrapped in
| NumPy with their C code. It is not a reasonable
| comparison to make.
|
| Look, I'm trying to be charitable to the authors, hard as
| that might be.
| cnity wrote:
| There is some reason in this comparison. You might want
| to answer the question: "if I pick the common approach to
| matrix multiplication in the world of Data Science
| (numpy), how far off is the performance from some
| potential ideal reference implementation?"
|
| I actually do have that question niggling in the back of
| my mind when I use something like NumPy. I don't
| necessarily care exactly _where_ the overhead comes from,
| I might just be interested whether it's close to ideal or
| not.
| david-gpu wrote:
| If that was your question, you would compare against a
| number of BLAS libraries, which are already well
| optimized.
|
| What they are doing here is patting themselves on the
| back after handicapping the competition. Not to mention
| that they have given themselves the chance to cherry pick
| the very best hyperparameters for this particular
| comparison while BLAS is limited to using heuristics to
| guess which of their kernels will suit this particular
| combination of hardware and parameters.
|
| The authors need to be called out for this contrived
| comparison.
| pmeira wrote:
| It's a muddy comparison given that NumPy is commonly used
| with other BLAS implementations, which the author even
| lists, but doesn't properly address. Anaconda defaults to
| Intel oneAPI MKL, for example, and that's a widely used
| distribution. Not that I think MKL would do great on AMD
| hardware, BLIS is probably a better alternative.
|
| The author also says "(...) implementation follows the
| BLIS design", but then proceeds to compare *only* with
| OpenBLAS. I'd love to see a more thorough analysis, and
| using C directly would make it easier to compare multiple
| BLAS libs.
| lamcny wrote:
| Popular does not mean best.
|
| Suppose that this blog post were part of a series that
| questions the axiom (largely bolstered by academic
| marketing) that one needs Python to do array computing.
| Then it is valid to compare C directly to NumPy.
|
| It isn't even far fetched. The quality of understanding
| something after having implemented it in C is far greater
| than the understanding gained by rearranging PyTorch or
| NumPY snippets.
|
| That said, the Python overhead should not be very high if
| M=1000, N=1000, K=1000 was used. The article is a bit
| silent on the array sizes, this is somewhere from the
| middle of the article.
| CleanRoomClub wrote:
| Python is popular precisely because non-programmers are
| able to rearrange snippets and write rudimentary programs
| without a huge time investment in learning the language
| or tooling. It's a very high level language with
| syntactic sugar that has a lot of data science libraries
| which can call C code for performance, which makes it
| great for data scientists.
|
| It would be a huge detriment and time sink for these data
| scientist to take the time to learn to write an
| equivalent C program if their ultimate goal is to do data
| science.
| Bayes7 wrote:
| This was a great read, thanks a lot! One a side note, any one has
| a good guess what tool/software they used to create the
| visualisations for matrix multiplications or memory outline?
| salykova wrote:
| excalidraw <3
| bjourne wrote:
| Good writeup and commendable of you to make your benchmark so
| easily repeatable. On my 16-core Xeon(R) W-2245 CPU @ 3.90GHz
| matmul.c takes about 1.41 seconds to multiply 8192x8192 matrices
| when compiled with gcc -O3 and 1.47 seconds when compiled with
| clang -O2, while NumPy does it in 1.07 seconds. I believe an
| avx512 kernel would be significantly faster. Another reason for
| the lackluster performance may be omp. IME, you can reduce
| overhead by managing the thread pool explicitly with pthreads
| (and use sysconf(_SC_NPROCESSORS_ONLN) instead of hard-coding).
| epr wrote:
| If the point of this article is that there's generally
| performance left on the table, if anything it's understating how
| much room there generally is for improvement considering how much
| effort goes into matmul libraries compared to most other
| software.
|
| Getting a 10-1000x or more improvement on existing code is very
| common without putting in a ton of effort if the code was not
| already heavily optimized. These are listed roughly in order of
| importance, but performance is often such a non-consideration
| from most developers that a little effort goes a long way.
|
| 1. Most importantly, is the algorithm a good choice? Can we
| eliminate some work entirely? (this is what algo interviews are
| testing for)
|
| 2. Can we eliminate round trips to the kernel and similar heavy
| operations? The most common huge gain here is replacing tons of
| malloc calls with a custom allocator.
|
| 3. Can we vectorize? Explicit vector intrinsics like in the blog
| post are great, but you can often get the same machine code by
| reorganizing your data into arrays / struct of arrays rather than
| arrays of structs.
|
| 4. Can we optimize for cache efficiency? If you already
| reorganized for vectors this might already be handled, but this
| can get more complicated with parallel code if you can't isolate
| data to one thread (false sharing, etc.)
|
| 5. Can we do anything else that's hardware specific? This can be
| anything from using intrinsics to hand-coding assembly.
| 38 wrote:
| > #define min(x, y) ((x) < (y) ? (x) : (y))
|
| cursed code. I checked and every single invocation is just `int`,
| so why do this? you can just write a function:
| func min(x, y int) int { if x < y { return x }
| return y }
|
| and keep type safety
| throwaway4good wrote:
| What is the point of making the matrix multiplication itself
| multithreaded (other than benchmarking)? Wouldn't it be more
| beneficial in practice to have the multithreadedness in the
| algorithm that use the multiplication?
| gnufx wrote:
| That's indeed what's typically done in HPC. However,
| substituting a parallel BLAS can help the right sort of R code
| simply, for instance, but HPC codes typically aren't
| bottlenacked on GEMM.
| softmicro wrote:
| Is there any explaination for the dropping and rising in GFLOP/S
| at certain matrix sizes as shown in the plots.
| gnufx wrote:
| The papers referenced in the BLIS repo are the authoritative
| reference to understand this stuff. I've no idea why people think
| optimized BLASes aren't performant; you should expect >90% of CPU
| peak for sufficiently large matrices, and serial OpenBLAS was
| generally on a par with MKL last I saw. (BLAS implements GEMM,
| not matmul, as the basic linear algebra building block.) I don't
| understand using numpy rather than the usual benchmark
| frameworks, and on Zen surely you should compare with AMD's BLAS
| (based on BLIS). BLIS had a better parallelization story than
| OpenBLAS last I knew. AMD BLIS also has the implementation switch
| for "small" dimensions, and I don't know if current OpenBLAS
| does.
|
| SIMD intrinsics aren't needed for micro-kernel vectorization, as
| a decent C compiler will fully vectorize and unroll it. BLIS'
| pure C micro-kernel gets >80% of the performance of the hand-
| optimized implementation on Haswell with appropriate block sizes.
| The difference is likely to be due to prefecth, but I don't
| properly understand it.
___________________________________________________________________
(page generated 2024-07-04 23:01 UTC)