[HN Gopher] What Shapes Do Matrix Multiplications Like?
___________________________________________________________________
What Shapes Do Matrix Multiplications Like?
Author : skidrow
Score : 153 points
Date : 2024-11-05 22:08 UTC (1 days ago)
(HTM) web link (www.thonking.ai)
(TXT) w3m dump (www.thonking.ai)
| stoniejohnson wrote:
| Great post! I appreciate the diagrams.
| amelius wrote:
| TL;DR: make sure your matrix dimensions are divisible by 2 often.
| ykonstant wrote:
| I have always done that instinctively, even when there is no
| formal demand by the system. Every time I have implemented
| matmuls, at any level, with any optimization requirement,
| partitioning into dyadic blocks had always sped things up. So I
| try to feed the system the nicest numbers I can muster!
| zusammen wrote:
| D&C approaches are applicable to lots of problems and, as
| you've noticed, tend to do well with "round" (in binary)
| numbers.
| carlmr wrote:
| And if you can't make sure all of them are divisible by 2
| often, at least pick the inner dimensions of your matmuls (if
| possible).
| chillee wrote:
| Well, that'll help with a lot :) But dealing with wave
| quantization requires dimensions that aren't neceessarily a
| multiple of 2, and often are a multiple of the number of SMs on
| a GPU (i.e. 132 on an H100)
| jabl wrote:
| I recall some optimization advice to choose a leading dimension
| that is NOT a power of two, in order to avoid cache set
| associativity conflicts. Guess it's highly hardware dependent (in
| particular, that advice was for cpu's not GPU's).
| adrian_b wrote:
| Many Intel CPUs have caches that have a number of ways that is
| not a power of two, but it contains a 3 or 5 factor,
| corresponding to cache sizes like 2.5 MB, 3 MB, 24 MB or 36 MB.
|
| Regardless of the cache associativity, the leading dimension
| must always correspond to a size that is a multiple of the
| cache line size (64 bytes for most CPUs) and the matrix must be
| aligned to the cache line size.
|
| With that condition fulfilled, the total size of a row or of a
| column (depending on whether the matrix uses C order or Fortran
| order) may be optimal as a power of two or not, which should be
| better tested experimentally on the target CPUs or GPUs.
| muziq wrote:
| DRAM page size surely ? All those cheap CASs' ?
| dekhn wrote:
| That only makes sense if you know for sure your application is
| running on a specific architecture. Otherwise, it's a highly
| specific optimization that is bound to violate another
| architecture's design, also would be extremely challenging to
| debug.
___________________________________________________________________
(page generated 2024-11-06 23:02 UTC)