[HN Gopher] What Shapes Do Matrix Multiplications Like?
       ___________________________________________________________________
        
       What Shapes Do Matrix Multiplications Like?
        
       Author : skidrow
       Score  : 153 points
       Date   : 2024-11-05 22:08 UTC (1 days ago)
        
 (HTM) web link (www.thonking.ai)
 (TXT) w3m dump (www.thonking.ai)
        
       | stoniejohnson wrote:
       | Great post! I appreciate the diagrams.
        
       | amelius wrote:
       | TL;DR: make sure your matrix dimensions are divisible by 2 often.
        
         | ykonstant wrote:
         | I have always done that instinctively, even when there is no
         | formal demand by the system. Every time I have implemented
         | matmuls, at any level, with any optimization requirement,
         | partitioning into dyadic blocks had always sped things up. So I
         | try to feed the system the nicest numbers I can muster!
        
           | zusammen wrote:
           | D&C approaches are applicable to lots of problems and, as
           | you've noticed, tend to do well with "round" (in binary)
           | numbers.
        
         | carlmr wrote:
         | And if you can't make sure all of them are divisible by 2
         | often, at least pick the inner dimensions of your matmuls (if
         | possible).
        
         | chillee wrote:
         | Well, that'll help with a lot :) But dealing with wave
         | quantization requires dimensions that aren't neceessarily a
         | multiple of 2, and often are a multiple of the number of SMs on
         | a GPU (i.e. 132 on an H100)
        
       | jabl wrote:
       | I recall some optimization advice to choose a leading dimension
       | that is NOT a power of two, in order to avoid cache set
       | associativity conflicts. Guess it's highly hardware dependent (in
       | particular, that advice was for cpu's not GPU's).
        
         | adrian_b wrote:
         | Many Intel CPUs have caches that have a number of ways that is
         | not a power of two, but it contains a 3 or 5 factor,
         | corresponding to cache sizes like 2.5 MB, 3 MB, 24 MB or 36 MB.
         | 
         | Regardless of the cache associativity, the leading dimension
         | must always correspond to a size that is a multiple of the
         | cache line size (64 bytes for most CPUs) and the matrix must be
         | aligned to the cache line size.
         | 
         | With that condition fulfilled, the total size of a row or of a
         | column (depending on whether the matrix uses C order or Fortran
         | order) may be optimal as a power of two or not, which should be
         | better tested experimentally on the target CPUs or GPUs.
        
           | muziq wrote:
           | DRAM page size surely ? All those cheap CASs' ?
        
         | dekhn wrote:
         | That only makes sense if you know for sure your application is
         | running on a specific architecture. Otherwise, it's a highly
         | specific optimization that is bound to violate another
         | architecture's design, also would be extremely challenging to
         | debug.
        
       ___________________________________________________________________
       (page generated 2024-11-06 23:02 UTC)