[HN Gopher] X X^t can be faster
       ___________________________________________________________________
        
       X X^t can be faster
        
       Author : robinhouston
       Score  : 147 points
       Date   : 2025-05-16 15:45 UTC (7 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | BryanLegend wrote:
       | So if an AI company spends $5B on a cluster, is this optimization
       | worth $250m?
        
         | disofu234 wrote:
         | Don't think so. This is an optimization on multiplying a matrix
         | times its own transpose, not for generic matrix multiplication.
        
         | qoez wrote:
         | Definitely didn't cost that much to solve this particluar
         | problem (that cost is amortized over the lifetime of the
         | clusters existence). Compared to the CO2 to feed academics
         | eating red meat, this is like wind power compared to coal it's
         | way more environmentally friendly.
        
       | Scene_Cast2 wrote:
       | I can't name any applications off the top of my head, other than
       | iterative matrix multiplication for approximate eigenvector
       | finding in square matrixes. But I don't know what's actually used
       | for finding eigenvectors (or other decompositions for that
       | matter).
        
         | TimorousBestie wrote:
         | It's a fairly common operation. The sample covariance of a
         | vector-valued random variable is XX^t/N.
        
         | vqv wrote:
         | It's pretty fundamental in a variety of multivariate
         | statistical methods. If the rows of X are multi variate
         | observations, then XX' is the Gram matrix (of dot products).
         | This can be used in clustering and regression. If the columns
         | of X are (centered) multivariate observations, then XX' is a
         | scalar multiple of the sample covariance matrix. This is used
         | in PCA.
         | 
         | But in large scale applications you may not want to store XX'
         | but instead are interested in computing products of the form
         | XX' v on the fly.
        
           | adgjlsfhk1 wrote:
           | generally good implementations use QR instead
        
         | bee_rider wrote:
         | Could be useful when doing an SVD as well (although, really,
         | you don't want X'X but X'Xv for some vector...).
        
         | constantcrying wrote:
         | I think one particularly interesting result is that superior
         | algorithms for numerical linear algebra exist at all and can be
         | found by artificial intelligence.
         | 
         | XX^T is the matrix of all piecewise dot products of vectors in
         | X and as others have pointed out there are legitimate
         | applications. Another one being e.g. transforming an
         | undetermined linear system into a least squares problem.
        
           | FabHK wrote:
           | The solution to the least squares problem Ax [?] b is given
           | by x = (A'A)^-1 A'b, but modern solvers never form the matrix
           | product A'A explicitly. Even for the SVD it isn't formed.
        
           | jerf wrote:
           | Google put something out like this recently too. The original
           | is https://deepmind.google/discover/blog/alphaevolve-a-
           | gemini-p..., but it's sort of buried in there. Here's the
           | YouTube video that introduced it to me, forwarded to the spot
           | where the host talks about the improved 4x4 matrix
           | multiplication algorithm:
           | https://www.youtube.com/watch?v=sGCmu7YKgPA&t=396s It's only
           | a slight tweak, but it's a slight tweak to something pretty
           | highly studied and still impressive.
        
         | blobbers wrote:
         | Covariance.
        
         | Bootvis wrote:
         | I expect this algorithm or similar to work for X^tX as well.
         | Fun fact, that operation is common enough that a trading firm
         | was named after it: XTX Markets.
        
       | TimorousBestie wrote:
       | I wish they could have modeled memory cache movements as well,
       | somehow. It would have made the analysis more difficult but often
       | these large matrix algorithms live or die by how cache-friendly
       | the memory access patterns are.
       | 
       | Splitting into 4x4 blocks is typically very nice, though. Maybe
       | it doesn't matter so much to practical runtime.
        
         | grumpymuppet wrote:
         | I wish cache analysis was exposed more broadly _everywhere_.
         | Everyone I deal with seems to understand and know how to think
         | about it, but best practices seem to be wrapped up in a lot of
         | heuristics.
         | 
         | I am aware of being able use hardware counters for profiling
         | systems, but the more fine-grained exposure of what's happening
         | at a lower level seems to be just hidden... Or I haven't found
         | the right reference resource yet.
        
       | meisel wrote:
       | Is this like the Karatsuba algorithm, where it's theoretically
       | faster but not actually faster when run on real hardware?
       | 
       | Btw, it's worth noting that if you know that the result will be
       | symmetric (such as is the case for X * X^T), you can make things
       | faster. For example in cuBLAS, cublas*syrk (the variant optimized
       | for when the result is symmetric) IME isn't faster than gemm, so
       | what you can do instead is just do smaller multiplications that
       | fill in one of the two triangles piece by piece, and then copy
       | that triangle to the other one.
        
         | optimalsolver wrote:
         | >where it's theoretically faster but not actually faster when
         | run on real hardware
         | 
         | Aka a galactic algorithm:
         | 
         | https://en.wikipedia.org/wiki/Galactic_algorithm
        
           | thrance wrote:
           | I feel like a galactic algorithm is slow even theoretically,
           | unless you work with out-of-this-world data.
           | 
           | Whereas many algorithms are theoretically fine for real data
           | but don't make good use of cache, or instruction sets...
        
         | thrance wrote:
         | They tested it against BLAS primitives and found theirs to be
         | faster, in hardware.
        
         | qoez wrote:
         | Since this is 2x2 there's simd instructions that can do this
         | (or with two simd dot products) both on CPU and inside each GPU
         | core. So with current hardware you won't beat writing this out
         | manually.
        
         | bee_rider wrote:
         | The mention 5% improvements and small matrices in the abstract,
         | so my gut says (I haven't read the actual paper yet) that is
         | probably is a practical-type algorithm.
        
         | odo1242 wrote:
         | It's faster for matrices of approximately at least ~256x256,
         | though it depends on hardware
        
         | pbsd wrote:
         | Karatsuba is definitely faster than schoolbook multiplication
         | at practical sizes. You presumably mean Strassen.
        
       | ttoinou wrote:
       | Are there researchers interested in accelerating these algorithms
       | while also keeping the maximum accuracy of the results ? This and
       | others optimizations are trading less multiplication for more
       | additions, but from the little I know, on floating point
       | additions and subtractions are risky precision wise, while
       | multiplications are harmless. Also using FMA fused multiply add
       | operations would be beneficial.
        
         | constantcrying wrote:
         | The question of stability of the algorithm is interesting, the
         | paper does not seem to discuss it though. You are correct that
         | we should expect the algorithms to deliver different results.
         | 
         | >This and others optimizations are trading less multiplication
         | for more additions, but from the little I know, on floating
         | point additions and subtractions are risky precision wise,
         | while multiplications are harmless.
         | 
         | Certainly not true. In general different orders of magnitude
         | are a problem. With multiplication and division these are more
         | serve problems, as they lead to greater changes in magnitude,
         | although even that is a generalization.
        
           | wizzwizz4 wrote:
           | Different orders of magnitude aren't a problem when
           | multiplying _floats_ , because the order-of-magnitude portion
           | gets translated into addition of exponents.
           | 
           | They're a problem with fixed point: might you have been
           | thinking of that?
        
             | Dylan16807 wrote:
             | We're not looking at instructions in a vacuum. No matter
             | what you have both multiplications and additions happening.
             | 
             | So they're making a point that if you apply more
             | multiplications before the inevitable addition, you are
             | likely increasing the danger levels of that addition.
        
               | wizzwizz4 wrote:
               | Oh, if one of the multiplications yields a positive
               | number, and another yields a negative number of similar
               | magnitude? That makes sense; I forgot linear spaces were
               | over fields.
        
         | almostgotcaught wrote:
         | > while also keeping the maximum accuracy of the results
         | 
         | All of these papers/algos are for the ML hype-train. ML algos
         | are approximate anyway so no one cares about absolute accuracy,
         | only the precision of the overall pipeline (class labels
         | shouldn't change, at least not too much). Consider that very
         | many papers/techniques quantize down to 8 or even 4 bits (yes
         | sometimes even during training) for the purposes of perf.
         | 
         | This whole research area should just be renamed to something
         | like approximate computing so that people don't confuse it (and
         | the goals) with classical numerical analysis.
        
           | saagarjha wrote:
           | Surely nobody is tiling 4x4 blocks for ML training
        
             | almostgotcaught wrote:
             | Why surely? Eg on GPU style chips today you'd tile to
             | warpsize (or half or something like that) to target warp
             | cooperative primitives (so called tensor cores). On AMD and
             | NV that's like 16x16 or 32x32 depending on the data type.
             | That's not that far from 4x4 and it's not like all chips in
             | the world have 32-lane warps. Anyway if a trick is good
             | enough and people start trying to shoehorn it into
             | everything (not saying this one is) then the next gen (or
             | the one after that or after) will just have units to
             | support the trick (it's an expensive way to gain ground on
             | mlperf rankings).
        
           | ttoinou wrote:
           | I get your point, neural networks could be some kind of black
           | boxes and how the sequence of operations evolves might not be
           | so dependent on the accuracy of each operations -- at least,
           | it would be a good intuition to believe "accuracy at each
           | step matters" but it's not rigorously proven ? We could
           | empirically check this by training a neural net and adding
           | some random noise at each step
        
         | HappyPanacea wrote:
         | Yes see this: https://arxiv.org/abs/1507.00687 . There is also
         | this answer ( https://mathoverflow.net/a/421380 ) in
         | MathOverflow which state that you can't do better than cubic
         | time if you want strong form of numerical stability.
        
           | ttoinou wrote:
           | Wow thank you I do not understand everything written and
           | didn't take time to do so (and can't even find reference 23
           | paper on Brent stability), but I was thinking we don't
           | necessarily need a new algorithm, we could tweak existing
           | ones.
           | 
           | For example swapping columns and rows before applying a
           | product, so that accuracy is best kept (then reswapping back
           | at the end of the operation). It seems to me the Strassen-
           | like algorithm choose arbitrarily some elements of the
           | matrices over others to make their temporary sum-products
           | variables, so we could exploit this asymmetry (not sure if
           | it's the correct word here) to choose which elements are to
           | be chosen
        
       | toolslive wrote:
       | as a general tip: X^-1 doesn't mean you have to inverse the
       | matrix. It's often a notational shorthand for "you can solve the
       | system". Same remark for X^t. It doesn't mean you have to build a
       | new matrix. It just means you have to use the one you have in a
       | different way. I've have seen this being butchered by scientists
       | and then they complain their performance sucks.
        
         | jerf wrote:
         | Even computer programmers can get it wrong, and even reify the
         | wrongness into interview questions. The correct answer to the
         | common question of "how do you reverse an array" is that you
         | don't reverse it. You read it backwards. Details vary from
         | language to language; languages with iterators can find this
         | particularly easy. But this is often not the "correct" answer
         | according to the interviewer.
         | 
         | In the end, all data structures are functions. The academic
         | applications of that may make your eyes glaze over, but the
         | _practical_ application is that if you want a reversed array
         | /tree/etc., you just need the reading/iteration "function" for
         | the array to return the data as if it was reversed. A matrix
         | can be "transposed" just by reading its transpose. (Caching or
         | other consequences may result in you choosing to manifest it in
         | memory anyhow, but if it was stored cleverly in the first
         | place, or you're only going to read it once anyhow such that
         | using it once is as expensive as reading it once to reconstruct
         | it anyhow, then there's no reason to.) An array can be randomly
         | permuted simply by wrapping the read function with a
         | permutation on the index, it need not be manifested in memory.
         | etc.
        
           | kazinator wrote:
           | The "invert binary tree" interview question can be solved by
           | swapping the accessors of the node type:
           | tmpfn = tree.left       tree.left = tree.right
           | tree.right = tmpfn
           | 
           | Now when tree.left(node) is applied, it returns the right
           | child, and vice versa. All traversals use the accessors, QED.
        
         | cmovq wrote:
         | I see this a lot in graphics, where people will for example say
         | to create a view matrix you first create your camera transform
         | and then invert it. When the better solution is to just
         | directly construct the inverted matrix.
         | 
         | There is almost never a good reason to invert a full 4x4/3x3
         | transform, yet I see it all the time.
        
       | selimthegrim wrote:
       | I guess the news is the "for small sizes" part
        
       | kazinator wrote:
       | Under multiplication, rows go into columns and all that. So if we
       | transpose things, rows are being dot-producted with (what were
       | originally) rows:                 [ a b ] [ a c ] = [ (a b) . (a
       | b)  (a b) . (c d) ] = [   |(a b)]^2     (a b) . (c d) ]       [ c
       | d ] [ b d ]   [ (c d) . (a b)  (c d) . (c d) ]   [ (c d) . (a b)
       | |(c d)|^2  ]
       | 
       | Down the diagonal we have squared norms of (a b) and (c d) which
       | is interesting and could somehow be used.
       | 
       | We also have repeated values between the upper and lower triangle
       | because (a b) . (c d) is the same as (c d) . (a b): the dot
       | product commutes.
       | 
       | Whereas just squaring the matrix:                 [ a b ] [ a b ]
       | = [ (a b) . (a c)  (a b) . (b d) ]       [ c d ] [ c d ]   [ (c
       | d) . (a c)  (c d) . (b d) ]
       | 
       | all four dot products are unique.
       | 
       | So right of the bat, we can find interesting things about X X^t
       | without going deep, and an obvious shortcut.
        
       ___________________________________________________________________
       (page generated 2025-05-16 23:00 UTC)