[HN Gopher] X X^t can be faster
___________________________________________________________________
X X^t can be faster
Author : robinhouston
Score : 147 points
Date : 2025-05-16 15:45 UTC (7 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| BryanLegend wrote:
| So if an AI company spends $5B on a cluster, is this optimization
| worth $250m?
| disofu234 wrote:
| Don't think so. This is an optimization on multiplying a matrix
| times its own transpose, not for generic matrix multiplication.
| qoez wrote:
| Definitely didn't cost that much to solve this particluar
| problem (that cost is amortized over the lifetime of the
| clusters existence). Compared to the CO2 to feed academics
| eating red meat, this is like wind power compared to coal it's
| way more environmentally friendly.
| Scene_Cast2 wrote:
| I can't name any applications off the top of my head, other than
| iterative matrix multiplication for approximate eigenvector
| finding in square matrixes. But I don't know what's actually used
| for finding eigenvectors (or other decompositions for that
| matter).
| TimorousBestie wrote:
| It's a fairly common operation. The sample covariance of a
| vector-valued random variable is XX^t/N.
| vqv wrote:
| It's pretty fundamental in a variety of multivariate
| statistical methods. If the rows of X are multi variate
| observations, then XX' is the Gram matrix (of dot products).
| This can be used in clustering and regression. If the columns
| of X are (centered) multivariate observations, then XX' is a
| scalar multiple of the sample covariance matrix. This is used
| in PCA.
|
| But in large scale applications you may not want to store XX'
| but instead are interested in computing products of the form
| XX' v on the fly.
| adgjlsfhk1 wrote:
| generally good implementations use QR instead
| bee_rider wrote:
| Could be useful when doing an SVD as well (although, really,
| you don't want X'X but X'Xv for some vector...).
| constantcrying wrote:
| I think one particularly interesting result is that superior
| algorithms for numerical linear algebra exist at all and can be
| found by artificial intelligence.
|
| XX^T is the matrix of all piecewise dot products of vectors in
| X and as others have pointed out there are legitimate
| applications. Another one being e.g. transforming an
| undetermined linear system into a least squares problem.
| FabHK wrote:
| The solution to the least squares problem Ax [?] b is given
| by x = (A'A)^-1 A'b, but modern solvers never form the matrix
| product A'A explicitly. Even for the SVD it isn't formed.
| jerf wrote:
| Google put something out like this recently too. The original
| is https://deepmind.google/discover/blog/alphaevolve-a-
| gemini-p..., but it's sort of buried in there. Here's the
| YouTube video that introduced it to me, forwarded to the spot
| where the host talks about the improved 4x4 matrix
| multiplication algorithm:
| https://www.youtube.com/watch?v=sGCmu7YKgPA&t=396s It's only
| a slight tweak, but it's a slight tweak to something pretty
| highly studied and still impressive.
| blobbers wrote:
| Covariance.
| Bootvis wrote:
| I expect this algorithm or similar to work for X^tX as well.
| Fun fact, that operation is common enough that a trading firm
| was named after it: XTX Markets.
| TimorousBestie wrote:
| I wish they could have modeled memory cache movements as well,
| somehow. It would have made the analysis more difficult but often
| these large matrix algorithms live or die by how cache-friendly
| the memory access patterns are.
|
| Splitting into 4x4 blocks is typically very nice, though. Maybe
| it doesn't matter so much to practical runtime.
| grumpymuppet wrote:
| I wish cache analysis was exposed more broadly _everywhere_.
| Everyone I deal with seems to understand and know how to think
| about it, but best practices seem to be wrapped up in a lot of
| heuristics.
|
| I am aware of being able use hardware counters for profiling
| systems, but the more fine-grained exposure of what's happening
| at a lower level seems to be just hidden... Or I haven't found
| the right reference resource yet.
| meisel wrote:
| Is this like the Karatsuba algorithm, where it's theoretically
| faster but not actually faster when run on real hardware?
|
| Btw, it's worth noting that if you know that the result will be
| symmetric (such as is the case for X * X^T), you can make things
| faster. For example in cuBLAS, cublas*syrk (the variant optimized
| for when the result is symmetric) IME isn't faster than gemm, so
| what you can do instead is just do smaller multiplications that
| fill in one of the two triangles piece by piece, and then copy
| that triangle to the other one.
| optimalsolver wrote:
| >where it's theoretically faster but not actually faster when
| run on real hardware
|
| Aka a galactic algorithm:
|
| https://en.wikipedia.org/wiki/Galactic_algorithm
| thrance wrote:
| I feel like a galactic algorithm is slow even theoretically,
| unless you work with out-of-this-world data.
|
| Whereas many algorithms are theoretically fine for real data
| but don't make good use of cache, or instruction sets...
| thrance wrote:
| They tested it against BLAS primitives and found theirs to be
| faster, in hardware.
| qoez wrote:
| Since this is 2x2 there's simd instructions that can do this
| (or with two simd dot products) both on CPU and inside each GPU
| core. So with current hardware you won't beat writing this out
| manually.
| bee_rider wrote:
| The mention 5% improvements and small matrices in the abstract,
| so my gut says (I haven't read the actual paper yet) that is
| probably is a practical-type algorithm.
| odo1242 wrote:
| It's faster for matrices of approximately at least ~256x256,
| though it depends on hardware
| pbsd wrote:
| Karatsuba is definitely faster than schoolbook multiplication
| at practical sizes. You presumably mean Strassen.
| ttoinou wrote:
| Are there researchers interested in accelerating these algorithms
| while also keeping the maximum accuracy of the results ? This and
| others optimizations are trading less multiplication for more
| additions, but from the little I know, on floating point
| additions and subtractions are risky precision wise, while
| multiplications are harmless. Also using FMA fused multiply add
| operations would be beneficial.
| constantcrying wrote:
| The question of stability of the algorithm is interesting, the
| paper does not seem to discuss it though. You are correct that
| we should expect the algorithms to deliver different results.
|
| >This and others optimizations are trading less multiplication
| for more additions, but from the little I know, on floating
| point additions and subtractions are risky precision wise,
| while multiplications are harmless.
|
| Certainly not true. In general different orders of magnitude
| are a problem. With multiplication and division these are more
| serve problems, as they lead to greater changes in magnitude,
| although even that is a generalization.
| wizzwizz4 wrote:
| Different orders of magnitude aren't a problem when
| multiplying _floats_ , because the order-of-magnitude portion
| gets translated into addition of exponents.
|
| They're a problem with fixed point: might you have been
| thinking of that?
| Dylan16807 wrote:
| We're not looking at instructions in a vacuum. No matter
| what you have both multiplications and additions happening.
|
| So they're making a point that if you apply more
| multiplications before the inevitable addition, you are
| likely increasing the danger levels of that addition.
| wizzwizz4 wrote:
| Oh, if one of the multiplications yields a positive
| number, and another yields a negative number of similar
| magnitude? That makes sense; I forgot linear spaces were
| over fields.
| almostgotcaught wrote:
| > while also keeping the maximum accuracy of the results
|
| All of these papers/algos are for the ML hype-train. ML algos
| are approximate anyway so no one cares about absolute accuracy,
| only the precision of the overall pipeline (class labels
| shouldn't change, at least not too much). Consider that very
| many papers/techniques quantize down to 8 or even 4 bits (yes
| sometimes even during training) for the purposes of perf.
|
| This whole research area should just be renamed to something
| like approximate computing so that people don't confuse it (and
| the goals) with classical numerical analysis.
| saagarjha wrote:
| Surely nobody is tiling 4x4 blocks for ML training
| almostgotcaught wrote:
| Why surely? Eg on GPU style chips today you'd tile to
| warpsize (or half or something like that) to target warp
| cooperative primitives (so called tensor cores). On AMD and
| NV that's like 16x16 or 32x32 depending on the data type.
| That's not that far from 4x4 and it's not like all chips in
| the world have 32-lane warps. Anyway if a trick is good
| enough and people start trying to shoehorn it into
| everything (not saying this one is) then the next gen (or
| the one after that or after) will just have units to
| support the trick (it's an expensive way to gain ground on
| mlperf rankings).
| ttoinou wrote:
| I get your point, neural networks could be some kind of black
| boxes and how the sequence of operations evolves might not be
| so dependent on the accuracy of each operations -- at least,
| it would be a good intuition to believe "accuracy at each
| step matters" but it's not rigorously proven ? We could
| empirically check this by training a neural net and adding
| some random noise at each step
| HappyPanacea wrote:
| Yes see this: https://arxiv.org/abs/1507.00687 . There is also
| this answer ( https://mathoverflow.net/a/421380 ) in
| MathOverflow which state that you can't do better than cubic
| time if you want strong form of numerical stability.
| ttoinou wrote:
| Wow thank you I do not understand everything written and
| didn't take time to do so (and can't even find reference 23
| paper on Brent stability), but I was thinking we don't
| necessarily need a new algorithm, we could tweak existing
| ones.
|
| For example swapping columns and rows before applying a
| product, so that accuracy is best kept (then reswapping back
| at the end of the operation). It seems to me the Strassen-
| like algorithm choose arbitrarily some elements of the
| matrices over others to make their temporary sum-products
| variables, so we could exploit this asymmetry (not sure if
| it's the correct word here) to choose which elements are to
| be chosen
| toolslive wrote:
| as a general tip: X^-1 doesn't mean you have to inverse the
| matrix. It's often a notational shorthand for "you can solve the
| system". Same remark for X^t. It doesn't mean you have to build a
| new matrix. It just means you have to use the one you have in a
| different way. I've have seen this being butchered by scientists
| and then they complain their performance sucks.
| jerf wrote:
| Even computer programmers can get it wrong, and even reify the
| wrongness into interview questions. The correct answer to the
| common question of "how do you reverse an array" is that you
| don't reverse it. You read it backwards. Details vary from
| language to language; languages with iterators can find this
| particularly easy. But this is often not the "correct" answer
| according to the interviewer.
|
| In the end, all data structures are functions. The academic
| applications of that may make your eyes glaze over, but the
| _practical_ application is that if you want a reversed array
| /tree/etc., you just need the reading/iteration "function" for
| the array to return the data as if it was reversed. A matrix
| can be "transposed" just by reading its transpose. (Caching or
| other consequences may result in you choosing to manifest it in
| memory anyhow, but if it was stored cleverly in the first
| place, or you're only going to read it once anyhow such that
| using it once is as expensive as reading it once to reconstruct
| it anyhow, then there's no reason to.) An array can be randomly
| permuted simply by wrapping the read function with a
| permutation on the index, it need not be manifested in memory.
| etc.
| kazinator wrote:
| The "invert binary tree" interview question can be solved by
| swapping the accessors of the node type:
| tmpfn = tree.left tree.left = tree.right
| tree.right = tmpfn
|
| Now when tree.left(node) is applied, it returns the right
| child, and vice versa. All traversals use the accessors, QED.
| cmovq wrote:
| I see this a lot in graphics, where people will for example say
| to create a view matrix you first create your camera transform
| and then invert it. When the better solution is to just
| directly construct the inverted matrix.
|
| There is almost never a good reason to invert a full 4x4/3x3
| transform, yet I see it all the time.
| selimthegrim wrote:
| I guess the news is the "for small sizes" part
| kazinator wrote:
| Under multiplication, rows go into columns and all that. So if we
| transpose things, rows are being dot-producted with (what were
| originally) rows: [ a b ] [ a c ] = [ (a b) . (a
| b) (a b) . (c d) ] = [ |(a b)]^2 (a b) . (c d) ] [ c
| d ] [ b d ] [ (c d) . (a b) (c d) . (c d) ] [ (c d) . (a b)
| |(c d)|^2 ]
|
| Down the diagonal we have squared norms of (a b) and (c d) which
| is interesting and could somehow be used.
|
| We also have repeated values between the upper and lower triangle
| because (a b) . (c d) is the same as (c d) . (a b): the dot
| product commutes.
|
| Whereas just squaring the matrix: [ a b ] [ a b ]
| = [ (a b) . (a c) (a b) . (b d) ] [ c d ] [ c d ] [ (c
| d) . (a c) (c d) . (b d) ]
|
| all four dot products are unique.
|
| So right of the bat, we can find interesting things about X X^t
| without going deep, and an obvious shortcut.
___________________________________________________________________
(page generated 2025-05-16 23:00 UTC)