[HN Gopher] Show HN: Matrix Multiplication with Half the Multipl...
       ___________________________________________________________________
        
       Show HN: Matrix Multiplication with Half the Multiplications
        
       Author : emacs28
       Score  : 199 points
       Date   : 2024-03-15 10:36 UTC (12 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | halflings wrote:
       | This looks pretty cool! What's the catch? e.g. why isn't this
       | already implemented in accelerators, is it really just a
       | forgotten algorithm, or this has some implications on the cost of
       | building the accelerator or else?
        
         | emacs28 wrote:
         | IMHO, for fixed-point MM accelerators, there is no catch, I
         | think it's an overlooked algorithm. It's based on an algorithm
         | by Winograd who coincidentally also proposed another unrelated
         | algorithm that later became very popular for CNN acceleration
         | which would take some visibility away from this other algorithm
         | by Winograd... But that is speculative
        
           | yorwba wrote:
           | On the other hand, if you tried it with floating point, you'd
           | lose significant digits. Since the approach is to sum (a[i] +
           | b[i+1])(a[i+1] + b[i]) and subtract the sums of a[i]a[i+1]
           | and b[i]b[i+1] in the end to get a[i]b[i] + a[i+1]b[i+1], you
           | may be taking the difference of two large values to get a
           | small value, losing precision.
        
             | tmp8086 wrote:
             | Here's a matrix multiplication using Winograd's algorithm
             | (that you describe), implemented in Go (very small code):
             | 
             | https://bugfix-66.com/983cb3a3a37dbe8e82edea6298988f03d2a1a
             | 4...
        
           | OJFord wrote:
           | LLM hype and this submission in particular keep making me
           | think of a lecturer I had for _Topics in Large Dimensional
           | Data Processing_ , circa 2016: as I recall he was
           | enthusiastically adamant that the most important thing,
           | breakthroughs etc., in years/decades to come was going to be
           | faster matrix operations. Anyway, I'm pretty sure I recognise
           | FIP (not FFIP of course) from that course.
           | 
           | I wish I could remember his name, I believe he left academia
           | after my year and went to work in industry, I'd just be
           | curious to see what he's up to now. I'm not saying it was a
           | particularly novel or prescient comment/attitude, we may not
           | have had quite such ML hype but certainly 'big data' was all
           | the rage at the time, it's just something that's stuck in my
           | mind. One of those areas I always meant to study more, just
           | realistically probably never had the mathematical chops for
           | and certainly those I did have atrophied.
        
             | bee_rider wrote:
             | Maybe I'm joking, but: our society is just a vehicle for
             | economics at this point, our economy is built around
             | science, our science has mostly been turned into
             | observations about engineering, some time ago we changed
             | all of engineering into differential equations, and
             | differential equations can be solved by discretizing them
             | and doing linear algebra, and most of linear algebra can be
             | done with matrix multiplications (triangular solves and
             | orthonormalizations if you are fancy). All you need is
             | matmul.
        
               | OJFord wrote:
               | > our science has mostly been turned into observations
               | about engineering
               | 
               | You may be joking but that in particular seems pretty
               | astute.
               | 
               | Superficially it seems accurate, and reasonably
               | ascribable to economic forces, fewer concentrations of
               | capital in people (aristocrats) spending it on a hobby
               | interest or academic pursuit of their own - today's
               | equivalents mostly prefer philanthropy (Musk is, I
               | suppose, for whatever else you might think of him, a
               | notable exception - preferring to explore space, AI, etc.
               | not really it seems for personal monetary gain). But I
               | wonder if that _is_ fair, to modern scientists, or is it
               | just that  'all the low-hanging stuff's been done'?
        
         | mariocesar wrote:
         | Perhaps it's less of a hidden gem and more of a spotlight
         | moment.
        
         | pclmulqdq wrote:
         | There are a lot of matrix multiplication algorithms out there
         | with a lot of pluses and minuses. It's always a balance of
         | accuracy, runtime, and scaling. This one probably has bad
         | accuracy in floating point.
        
           | waynecochran wrote:
           | The document said it outputs the _exact_ same values as the
           | conventional method. There is _no_ accuracy trade off here.
        
             | p1esk wrote:
             | For floating point? Are you sure?
        
               | waynecochran wrote:
               | Opening statement of README                   This
               | repository contains the source code for ML hardware
               | architectures that          require nearly half the
               | number of multiplier units to achieve the same
               | performance, by executing alternative inner-product
               | algorithms that trade          nearly half the
               | multiplications for cheap low-bitwidth additions, while
               | still          producing identical output as the
               | conventional inner product.
        
               | p1esk wrote:
               | I just looked at the paper: the answer is no, floating
               | point is not supported.
        
             | pclmulqdq wrote:
             | The paper cited is about hardware, where there is no
             | accuracy tradeoff because you control the numerical
             | precision completely and use fixed point. In a software
             | implementation, neither is true. There is _no chance_ that
             | you will get the exact same values out of this method that
             | you do out of other FP matmuls.
        
           | abetusk wrote:
           | I don't know why this answer is getting downvoted. This is
           | absolutely correct.
           | 
           | W. Miller has a paper discussing, under conditions of
           | numerical stability, O(n^3) multiplications is necessary [0].
           | Any algorithm that gets sub cubic runtime for matrix
           | multiplication, like Strassen's or Coppersmith's, must
           | sacrifice some amount of precision or stability.
           | 
           | [0] https://epubs.siam.org/doi/10.1137/0204009
        
             | gyrovagueGeist wrote:
             | Another relevant paper is:
             | https://epubs.siam.org/doi/10.1137/15M1032168
        
           | emacs28 wrote:
           | For everyone discussing the reduced accuracy/numerical
           | stability of the algorithms in floating-point, this is true.
           | But note that the application of the algorithms in the work
           | is explored for fixed-point MM/quantized integer NN
           | inference, not floating-point MM/inference. Hence, there is
           | no reduction in accuracy for that application of it compared
           | to using conventional fixed-point MM.
        
             | abetusk wrote:
             | I'm no expert but I suspect this is wrong. To me, this is
             | like saying you don't need to worry about integer overflow
             | because your operations are only working on fixed integers.
             | Really? You don't care if you multiply or add two large
             | numbers and they spill over?
             | 
             | The more appropriate answer, I suspect, is that the
             | numerical precision and stability sacrifices are more than
             | adequate for normal usage.
             | 
             | If I'm wrong about this, I would certainly like to know.
        
               | pclmulqdq wrote:
               | In hardware, you control your integer widths completely,
               | so if you add two 32-bit ints to a 33-bit int, there is
               | no chance of overflow. The same goes for multiplications,
               | etc.
        
             | pclmulqdq wrote:
             | "Conventional fixed-point MM" is a large suite of
             | algorithms. It is correct that this is a 2x reduction in
             | MULs compared to _naive_ fixed-point matrix multiply, but
             | there is a large body of literature out there with other
             | circuits. This is a cool trick to add to the group.
        
             | p1esk wrote:
             | Inference world is gradually switching from INT formats to
             | FP formats. FP8 is already supported in modern hardware,
             | and FP4 support is coming. In my experiments I get better
             | perplexity in language models with FP4 than with INT4.
        
         | lupire wrote:
         | It's not just a software algorithm. It's a hardware
         | architecture optimization. To benefit, you have to build
         | hardware that matches the dimensions of the algorithm. That's
         | an expensive commitment.
        
           | emacs28 wrote:
           | > you have to build hardware that matches the dimensions of
           | the algorithm
           | 
           | Yes the benefits are realized in custom hardware designs as
           | opposed to software, however, the hardware architectures work
           | for multiplying matrices of arbitrary dimensions by splitting
           | up larger matrices into smaller tiles, then summing up the
           | tile products to form the final larger matrix products (i.e.
           | GEMM)
        
         | adastra22 wrote:
         | I've only glanced at it so someone correct me if I'm wrong, but
         | IIUC this is not a replacement for matrix multiplication but
         | rather an approximation that only gives decent-ish results for
         | the types of linear systems you see in AI/ML. But for that use
         | case it is totally fine?
        
         | pbsd wrote:
         | It's not quite forgotten. It kind of lives on in the pseudo-dot
         | product Wegman-Carter authenticators like UMAC. See Section 3
         | of [1] for context.
         | 
         | [1] https://cr.yp.to/antiforgery/pema-20071022.pdf
        
       | Drakim wrote:
       | I'm surprised this actually works, usually detecting whether to
       | use multiplication or addition is slower than simply using
       | multiplication. Especially if it's massive amounts of work being
       | done in parallel.
        
       | barfbagginus wrote:
       | This readme does a really poor job of explaining what the
       | improvement is or how they drop half the multiplications. What is
       | the Big O run time on this? Is this shifting the known best
       | bounds?
       | 
       | And the diagrams are chaotic and don't really explain anything
       | about why this approach is fast or good. The result is that I'm
       | reluctant to even click-through to the PDF.
       | 
       | If you want to improve the project credibility please consider
       | being honest and open about what is actually going on and giving
       | some clear explanations and illustrations, rather than things
       | that may as well be designed to hype people too busy to tell you
       | that you are cranks. It's hard to tell if this is incredibly
       | groundbreaking or just but nothingburger. Sadly I almost feel
       | like that must be an intentional decision motivated by poor
       | merits of work and a desire to exploit AI height. The alternative
       | - which I prefer to believe is the case - is that the author
       | simply needs to revise and better contextualize.
        
         | mariocesar wrote:
         | It's actually fairly clear
        
           | hackyhacky wrote:
           | Not to everyone. If it's clear to you, you could helpfully
           | explain it.
        
             | VogonPoetry wrote:
             | This is an analogy. a^2 - b^2 = a _a - b_ b. This can be
             | factored to (a+b)(a-b). In the first expression there are
             | two multiplies, in the factored version there is only one.
             | 
             | However, from a numerical analysis / accuracy standpoint,
             | evaluating the factored expression can result in loss of
             | precision in the result when a is close to b. This is
             | especially true if you repeatedly and sequentially do a lot
             | of these operations. Loss of precision can be a problem in
             | numeral modeling (like climate simulation) -- long term
             | predictions diverge.
             | 
             | Given that there is a drive to use greatly reduced
             | precision in ML engines, loss of precision might have an
             | effect on how a model performs. Then again, it might not. I
             | haven't read a lot of papers on ML, but I don't recall
             | seeing ones that try to quantify how sensitive a model is
             | to error propagation. (I am making a distinction between
             | tests where the precision is reduced to see where it breaks
             | down v.s. calculating / understanding what the error level
             | actually is in a model)
        
               | jart wrote:
               | With LLMs they start showing signs of brain damage once
               | the errors get too high. In my experience it reduces
               | their ability to reason, they stop counting correctly,
               | and they start homogenizing categories, like calling a
               | lemur a monkey. Compare this with quantizing weights,
               | which instead of brain damage leads to ignorance, forcing
               | them to hallucinate more.
        
         | stephencanon wrote:
         | The readme doesn't explain much, but the introduction to the
         | paper itself is quite approachable.
         | 
         | As for whether it's groundbreaking or not ... it's a neat
         | readily-accessible constant-factor improvement for area-
         | constrained fixed-point accelerators. That doesn't change
         | everything overnight, but neither is it nothing. It's nice
         | work.
        
         | Someone wrote:
         | > What is the Big O run time on this?
         | 
         | The claim is they're dropping half the multiplications, so it
         | isn't doing anything for Big O.
         | 
         | > If you want to improve the project credibility please
         | consider being honest and open about what is actually going on
         | and giving some clear explanations and illustrations,
         | 
         | The math explaining how to halve the number of multiplications
         | in the paper (https://arxiv.org/abs/2311.12224) isn't hard to
         | understand.
         | 
         | You only have to read formulas 2 (traditional matrix
         | multiplication) and 3 to 6.
         | 
         | I think it's clear it does do what's being advertised, halving
         | the number of multiplications at the cost of lots of extra
         | additions/subtractions.
         | 
         | They then go on to better vectorize that algorithm. That, as is
         | usual for that, gets looking messy soon.
         | 
         | My main concern would be numerical stability.
        
           | emacs28 wrote:
           | Thanks, good summary. Regarding numerical stability, the
           | application is for fixed-point arithmetic, and therefore
           | numerical stability is not an issue (the result is identical
           | compared to using the conventional inner-product)
        
         | eigenket wrote:
         | > This readme does a really poor job of explaining what the
         | improvement is or how they drop half the multiplications. What
         | is the Big O run time on this? Is this shifting the known best
         | bounds?
         | 
         | Without wishing to sound elitist I, I don't understand the
         | point of this comment at all. If you don't understand Big O
         | notation enough to know that "half the multiplications" doesn't
         | change it then why are you even asking about it?
        
       | ixaxaar wrote:
       | Man I remembered something similar I had tried working on in
       | 2018, but gave up after all my PhD applications got rejected.
       | 
       | https://github.com/ixaxaar/pytorch-dni
       | 
       | The concept here goes a bit further and tries to replicate
       | backprop with an external network, arguing that that's probably
       | what the brain actually does.
        
         | yorwba wrote:
         | I'm not seeing the connection. This work is about low-level
         | optimization of matrix multiplication. The repo you linked
         | seems to be about replacing back-propagated gradients with a
         | cheaper estimate. What's the similarity you see between these
         | two?
        
           | ixaxaar wrote:
           | Correct, I think I mistook it as "use a small neural net to
           | approximate matrix multiplication" instead it seems as "use
           | cheaper replacements of matrix mul without much acc loss".
           | 
           | Wellll that means I can give dni another try :D
        
         | rollingtide wrote:
         | Unrelated to the technical discussion but I was wondering what
         | you made that architecture gif with? Looks neat!
        
           | ixaxaar wrote:
           | I think that image is from the paper and was not created by
           | me. Looks cool indeed!
        
         | jebarker wrote:
         | This feels like a "no free lunch" situation. I would imagine
         | that any time saving in approximating the gradients this way
         | would be lost to needing to train for more iterations due to
         | the loss in gradient accuracy. Is that not the case?
        
           | ixaxaar wrote:
           | I think that's the reason for its dead end.
           | 
           | However if this is really the biological analogue of credit
           | assignment, this might scale better than training llms from
           | scratch every time. Even if say it could approx gradients to
           | a certain degree given a new network, normal backprop could
           | further tune for a few epochs or so dramatically reducing
           | overall training costs.
        
       | Lucasoato wrote:
       | If you're interested in the mathematical theory behind sub-cubic
       | algorithms for matrix multiplications, you can start from here:
       | https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...
       | 
       | I conjecture that for every j > 0 in R, a number n exists so that
       | any two n x n matrices can be multiplied together in O(n^(2+j))
       | steps.
       | 
       | (Now proven for for 2+j = w = 2.3728596, or j > 0.3728596)
        
         | abeppu wrote:
         | Predicting that this holds for any j > 0 seems rather bold.
         | Would you care to share your intuition why you think that's the
         | case?
        
           | roflmaostc wrote:
           | Two matrices with size NxN each can be multiplied naively
           | with the schoolbook algorithm in O(N^3).
           | 
           | It's clear that the algorithm needs at least O(N^2) because
           | to access each element of the matrices once, you need a
           | double for loop, which is O(N^2).                 for i in
           | rows           for j in cols               # do something
           | with element matrix1 [i, j], matrix2[i, j],...
           | 
           | so it has to be j >= 0
        
             | JohnKemeny wrote:
             | His question was: what is the reasoning behind there
             | existing an algorithm running in time n^2+epsilon for
             | really small epsilon.
        
             | abeppu wrote:
             | Yeah, we're agreed that j cannot be less than 0. But your
             | conjecture was about _every j > 0_. Do you have any
             | specific line of reasoning which suggests that j can be
             | arbitrarily close to 0 (0 is the greatest lower bound)? Why
             | do you not think there's some other specific limit k \in
             | (0, 0.3728596] beyond which j cannot be improved?
        
         | gizmo686 wrote:
         | > I conjecture that for every j > 0 in R, a number n exists so
         | that any two n x n matrices can be multiplied together in
         | O(n^(2+j)) steps.
         | 
         | Is this stated correctly? Because it seems almost meaningless
         | as stated. You start with "for every j, there exists an n such
         | that...". That would mean that for the rest of the statement, n
         | and j are constant. So you are just saying that you can
         | multiply constant sized matrices in constant time. Technically
         | true, but I feel like you are trying to claim something
         | stronger.
        
           | JohnKemeny wrote:
           | It should simply say:
           | 
           | for any j>0 there exists an algorithm multiplying nxn
           | matrices in time O(n^{2+j}).
        
             | Lucasoato wrote:
             | You are correct, I apologize for the confusion! :)
        
         | bee_rider wrote:
         | It does seem to be harder to make progress over time though.
         | Maybe it'll bottom out at j=1/e. I won't call my guess even a
         | conjecture though, it just happens to be a convenient constant
         | near where the current value is. It would be a funny prank for
         | math to pull on us.
        
       | michelpp wrote:
       | This is very cool and a real interesting read! For those in the
       | comments confused about how this is better, the paper is talking
       | about synthesizing matrix multiplication pipelines in hardware,
       | like an FPGA or ASIC. On a CPU or GPU you won't notice because
       | adds and multiplications take the same amount of time generally,
       | but multiplication units takes up many more transistors, so if
       | you can reduce the circuit complexity you can increase the speed
       | and parallel throughput and reduce power and routing complexity.
       | This approach could be particularly useful for efficient sparse
       | matrix multiplication accelerators.
       | 
       | Another cool way to eliminate multiplication in matrix
       | multiplication is to use different semirings [1]. The Tropical
       | Semiring [2] for example substitutes addition for multiplication
       | and min (or max) for addition. It's still matrix multiplication
       | but with substituted binary operations. The research in this
       | relatively new field of Tropical Algebra [3] is quite active and
       | rich right now, being used for all kinds of optimization problems
       | and in research for optimizing neural networks [4] . This
       | approach also lends itself to hardware synthesis since most FPGA
       | configurable logical blocks can add/min/max in one clock cycle,
       | whereas efficient multiplication requires fixed dedicated on-chip
       | hardware multipliers.
       | 
       | Another way to efficiently remove multiplications with a
       | different but related semiring is to use a Log Semiring [5]. If
       | you have to multiply chains of probabilities (like Markov chains)
       | then the numbers quickly become very small and floating point
       | loses its accuracy to represent the numbers. By scaling the
       | numbers first by taking the log, multiplication becomes addition
       | and addition becomes x + log1p(exp(y - x)).
       | 
       | [1] https://en.wikipedia.org/wiki/Semiring
       | 
       | [2] https://en.wikipedia.org/wiki/Tropical_semiring
       | 
       | [3] https://en.wikipedia.org/wiki/Tropical_geometry
       | 
       | [4] https://proceedings.mlr.press/v80/zhang18i/zhang18i.pdf
       | 
       | [5] https://en.wikipedia.org/wiki/Log_semiring
        
         | pk-protect-ai wrote:
         | > By scaling the numbers first by taking the log,
         | multiplication becomes addition and addition becomes x +
         | log1p(exp(y - x)).
         | 
         | Isn't this the same approach as in GF(2^x), which has been in
         | use for decades? The only limitation that comes to mind is the
         | field size.
        
           | jhj wrote:
           | Finite field log/antilog lookup tables are used for
           | efficient-ish multiplication, similar to addition/subtraction
           | tables used for logarithmic number systems.
        
         | gatane wrote:
         | Whoah, this is what Unified Algebra is all about!
         | 
         | http://www.cs.toronto.edu/~hehner/UA.pdf
        
           | buybackoff wrote:
           | This is what HN is about :)
           | 
           | I understood a fraction but instantly wanted to dive into the
           | topic just after reading such a knowledgeable comment.
        
         | jhj wrote:
         | > By scaling the numbers first by taking the log,
         | multiplication becomes addition and addition becomes x +
         | log1p(exp(y - x)).
         | 
         | Addition/subtraction in a logarithmic number system is way more
         | expensive than what you would spend on multiplication,
         | especially if you care about correctly rounded results, as the
         | (hardware) LUTs required are rather big.
        
         | bearzoo wrote:
         | somewhat related is the number theoretic transform
         | 
         | https://ieeexplore.ieee.org/abstract/document/1451721
        
         | btown wrote:
         | The paper in [4] is absolutely fascinating - I'm very much a
         | neophyte here, but I believe it shows that practically any ReLU
         | network can be represented as a tropical ratio of two tropical
         | polynomials, and thus can be analyzed with geometric principles
         | including visualizations of surfaces. It's been cited in more
         | recent work:
         | https://scholar.google.com/scholar?cites=1003719112553620451...
         | - does anyone know if there's been any significant advancements
         | here?
        
       | skykooler wrote:
       | I find it fascinating that this is using a process invented in
       | 1968 and hasn't been used for this purpose until now!
        
         | pk-protect-ai wrote:
         | Hey, nobody knew what to do with GF(2^x) up until mid last
         | century either... Oh wait, CS was not really a thing almost up
         | until mid last century...
        
       ___________________________________________________________________
       (page generated 2024-03-15 23:00 UTC)