[HN Gopher] Show HN: Matrix Multiplication with Half the Multipl...
___________________________________________________________________
Show HN: Matrix Multiplication with Half the Multiplications
Author : emacs28
Score : 199 points
Date : 2024-03-15 10:36 UTC (12 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| halflings wrote:
| This looks pretty cool! What's the catch? e.g. why isn't this
| already implemented in accelerators, is it really just a
| forgotten algorithm, or this has some implications on the cost of
| building the accelerator or else?
| emacs28 wrote:
| IMHO, for fixed-point MM accelerators, there is no catch, I
| think it's an overlooked algorithm. It's based on an algorithm
| by Winograd who coincidentally also proposed another unrelated
| algorithm that later became very popular for CNN acceleration
| which would take some visibility away from this other algorithm
| by Winograd... But that is speculative
| yorwba wrote:
| On the other hand, if you tried it with floating point, you'd
| lose significant digits. Since the approach is to sum (a[i] +
| b[i+1])(a[i+1] + b[i]) and subtract the sums of a[i]a[i+1]
| and b[i]b[i+1] in the end to get a[i]b[i] + a[i+1]b[i+1], you
| may be taking the difference of two large values to get a
| small value, losing precision.
| tmp8086 wrote:
| Here's a matrix multiplication using Winograd's algorithm
| (that you describe), implemented in Go (very small code):
|
| https://bugfix-66.com/983cb3a3a37dbe8e82edea6298988f03d2a1a
| 4...
| OJFord wrote:
| LLM hype and this submission in particular keep making me
| think of a lecturer I had for _Topics in Large Dimensional
| Data Processing_ , circa 2016: as I recall he was
| enthusiastically adamant that the most important thing,
| breakthroughs etc., in years/decades to come was going to be
| faster matrix operations. Anyway, I'm pretty sure I recognise
| FIP (not FFIP of course) from that course.
|
| I wish I could remember his name, I believe he left academia
| after my year and went to work in industry, I'd just be
| curious to see what he's up to now. I'm not saying it was a
| particularly novel or prescient comment/attitude, we may not
| have had quite such ML hype but certainly 'big data' was all
| the rage at the time, it's just something that's stuck in my
| mind. One of those areas I always meant to study more, just
| realistically probably never had the mathematical chops for
| and certainly those I did have atrophied.
| bee_rider wrote:
| Maybe I'm joking, but: our society is just a vehicle for
| economics at this point, our economy is built around
| science, our science has mostly been turned into
| observations about engineering, some time ago we changed
| all of engineering into differential equations, and
| differential equations can be solved by discretizing them
| and doing linear algebra, and most of linear algebra can be
| done with matrix multiplications (triangular solves and
| orthonormalizations if you are fancy). All you need is
| matmul.
| OJFord wrote:
| > our science has mostly been turned into observations
| about engineering
|
| You may be joking but that in particular seems pretty
| astute.
|
| Superficially it seems accurate, and reasonably
| ascribable to economic forces, fewer concentrations of
| capital in people (aristocrats) spending it on a hobby
| interest or academic pursuit of their own - today's
| equivalents mostly prefer philanthropy (Musk is, I
| suppose, for whatever else you might think of him, a
| notable exception - preferring to explore space, AI, etc.
| not really it seems for personal monetary gain). But I
| wonder if that _is_ fair, to modern scientists, or is it
| just that 'all the low-hanging stuff's been done'?
| mariocesar wrote:
| Perhaps it's less of a hidden gem and more of a spotlight
| moment.
| pclmulqdq wrote:
| There are a lot of matrix multiplication algorithms out there
| with a lot of pluses and minuses. It's always a balance of
| accuracy, runtime, and scaling. This one probably has bad
| accuracy in floating point.
| waynecochran wrote:
| The document said it outputs the _exact_ same values as the
| conventional method. There is _no_ accuracy trade off here.
| p1esk wrote:
| For floating point? Are you sure?
| waynecochran wrote:
| Opening statement of README This
| repository contains the source code for ML hardware
| architectures that require nearly half the
| number of multiplier units to achieve the same
| performance, by executing alternative inner-product
| algorithms that trade nearly half the
| multiplications for cheap low-bitwidth additions, while
| still producing identical output as the
| conventional inner product.
| p1esk wrote:
| I just looked at the paper: the answer is no, floating
| point is not supported.
| pclmulqdq wrote:
| The paper cited is about hardware, where there is no
| accuracy tradeoff because you control the numerical
| precision completely and use fixed point. In a software
| implementation, neither is true. There is _no chance_ that
| you will get the exact same values out of this method that
| you do out of other FP matmuls.
| abetusk wrote:
| I don't know why this answer is getting downvoted. This is
| absolutely correct.
|
| W. Miller has a paper discussing, under conditions of
| numerical stability, O(n^3) multiplications is necessary [0].
| Any algorithm that gets sub cubic runtime for matrix
| multiplication, like Strassen's or Coppersmith's, must
| sacrifice some amount of precision or stability.
|
| [0] https://epubs.siam.org/doi/10.1137/0204009
| gyrovagueGeist wrote:
| Another relevant paper is:
| https://epubs.siam.org/doi/10.1137/15M1032168
| emacs28 wrote:
| For everyone discussing the reduced accuracy/numerical
| stability of the algorithms in floating-point, this is true.
| But note that the application of the algorithms in the work
| is explored for fixed-point MM/quantized integer NN
| inference, not floating-point MM/inference. Hence, there is
| no reduction in accuracy for that application of it compared
| to using conventional fixed-point MM.
| abetusk wrote:
| I'm no expert but I suspect this is wrong. To me, this is
| like saying you don't need to worry about integer overflow
| because your operations are only working on fixed integers.
| Really? You don't care if you multiply or add two large
| numbers and they spill over?
|
| The more appropriate answer, I suspect, is that the
| numerical precision and stability sacrifices are more than
| adequate for normal usage.
|
| If I'm wrong about this, I would certainly like to know.
| pclmulqdq wrote:
| In hardware, you control your integer widths completely,
| so if you add two 32-bit ints to a 33-bit int, there is
| no chance of overflow. The same goes for multiplications,
| etc.
| pclmulqdq wrote:
| "Conventional fixed-point MM" is a large suite of
| algorithms. It is correct that this is a 2x reduction in
| MULs compared to _naive_ fixed-point matrix multiply, but
| there is a large body of literature out there with other
| circuits. This is a cool trick to add to the group.
| p1esk wrote:
| Inference world is gradually switching from INT formats to
| FP formats. FP8 is already supported in modern hardware,
| and FP4 support is coming. In my experiments I get better
| perplexity in language models with FP4 than with INT4.
| lupire wrote:
| It's not just a software algorithm. It's a hardware
| architecture optimization. To benefit, you have to build
| hardware that matches the dimensions of the algorithm. That's
| an expensive commitment.
| emacs28 wrote:
| > you have to build hardware that matches the dimensions of
| the algorithm
|
| Yes the benefits are realized in custom hardware designs as
| opposed to software, however, the hardware architectures work
| for multiplying matrices of arbitrary dimensions by splitting
| up larger matrices into smaller tiles, then summing up the
| tile products to form the final larger matrix products (i.e.
| GEMM)
| adastra22 wrote:
| I've only glanced at it so someone correct me if I'm wrong, but
| IIUC this is not a replacement for matrix multiplication but
| rather an approximation that only gives decent-ish results for
| the types of linear systems you see in AI/ML. But for that use
| case it is totally fine?
| pbsd wrote:
| It's not quite forgotten. It kind of lives on in the pseudo-dot
| product Wegman-Carter authenticators like UMAC. See Section 3
| of [1] for context.
|
| [1] https://cr.yp.to/antiforgery/pema-20071022.pdf
| Drakim wrote:
| I'm surprised this actually works, usually detecting whether to
| use multiplication or addition is slower than simply using
| multiplication. Especially if it's massive amounts of work being
| done in parallel.
| barfbagginus wrote:
| This readme does a really poor job of explaining what the
| improvement is or how they drop half the multiplications. What is
| the Big O run time on this? Is this shifting the known best
| bounds?
|
| And the diagrams are chaotic and don't really explain anything
| about why this approach is fast or good. The result is that I'm
| reluctant to even click-through to the PDF.
|
| If you want to improve the project credibility please consider
| being honest and open about what is actually going on and giving
| some clear explanations and illustrations, rather than things
| that may as well be designed to hype people too busy to tell you
| that you are cranks. It's hard to tell if this is incredibly
| groundbreaking or just but nothingburger. Sadly I almost feel
| like that must be an intentional decision motivated by poor
| merits of work and a desire to exploit AI height. The alternative
| - which I prefer to believe is the case - is that the author
| simply needs to revise and better contextualize.
| mariocesar wrote:
| It's actually fairly clear
| hackyhacky wrote:
| Not to everyone. If it's clear to you, you could helpfully
| explain it.
| VogonPoetry wrote:
| This is an analogy. a^2 - b^2 = a _a - b_ b. This can be
| factored to (a+b)(a-b). In the first expression there are
| two multiplies, in the factored version there is only one.
|
| However, from a numerical analysis / accuracy standpoint,
| evaluating the factored expression can result in loss of
| precision in the result when a is close to b. This is
| especially true if you repeatedly and sequentially do a lot
| of these operations. Loss of precision can be a problem in
| numeral modeling (like climate simulation) -- long term
| predictions diverge.
|
| Given that there is a drive to use greatly reduced
| precision in ML engines, loss of precision might have an
| effect on how a model performs. Then again, it might not. I
| haven't read a lot of papers on ML, but I don't recall
| seeing ones that try to quantify how sensitive a model is
| to error propagation. (I am making a distinction between
| tests where the precision is reduced to see where it breaks
| down v.s. calculating / understanding what the error level
| actually is in a model)
| jart wrote:
| With LLMs they start showing signs of brain damage once
| the errors get too high. In my experience it reduces
| their ability to reason, they stop counting correctly,
| and they start homogenizing categories, like calling a
| lemur a monkey. Compare this with quantizing weights,
| which instead of brain damage leads to ignorance, forcing
| them to hallucinate more.
| stephencanon wrote:
| The readme doesn't explain much, but the introduction to the
| paper itself is quite approachable.
|
| As for whether it's groundbreaking or not ... it's a neat
| readily-accessible constant-factor improvement for area-
| constrained fixed-point accelerators. That doesn't change
| everything overnight, but neither is it nothing. It's nice
| work.
| Someone wrote:
| > What is the Big O run time on this?
|
| The claim is they're dropping half the multiplications, so it
| isn't doing anything for Big O.
|
| > If you want to improve the project credibility please
| consider being honest and open about what is actually going on
| and giving some clear explanations and illustrations,
|
| The math explaining how to halve the number of multiplications
| in the paper (https://arxiv.org/abs/2311.12224) isn't hard to
| understand.
|
| You only have to read formulas 2 (traditional matrix
| multiplication) and 3 to 6.
|
| I think it's clear it does do what's being advertised, halving
| the number of multiplications at the cost of lots of extra
| additions/subtractions.
|
| They then go on to better vectorize that algorithm. That, as is
| usual for that, gets looking messy soon.
|
| My main concern would be numerical stability.
| emacs28 wrote:
| Thanks, good summary. Regarding numerical stability, the
| application is for fixed-point arithmetic, and therefore
| numerical stability is not an issue (the result is identical
| compared to using the conventional inner-product)
| eigenket wrote:
| > This readme does a really poor job of explaining what the
| improvement is or how they drop half the multiplications. What
| is the Big O run time on this? Is this shifting the known best
| bounds?
|
| Without wishing to sound elitist I, I don't understand the
| point of this comment at all. If you don't understand Big O
| notation enough to know that "half the multiplications" doesn't
| change it then why are you even asking about it?
| ixaxaar wrote:
| Man I remembered something similar I had tried working on in
| 2018, but gave up after all my PhD applications got rejected.
|
| https://github.com/ixaxaar/pytorch-dni
|
| The concept here goes a bit further and tries to replicate
| backprop with an external network, arguing that that's probably
| what the brain actually does.
| yorwba wrote:
| I'm not seeing the connection. This work is about low-level
| optimization of matrix multiplication. The repo you linked
| seems to be about replacing back-propagated gradients with a
| cheaper estimate. What's the similarity you see between these
| two?
| ixaxaar wrote:
| Correct, I think I mistook it as "use a small neural net to
| approximate matrix multiplication" instead it seems as "use
| cheaper replacements of matrix mul without much acc loss".
|
| Wellll that means I can give dni another try :D
| rollingtide wrote:
| Unrelated to the technical discussion but I was wondering what
| you made that architecture gif with? Looks neat!
| ixaxaar wrote:
| I think that image is from the paper and was not created by
| me. Looks cool indeed!
| jebarker wrote:
| This feels like a "no free lunch" situation. I would imagine
| that any time saving in approximating the gradients this way
| would be lost to needing to train for more iterations due to
| the loss in gradient accuracy. Is that not the case?
| ixaxaar wrote:
| I think that's the reason for its dead end.
|
| However if this is really the biological analogue of credit
| assignment, this might scale better than training llms from
| scratch every time. Even if say it could approx gradients to
| a certain degree given a new network, normal backprop could
| further tune for a few epochs or so dramatically reducing
| overall training costs.
| Lucasoato wrote:
| If you're interested in the mathematical theory behind sub-cubic
| algorithms for matrix multiplications, you can start from here:
| https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...
|
| I conjecture that for every j > 0 in R, a number n exists so that
| any two n x n matrices can be multiplied together in O(n^(2+j))
| steps.
|
| (Now proven for for 2+j = w = 2.3728596, or j > 0.3728596)
| abeppu wrote:
| Predicting that this holds for any j > 0 seems rather bold.
| Would you care to share your intuition why you think that's the
| case?
| roflmaostc wrote:
| Two matrices with size NxN each can be multiplied naively
| with the schoolbook algorithm in O(N^3).
|
| It's clear that the algorithm needs at least O(N^2) because
| to access each element of the matrices once, you need a
| double for loop, which is O(N^2). for i in
| rows for j in cols # do something
| with element matrix1 [i, j], matrix2[i, j],...
|
| so it has to be j >= 0
| JohnKemeny wrote:
| His question was: what is the reasoning behind there
| existing an algorithm running in time n^2+epsilon for
| really small epsilon.
| abeppu wrote:
| Yeah, we're agreed that j cannot be less than 0. But your
| conjecture was about _every j > 0_. Do you have any
| specific line of reasoning which suggests that j can be
| arbitrarily close to 0 (0 is the greatest lower bound)? Why
| do you not think there's some other specific limit k \in
| (0, 0.3728596] beyond which j cannot be improved?
| gizmo686 wrote:
| > I conjecture that for every j > 0 in R, a number n exists so
| that any two n x n matrices can be multiplied together in
| O(n^(2+j)) steps.
|
| Is this stated correctly? Because it seems almost meaningless
| as stated. You start with "for every j, there exists an n such
| that...". That would mean that for the rest of the statement, n
| and j are constant. So you are just saying that you can
| multiply constant sized matrices in constant time. Technically
| true, but I feel like you are trying to claim something
| stronger.
| JohnKemeny wrote:
| It should simply say:
|
| for any j>0 there exists an algorithm multiplying nxn
| matrices in time O(n^{2+j}).
| Lucasoato wrote:
| You are correct, I apologize for the confusion! :)
| bee_rider wrote:
| It does seem to be harder to make progress over time though.
| Maybe it'll bottom out at j=1/e. I won't call my guess even a
| conjecture though, it just happens to be a convenient constant
| near where the current value is. It would be a funny prank for
| math to pull on us.
| michelpp wrote:
| This is very cool and a real interesting read! For those in the
| comments confused about how this is better, the paper is talking
| about synthesizing matrix multiplication pipelines in hardware,
| like an FPGA or ASIC. On a CPU or GPU you won't notice because
| adds and multiplications take the same amount of time generally,
| but multiplication units takes up many more transistors, so if
| you can reduce the circuit complexity you can increase the speed
| and parallel throughput and reduce power and routing complexity.
| This approach could be particularly useful for efficient sparse
| matrix multiplication accelerators.
|
| Another cool way to eliminate multiplication in matrix
| multiplication is to use different semirings [1]. The Tropical
| Semiring [2] for example substitutes addition for multiplication
| and min (or max) for addition. It's still matrix multiplication
| but with substituted binary operations. The research in this
| relatively new field of Tropical Algebra [3] is quite active and
| rich right now, being used for all kinds of optimization problems
| and in research for optimizing neural networks [4] . This
| approach also lends itself to hardware synthesis since most FPGA
| configurable logical blocks can add/min/max in one clock cycle,
| whereas efficient multiplication requires fixed dedicated on-chip
| hardware multipliers.
|
| Another way to efficiently remove multiplications with a
| different but related semiring is to use a Log Semiring [5]. If
| you have to multiply chains of probabilities (like Markov chains)
| then the numbers quickly become very small and floating point
| loses its accuracy to represent the numbers. By scaling the
| numbers first by taking the log, multiplication becomes addition
| and addition becomes x + log1p(exp(y - x)).
|
| [1] https://en.wikipedia.org/wiki/Semiring
|
| [2] https://en.wikipedia.org/wiki/Tropical_semiring
|
| [3] https://en.wikipedia.org/wiki/Tropical_geometry
|
| [4] https://proceedings.mlr.press/v80/zhang18i/zhang18i.pdf
|
| [5] https://en.wikipedia.org/wiki/Log_semiring
| pk-protect-ai wrote:
| > By scaling the numbers first by taking the log,
| multiplication becomes addition and addition becomes x +
| log1p(exp(y - x)).
|
| Isn't this the same approach as in GF(2^x), which has been in
| use for decades? The only limitation that comes to mind is the
| field size.
| jhj wrote:
| Finite field log/antilog lookup tables are used for
| efficient-ish multiplication, similar to addition/subtraction
| tables used for logarithmic number systems.
| gatane wrote:
| Whoah, this is what Unified Algebra is all about!
|
| http://www.cs.toronto.edu/~hehner/UA.pdf
| buybackoff wrote:
| This is what HN is about :)
|
| I understood a fraction but instantly wanted to dive into the
| topic just after reading such a knowledgeable comment.
| jhj wrote:
| > By scaling the numbers first by taking the log,
| multiplication becomes addition and addition becomes x +
| log1p(exp(y - x)).
|
| Addition/subtraction in a logarithmic number system is way more
| expensive than what you would spend on multiplication,
| especially if you care about correctly rounded results, as the
| (hardware) LUTs required are rather big.
| bearzoo wrote:
| somewhat related is the number theoretic transform
|
| https://ieeexplore.ieee.org/abstract/document/1451721
| btown wrote:
| The paper in [4] is absolutely fascinating - I'm very much a
| neophyte here, but I believe it shows that practically any ReLU
| network can be represented as a tropical ratio of two tropical
| polynomials, and thus can be analyzed with geometric principles
| including visualizations of surfaces. It's been cited in more
| recent work:
| https://scholar.google.com/scholar?cites=1003719112553620451...
| - does anyone know if there's been any significant advancements
| here?
| skykooler wrote:
| I find it fascinating that this is using a process invented in
| 1968 and hasn't been used for this purpose until now!
| pk-protect-ai wrote:
| Hey, nobody knew what to do with GF(2^x) up until mid last
| century either... Oh wait, CS was not really a thing almost up
| until mid last century...
___________________________________________________________________
(page generated 2024-03-15 23:00 UTC)