[HN Gopher] Multiplying Matrices Without Multiplying
___________________________________________________________________
Multiplying Matrices Without Multiplying
Author : moinnadeem
Score : 168 points
Date : 2021-09-01 00:02 UTC (9 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| random314 wrote:
| Ha ha! What the hell! This is revolutionary if true!
| cs702 wrote:
| Every element of a matrix multiplication is a dot-product of two
| vectors. The dot-product of two vectors quantifies their
| similarity -- in fact, we call it "dot-product similarity" in a
| nearest-neighbors context: If the dot product > 0, the vectors
| point in similar directions; if < 0, the two vectors point in
| dissimilar directions; if = 0, the two vectors are orthogonal.
|
| It's not too hard to imagine that it might be possible to learn
| representative K-means clusters of training vectors and then, at
| run-time, find similar vectors using an efficient hashing scheme
| and do the equivalent of a table lookup to get the approximate
| dot-product similarity scores -- without having to perform any
| multiplications.
|
| This isn't quite correct (I'm oversimplifying and ignoring many
| important details), but I think it's a helpful mental image.
|
| Very clever stuff.
| amelius wrote:
| One question is: how does it scale? E.g. how does the
| time/space complexity increase in the big-O sense with
| increasing dimension N?
| xyzzyz wrote:
| _The dot-product of two vectors quantifies their similarity --
| in fact, we call it "dot-product similarity" in a nearest-
| neighbors context: If the dot product > 0, the vectors point in
| similar directions; if < 0, the two vectors point in dissimilar
| directions; if = 0, the two vectors are orthogonal._
|
| To make it more explicit, dot product of two vectors is just
| cosine of the angle between them, multiplied by their lengths.
| kazinator wrote:
| The angle and cosine start to lose their geometric intuition
| when we go beyond 3D.
|
| The concept of correlation has no issue with additional
| components. The concept of similarity of two 17-element
| vectors is clear. In fact correlation intuitively scales to
| "infinite component vectors": the dot product becomes
| multiplying two functions together and then taking an
| integral.
|
| The Fourier transform of a periodic signal is based in the
| concept of how similar the signal is to a certain basis of
| sine/cosine quadratures spaced along a frequency spectrum.
| This is like a projection of a vector into a space; only the
| vector has an infinite number of components since it is an
| interval of a smooth function.
| Y_Y wrote:
| This idea generalises to the concept of
| https://en.wikipedia.org/wiki/Inner_product_space and a the
| equivalent of a change-of-basis.
| tpoacher wrote:
| > The angle and cosine start to lose their geometric
| intuition when we go beyond 3D
|
| ... they do?
|
| "Geometry" in general loses intuition beyond 3D, but apart
| from that, angles between two vectors are probably the one
| thing that still remains intuitive in higher dimensions
| (since the two vectors can always be reduced to their
| common plane).
| Certhas wrote:
| Even angles behave very counter-intuitively in high
| dimensions. E.g. in high dimensional spaces uniformaly
| randomly chosen vectors always have the same inner
| product. Why? Sum x_i y_i is a sum of iid random
| variables, so the variance goes to zero by the central
| limit theorem.
| atq2119 wrote:
| The variance goes to 0 only if you normalize, which is to
| say that two random high-dimensional vectors are very
| likely to be close to orthogonal (under mild assumptions
| on their distribution).
|
| I agree that that's one of those important but initially
| unintuitive facts about high dimensions. Just like almost
| all of the volume of a reasonably round convey body is
| near its surface. But it also doesn't really contradict
| the GP comment.
| bollu wrote:
| I would say that this is intuitive. For any direction you
| pick, there are (n-1) orthogonal directions in nD space.
| It's only natural that the expected inner product drops
| to zero.
| adwn wrote:
| > _dot product of two vectors is just cosine of the angle
| between them, multiplied by their lengths_
|
| How do you define the "angle" between two n-dimensional
| vectors? Most likely using the dot-product and the arccos.
| "cos(angle) times lengths" might give a good intuition for 2D
| or 3D space, but it doesn't help in higher-dimensional
| vectors.
| BongoMcCat wrote:
| take the first line/vector, then just call that line/vector
| "first-new-dimension".
|
| Then take the second line/vector, and decide that this
| vector can be written as a 2-dimensional vector made out of
| "first-new-dimension" and "second-new-dimension", now you
| just have to figure out what "second-new-dimension" is.
|
| A simple trick would be to measure the length of the line,
| and then add/remove a multiple of the first dimension until
| the length of the line becomes as short as possible, this
| new line is your "second-new-dimension".
|
| Now, even if you are working with a 10-dimensional space,
| you have two lines that only exist in the first two (new)
| dimensions, so, you can treat them as two-dimensional
| objects and find an angle using your old 2-dimensional
| methods.
| Nevermark wrote:
| It generalizes perfectly. The angle between two lines in
| any dimensions is the same concept.
|
| Two (non-collinear) lines share a plane. The angle on that
| plane is just the ordinary angle, no matter how many
| dimensions the two lines are embedded in.
|
| In the case they are collinear, the angle between them is
| zero on any plane that intersects them. So that corner case
| works too, regardless of numbers of dimensions.
| davesque wrote:
| Thanks for your succinct summary of the result.
|
| Just the other day, I was thinking about how useful clusterings
| seem in summarizing the landscape of a set of points in any
| number of dimensions. This work seems to be built on top of the
| same insight as well as the innovation of using hashing to
| speed up the similarity checks.
| 0-_-0 wrote:
| >It's not too hard to imagine that it might be possible to
| learn representative K-means clusters of training vectors and
| then, at run-time, find similar vectors using an efficient
| hashing scheme and do the equivalent of a table lookup to get
| the approximate dot-product similarity scores -- without having
| to perform any multiplications.
|
| Isn't that basically the same as replacing your hardware
| multiplication operations with a hardware lookup table?
| leecarraher wrote:
| For some reason they run all their tests single threaded. Seems
| like parallelism is where all computing hardware is inevitably
| going. I also wish they had run time comparisons to more recent
| matrix sketching and multiplication method such as frequent
| directions, newer FJLT implementations, and RIP matrices based on
| hashing.
| ffast-math wrote:
| We compared to several frequent directions variants, Fast
| Johnson-Lindenstrauss, some other hashing-based methods, and a
| bunch of other approximate matrix multiplication approaches. We
| had to omit some of them from the results section though
| because they were 10-100x worse than exact matrix products and
| they ruined the plots. More info in appendix E5.
|
| As far as single threaded, there's a simple answer and a subtle
| one. The simple answer is that we consider the core subroutine
| a single thread would run in a multithreaded context, not how
| to do the multithreading. These are basically orthogonal issues
| since matrix multiplication is embarrassingly parallel and our
| method would parallelize just like any other. More details in
| appendix E2.
|
| The subtler answer is that we could do _even better_ in the
| multithreaded context _if_ you could fuse the encoding step
| with whatever earlier code produces the larger matrix. This is
| a result of matrix products becoming memory-bandwidth-bound in
| the limit of sufficient cores, combined with our approach
| reducing the size of the matrix by a huge amount.
| alfor wrote:
| I wonder how our brain can train billion of neuron without matrix
| multiplication. What is the biological process that get a similar
| result?
| tgv wrote:
| The "original" and easy to grasp idea behind neural learning is
| reinforcement: every time a good result is obtained, the
| connections that contributed get strengthened (reinforced). Bad
| results lead to weakening. Back-prop is specific
| implementation, and it is usually expressed in terms of matrix
| operations, but you can describe it without as well at the
| single connection level. Implementing that isn't efficient,
| though, hence matrix operations.
| jvanderbot wrote:
| Perhaps addition and multiplication are the tools we use to get
| a computer to appear to act like a biological system with all
| its chemical pathways and simultaneous stimuli.
|
| Lets take that one step further. Who taught orbiting bodies how
| to do differential equations?
| fnord77 wrote:
| btw neurons are not the only units of computation in the human
| body.
|
| allosteric activation in enzyme binding sites act like
| transistors almost. inside every cell there's various
| computations occuring
| yangjunpro wrote:
| An interesting work, with some to-be-addressed questions: 1.The
| paper only covers the GEMM part with small-scale
| experiments(CIFAR-10/100), not covering convolution, not covering
| GEMM part in more popular network such as Transformer/BERT, etc.
| 2. It is still an approximating method, meaning potential
| accuracy loss. So I think this method is less attractive to
| training acceleration scenario, maybe potentially as a
| complementing methods for inference acceleration. 3. No results
| evaluated in GPU with TensorCore equipment. I am a little bit
| curious, since modern AI accelerator(including NV GPU) all
| incorporate TensorCore which by-design supports GEMM
| acceleration, what is the add-on value brought by the
| approximating method mentioned in this paper.
| Ar-Curunir wrote:
| If it works better for inference, it could enable fast
| inference on devices which don't have good tensor cores/gpus
| ffast-math wrote:
| Great observations. I see this paper as the first in a three-
| part series. The second part is specializing it for convolution
| (which has additional structure to exploit), and the third is
| hooking these approximate ops into deep neural nets the way
| people currently do with scalar quantization / pruning.
|
| I'm not optimistic about beating tensor cores when running on
| GPUs, at least until/unless we get similar hardware support.*
|
| Barring better hardware support, the killer app is probably CPU
| inference--once there are Conv implementations and the
| necessary GPU kernels to train the network.
|
| *Aside: this support would be pretty doable since the kernels
| look almost identical to GEMM kernels--you just need a
| multiplex-add rather than a multiply-add. On an x86 machine,
| all it would take is a vpshufb-add and a 4-bit unpack
| instruction.
| criticaltinker wrote:
| > I think this method is less attractive to training
| acceleration scenario
|
| The proposed hash based encoding function is not
| differentiable, so it doesn't appear this method can be used
| for training at all.
|
| I'm not aware of any hash functions that are analytically
| differentiable, so to support efficient back-propagation I
| suspect that some fundamental changes to this method would be
| necessary.
| infogulch wrote:
| I wonder how much alternative number representations have been
| studied for use in matrices. Like, storing the log of values
| instead so that multiplication can be done by just adding. Or
| something like Zech's logarithm. Or even, take the log of whole
| matrices (which is a thing apparently [0]), then adding them to
| compute their multiplication. I wonder if the logarithm of a
| matrix can be computed using ML.
|
| [0]: https://en.wikipedia.org/wiki/Logarithm_of_a_matrix
| rocqua wrote:
| storing log values helps with the multiply part of multiply and
| add.
|
| But it seems like it would make the add part of multiply and
| add quite a bit more difficult.
| thwoeri2324 wrote:
| Quick note: the exp does not work the way it does with scalars.
|
| https://en.wikipedia.org/wiki/Matrix_exponential#Elementary_...
| ffast-math wrote:
| There's been some work on doing adds instead of multiplies
| (e.g., https://arxiv.org/abs/2012.03458). And I think float8
| will roughly be doing this under the hood.
|
| Personally, I'm not sure whether this is the path to go down or
| not. Doing everything in log space could help from a hardware
| perspective, since multiply-adds are much more complex than
| adds. But it 1) doesn't reduce the number of operations being
| performed, 2) might increase the necessary bit width for a
| given level of fidelity (depending on your input distribution
| and other factors), and 3) might make accumulate operations
| expensive, since these aren't cleanly expressed in the log
| domain.
| ffast-math wrote:
| Primary author here. Happy to answer questions!
|
| Also, feel free to email me at the address in the paper if you're
| interested in talking about it in more detail. E.g., I've already
| heard from some hardware folks looking at expanding on this work.
| ausbah wrote:
| as a total outside to this sort of stuff, doesn't this have the
| big issue of error propagation?
| echelon wrote:
| Probably, but hill climbing will avoid loss in the long run.
| The speed boost is probably more than worth it.
|
| I wouldn't be surprised if we started using lossier math for
| faster state space search. Once you find a peak, you could swap
| out the maths to be more exact.
| erosenbe0 wrote:
| Suppose you have a set membership question. If you
| approximate yes, you do a more complex inquiry to grab
| details. If you approximate no, you tell the user no, don't
| know that face [answer, command, etc]. So this is ripe for
| approximate methods as you can tune to get very few false
| negatives (respond no when it's really yes), but allow a
| healthy dose of false positives.
| antonzabirko wrote:
| Had this same exact thought as an undergrad like 3 years ago! I
| kinda gave up due to the massive barrier and difficult financial
| burdens faced by phd students. This feels nice to know i wasn't
| crazy.
| Exuma wrote:
| How does it work?
| ssivark wrote:
| Do you mean the exact idea behind MADDNESS or just "ML for
| multiplying matrices"? The latter doesn't mean much without the
| former :-)
| antonzabirko wrote:
| Not sure how they did it, but basically i thought about
| reducing things that gained complexity through layers with
| neural nets. Computer architecture and the layers of
| complexity between groups of system instructions and function
| code for example. There have got to be inefficiencies that
| can be collapsed with the help of ml.
| honie wrote:
| I don't disagree with you on the potential financial burdens
| faced by PhD students but, in this case, and if I haven't
| missed anything, the infrastructural barrier is not high:
|
| > All experi-ments use a single thread on a Macbook Pro with a
| 2.6GHz Intel Core i7-4960HQ processor. Unless stated otherwise,
| all timing results use five trials, with each trial reporting
| the fastest among 20 executions.
|
| It may also be worth nothing that, even as a hobby, one can
| actually do a lot with modern hardware if you scope your
| project well. There is also a lot of free resources such as
| CoLab available to those who have very limited computing power
| at home/work.
|
| Last but not least, there is also nothing stopping you from
| announcing your results on arXiv and, if you can be bothered
| (as a hobbyist), get it published in a peer-reviewed journal.
|
| So if you still have ideas, I encourage you to go ahead and try
| them! :)
| antonzabirko wrote:
| I meant learning barrier. If i had the money i likely would
| go for it though!
| ltbarcly3 wrote:
| I really don't know how to respond other than to say:
|
| "I would have done this years ago if I knew how to do it"
| is fantastically narcissistic.
| m00x wrote:
| Another form of: "I could totally have built Google or
| Amazon if I wanted to, I'm just a lazy genius."
| jmount wrote:
| In my opinion, that is the thing with theoretical computer
| science. Many of the ideas are comprehendible. And if one has
| the opportunity to do the work: you may get the result.
| jvanderbot wrote:
| Well, taken to the extreme, and from what I can tell, that's
| the truth for any field.
|
| If you roll everything, financial incentives, emotional
| readiness, support and preparation, etc, into 'opportunity'
| then yeah, I believe it.
| Ar-Curunir wrote:
| tbh I don't think most ideas in TCS are comprehensible to
| outsiders without spending at least a year or two building
| background with the state of the art. Even if you can design
| an algorithm, to get a good TCS paper you have to prove
| bounds on how well it approximates and/or on its running
| time.
| mhh__ wrote:
| The biggest blocker for me is usually working out how I can
| implement a given idea without either writing a bunch of code
| (I'll get bored) or trying to verify if the paper even works
| for my use case without doing the aforementioned. One field I
| pay attention to with this problem seems to be abstract
| interpretation, with my background at least, the methods are
| very theoretically clean and impressive but actually
| implementing these as code _and_ knowing how to write _good_
| implementations seems to be quite obtuse.
|
| I genuinely don't understand why we allow papers to be
| published on _computer_ science, with graphs plotted of the
| supposed efficacy of research (i.e. not just a theoretical
| paper), with no code attached.
| jijji wrote:
| On page six of the paper:
|
| To assess MADDNESS's effectiveness, we implemented both it
| and existing algorithms in C++ and Python. All of our code
| and raw numerical results are publicly available at
| https://smarturl.it/Maddness. All experiments use a single
| thread on a Macbook Pro with a 2.6GHz Intel Core i7-4960HQ
| processor.
|
| edit: the url above redirects to:
| https://github.com/dblalock/bolt
| Ar-Curunir wrote:
| Computer Science doesn't mean empirical stuff only, nor
| does it mean evaluations with code.
|
| Theory has its place, and so does systems work. No need to
| denigrate one or the other.
| mhh__ wrote:
| I was trying to be clear that I was referring to papers
| which are talking about code they wrote. If you wrote
| (say) a program to predict the throughput of machine
| code, then I want to be able to reproduce the results you
| claim - thats a real example, no hint of any source yet
| and I've been looking.
|
| If we can't reproduce it isn't really science. I know
| academics often write bad code and don't like to publish
| their dirty work, but the buck has to stop somewhere.
| robotresearcher wrote:
| Did you ask them for their code?
| esjeon wrote:
| > MADDNESS
|
| I think this name really fits well to the concept. Replacing
| matrix-multiplication with some hash-lookups! (warning: an overly
| simplified statement)
|
| This is a really interesting application of PQ(product
| quantization), which itself also requires learning (usually
| K-means). Paper:
| https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_wi...
|
| Considering that ANN has survived through many approximations
| (e.g. lower precision, pruning), and that many ANN applications
| are anyway subjective (e.g. image processing, recommendation), I
| think this can be highly useful in many cases.
| nynx wrote:
| Huh, that's kinda fascinating. Maybe it'd be worth running ml on
| matrix multiplication that's approximated by ml?
| TrainedMonkey wrote:
| Ironic thing is large part of how ML works is via matrix
| multiplication.
| r00fus wrote:
| That's not ironic, it could be revolutionary for this
| algorithm and ML in general.
| [deleted]
| noobermin wrote:
| It important to remember dense matrix multiplication (doing
| it by hand) is a O(N^3) operation, this is about
| approximations to multiplication that beat that already harsh
| complexity. There is a whole field that develops
| approximations to matrix multiplication of large matrices,
| I'm assuming this article is about using ML to find good
| approximations.
|
| To the replies, the very act of evaluting a prediction from a
| NN is matrix multiplication (linear transform of a column
| vector is matrix multiplication). This doesn't replace matrix
| multiplication wholesale, lol. This is about multiplication
| in a specific case.
| Nevermark wrote:
| As long as their is enough consistency between
| approximating a multiply X*W and Z*Wt (Wt = W transpose),
| then it is possible it could be used in NN training.
|
| Y = X*W is the forward propagation. If Z is an error or
| derivative of error, Z*Wt is the back propagation.
|
| Its an interesting question as to how well that would work.
| Anything that speeds up matrix multiply in NN and deep
| learning in general would be a big deal.
| uluyol wrote:
| Lots of things are like this. I program my network using a
| network. I run a server (BMC) on my server. I start my engine
| with an engine
| (https://en.m.wikipedia.org/wiki/Lockheed_SR-71_Blackbird).
| trilinearnz wrote:
| Clever and logical. It reminds me of when John Carmack used a
| precomputed table of Sine values for fast lookup in Quake, rather
| than running the actual function on the CPU.
| richrichardsson wrote:
| I was using a precomputed table of sine values in 3D graphics
| way before Quake ever hit the scene, and I certainly didn't
| invent that idea either.
| trilinearnz wrote:
| Ah that's useful to know that it was conventional up until
| then, thanks. It was my first exposure, personally :)
| kazinator wrote:
| If machine learning depends on matrix multiplications, but you
| fib them using machine learning ... do you see the problem?
| qsort wrote:
| If a C++ compiler depends on a C++ compiler to be compiled
| itself... do you see the problem?
|
| I have only superficially looked at the paper, but it's pretty
| clear they are using an offline lookup table. There are no such
| problems involved.
| kazinator wrote:
| Yes, I do; your bootstrapping is dependent on a binary C++
| compiler, which could be hiding something that isn't in the
| source code, but which propagates to newly bootstrapped
| compiler binaries which again pass it on to the next round of
| bootstrapping, _ad infinitum_.
|
| Basically, we can't be sure that you really have a C++
| compiler. The source code can be verified to be a C++
| compiler, but the binaries deviate in some way, possibly
| malicious.
| robertlagrant wrote:
| Sure, but you seem to have gone off on a wild tangent. How
| does that relate to matrix multiplication?
| joe_the_user wrote:
| Shouldn't the solution to fast matrix multiplication be
| logarithms, similarly to fast scalar multiplication?
|
| https://en.wikipedia.org/wiki/Logarithm_of_a_matrix
| PythonNut wrote:
| Perhaps surprisingly, the matrix logarithm does not satisfy
| log(A) + log(B) = log(AB) in general (only when AB = BA) which
| is why you cannot use it to "multiply by adding".
| adgjlsfhk1 wrote:
| More importantly, matrix logarithms only exist for square
| matrices.
| xyzzyz wrote:
| That's not that big of a problem, as you can easily reduce
| the problem of multiplication of rectangular matrices to
| multiplication of square ones.
___________________________________________________________________
(page generated 2021-09-01 10:01 UTC)