[HN Gopher] Multiplying Matrices Without Multiplying
       ___________________________________________________________________
        
       Multiplying Matrices Without Multiplying
        
       Author : moinnadeem
       Score  : 168 points
       Date   : 2021-09-01 00:02 UTC (9 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | random314 wrote:
       | Ha ha! What the hell! This is revolutionary if true!
        
       | cs702 wrote:
       | Every element of a matrix multiplication is a dot-product of two
       | vectors. The dot-product of two vectors quantifies their
       | similarity -- in fact, we call it "dot-product similarity" in a
       | nearest-neighbors context: If the dot product > 0, the vectors
       | point in similar directions; if < 0, the two vectors point in
       | dissimilar directions; if = 0, the two vectors are orthogonal.
       | 
       | It's not too hard to imagine that it might be possible to learn
       | representative K-means clusters of training vectors and then, at
       | run-time, find similar vectors using an efficient hashing scheme
       | and do the equivalent of a table lookup to get the approximate
       | dot-product similarity scores -- without having to perform any
       | multiplications.
       | 
       | This isn't quite correct (I'm oversimplifying and ignoring many
       | important details), but I think it's a helpful mental image.
       | 
       | Very clever stuff.
        
         | amelius wrote:
         | One question is: how does it scale? E.g. how does the
         | time/space complexity increase in the big-O sense with
         | increasing dimension N?
        
         | xyzzyz wrote:
         | _The dot-product of two vectors quantifies their similarity --
         | in fact, we call it "dot-product similarity" in a nearest-
         | neighbors context: If the dot product > 0, the vectors point in
         | similar directions; if < 0, the two vectors point in dissimilar
         | directions; if = 0, the two vectors are orthogonal._
         | 
         | To make it more explicit, dot product of two vectors is just
         | cosine of the angle between them, multiplied by their lengths.
        
           | kazinator wrote:
           | The angle and cosine start to lose their geometric intuition
           | when we go beyond 3D.
           | 
           | The concept of correlation has no issue with additional
           | components. The concept of similarity of two 17-element
           | vectors is clear. In fact correlation intuitively scales to
           | "infinite component vectors": the dot product becomes
           | multiplying two functions together and then taking an
           | integral.
           | 
           | The Fourier transform of a periodic signal is based in the
           | concept of how similar the signal is to a certain basis of
           | sine/cosine quadratures spaced along a frequency spectrum.
           | This is like a projection of a vector into a space; only the
           | vector has an infinite number of components since it is an
           | interval of a smooth function.
        
             | Y_Y wrote:
             | This idea generalises to the concept of
             | https://en.wikipedia.org/wiki/Inner_product_space and a the
             | equivalent of a change-of-basis.
        
             | tpoacher wrote:
             | > The angle and cosine start to lose their geometric
             | intuition when we go beyond 3D
             | 
             | ... they do?
             | 
             | "Geometry" in general loses intuition beyond 3D, but apart
             | from that, angles between two vectors are probably the one
             | thing that still remains intuitive in higher dimensions
             | (since the two vectors can always be reduced to their
             | common plane).
        
               | Certhas wrote:
               | Even angles behave very counter-intuitively in high
               | dimensions. E.g. in high dimensional spaces uniformaly
               | randomly chosen vectors always have the same inner
               | product. Why? Sum x_i y_i is a sum of iid random
               | variables, so the variance goes to zero by the central
               | limit theorem.
        
               | atq2119 wrote:
               | The variance goes to 0 only if you normalize, which is to
               | say that two random high-dimensional vectors are very
               | likely to be close to orthogonal (under mild assumptions
               | on their distribution).
               | 
               | I agree that that's one of those important but initially
               | unintuitive facts about high dimensions. Just like almost
               | all of the volume of a reasonably round convey body is
               | near its surface. But it also doesn't really contradict
               | the GP comment.
        
               | bollu wrote:
               | I would say that this is intuitive. For any direction you
               | pick, there are (n-1) orthogonal directions in nD space.
               | It's only natural that the expected inner product drops
               | to zero.
        
           | adwn wrote:
           | > _dot product of two vectors is just cosine of the angle
           | between them, multiplied by their lengths_
           | 
           | How do you define the "angle" between two n-dimensional
           | vectors? Most likely using the dot-product and the arccos.
           | "cos(angle) times lengths" might give a good intuition for 2D
           | or 3D space, but it doesn't help in higher-dimensional
           | vectors.
        
             | BongoMcCat wrote:
             | take the first line/vector, then just call that line/vector
             | "first-new-dimension".
             | 
             | Then take the second line/vector, and decide that this
             | vector can be written as a 2-dimensional vector made out of
             | "first-new-dimension" and "second-new-dimension", now you
             | just have to figure out what "second-new-dimension" is.
             | 
             | A simple trick would be to measure the length of the line,
             | and then add/remove a multiple of the first dimension until
             | the length of the line becomes as short as possible, this
             | new line is your "second-new-dimension".
             | 
             | Now, even if you are working with a 10-dimensional space,
             | you have two lines that only exist in the first two (new)
             | dimensions, so, you can treat them as two-dimensional
             | objects and find an angle using your old 2-dimensional
             | methods.
        
             | Nevermark wrote:
             | It generalizes perfectly. The angle between two lines in
             | any dimensions is the same concept.
             | 
             | Two (non-collinear) lines share a plane. The angle on that
             | plane is just the ordinary angle, no matter how many
             | dimensions the two lines are embedded in.
             | 
             | In the case they are collinear, the angle between them is
             | zero on any plane that intersects them. So that corner case
             | works too, regardless of numbers of dimensions.
        
         | davesque wrote:
         | Thanks for your succinct summary of the result.
         | 
         | Just the other day, I was thinking about how useful clusterings
         | seem in summarizing the landscape of a set of points in any
         | number of dimensions. This work seems to be built on top of the
         | same insight as well as the innovation of using hashing to
         | speed up the similarity checks.
        
         | 0-_-0 wrote:
         | >It's not too hard to imagine that it might be possible to
         | learn representative K-means clusters of training vectors and
         | then, at run-time, find similar vectors using an efficient
         | hashing scheme and do the equivalent of a table lookup to get
         | the approximate dot-product similarity scores -- without having
         | to perform any multiplications.
         | 
         | Isn't that basically the same as replacing your hardware
         | multiplication operations with a hardware lookup table?
        
       | leecarraher wrote:
       | For some reason they run all their tests single threaded. Seems
       | like parallelism is where all computing hardware is inevitably
       | going. I also wish they had run time comparisons to more recent
       | matrix sketching and multiplication method such as frequent
       | directions, newer FJLT implementations, and RIP matrices based on
       | hashing.
        
         | ffast-math wrote:
         | We compared to several frequent directions variants, Fast
         | Johnson-Lindenstrauss, some other hashing-based methods, and a
         | bunch of other approximate matrix multiplication approaches. We
         | had to omit some of them from the results section though
         | because they were 10-100x worse than exact matrix products and
         | they ruined the plots. More info in appendix E5.
         | 
         | As far as single threaded, there's a simple answer and a subtle
         | one. The simple answer is that we consider the core subroutine
         | a single thread would run in a multithreaded context, not how
         | to do the multithreading. These are basically orthogonal issues
         | since matrix multiplication is embarrassingly parallel and our
         | method would parallelize just like any other. More details in
         | appendix E2.
         | 
         | The subtler answer is that we could do _even better_ in the
         | multithreaded context _if_ you could fuse the encoding step
         | with whatever earlier code produces the larger matrix. This is
         | a result of matrix products becoming memory-bandwidth-bound in
         | the limit of sufficient cores, combined with our approach
         | reducing the size of the matrix by a huge amount.
        
       | alfor wrote:
       | I wonder how our brain can train billion of neuron without matrix
       | multiplication. What is the biological process that get a similar
       | result?
        
         | tgv wrote:
         | The "original" and easy to grasp idea behind neural learning is
         | reinforcement: every time a good result is obtained, the
         | connections that contributed get strengthened (reinforced). Bad
         | results lead to weakening. Back-prop is specific
         | implementation, and it is usually expressed in terms of matrix
         | operations, but you can describe it without as well at the
         | single connection level. Implementing that isn't efficient,
         | though, hence matrix operations.
        
         | jvanderbot wrote:
         | Perhaps addition and multiplication are the tools we use to get
         | a computer to appear to act like a biological system with all
         | its chemical pathways and simultaneous stimuli.
         | 
         | Lets take that one step further. Who taught orbiting bodies how
         | to do differential equations?
        
         | fnord77 wrote:
         | btw neurons are not the only units of computation in the human
         | body.
         | 
         | allosteric activation in enzyme binding sites act like
         | transistors almost. inside every cell there's various
         | computations occuring
        
       | yangjunpro wrote:
       | An interesting work, with some to-be-addressed questions: 1.The
       | paper only covers the GEMM part with small-scale
       | experiments(CIFAR-10/100), not covering convolution, not covering
       | GEMM part in more popular network such as Transformer/BERT, etc.
       | 2. It is still an approximating method, meaning potential
       | accuracy loss. So I think this method is less attractive to
       | training acceleration scenario, maybe potentially as a
       | complementing methods for inference acceleration. 3. No results
       | evaluated in GPU with TensorCore equipment. I am a little bit
       | curious, since modern AI accelerator(including NV GPU) all
       | incorporate TensorCore which by-design supports GEMM
       | acceleration, what is the add-on value brought by the
       | approximating method mentioned in this paper.
        
         | Ar-Curunir wrote:
         | If it works better for inference, it could enable fast
         | inference on devices which don't have good tensor cores/gpus
        
         | ffast-math wrote:
         | Great observations. I see this paper as the first in a three-
         | part series. The second part is specializing it for convolution
         | (which has additional structure to exploit), and the third is
         | hooking these approximate ops into deep neural nets the way
         | people currently do with scalar quantization / pruning.
         | 
         | I'm not optimistic about beating tensor cores when running on
         | GPUs, at least until/unless we get similar hardware support.*
         | 
         | Barring better hardware support, the killer app is probably CPU
         | inference--once there are Conv implementations and the
         | necessary GPU kernels to train the network.
         | 
         | *Aside: this support would be pretty doable since the kernels
         | look almost identical to GEMM kernels--you just need a
         | multiplex-add rather than a multiply-add. On an x86 machine,
         | all it would take is a vpshufb-add and a 4-bit unpack
         | instruction.
        
         | criticaltinker wrote:
         | > I think this method is less attractive to training
         | acceleration scenario
         | 
         | The proposed hash based encoding function is not
         | differentiable, so it doesn't appear this method can be used
         | for training at all.
         | 
         | I'm not aware of any hash functions that are analytically
         | differentiable, so to support efficient back-propagation I
         | suspect that some fundamental changes to this method would be
         | necessary.
        
       | infogulch wrote:
       | I wonder how much alternative number representations have been
       | studied for use in matrices. Like, storing the log of values
       | instead so that multiplication can be done by just adding. Or
       | something like Zech's logarithm. Or even, take the log of whole
       | matrices (which is a thing apparently [0]), then adding them to
       | compute their multiplication. I wonder if the logarithm of a
       | matrix can be computed using ML.
       | 
       | [0]: https://en.wikipedia.org/wiki/Logarithm_of_a_matrix
        
         | rocqua wrote:
         | storing log values helps with the multiply part of multiply and
         | add.
         | 
         | But it seems like it would make the add part of multiply and
         | add quite a bit more difficult.
        
         | thwoeri2324 wrote:
         | Quick note: the exp does not work the way it does with scalars.
         | 
         | https://en.wikipedia.org/wiki/Matrix_exponential#Elementary_...
        
         | ffast-math wrote:
         | There's been some work on doing adds instead of multiplies
         | (e.g., https://arxiv.org/abs/2012.03458). And I think float8
         | will roughly be doing this under the hood.
         | 
         | Personally, I'm not sure whether this is the path to go down or
         | not. Doing everything in log space could help from a hardware
         | perspective, since multiply-adds are much more complex than
         | adds. But it 1) doesn't reduce the number of operations being
         | performed, 2) might increase the necessary bit width for a
         | given level of fidelity (depending on your input distribution
         | and other factors), and 3) might make accumulate operations
         | expensive, since these aren't cleanly expressed in the log
         | domain.
        
       | ffast-math wrote:
       | Primary author here. Happy to answer questions!
       | 
       | Also, feel free to email me at the address in the paper if you're
       | interested in talking about it in more detail. E.g., I've already
       | heard from some hardware folks looking at expanding on this work.
        
       | ausbah wrote:
       | as a total outside to this sort of stuff, doesn't this have the
       | big issue of error propagation?
        
         | echelon wrote:
         | Probably, but hill climbing will avoid loss in the long run.
         | The speed boost is probably more than worth it.
         | 
         | I wouldn't be surprised if we started using lossier math for
         | faster state space search. Once you find a peak, you could swap
         | out the maths to be more exact.
        
           | erosenbe0 wrote:
           | Suppose you have a set membership question. If you
           | approximate yes, you do a more complex inquiry to grab
           | details. If you approximate no, you tell the user no, don't
           | know that face [answer, command, etc]. So this is ripe for
           | approximate methods as you can tune to get very few false
           | negatives (respond no when it's really yes), but allow a
           | healthy dose of false positives.
        
       | antonzabirko wrote:
       | Had this same exact thought as an undergrad like 3 years ago! I
       | kinda gave up due to the massive barrier and difficult financial
       | burdens faced by phd students. This feels nice to know i wasn't
       | crazy.
        
         | Exuma wrote:
         | How does it work?
        
         | ssivark wrote:
         | Do you mean the exact idea behind MADDNESS or just "ML for
         | multiplying matrices"? The latter doesn't mean much without the
         | former :-)
        
           | antonzabirko wrote:
           | Not sure how they did it, but basically i thought about
           | reducing things that gained complexity through layers with
           | neural nets. Computer architecture and the layers of
           | complexity between groups of system instructions and function
           | code for example. There have got to be inefficiencies that
           | can be collapsed with the help of ml.
        
         | honie wrote:
         | I don't disagree with you on the potential financial burdens
         | faced by PhD students but, in this case, and if I haven't
         | missed anything, the infrastructural barrier is not high:
         | 
         | > All experi-ments use a single thread on a Macbook Pro with a
         | 2.6GHz Intel Core i7-4960HQ processor. Unless stated otherwise,
         | all timing results use five trials, with each trial reporting
         | the fastest among 20 executions.
         | 
         | It may also be worth nothing that, even as a hobby, one can
         | actually do a lot with modern hardware if you scope your
         | project well. There is also a lot of free resources such as
         | CoLab available to those who have very limited computing power
         | at home/work.
         | 
         | Last but not least, there is also nothing stopping you from
         | announcing your results on arXiv and, if you can be bothered
         | (as a hobbyist), get it published in a peer-reviewed journal.
         | 
         | So if you still have ideas, I encourage you to go ahead and try
         | them! :)
        
           | antonzabirko wrote:
           | I meant learning barrier. If i had the money i likely would
           | go for it though!
        
             | ltbarcly3 wrote:
             | I really don't know how to respond other than to say:
             | 
             | "I would have done this years ago if I knew how to do it"
             | is fantastically narcissistic.
        
               | m00x wrote:
               | Another form of: "I could totally have built Google or
               | Amazon if I wanted to, I'm just a lazy genius."
        
         | jmount wrote:
         | In my opinion, that is the thing with theoretical computer
         | science. Many of the ideas are comprehendible. And if one has
         | the opportunity to do the work: you may get the result.
        
           | jvanderbot wrote:
           | Well, taken to the extreme, and from what I can tell, that's
           | the truth for any field.
           | 
           | If you roll everything, financial incentives, emotional
           | readiness, support and preparation, etc, into 'opportunity'
           | then yeah, I believe it.
        
           | Ar-Curunir wrote:
           | tbh I don't think most ideas in TCS are comprehensible to
           | outsiders without spending at least a year or two building
           | background with the state of the art. Even if you can design
           | an algorithm, to get a good TCS paper you have to prove
           | bounds on how well it approximates and/or on its running
           | time.
        
           | mhh__ wrote:
           | The biggest blocker for me is usually working out how I can
           | implement a given idea without either writing a bunch of code
           | (I'll get bored) or trying to verify if the paper even works
           | for my use case without doing the aforementioned. One field I
           | pay attention to with this problem seems to be abstract
           | interpretation, with my background at least, the methods are
           | very theoretically clean and impressive but actually
           | implementing these as code _and_ knowing how to write _good_
           | implementations seems to be quite obtuse.
           | 
           | I genuinely don't understand why we allow papers to be
           | published on _computer_ science, with graphs plotted of the
           | supposed efficacy of research (i.e. not just a theoretical
           | paper), with no code attached.
        
             | jijji wrote:
             | On page six of the paper:
             | 
             | To assess MADDNESS's effectiveness, we implemented both it
             | and existing algorithms in C++ and Python. All of our code
             | and raw numerical results are publicly available at
             | https://smarturl.it/Maddness. All experiments use a single
             | thread on a Macbook Pro with a 2.6GHz Intel Core i7-4960HQ
             | processor.
             | 
             | edit: the url above redirects to:
             | https://github.com/dblalock/bolt
        
             | Ar-Curunir wrote:
             | Computer Science doesn't mean empirical stuff only, nor
             | does it mean evaluations with code.
             | 
             | Theory has its place, and so does systems work. No need to
             | denigrate one or the other.
        
               | mhh__ wrote:
               | I was trying to be clear that I was referring to papers
               | which are talking about code they wrote. If you wrote
               | (say) a program to predict the throughput of machine
               | code, then I want to be able to reproduce the results you
               | claim - thats a real example, no hint of any source yet
               | and I've been looking.
               | 
               | If we can't reproduce it isn't really science. I know
               | academics often write bad code and don't like to publish
               | their dirty work, but the buck has to stop somewhere.
        
               | robotresearcher wrote:
               | Did you ask them for their code?
        
       | esjeon wrote:
       | > MADDNESS
       | 
       | I think this name really fits well to the concept. Replacing
       | matrix-multiplication with some hash-lookups! (warning: an overly
       | simplified statement)
       | 
       | This is a really interesting application of PQ(product
       | quantization), which itself also requires learning (usually
       | K-means). Paper:
       | https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_wi...
       | 
       | Considering that ANN has survived through many approximations
       | (e.g. lower precision, pruning), and that many ANN applications
       | are anyway subjective (e.g. image processing, recommendation), I
       | think this can be highly useful in many cases.
        
       | nynx wrote:
       | Huh, that's kinda fascinating. Maybe it'd be worth running ml on
       | matrix multiplication that's approximated by ml?
        
         | TrainedMonkey wrote:
         | Ironic thing is large part of how ML works is via matrix
         | multiplication.
        
           | r00fus wrote:
           | That's not ironic, it could be revolutionary for this
           | algorithm and ML in general.
        
           | [deleted]
        
           | noobermin wrote:
           | It important to remember dense matrix multiplication (doing
           | it by hand) is a O(N^3) operation, this is about
           | approximations to multiplication that beat that already harsh
           | complexity. There is a whole field that develops
           | approximations to matrix multiplication of large matrices,
           | I'm assuming this article is about using ML to find good
           | approximations.
           | 
           | To the replies, the very act of evaluting a prediction from a
           | NN is matrix multiplication (linear transform of a column
           | vector is matrix multiplication). This doesn't replace matrix
           | multiplication wholesale, lol. This is about multiplication
           | in a specific case.
        
             | Nevermark wrote:
             | As long as their is enough consistency between
             | approximating a multiply X*W and Z*Wt (Wt = W transpose),
             | then it is possible it could be used in NN training.
             | 
             | Y = X*W is the forward propagation. If Z is an error or
             | derivative of error, Z*Wt is the back propagation.
             | 
             | Its an interesting question as to how well that would work.
             | Anything that speeds up matrix multiply in NN and deep
             | learning in general would be a big deal.
        
           | uluyol wrote:
           | Lots of things are like this. I program my network using a
           | network. I run a server (BMC) on my server. I start my engine
           | with an engine
           | (https://en.m.wikipedia.org/wiki/Lockheed_SR-71_Blackbird).
        
       | trilinearnz wrote:
       | Clever and logical. It reminds me of when John Carmack used a
       | precomputed table of Sine values for fast lookup in Quake, rather
       | than running the actual function on the CPU.
        
         | richrichardsson wrote:
         | I was using a precomputed table of sine values in 3D graphics
         | way before Quake ever hit the scene, and I certainly didn't
         | invent that idea either.
        
           | trilinearnz wrote:
           | Ah that's useful to know that it was conventional up until
           | then, thanks. It was my first exposure, personally :)
        
       | kazinator wrote:
       | If machine learning depends on matrix multiplications, but you
       | fib them using machine learning ... do you see the problem?
        
         | qsort wrote:
         | If a C++ compiler depends on a C++ compiler to be compiled
         | itself... do you see the problem?
         | 
         | I have only superficially looked at the paper, but it's pretty
         | clear they are using an offline lookup table. There are no such
         | problems involved.
        
           | kazinator wrote:
           | Yes, I do; your bootstrapping is dependent on a binary C++
           | compiler, which could be hiding something that isn't in the
           | source code, but which propagates to newly bootstrapped
           | compiler binaries which again pass it on to the next round of
           | bootstrapping, _ad infinitum_.
           | 
           | Basically, we can't be sure that you really have a C++
           | compiler. The source code can be verified to be a C++
           | compiler, but the binaries deviate in some way, possibly
           | malicious.
        
             | robertlagrant wrote:
             | Sure, but you seem to have gone off on a wild tangent. How
             | does that relate to matrix multiplication?
        
       | joe_the_user wrote:
       | Shouldn't the solution to fast matrix multiplication be
       | logarithms, similarly to fast scalar multiplication?
       | 
       | https://en.wikipedia.org/wiki/Logarithm_of_a_matrix
        
         | PythonNut wrote:
         | Perhaps surprisingly, the matrix logarithm does not satisfy
         | log(A) + log(B) = log(AB) in general (only when AB = BA) which
         | is why you cannot use it to "multiply by adding".
        
           | adgjlsfhk1 wrote:
           | More importantly, matrix logarithms only exist for square
           | matrices.
        
             | xyzzyz wrote:
             | That's not that big of a problem, as you can easily reduce
             | the problem of multiplication of rectangular matrices to
             | multiplication of square ones.
        
       ___________________________________________________________________
       (page generated 2021-09-01 10:01 UTC)