[HN Gopher] Matrix Multiplication Inches Closer to Mythic Goal
___________________________________________________________________
Matrix Multiplication Inches Closer to Mythic Goal
Author : yboris
Score : 225 points
Date : 2021-03-23 14:39 UTC (8 hours ago)
(HTM) web link (quantamagazine.org)
(TXT) w3m dump (quantamagazine.org)
| [deleted]
| woopwoop wrote:
| On the off chance an expert reads this comment and feels like
| answering, is there a consensus opinion (or even just your
| opinion) on what the best result is likely to be? Do people think
| O(n^2) is actually possible?
|
| Edit: shoulda read to the end, it says that Strassen says no.
| phkahler wrote:
| I'm no expert, but IMHO it feels like a problem that might have
| O(n^2 log(n)) time complexity. That would be more satisfying
| than some arbitrary non-integer exponent.
| waynecochran wrote:
| Get that n out of the exponent -- that is _waay_ too slow.
| Maybe you meant O(n^3*log(2))? That exponent is definitely a
| constant.
| btilly wrote:
| We have better than that. I think that what was meant is
| O(n^2*log(n))
| jessriedel wrote:
| I don't know if he edited his comment, but as it is
| currently written the log is not in the exponent under
| standard order of operations.
|
| Also, putting a constant factor like log(2) inside big-O
| notation is meaningless.
| kadoban wrote:
| Constant factors in exponents are _not_ meaningless. One
| of those cases where the rules-of-thumb bite you.
| jessriedel wrote:
| I didn't claim that. The log(2) term is multiplied, and
| that's implied by my use of "factor", which in this
| context is always understood to mean multiplicative
| factor.
| kadoban wrote:
| You're correct. I think this comes down to confusion on
| the operator precedence. I read it as O(n^(3 log 2))
| because otherwise it's kind of very obviously not correct
| (no way it's n^3, that's already known), but should have
| stated that, because according to usual
| notation/precedence it's not that.
| jessriedel wrote:
| The impression I had is that many experts believe that no
| single algorithm A can achieve O(n^2), but there exists a
| sequence of algorithms A_i that achieve O(n^{2+d_i}), with
| d_i>0 getting arbitrarily small, but with the algorithms
| getting arbitrarily long and the constant overhead factors out
| front getting arbitrarily large.
|
| Unfortunately this was just word of mouth, and I might be
| misremembering.
| zests wrote:
| This is what I choose to believe. Even if it is not true I
| will still believe it because it is so satisfying.
| dekhn wrote:
| in the time spent improving matrix multiplication by a tiny
| incremental amount, many gigawatt hours have been spent making
| existing CPUs, GPUs, and TPUs do matrix multiplication using far
| less efficient algorithms. It's unclear this work would really be
| important in production computing (I understand the theoretical
| desire) compared to time spent making other things work better.
| MauranKilom wrote:
| > It's unclear this work would really be important in
| production computing
|
| It most certainly won't [0]. But neither will something like
| 99% of mathematical research right now.
|
| [0]: See first example in
| https://en.wikipedia.org/wiki/Galactic_algorithm
| snapcore wrote:
| I was under the impression the Strassen method was rarely used
| because it has a number of drawbacks such as harder to
| parallelize, uses more memory, has bigger constants for
| complexity, less stable, etc.
|
| It's not really designed to be optimal for the current
| hardware, but just to lower the number of steps. It could be
| more useful in the future though.
| pjsg wrote:
| There are related problems in computer science -- e.g. how
| quickly can you multiply a pair of 65536x65536 extended double
| precision matrices together. Note that each matrix consume ~68GB
| of memory, so you need (probably) at least 200GB to store the two
| input matrices and the output matrix. So far, so good.
|
| Now you want to see how fast it can really go if you are prepared
| to spend (say) $100k on the hardware (e.g. 64 times as many CPUs
| with some fancy interconnect fabric). How about spending $100M?
| It turns out that moving all this data around is also quite
| costly.....
| goldenkey wrote:
| That's not a problem in C.S. That's a problem in engineering.
| ovi256 wrote:
| > "Exponent two" refers to the ideal speed -- in terms of number
| of steps required -- of performing one of the most fundamental
| operations in math: matrix multiplication. If exponent two is
| achievable, then it's possible to carry out matrix multiplication
| as fast as physically possible.
|
| Can anyone clarify what "as physically possible" means here? Are
| there physical systems that perform matrix multiplication in
| O(n^2) ? (I feel silly even asking that about a physical, non-
| computation system)
|
| I remember there were applications of optical computation,
| including a startup that wanted to accelerate deep network
| inference through optical computation: they had a way to do
| matrix multiplication using a projector (for the data matrix) and
| a 3D printed lens (for the network weights).
| aliceryhl wrote:
| It's badly phrased. What they mean is that we know that it will
| take at least n^2 operations, and that it would be nice to find
| an algorithm that is actually that fast.
| OneDayIGoHome wrote:
| I think if you could somehow start computing the resultant
| matrix elements as soon as you read a row/column from the input
| ones, you could reach their "physically possible" limit.
|
| A couch expert on computer architecture here, but a large
| enough systolic array could be used to achieve their
| "physically possible" limit? [0]
|
| New CUDA GPUs have been coming with these tensor cores that are
| just systolic arrays. Google's TPU are the same.
|
| Could someone with more knowledge on systolic arrays comment on
| whether a large systolic array can achieve this?
|
| [0] https://medium.com/intuitionmachine/googles-ai-processor-
| is-...
| btilly wrote:
| The answer is no but sort of.
|
| You cannot reduce the number of operations by waving a magic
| architecture want. Hence the no. But you can do operations in
| parallel rather than in series. Same number of operations,
| but it takes less time.
|
| So if a systolic array is built for the same scale as your
| problem, your bottleneck does indeed become how quickly you
| can send data to/from that dedicated system.
| slt2021 wrote:
| O(N^2) is the time you need to write down the correct answer,
| provided that it takes 0 time to calculate each cell in
| resulting matrix
| hyperbovine wrote:
| Provided it takes O(1) time not 0.
| klyrs wrote:
| An (n x n) square systolic array can perform matrix
| multiplication in time O(n) (not counting I/O, which takes
| O(n^2) without more specialty hardware) -- this is how TPUs
| work. Simulate the same algorithm on a single processor and it
| takes O(n^3). Use the same hardware to compute larger matrix
| products and it only gives a constant factor speedup.
| sdenton4 wrote:
| I don't think the sentence is very good.
|
| You need n^2 time just to read the two matrices, so I always
| thought of that as a natural lower bound. (Of course, we're
| talking about multiplications eventually here, so may not be
| exactly what they meant...)
| [deleted]
| jwmerrill wrote:
| I don't think they're actually intending to refer to physics
| here--they could have said "as fast as theoretically possible"
| instead.
|
| A square matrix with n rows and columns has n^2 entries, so if
| you need to look at all the entries in two matrices to compute
| their product, then you need to do O(n^2) work just to look at
| all the relevant entries.
|
| The only way you could get around this is if you know many
| entries are 0, or identical, or have some other structure such
| that you don't need to look at them separately.
| fastball wrote:
| I don't think they're talking about total operations here, as
| pointed out in the article this is just about the number of
| multiplications.
|
| Every matrix multiplication can be boiled down to multiplying
| rows by columns. This is definitionally required for matrix
| multiplication, so that is the lower bound for how many
| multiplications are required. One multiplication for every
| final element in the n x n matrix, or n^2.
| staticfloat wrote:
| They are talking about total operations, but since big O
| notation ignores constant factors, and they explicitly
| ignore the addition steps (because for large n
| multiplication dominates) it is, in effect, only looking at
| the multiplications.
|
| To answer the GP, the minimum amount of work necessary is
| to write a number for each element in the array, which
| scales by n^2 since there are n^2 elements. It is of course
| only possible to beat this if you can make assumptions like
| "the diagonal of my matrix is nonzero and everything else
| is zero", but that's not general matrix multiplication
| anymore, that's a specialized sparse matrix routine. For a
| dense, "normal" matrix, O(n^2) is the absolute best we
| could hope for.
| fastball wrote:
| We're saying approximately the same thing, though I'd
| argue that your phrasing is slightly more confusing to
| people that aren't super familiar with Big O notation
| (and therefore don't exactly understand what is meant by
| "ignores constant factors" and similar).
|
| Reads from the original two matrices and writes to the
| final matrix aren't at all the crux of the problem, so I
| don't think it makes sense to focus on either of those
| things in your explanation. The goal of the problem is to
| reduce the number of multiplications which are required
| to _find_ the values that you then write (of which there
| will always be n^2 writes). There will also be 2n^2
| reads, but likewise that isn 't relevant because it's not
| what you're trying to optimize.
|
| The relevant idea here is that the naive method of
| multiplying a matrix involves n multiplications for every
| row-column pair, for a total time complexity of n^3 (n
| rows, n cols, n multiplications). The goal here is to
| bring down that third number, from n down to something >=
| 1, with the holy grail obviously being 1 itself.
| MauranKilom wrote:
| > with the holy grail obviously being 1 itself.
|
| It's not at all obvious to me why the number of
| _multiplications_ must be at least n2. Can you prove
| that? (To be clear, n2 multiplications is most certainly
| the lower bound, but I can 't come up with a good
| argument as to why.)
|
| This is why the trivial "you must read n2 entries from
| each input matrix and write n2 entries in the end" is
| helpful: It's an extremely obvious lower bound on the
| asymptotic complexity. Sure, it's not what we're trying
| to reduce (and getting < n2 multiplications would be a
| ridiculous result), but anyone can see that it's true.
| hansvm wrote:
| The other answers are good (it takes O(n^2) time just to read
| two matrices, so you can't possibly multiply them more quickly
| than that), but they leave out a critical part of that analysis
| that I want to highlight just in case it would have otherwise
| bitten you in the future:
|
| You _do_ actually have to read all the entries in both matrices
| -- if you know K-1 out of K values then in general you still
| can't uniquely compute the output of matrix multiplication.
| Consequently, any algorithm relying on only some subset of the
| entries is at best an approximation.
|
| If you have more information (e.g., that one or both of the
| matrices is sufficiently sparse) then you can potentially do
| better.
| CalChris wrote:
| If n^2 is possible then it would be possible for a given n, say
| 4. Can't it be shown by example or exhaustive search that this is
| or isn't possible for 4?
| dragontamer wrote:
| O(n^2) is a scaling factor: what happens as n->infinity.
|
| An optimal arrangement for n==4 already exists. Both AMD MI100
| and NVidia Volta / Ampere perform 4x4 FP16 matrix
| multiplication in a single assembly language statement.
|
| An 8x8 matrix is just an arrangement of four 4x4 matricies
| (!!!!). As such, you can define a 8x8 matrix multiplication as
| a 4x4 matrix multiplication over 4x4 matricies. This recursive
| relationship is often key to how they manage to get faster-and-
| faster. Getting to O(n^2.8), or O(n^2.6,) etc. etc.
| kadoban wrote:
| Since we're looking for a O(n^2) algorithm, not specifically
| n^2 operations, this actually doesn't help. You could for
| example need 1000n^2+1000000 operations, which would look
| terrible if you only look at n=4, but you're golden if
| n=10000000 (and even if there's no practical n where you win,
| it's still theoretically very interesting).
| [deleted]
| throwaway34u29 wrote:
| Huh, it's been along time since I heard the name Josh Alman. Last
| time I heard about him he was cheating in academic trivia. Looks
| like he's managed to contribute more positively to society since
| then.
| dewiz wrote:
| How would one calculate the $$$ value of exp2 multiplication? Eg
| if one finds a way, would it be worth keeping it a secret and
| creating a business around it?
| hypersoar wrote:
| First of all, you couldn't. The "fastest" multiplication
| algorithms for Matrix multiplication, which are roughly
| O(n^2.3), aren't used in practice. They're "galactic
| algorithms" (a term I just now learned), meaning that the size
| the problem would need to be for them to overtake the more
| common O(n^2.8) algorithm is too large for today's computers.
| O(n^2) might be theoretically valuable but practically useless.
|
| Second of all, algorithms aren't patentable. The only way to
| keep it exclusive would be to keep it secret. I can't think of
| any examples of a private company making a major scientific or
| mathematical breakthrough and then keeping it secret long
| enough to profit off of it.
| ben509 wrote:
| > Second of all, algorithms aren't patentable.
|
| Algorithms absolutely are patentable because they are
| computing "machines." It's mathematical equations that aren't
| patentable, because they're closer to laws of nature.
| desertrider12 wrote:
| > algorithms aren't patentable
|
| I wish. The RSA cryptosystem was patented and under license
| until 2000 in the US, even though the math wasn't secret.
| thomasahle wrote:
| If you are interested in how those tensor methods work, the
| authors have another very nice paper that gives a great
| introduction https://arxiv.org/abs/1810.08671 It is actually not
| that different from Strassen after all. It just uses a bit more
| notation to help construct more clever ways of saving
| multiplication.
| joosters wrote:
| Shouldn't that first graph have the axes the other way round?
| Time (date) should be on the x-axis, and the measurement on the
| y-axis? As it is, it looks terribly confusing to me.
| balefrost wrote:
| _However, the number of additions required to add those sets of
| matrices is much closer together. Generally, the number of
| additions is equal to the number of entries in the matrix, so
| four for the two-by-two matrices and 16 for the four-by-four
| matrices._
|
| Maybe my linear algebra is REALLY rusty, but I that doesn't seem
| right to me.
|
| For two 4x4 matrices, each element in the output matrix will be
| the sum of four products. So each of the 16 output elements will
| require 3 additions, and you'll need a total of 48 additions.
|
| I'm not sure where they're getting "16 additions" from.
|
| To multiply two nxn matrices according to the textbook method, I
| think you'll need n^3 multiplications and n^3-n^2 additions.
|
| I don't doubt that the multiplication time dominates the
| algorithm. Maybe this is just a nitpick.
| soVeryTired wrote:
| They're counting the additions needed to _add_ the matricies,
| not to multiply them. The point being that matrix addition is
| order n^2 so even if you need to do a large number of matrix
| additions in some intermediate calculation, they don 't hurt
| you as n gets large if you're trying to speed up matrix
| multiplication.
| ur-whale wrote:
| A graph with time on the Y axis ... a.k.a why make things simple
| when you can instead confuse 80% of your readership with a simple
| mirror trick.
| munk-a wrote:
| Yea this surprised me at first - especially with the slope to
| n^2 seeming to indicate that the Y axis quantity remaining was
| known and measurable.
|
| Additionally the quote below it:
|
| > Mathematicians have been getting closer to their goal of
| reaching exponent two -- meaning multiplying a pair of n-by-n
| matrices in only n2 steps -- but will they ever reach it?
|
| seems particularly odd when the height of those steps is
| irrelevant in the graphic - if the entire graph was just a flat
| plane going from n^3 to n^2 it would signify the "best" or at
| least fastest innovation, the "steppedness" of it and
| especially steep steps, indicate lags in progress rather than
| great leaps.
| tooltower wrote:
| This is a common convention for annotating historical
| timelines. Even completely unrelated fields like Relativity
| routinely places time on the y-axis.
|
| Please take your snark elsewhere.
| mschuetz wrote:
| It is kind of weird and unintuitive. This one here, posted by
| another commenter, is much much more understandable: https://
| commons.wikimedia.org/wiki/File:MatrixMultComplexity...
| the8472 wrote:
| The WP image has an unfortunate offset on the Y axis,
| giving the impression that we're much closer to optimal
| than we actually are.
| stephencanon wrote:
| We don't know this. I think a lot of people expect that
| there's an O(n^(2+eps)) algorithm for any epsilon > 0,
| but that's conjecture; we might already be at the
| optimum.
| cmarschner wrote:
| It is only unintuitive if your mental representation of
| time flows to the right.
|
| We all create a mental model of time at some point in our
| life, and then stick to it - often without ever talking
| about it to anyone.
|
| But the representations vary substantially between people.
| Many people might have a timeline where the past is on the
| left and the future is on the right - maybe along the
| reading direction. But for some, future expands to the
| left. For others, future expands in front of them while the
| past is behind them (quite inconvenient).
|
| It can make for a great party conversation to explore each
| others' timelines.
|
| If there are parties, that is.
| jvanderbot wrote:
| What is with that banner image? Time on the Y axis? Factor on the
| X? Some mythical line-fit of factor over time that doesn't match
| the data?
|
| Great article though.
|
| Yuck.
| ziddoap wrote:
| I definitely agree, but thinking about it I wonder if it was
| simply to get a "gut reaction" to the graphic - that things are
| getting better and we are reaching the limit of what is
| possible (in the context of matrix multiplication).
|
| It's easier to get that feeling at a glance when the graphic is
| increasing left-to-right. With normal conventions the graphic
| would be decreasing. Which of course, that's what it is doing
| (and the goal). But the "gut reaction" to a decreasing graph is
| "things are doing worse". The gut reaction to graph increasing
| is "things are doing better".
|
| I could be completely off-base, but that's the only reason I
| can think of to eschew normal axis conventions.
| MauranKilom wrote:
| So put time on x axis and exponent on y, but make the
| exponent decrease from bottom to top. They've already
| reversed the x axis here, so they could've just done that for
| y instead.
| ziddoap wrote:
| Yep, that would be another option and perhaps a more
| suitable one to achieve the same effect -- assuming my
| guess as to "why" is even correct in the first place.
| black_puppydog wrote:
| Came here for exactly that. WTF?
| ur-whale wrote:
| >What is with that banner image?
|
| You beat me to it.
| daureg wrote:
| I felt the same about that picture, the original source is less
| pretty but much easier to read:
| https://commons.wikimedia.org/wiki/File:MatrixMultComplexity...
| 1-more wrote:
| It made me think of the progression of men's 100m dash times. I
| was really hoping it would line up better. I think long jump
| might have been a better choice for a spurious correlation
| graph.
|
| https://imgur.com/a/5quyLWC
| 1-more wrote:
| I think I cracked it: it lags about 30 years behind the long
| jump progression
|
| https://imgur.com/a/YHvXq7G
| melling wrote:
| I am doing Andrew Ng's ML Coursera course with Matlab.
|
| I've now got the crazy desire to see matrices and vectors built
| into every language in a clean and succinct way.
| mkaic wrote:
| Just finished the course and couldn't agree more. Everything
| seemingly is faster with matrices!
| bobbybabylon wrote:
| Have you looked at languages like J or APL?
| centimeter wrote:
| Beyond matrices and vectors, it would be nice to have
| generalized (typed) tensors. What you can do with matrices
| alone is really somewhat limited.
| Robotbeat wrote:
| From a data structure perspective: You mean like an
| n-dimensional array? That's what Matlab does. Matrices (2D
| arrays) are just a special case of array in Matlab.
| whimsicalism wrote:
| I suspect by the usage of the phrase "typed" they mean
| something like that, but where the axes/dimensions/whatever
| you call them are typed - ie. you can't naively take a
| product along the time dimension of A with the batch
| dimension of B.
| jpeloquin wrote:
| xarray seems to fit the bill. As long as you're ok with
| runtime axis/dimension/type checking.
|
| https://xarray.pydata.org/en/stable/
| montalbano wrote:
| Convenient matrix algebra is one of the selling points of
| Julia.
| nerdponx wrote:
| I'm really hopeful that Julia will continue to gain adoption.
|
| That said, I would _love_ to have good data science library
| support (bindings to stuff like BLAS /LAPACK, Torch, Stan,
| etc) in Idris.
|
| Imagine vector and matrix dimensions checked at compile time.
| Probabilities guaranteed to be between 0 and 1, checked at
| compile time. Compile-time enforcement that you are handling
| "nil/null" data inputs correctly when reading data from a
| database or CSV file. Writing a web scraper with linear types
| to help catch bugs & performance leaks. Maybe even _proving
| theorems_ about the math code you 're writing ("is this a
| valid way to express X"?).
|
| Maybe not ideal for rapid exploratory model development, but
| it'd be pretty damn cool for writing data science
| tooling/libraries and "production" ML code in my opinion.
| [deleted]
| taxemeEvasion wrote:
| It's a main reason why Fortran still has such a hold on the
| SciComp/HPC community.
| gowld wrote:
| Fortran is also optimized and debugged over decades of
| testing.
| karma_fountain wrote:
| Yeah those algorithms are numerically tested, haven't
| changed in years, and really fast. You can certainly get to
| convenient matrix manipulation in C++, but then your
| compile times will increase significantly.
| Zardoz84 wrote:
| You can have a fast implementation and fast compilation
| time with DLang
| andi999 wrote:
| Performance might be worse in c++
| DesiLurker wrote:
| Can the fortran implementations do auto-vectorization on
| simd hardware? or map to gpu using cuda-like library?
|
| not rhetorical, genuine question.
| cygx wrote:
| Yes. See e.g.
|
| https://gcc.gnu.org/projects/tree-
| ssa/vectorization.html#exa...
|
| https://developer.nvidia.com/cuda-fortran
| jhayward wrote:
| Yes, in fact all the original research for those things
| was done in Fortran compilers.
| tasty_freeze wrote:
| One of the unexpected things is Dartmouth BASIC from 1964 had
| matrix primitives, despite being an otherwise primitive
| language. MS never took up the feature, but I know Wang BASIC
| did on their 2200 family of machines.
|
| For example: 10 DIM A(3,3),B(3,3),C(3,3)
| 20 MAT READ A 30 DATA 1,-5,7, .45,23,9, -5,-1,0
| 40 MAT B=A:REM assignment 50 MAT B=ZER:REM zero array
| 60 MAT B=CON:REM all 1s 70 MAT B=IDN:REM identity
| 80 MAT B=TRN(A):REM transpose 90 MAT B=(2)*A:REM scale
| 100 MAT B=INV(A),D:REM invert array, determinant in scalar D
| 110 MAT C=A*B:REM matrix multiplication 120 MAT PRINT
| C:REM prints something close to an IDN array
| Someone wrote:
| The first version of Dartmouth basic ran on a system with a
| whopping 20kB of memory (8k words at 20 bits each. See
| https://en.)wikipedia.org/wiki/GE-200_series)
|
| The GE-635 (https://en.wikipedia.org/wiki/GE-600_series) that
| it ran on most of its life could have 4 CPUs, had protected
| memory and floating point hardware, and supported at least
| 200 kilowords of memory (http://www.bitsavers.org/pdf/ge/GE-6
| xx/CPB-371A_GE-635_Syste...)
|
| At 36 bits/word, that's 900 kilobytes.
|
| The first version of Microsoft basic
| (https://en.wikipedia.org/wiki/Altair_BASIC) ran in 4
| kilobytes (leaving about 3/4 kilobytes for programs and
| data), so it's understandable that it lacked some features
| (it didn't even have else clauses in if statements, had SIN,
| but not COS or TAN, didn't support integer or string
| variables, etc))
| coliveira wrote:
| Believe it or not, engineers were heavy users of BASIC in the
| first few years. And FORTRAN was created with the notion of
| matrix multiplication, so it was just natural that BASIC
| implementations supported matrix multiplication.
| pavlov wrote:
| My grandfather was a construction engineer. I remember
| visiting his office around 1990 just before he retired.
|
| His only computer was a Commodore 64 and he had written all
| the software himself in BASIC over the years, just
| translating a lifetime of professional experience into
| these little interactive programs on floppies.
|
| Maybe the modern equivalent would be to use Excel.
| mamcx wrote:
| Yeah, this is something pretty useful. I also add to even go
| beyond and go full relational, that is what I'm trying with
| https://tablam.org
| teruakohatu wrote:
| The base type in R is a vector. A number such as 3.14 is simply
| a vector of length 1. So functions are vectorized by default
| (as long as you don't do any operations that are explicitly not
| vectorizable).
|
| Once you get your head around everything being a vector, R can
| be a lot of fun.
|
| I love Julia but I wish I didnt need to explicitly broadcast
| functions across a vector.
| make3 wrote:
| Yes, have you tried Julia? It is pretty much how you describe.
| There are other things that I don't like about that language,
| but the matrix part is really great
| yrral wrote:
| Does anyone know how matrix multiplications are implemented in
| CPUs GPUs or ML asics? Are there optimizations used or do the
| optimizations take too much circuitry and thus the n^3 method is
| still used?
| OneDayIGoHome wrote:
| For GPUs, it's actually much faster than O(n^3) because
| computing each entry in the result matrix is independent.
| Hence, the problem is embarrassingly parallel in a way.
|
| I don't know how to use O() notation for GPUs but it should be
| something like O(n^2/k^2) where K is the tile size [0].
|
| Also lower memory bandwidth becomes a bottleneck here. So there
| is a lot of optimizations done on how to transfer from CPU to
| GPU and then within GPU to efficiently query the matrices.
|
| [0]https://docs.nvidia.com/deeplearning/performance/dl-
| performa...
| overboard2 wrote:
| Just because you can do a finite number of operations at the
| same time, it doesn't mean you've changed O().
| dragontamer wrote:
| > For GPUs, it's actually much faster than O(n^3) because
| computing each entry in the result matrix is independent.
| Hence, the problem is embarrassingly parallel in a way.
|
| That's not how you do big-O with parallel systems.
|
| When talking a parallel algorithm, you often perform two
| different big-O calculations:
|
| * Breadth: "If a single-processor" executed the parallel
| algorithm, how long would it take? (If the best sequential
| algorithm and best parallel algorithm are both within a
| constant-factor of breadth, its called "work-efficient").
| Also called "Total Work"
|
| * Depth: "If an infinite number of processors" executed the
| parallel algorithm, how long would it take? Also called
| "Length of the longest dependency chain".
|
| A naive matrix-multiplication would be O(n^3) breadth. I
| don't know the depth calculation unfortunately...
|
| ---------
|
| For example, Bitonic Sort is one of the best GPU-sorting
| algorithms, but its O(n * log^2(n)) Breadth... asymptotically
| slower than sequential's O(n*log(n)) time.
|
| But Bitonic Sort is often used because its Depth is
| O(log^2(n)). Which means that "with enough processors", you
| approach O(log^2(n)) sorting time, which is pretty amazing.
|
| Note: "GPU Mergepath" is pretty damn amazing if you've never
| seen it before. O(n) breadth to perform a merge operation
| (part of merge-sort), so for large arrays, Mergesort wins as
| long as you use the GPU Mergepath algorithm to perform the
| individual merge steps.
|
| But if you have more processors than data (lol, it happens:
| Vega64 supports 163,840 threads of execution: Occupancy 10 x
| 4096 physical cores x innate hardware parallelism over 4
| clock ticks), Bitonic sort is an obvious choice.
| sdenton4 wrote:
| For depth:
|
| If I have at least n^2 processors, I can send a row and
| column to each processor, which can compute the inner
| product in linear time. So O(n^2) time to coordinate the
| work, and O(n) to actually do it.
| dragontamer wrote:
| Hmmmm. Your O(n^2) step seems unnecessary to me.
| Therefore: the answer is O(n) depth for the naive case.
|
| -----------
|
| Processors are generally assumed to be self-numbered. Ex:
| processor #100 knows its processor#100.
| xloc = (threadIdx.x % matrix_width); yloc =
| (threadIdx.x / matrix_width);
| performMatrixCalc(xloc,yloc);
|
| Therefore, only O(n) time depth apparently. O(log(n)) to
| broadcast matrix_width to all processors, which seems to
| be the only communication needed to organize the
| calculation.
| sdenton4 wrote:
| Nice, thanks!
| colanderman wrote:
| O notation doesn't apply to hardware in that way. The naive
| algorithm is still O(n^3) on a GPU. Remember that O notation
| is concerned with behavior at the limits and ignores constant
| factors. Parallel hardware can only provide a constant
| speedup for problems of arbitrary size, hence it does not
| show up in O notation.
| sdenton4 wrote:
| Which, alas, is how we get kids at Harvard chasing a couple
| decimal points on a dead end path instead of doing
| something useful. I'm actually a bit perplexed as to why
| anyone thinks extending the laser method to improve the
| fourth decimal point is worth the effort. No one will ever
| implement this thing, and no one believes it actually
| brings us closer to an actual solution to the exponent 2
| conjecture. So seems entirety like a way for phd students
| to cut their teeth, perhaps? But ultimately not much more
| helpful than finding the next digit of pi.
| gugagore wrote:
| > No one will ever implement this thing, and no one
| believes it actually brings us closer to an actual
| solution to the exponent 2 conjecture.
|
| It brings us closer more than anything I've done, that's
| for sure.
|
| I agree with your sense of taste about which problems are
| personally interesting, and likely to be significant in
| my lifetime. But I still think it's cool that there are
| theoretical developments like this. Refining bounds is a
| cool way to see theoretical progress quantitatively.
|
| I think also it's more like discovering what pi is than
| trying to find the next digit. We know a lot more about
| pi than we do about the lowest achievable exponent.
| sdenton4 wrote:
| What's needed are new methods, though. I saw some really
| interesting work ten years ago using group Fourier
| transforms to attack the problem. I think it didn't end
| up working out, but was fundamentally more interesting
| than another extension of the laser approach.
|
| One of the major failure modes of academia is that
| students are generally not very good at picking problems,
| and can end up following their advisors down a blind
| alley. The underlying question here is absolutely
| worthwhile, but spending a year extending laser will have
| no impact. It's like you're trying to dig to the center
| of the earth and someone hands you a shovel: do you take
| it and start digging, or walk away and try to find/invent
| an oil drill?
| Mauricebranagh wrote:
| Faster larger more accurate models of the top off my
| head.
| sdenton4 wrote:
| Laser improvements are galactic algorithms, though, and
| don't give you that at all.
| gugagore wrote:
| I want to clarify that your first sentence likely means
| something like "O notation is insensitive to hardware in
| the way you're suggesting." Not "you can't apply O notation
| GPUs"
| colanderman wrote:
| Yes correct. Edited to clarify. Another way of phrasing
| it -- O notation is insensitive to the specific
| implementation of a computing model.
| whatshisface wrote:
| On hardware, for fixed-size problems, O notation applies in
| the form of circuit size, which maps, unsurprisingly, to
| actual circuit size.
| anon_tor_12345 wrote:
| r/confidentlyincorrect
|
| complexity is wrt a particular model of computation - a
| turing machine. turing machines are sequential (NTMs
| aren't). run time on a parallel model is not necessarily a
| constant off from run time on a sequential model.
| colanderman wrote:
| Snark aside, GP is asking about GPUs. GPUs are not
| nondeterministic Turing machines.
| anon_tor_12345 wrote:
| like the other comment alludes to it's not constant it's
| a function of block size and the size of the matrices
|
| http://www.ijcee.org/vol9/949-E1621.pdf
|
| you edited your comment. it said what's the speed up on
| GPUs (which i provided).
|
| >GPUs are not nondeterministic Turing machines
|
| for small problem size that's exactly what they are (or
| any multicore cpu for that matter)
|
| >The other is to imagine that the machine "branches" into
| many copies, each of which follows one of the possible
| transitions. Whereas a DTM has a single "computation
| path" that it follows, an NTM has a "computation tree".
|
| https://en.wikipedia.org/wiki/Nondeterministic_Turing_mac
| hin...
| colanderman wrote:
| > for small problem size that's exactly what they are
|
| O notation does not apply to small problems. It is
| strictly concerned with asymptotic behavior.
| anon_tor_12345 wrote:
| you're completely missing the point (again). the point is
| that complexities (even asymptotic) of sequential
| algorithms don't apply to parallelizations of those
| algorithms.
| tsimionescu wrote:
| An NTM is one which can run arbitrarily many branches in
| parallel. So parallel processors are not NTMs, since they
| can only run a fixed number of branches in parallel.
|
| It's true that for small problems they are
| indistinguishable, but in the context of discussing big O
| notation that is irrelevant.
|
| For the purposes of computing asymptotic time complexity,
| whether the algorithm is run on a 1 core system or an M
| core system is usually irrelevant.
| AnotherGoodName wrote:
| Note that GPUS are almost always multiplying 4x4 matrices. Sure
| they multiply many matrices together but each is 4x4. The 4^3
| complexity isn't at all an issue (64 steps) and the constant
| time for some of the n^2.x methods is high.
| taxemeEvasion wrote:
| It's typically still the O(N^3) method that's implemented in
| OpenBLAS, BLIS, cuBLAS, MKL, etc. There are some high
| performance implementations of Strassen's Algorithm with a
| fixed level of recursion that start to perform much better as
| the matrix size gets large (see the work from UT Austin's
| BLIS). From my understanding for the more complex algorithms
| the hidden constant simply grows too large to be worth it on
| conventional hardware.
|
| As another poster commented, these are O(N^3) in work, not
| necessarily the wall clock time you see due to parallel speedup
| and cache effects, but will scale as O(N^3) once you're at a
| size past all of these. These implementations are highly tuned
| and optimized for hardware, much much more so than the naive 3
| loop implementation. The predictable and 'easy' access patterns
| make the simple algorithm hard to beat.
| taxemeEvasion wrote:
| I think its also important to mention that these results are
| for the multiplication of two general dense matrices and full
| floating point accuracy. If your matrix is structured
| (sparse, sparse in the Fourier domain, low rank, H, HSS, etc)
| you can usually exploit that structure to break up the
| multiplication and reduce the work complexity dramatically.
| (These rely on building blocks of the dense general matrix
| results)
| kkylin wrote:
| Another issue of both practical and theoretical importance is
| whether these asymptoticaly-faster algorithms are numerically
| stable. Skimming the article very quickly (and not having
| looked at any original sources), this doesn't seem to be
| addressed (yet).
| oscardssmith wrote:
| At least for strassen, the result is slightly less stable
| (increases the condition number by a constant factor). I
| think higham shows that any sub-cubic algorithm must have
| done conditioning issues, but that in practice they're not
| that bad for most matrices.
| jmblpati wrote:
| Strassen is reasonably numerically stable (although not as
| good as the naive algorithm), and everything beyond
| Strassen is extremely unstable. This is due to a trick that
| many of these algorithms employ: you technically only need
| to do multiplication of 0-1 matrices to fully solve the
| matrix multiplication problem in theory, and so having an
| algorithm which computes the entries of the matrix with
| additive error 0.1 is sufficient to exactly solve the
| problem (just round entries to the nearest integer). As you
| can imagine, this means your algorithm can only give O(1)
| bits of precision unless you employ this reduction to the
| 0-1 case first.
|
| To understand why this even happens, let's say we want to
| compute the expression A*B + B*A for matrices A and B. One
| way we can do this is to compute the products A*B and B*A
| naively: two multiplications are needed. A trickier way to
| do this is to introduce a parameter x: we will instead
| compute A*A and (x*A + B/x)*(x*A + B/x) = x^2 A*A + (A*B +
| B*A) + x^-2 B*B. Thus (x*A + B/x)*(x*A + B/x) - x^2 A*A =
| (A*B + B*A) + x^-2 B*B: if x is large enough the second
| term vanishes and we can employ the rounding trick from
| before. Now this still needed two multiplications, but here
| one of our multiplications was A*A. If later we needed to
| compute A*C + C*A in the algorithm, we could then do that
| in only 1 additional matrix multiplication by repeating the
| trick. A more sophisticated version of this algorithm
| underlies all known approaches for matrix multiplication
| beyond w << 2.8.
| foolinaround wrote:
| quanta magazine has been the best find for me this year!
| JulianChastain wrote:
| The article doesn't specify, and I don't have the prior knowledge
| to understand the paper; Is there a coded implementation of this
| in some language or is it purely a mathematical description of
| how theoretically multiplication can always be done with slightly
| fewer multiplications?
| ovi256 wrote:
| I think the way it'll go is: it'll be implemented in Fotran,
| BLAS and so on until it percolates to numpy / CUDA so mortals
| like us can use it.
| MPSimmons wrote:
| Agree. I don't think we'll notice, but the next gen of CUDA
| will be better because of it.
| amluto wrote:
| These particular algorithms tend to have large Constance
| factors, so you need astronomically large matrices before
| they become faster than algorithms with worse exponents but
| better constant factors.
| sdenton4 wrote:
| The last time I paid attention to this stuff ten years ago, the
| operations were 'galactic' to get to these theoretical bounds.
| Constant terms such that you'd need matrices with n = many
| billions to start to be competitive with normal matmul or
| Strassen.
| taxemeEvasion wrote:
| Strassen's is sometimes competitive in the thousands!
|
| https://www.cs.utexas.edu/users/flame/pubs/FLAWN79.pdf
|
| You're right though afaik all the other algorithms are still
| 'galactic' and completely impractical
| nonameiguess wrote:
| Crap. I should have refreshed the page. Didn't see you
| already posted what I did.
| nonameiguess wrote:
| Strassen's algorithm is still the fastest to be widely
| implemented. The Coppersmith-Winograd variants that have
| achieved better asymptotic complexity are all galactic
| algorithms: https://en.wikipedia.org/wiki/Galactic_algorithm.
| ableal wrote:
| Is anyone looking into the memory accesses for the operation?
|
| Fifty years ago multiplication was vastly slower than RAM access.
| Nowadays it's pratically free compared to even getting data out
| of cache ...
| MauranKilom wrote:
| Nobody is remotely looking at running these algorithms in the
| real world. These are very much _asymptotic_ complexities, and
| to draw benefit from even the 1990 version you 'd have to
| literally use galactic input sizes. None of this is about
| making the matrix multiplications on your PC faster.
|
| https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_a...
| fastball wrote:
| > no, no, no, no, no
|
| Confirmed Strassen thinks the lower bound is n^2.32192809489.
| lsb wrote:
| log2(5)???
| vessenes wrote:
| Man, Quanta magazine is a gift. I love that there is a
| publication with a significantly higher reading level assumed
| than most pop science content, but not written for industry
| insiders, it's just great.
|
| I pitched them a paper magazine a few years ago, because I wish I
| could leave these around for my kids to flip through, but no joy
| yet..
| gspr wrote:
| I agree wholeheartedly! If you don't know of it already, I also
| highly recommend http://nautil.us
| martin_balsam wrote:
| I think Quanta magazine is directly funded by Jim Simons, the
| billionaire mathematician who founded Renaissance Technology.
| carterschonwald wrote:
| Yup. It's affiliated with the simons foundation too I think?
| beefield wrote:
| I'm trying to get rid of my news addiction (so far failing
| spectacularly...), And one of my plans would be to have a
| couple of sources that do not push a constant stream of crappy
| news, but occasional selected high quality articles. At the
| moment I have Quanta magazine and The Economist Leaders on my
| RSS reader news section. Does anyone have any improvements or
| additions to this list - the niche of the articles can be
| almost anything, as long as it is only a few per week and
| quality is top notch?
| medstrom wrote:
| Some news sources on my Atom reader:
|
| Aeon https://aeon.co
|
| Undark Magazine https://undark.org
|
| The Conversation https://theconversation.com/global
|
| Curie https://www.tidningencurie.se/en/home/ (based in
| Sweden)
| gspr wrote:
| http://nautil.us (don't let their broken https fool you into
| thinking they're not excellent!)
| cambalache wrote:
| > and The Economist Leaders.
|
| That's fine but that will give you an extremely limited
| vision of the world. The technocratic, slightly anarcho-
| capitalistic, American-knows-best, social liberal world
| view.If you choose that way at least go to
| https://www.project-syndicate.org/, with a little wider
| spectrum (With actual good economists and not the
| stenographer of the day) Too bad the redesign is totally
| crap.
| coliveira wrote:
| You now mentioned something important: older generations have
| left behind journals, magazines, books, all kinds of documents
| that represent their generation. When our generation is gone,
| there will be very little physical traces left. Someone will
| have to search archive.org or similar to find anything.
| mjburgess wrote:
| It's the function of historians to do the archeology and
| summarise for a broader audience.
| goatinaboat wrote:
| _It 's the function of historians to do the archeology and
| summarise for a broader audience_
|
| Those summaries will be written in a similarly ephemeral
| format. A book in a library can sit there and be
| rediscovered in a hundred years. A scroll or an engraving
| in a thousand. But we create things that have evaporated in
| a mere decade.
| Robotbeat wrote:
| I've been thinking of backing up a few dozen megabytes of
| my most important files in like QR codes or maybe
| something simpler like dual-parity data matrix with
| instructions in plain text and printed out FORTRAN code.
| Printed on long lived (waterproof?) paper with a
| laserjet. Along with plain text printouts of text and
| pictures.
|
| Even thought of etching that on stainless plates or
| something.
|
| Could be scanned with a smartphone or even decoded by
| hand (include an ASCII lookup table).
| KineticLensman wrote:
| Maybe, but most people won't have the equivalent of boxes
| in the attic containing their or their family memorabilia
| jrumbut wrote:
| We may be a lot like the much older generations that left
| behind only a handful of prized possessions.
|
| I have a wedding photograph printed on non-biodegradable
| material (not intentionally for preservation, it's
| decorative), and perhaps that will eventually be the only
| tangible evidence of me in the same way that wedding
| photographs are all I've seen of some relatives born
| before 1900.
| echelon wrote:
| How many generations are even left before we _don 't have
| anymore humans?_
|
| I'd wager we're 20 - 30 years out from regenerative
| cloning, and 50 - 75 out from reproductive cloning. We
| could start that process today if we weren't so squeamish.
|
| ML is coming to a head, and we'll potentially have machines
| that outclass us in basic tasks in 25 - 50 years. Within
| 200, I'd say the human experiment is at an end.
|
| What then? Either we evolve along with our machines, we
| become one with the machines, or the machines take over.
|
| We're living in the last stages of the human species right
| now, and not many people think to contemplate that and
| integrate it into their worldview.
|
| Climate, space exploration, etc. is all incredibly human-
| centric. Fighting for resources, defining legacies, ... I
| mean, it's what we as humans do. One day it'll just
| suddenly end, and it's probably only a handful of
| generations until the pieces fall into place.
| zentiggr wrote:
| > ML is coming to a head, and we'll potentially have
| machines that outclass us in basic tasks in 25 - 50
| years.
|
| Just this line makes me skeptical of the rest of your
| thinking as well... what I've seen of ML application
| leaves me strongly convinced that every usage of ML will
| need human supervision and override control. Every case
| of "can't reach a human to get this corrected" is a call
| for boycotting / suing to force a human override channel
| for any automated system.
|
| The "human experiment" is not an experiment, it is the
| fundamental reality, and no automation will ever truly
| replace human thoughts plus emotions plus inherent
| irrational priorities / preferences / biases.
| vidarh wrote:
| Maybe, but one of the things that frustrates me is that I
| _know_ picked up a lot of things because of the books my
| dad had on the shelves for example. Of course he showed me
| many of them, but I discovered many more myself because
| they were _there_ in front of me.
|
| My Kindle books etc. are not visible to my son in the same
| way. Spatial, in your face, visibility matters in
| transferring culture.
| ableal wrote:
| But you may be slightly sad that a newer generation,
| growing with shelves full of books right in front of
| their eyes, is never seen picking up and leafing through
| a single one of them. They were into some children's
| books on their way to being teenagers, but now they read
| and watch videos on their phones, and only read on paper
| for school.
|
| Then you get over it. They're whip smart and doing things
| their way, trying to force them into yours probably would
| not work anyway.
|
| Mostly over it.
| rfrey wrote:
| It's not universal. I recently discovered that some weird
| behaviors in my kid were because he was emulating Richard
| Feynman, having discovered Surely You're Joking on the
| basement bookshelf.
| hooande wrote:
| That said, archive.org is accessible to anyone with an
| internet connection. This is good for those who don't have
| access to private libraries or family that have science
| magazine subscriptions.
|
| Future generations will have a different experience than
| those in the past. Theirs will be slightly more fair
| PicassoCTs wrote:
| Did you consider print on demand ? Some of them seem to have
| specialized on magazines?
___________________________________________________________________
(page generated 2021-03-23 23:01 UTC)