[HN Gopher] WebAssembly techniques to speed up matrix multiplica...
___________________________________________________________________
WebAssembly techniques to speed up matrix multiplication
Author : brrrrrm
Score : 201 points
Date : 2022-01-25 15:52 UTC (7 hours ago)
(HTM) web link (jott.live)
(TXT) w3m dump (jott.live)
| VHRanger wrote:
| This'll end up inevitably as a WASM BLAS library
|
| Which wouldn't be a bad thing
| LeanderK wrote:
| What can't we just compile BLAS to WASM? Isn't this the point
| of WASM?
| brandmeyer wrote:
| The BLAS have historically relied on a wide range of
| microarchitecture-specific optimizations to get the most out
| of each processor generation. An ideal solution would be for
| the browser to provide that to the application in such a way
| that it is difficult to fingerprint.
|
| See also the history of Atlas, GotoBLAS, Intel MKL, etc.
| bee_rider wrote:
| libflame/BLIS might be a good starting point, they've
| created a framework where you bring your compute kernels,
| and they make them into a BLAS (plus some other nice
| functionality). I believe most of the framework itself is
| in C, so I guess that could somehow be made to spit out
| wasm (I know nothing about wasm). Then, getting the browser
| to be aware of the actual real assembly kernels might be a
| pain.
| injidup wrote:
| https://www.google.com/search?q=wasm%20blas&ie=utf-8&oe=utf-...
| pcwalton wrote:
| I really like this writeup. Note that it may not be worth using
| the SIMD in this way (horizontal SIMD) if you know you will be
| multiplying many matrices that are the same size. It may be
| better to do vertical SIMD and simply perform the scalar
| algorithm on 4 or 8 matrices at a time, like GPUs would do for
| vertex shaders. This does mean that you may have to interleave
| your matrices in an odd way to optimize memory access, though.
| bertman wrote:
| Very cool writeup!
|
| Unfortunately, the bar graphs at the bottom of the article have
| different y-axis scaling, but even so:
|
| It's sad how Firefox' performance pales in comparison to Chrome.
| zwieback wrote:
| Very cool!
|
| My question: will this kind of thing become more mainstream? I've
| seen the web emerge, go from static pages to entire apps being
| delivered and executed in the browser. The last bastion of native
| apps and libraries seems to be highly optimized algorithms but
| maybe those will also migrate to a deliver-from-the-web and
| execute in some kind of browser sandbox.
|
| Java promised to deliver some version of native code execution
| but the Java app/applet idea never seemed to take off. In some
| ways it seems superior to what we have now but maybe the security
| concerns we had during that era held Java back too much. Or am I
| misunderstanding what WebAssembly can bring to the game?
| brrrrrm wrote:
| I'm not really equipped to predict anything, but I think recent
| surge in popularity of simple RISC-y[1] architectures like ARM
| will allow the WebAssembly standard to stay small yet
| efficient. I'm hopeful, but standards often have a way of not
| keeping up with the newest technology so we'll see.
|
| [1]
| https://en.wikipedia.org/wiki/Reduced_instruction_set_comput...
| pjscott wrote:
| I don't expect ARM to have much effect here. The wasm
| instruction set is basically a low-level compiler
| intermediate representation, and to go from that to machine
| code for x86 or ARM is about equally difficult in both cases.
| MauranKilom wrote:
| Well, here is a related talk:
| https://www.destroyallsoftware.com/talks/the-birth-and-death...
| halpert wrote:
| The web always feels terrible compared to a native app,
| especially on a phone. Some of the difference is due to Safari
| being a bad browser (probably intentional), but a bigger part
| is that the threading model makes it really difficult to have a
| responsive UI. Not to mention the browser's gestures often
| clash with the application's gestures.
| eyelidlessness wrote:
| I generally find Safari's performance better than other
| browsers. Also, every mainstream JS runtime is multithreaded.
| There are limitations on what can be shared between threads,
| but you can optimize a _lot_ despite those limitations
| (including using WASM on the web, and native extensions /FFI
| on Node/Deno).
| Uehreka wrote:
| In my experience, if the thing I'm working on runs in
| Safari, it's buttery smooth. And if it doesn't run, it
| completely shits the bed. Stuff like "we don't support that
| way of doing shadows in SVG, so rather than simply not
| implement that property, we've turned your entire SVG
| element black".
| johncolanduoni wrote:
| Every mainstream JS runtime has some sort of background
| garbage collection and parsing/compilation, but for any
| execution that has to interact with JavaScript heap or
| stack state you're still in single threaded territory.
| SharedArrayBuffer can help if you're willing to give up on
| JavaScript objects entirely, and for WebAssembly this is
| less of a burden, but that's not going to help you perform
| rendering on most web apps concurrently. JSCore goes a
| little bit further and can run javascript in the same
| runtime instance on multiple threads with some significant
| limitations, but this isn't exposed to developers on
| Safari.
| halpert wrote:
| On iOS, the perf of Safari is definitely better than every
| other browser, because every other browser is mandated to
| use WkWebView by Apple. They aren't allowed to implement
| their own engine. Of course Apple isn't subject to the same
| restriction.
| amelius wrote:
| Soon, other browsers can simply run themselves inside
| WASM which then runs inside a WkWebView :)
| jacobolus wrote:
| On a Mac, Safari Javascript generally outperforms Chrome
| and Firefox (other browsing tasks are also generally
| better performing), but there are some workloads where
| Safari turns out slower, especially when the developer
| has put a lot of work into Chrome-specific optimization.
|
| Safari also generally uses a lot less memory and CPU for
| the same websites. Chrome in particular burns through my
| battery very quickly, and is basically completely
| incapable of keeping up with my browser use style (it
| just crashes when I try to open a few hundred tabs).
| Presumably nobody with authority at Google is a heavy
| laptop web-user or prioritizes client-side resource use:
| Google's websites are also among the biggest browser
| resource hogs, even when sitting idle in a background
| tab.
|
| Safari often takes a couple years longer than other
| browsers to implement cutting-edge features. This seems
| to me like a perfectly reasonable design decision; some
| web developers love complaining about it though, and some
| sites that were only developed against the most recent
| versions of Chrome don't work correctly in Safari.
| jsheard wrote:
| WASM threads are available in all modern browsers now,
| including Safari. It's very early days in terms of ecosystem
| but we're steadily getting there.
| halpert wrote:
| The issue isn't so much having additional threads, it's
| needing two main threads. The browser has one main thread
| for accepting user input and then dispatches the relevant
| user events to the JS main thread.
|
| The browser can either dispatch the events asynchronously,
| leading to the events being handled in a noticeably delayed
| way, or the browser can block its main thread until the JS
| dispatch finishes, leading to fewer UI events being
| handled. Either way is an inferior experience.
| danielvaughn wrote:
| Also aren't service workers technically multi-threading?
| That's been a thing for a while in the browser now.
| jsheard wrote:
| Technically yeah, but the threads could only communicate
| through message passing which isn't ideal for
| performance. The more recent major improvement is for
| workers to be able to share a single memory space similar
| to how threading works in native applications.
| arendtio wrote:
| > The web always feels terrible compared to a native app
|
| 'always' is certainly not true. Yes, with modern frameworks
| it is very easy to build websites which are slow. But it is
| also possible to build websites with butter smooth animations
| and instant responses.
|
| I hope that in the future we will get frameworks that make it
| easier to create lightweight web apps, so that we will see
| more high performance apps.
| halpert wrote:
| I made a another comment further down, but basically web
| apps can't run as fast as native apps for a variety of
| reasons.
|
| One reason is the thread model. There are two main threads
| that need to be synchronized (browser main thread and JS
| main thread) which will always be slower than a single main
| thread.
|
| Another reason is that layout and measurement of elements
| in HTML is really complicated. Native apps heavily
| encourage deferred measurement which lets the app measure
| and lay itself out once per render pass. In JavaScript,
| layouts may need to happen immediately based on what
| properties of the dom you're reading and setting.
| arendtio wrote:
| I think nobody will argue against, that the majority of
| native apps are faster than web apps. But the key point
| isn't faster, but how much is fast enough.
|
| In general, 60fps is considered sufficient for smooth
| rendering and even 5 years ago, mobile hardware was fast
| enough for 60 fps web page rendering. However, many web
| pages a built in ways, that the browsers can't achieve
| that goal.
|
| So yes, it is harder for developers to create a pleasant
| experience and as a result there are more bad apples in
| the web app basket.
| halpert wrote:
| I disagree. Yes, if you have a static webpage and all you
| need to do is scroll, then you can easily get 60 fps,
| notably because the scrolling is handled natively by the
| browser and basically is just a translation on the GPU.
| If the web app accepts user input, especially touch with
| dragging, then the page will not feel native with the
| current batch of browsers for the reasons I mentioned
| above.
| arendtio wrote:
| So what would be needed to prove you wrong?
|
| How about a 240fps video of a 60Hz display, with 2
| implementations
|
| 1. Qt
|
| 2. Web
|
| Both times a finger dragging a slider from point A to
| point B?
| engmgrmgr wrote:
| You don't have to always use the DOM. You can render in
| another thread, or even run compute in another thread and
| use the animation frame system to handle updates.
|
| Having said that, maybe a little less than 10 years ago,
| we achieved the desired performance with touch-screen
| dragging of DOM elements. I don't remember specifics, but
| we didn't use any frameworks.
| nicoburns wrote:
| > Some of the difference is due to Safari being a bad browser
| (probably intentional), but a bigger part is that only having
| one thread makes it really difficult to have a responsive UI
|
| Interestingly Safari is actually generally better than other
| browsers for running a responsive smooth UI. Not sure how
| much of that is the Safari engine and how much of it is
| better CPU on iPhones. But even on a first generation iPad or
| an iPhone 4 it was possible to get 60fps rendering fairly
| easily. The same could not be said for even higher end
| android phones of the time.
| kitsunesoba wrote:
| Anecdotally, the only time I've had issues with
| unresponsiveness for pages in Safari is with sites that
| were written with Chrome specifically in mind.
| themerone wrote:
| So, basically everything.
| kitsunesoba wrote:
| The impact is minimal to nonexistent on a light-to-
| moderate-JS "site" and only really shows up in heavy
| "apps", like YouTube or GDocs.
| halpert wrote:
| Really? Even something simple like Wordle feels janky with
| the way Safari's chrome overlaps the keyboard.
| acdha wrote:
| Do you have some kind of extensions or something like
| text zooming enabled? On a clean install it doesn't
| overlap at all.
| halpert wrote:
| Hmm the layout is working for me now, but the tile
| animation is broken. They flicker instead of smoothly
| flipping over.
| acdha wrote:
| Any chance you have reduced motion enabled? They flip for
| me but are obviously on a short timer.
| javajosh wrote:
| The web solved software distribution. Full stop. There only
| remain the edge cases, and that's where webasm wants to help.
|
| Sun/Java wanted badly to solve this problem, but tried to do
| too much too soon. Java gave devs in 1999 cutting edge OOP
| tools for doing GUIs (e.g. Swing) but distribution was always
| the problem. Installing and running browser plugins was always
| error prone, and it turned out the browser was itself just good
| enough to deliver value, so it won. (With the happy side-effect
| of giving the world-wide community of devs one of the gooeyist
| languages ever to express their fever dreams of what
| application code should be).
|
| The question in my mind is whether there is enough demand for
| the kinds of software webasm enables, especially given that
| other routes of distribution (app stores) have filled in the
| gaps of what the web delivers, and are generally a lot more
| pleasant and predictable to native devs.
| ginko wrote:
| Isn't the idea of webassembly to compile native
| C/C++/Rust/Whatever code to be able to run in the browser?
|
| Why not just compile OpenBLAS or another computer numerics
| library like that to WA?
| brrrrrm wrote:
| that's exactly how TF.js does it:
| https://github.com/google/XNNPACK/blob/master/src/f32-gemm/M...
| remus wrote:
| I'm by no means an expert but my understanding is that a lot of
| the performance from libraries like openBLAS comes from
| targeting specific architectures (e.g. particular instruction
| sets on a series of processors). You can probably milk some
| more performance by targeting the web assembly architecture
| specifically (assuming openBLAS hasn't started doing similar
| themselves).
| bee_rider wrote:
| People are asking about BLAS already in various threads, but if
| they know the size of their matrices beforehand, it might be
| interesting to try EIGEN. EIGEN also has the benefit that it is
| an all-template C++ library so I guess it should be somehow
| possible to spit out the WASM (I know nothing about WASM).
|
| Of course usually BLAS beats EIGEN, but for small, known-sized
| matrices, it might have a chance.
| MauranKilom wrote:
| Can you specify what "small" constitutes for you?
| LeanderK wrote:
| I was hopeful WebAssembly will speed up the web. What's missing?
| Native Browser APIs? Native IO Apis?
|
| Let's say I want to interactively plot some complicated function
| without slowing down the rest. Can I do this in WebAssembly now?
| acdha wrote:
| > Let's say I want to interactively plot some complicated
| function without slowing down the rest. Can I do this in
| WebAssembly now?
|
| You've been able to do that for a while now and it would likely
| be fast enough even in pure JavaScript. The things which tend
| to slow the web down come back to developers not caring about
| performance (and thus not even measuring) or the cross-purposes
| of things like ad sales where a significant amount of
| JavaScript is deployed to do things the user doesn't care
| about.
| sharikous wrote:
| > Let's say I want to interactively plot some complicated
| function without slowing down the rest. Can I do this in
| WebAssembly now?
|
| Yep, with a Web Worker for the secondary thread. However the
| environment is still young and its' difficult to use heavy
| computation libraries. Besides for some reason SIMD
| instructions are present only for 128 bit units (2 doubles or 4
| floats). Another problem is no matter what to do it is a layer
| over the hardware, so it will be slower than specialized
| machine code (if what you do is not in the JS API)
| LeanderK wrote:
| > difficult to use heavy computation libraries
|
| what do you mean? What's the blocker? Something like numpy
| for js would fill this role, calling wasm in the background.
| Just the missing SIMD-instructions? Some quick googling shows
| that one can't compile BLAS for wasm yet. This might be due
| to wasm64 not being available yet, i think? So would this
| help to tap into the existing ecosystem of optimised
| mathematical routines?
|
| Ideally...I would leave js and just use python ;) It has the
| whole ecosystem at hands with numpy, scipy, statsmodels etc.
| But nobody is doing it and idk why. I think it might be due
| to fortran not compiling to wasm.
| sharikous wrote:
| Yes I forgot about wasm64 not being available still. Yes,
| that's a big block.
|
| About numpy for js I believe js is still not comfortable
| enough for this kind of use, especially with the lack of
| operator overloading.
|
| Anyway there are some builds of BLAS (or equivalents) to
| wasm and even of python. Check out pyodide and brython
| [deleted]
| onion2k wrote:
| _Let 's say I want to interactively plot some complicated
| function without slowing down the rest_
|
| If you can draw your plot in a shader then you can do it in
| WebGL very easily. You'd only need to update the input uniforms
| and everything else would happen on the GPU.
| johndough wrote:
| Browsers will always lag behind desktop applications by a
| decade or two because everything is designed by committee and
| takes forever to arrive (see e.g. SSE, AVX, GPGPU compute). And
| even if everyone can eventually agree on and implement the
| smallest common denominator hardware will already have evolved
| beyond that.
|
| In addition, browsers have to fight all kinds of nefarious
| attackers, so it is a very hostile environment to develop in.
| For example, we can't even measure time accurately (or do
| proper multithreading with shared memory) in the browser thanks
| to the Spectre and Meltdown vulnerabilities.
| https://meltdownattack.com/
|
| That being said, WebGL implements extremely gimped shaders.
| Yet, they are still more than enough to render all kinds of
| functions. For example, see https://www.shadertoy.com/ or
| https://glslsandbox.com/ which are huge collections of
| functions which take screen coordinates and time as input and
| compute pixel colors from that, i.e. f(x, y, t) -> (r, g, b).
| This might sound not very impressive on first glance, but
| people have been amazingly creative within this extremely
| constrained environment, resulting in all kinds of fancy 3D
| renderings.
| wdroz wrote:
| Since wasm supports threads, I wonder if you can speed up these
| operations further more by using multiple threads.
| brrrrrm wrote:
| That's a good point: you certainly could. There's some fun
| exploration to be done with atomic operations.
|
| The issue is that threaded execution requires cross-origin
| isolation, which isn't trivial to integrate. (Example server
| that will serve the required headers: https://github.com/bwasti
| /wasmblr/blob/main/thread_example/s...)
| phkahler wrote:
| Another technique is to transpose the left matrix so each dot
| product is scanned in row-order and hence more cache friendly.
|
| Another one I tried ages ago is to use a single loop counter and
| "de-interleave" the bits to get what would normally be 3 distinct
| loop variables. For this you need to modify the entry in the
| result matrix rather than having it write-only. It has the effect
| of accessing like a z-order curve but in 3 dimensions. It's a bit
| of overhead, but you can also unroll say 8 iterations (2x2x2)
| which helps make up for it. This ends up making good use of both
| caches and even virtual memory if things don't fit in RAM. OTOH
| it tends to prefer sizes that are a power of 2.
| gfd wrote:
| I really like these set of lecture notes for optimizing matrix
| multiplication: https://ppc.cs.aalto.fi/ch2/v7/ (The transpose
| trick is used in v1)
| progbits wrote:
| This deserves it's own submission, wonderful resource!
| eigenvalue wrote:
| I find it surprising that, even after using all those tricks,
| they are still only to achieve around 50% of the theoretical
| peak performance of the chip in terms of GFLOPS. And that's
| for matrix multiplication, which is a nearly ideal case for
| these techniques.
| dralley wrote:
| The compiler will sometimes do this transpose for you, but as
| with all compiler optimizations it might sometimes break.
| melissalobos wrote:
| That sounds very interesting, is the anywhere I can read more
| about this optimization? I didn't know any compiler could do
| optimizations like that.
| kanaffa12345 wrote:
| there is no way a general purpose compiler will figure this
| out. op is probably talking about something like halide or
| tvm or torchscript jit.
| [deleted]
| cerved wrote:
| You can get extremely creative in optimizing matrix
| mulplication for cache and SIMD.
| jacobolus wrote:
| For contexts like the web, also check out cache-oblivious
| matrix multiplication https://dspace.mit.edu/bitstream/handle
| /1721.1/80568/4355819...
| mynameismon wrote:
| Another very very interesting optimisation can be found in
| these lecture slides [0]. (Scroll to slide 28, although the
| entire slide deck is amazing)
|
| [0]: https://ocw.mit.edu/courses/electrical-engineering-and-
| compu...
| magoghm wrote:
| I tested it on my M1 Mac and it reached 46.78 Gigaflops, which is
| quite amazing for a CPU running at 3.2 GHz. Isn't that like an
| average of 14.6 floating point operations per clock cycle?
| lostmsu wrote:
| If you look at the comment above about the regular GEMM
| implementation, M1 actually can do that at 1.6 Teraflops.
| danieldk wrote:
| I hate to post this multiple times, but the M1 has a dedicated
| matrix co-processor, it can do matrix multiplication at
| >1300GFLOP/s if you use the native Accelerate framework (which
| uses the standard BLAS API). The M1 Pro/Max can even do double
| that (>2600 GFLOP/s) [1].
|
| 46.78 GFLOP/s is not even that great on non-specialized
| hardware. E.g., a Ryzen 5900X, can do ~150 GFLOP/s single-
| threaded with MKL.
|
| [1] https://github.com/danieldk/gemm-benchmark#1-to-16-threads
| owlbite wrote:
| How does this compare to the native BLAS in the Accelerate
| library?
| conradludgate wrote:
| Accelerate on the M1 is ridiculously fast (thanks to its
| special core set and specific instructions).
|
| Some benchmarks I've done has it beating out CUDA on my RTX
| 2070. I have to got a proper gflops number though
| danieldk wrote:
| It's going to absolutely blow this away. Here are some of my
| single precision GEMM benchmarks for the M1 and M1 Pro:
|
| https://github.com/danieldk/gemm-benchmark#1-to-16-threads
|
| tl;dr, the M1 can do ~1300 GFLOP/s and the M1 Pro up to
| ~2700GFLOP/s.
|
| On the vanilla M1, that's 28 times faster than the best result
| in the post.
|
| The difference (besides years of optimizing linear algebra
| libraries) is that Accelerate uses the AMX matrix
| multiplication co-processors through Apple's proprietary
| instructions.
| [deleted]
| riddleronroof wrote:
| This is very cool
| [deleted]
| wheelerof4te wrote:
| I admire people who can read and understand this.
| Kilenaitor wrote:
| Which parts are you unable to read and understand? I'm sure
| some of us here could help explain if you have specific
| questions or hangups.
| wheelerof4te wrote:
| The math stuff :)
|
| JavaScript code is readable, at least.
| Kilenaitor wrote:
| Is "the math stuff" all the optimizations being performed
| e.g. vectorizing multiplication?
|
| Not trying to sound dismissive here but the core math the
| post is working with is actually a pretty straightforward
| matrix multiplication.
|
| The bulk of the discussion focuses on optimizing the
| execution of that straightforward multiplication algorithm
| [triple-nested for loop; O(n^3)] rather than making
| algorithmic/mathematic optimizations.
|
| And again, specific questions are easier to answer. :)
| djur wrote:
| Matrix multiplication isn't exactly intuitive if you've
| never worked with it before.
| bruce343434 wrote:
| I don't understand the naming and notation of this article
| because the author is assuming context that I don't have.
|
| Section baseline: What are N, M, K? 3 matrices or? Laid out
| as a flat array, or what? `c[m * N + n] += a[m * K + k] * b[k
| * N + n];`, ah, apparently a b and c are the matrices? How
| does this work?
|
| Section body: What is the mathy "C'=aC+A[?]B"? derivative of
| a constant is the angle times the constant plus the dot
| product of A and B???
| conradludgate wrote:
| There are 3 matrices in question: A, B and C. They have
| dimensions (M * K), (K * N) and (M * N) respectively.
|
| They are laid out, rather than nested arrays, as a single
| continuous collection of bytes that can be interpreted as
| having a matrix shape. That's where the `m * N + n` comes
| from (m rows down and n cols in) C' = alpha
| C + A.B
|
| This is the 'generalised matrix-matrix multiplication'
| (GEMM) operation. It's multiplying the matrices A and B,
| adding it to a scales version of C and inserting it back
| into C. Setting alpha to 0 gets you basic matmul
| wheelerof4te wrote:
| Thank you for this detailed explanation. Making the
| matrix one-dimensional makes sense from the performance
| standpoint.
| [deleted]
| marginalia_nu wrote:
| It's kind of bizarre how it's an accomplishment to get your code
| closer to what the hardware is capable of. In a sane world, that
| should be the default environment you're working in. Anything
| else is wasteful.
| Kilenaitor wrote:
| There's always been a tradeoff in writing code between
| developer experience and taking full advantage of what the
| hardware is capable of. That "waste" in execution efficiency is
| often worth it for the sake of representing helpful
| abstractions and generally helping developer productivity.
|
| The real win here is when we can have both because of smart
| toolchains that can transform those high-level constructs and
| representations into the most efficient implementation for the
| hardware.
|
| Posts like this demonstrate what's possible with the right
| optimizations so tools like compilers and assemblers are able
| to take advantage of these when given the high-level code. That
| way we can achieve what you're hoping for: the default being
| optimal implementations.
| AnIdiotOnTheNet wrote:
| > That "waste" in execution efficiency is often worth it for
| the sake of representing helpful abstractions and generally
| helping developer productivity.
|
| That's arguable at best. I for one am sick of 'developer
| productivity' being the excuse for why my goddamned
| supercomputer crawls when performing tasks that were trivial
| even on hardware 15 years older.
|
| > The real win here is when we can have both because of smart
| toolchains that can transform those high-level constructs and
| representations into the most efficient implementation for
| the hardware.
|
| That's been the promise for a long time and it still hasn't
| been realized. If anything things seem to be getting less and
| less optimal.
| adamc wrote:
| No, it's really not even arguable. Lots and lots of
| software is written in business contexts where the cost of
| developing reliable code is a lot more important than its
| performance. Not everything is a commercial product aimed
| at a wide audience.
|
| What you're "sick of" is mostly irrelevant unless you
| represent a market that is willing to pay more for a more
| efficient product. I use commercial apps every day that
| clearly could work a lot better than they do. But... would
| I pay a lot for that? No. They are too small a factor in my
| workday.
|
| Saving money is part of engineering too.
| bruce343434 wrote:
| People have been sick of slow programs and slow computers
| since literally forever. I think you live in a bubble or
| are complacent.
|
| No one I know has anything good to say about microsoft
| teams, for instance. And that's just one of the recent
| "dekstop apps" which are actually framed browsers.
| lijogdfljk wrote:
| > when performing tasks that were trivial even on hardware
| 15 years older.
|
| Did the software to perform those tasks stop working?
| nicoburns wrote:
| > I for one am sick of 'developer productivity' being the
| excuse for why my goddamned supercomputer crawls when
| performing tasks that were trivial even on hardware 15
| years older.
|
| The problem here is developer salaries. So long as
| developers are as expensive as they are the incentive will
| be to optimise for developer productivity over runtime
| efficiency.
| dogleash wrote:
| If developers costed one fifth of what they do now, how
| many projects that let performance languish today would
| staff up to the extent that doing a perf pass would make
| it's way to the top of the backlog queue?
|
| Come on now. Let's be honest here. The answer for >90% of
| projects is either a faster pace on new features, or to
| pocket the payroll savings. They'd never prioritize
| something that they've already determined can be ignored.
| Kilenaitor wrote:
| We've been making developer experience optimizations
| _long_ before they started demanding high salaries. The
| whole reason to go from assembly to C was to improve
| developer experience and efficiency.
|
| It seems fairly reductive to dismiss the legitimate
| advantages of increased productivity. It's faster to
| iterate on ideas and products, we gain back time to focus
| on more complex concepts, and, more broadly, we further
| open up this field to more and more people. And those
| folks can then go on to invest in these kind of
| performance improvements.
| AnIdiotOnTheNet wrote:
| > It seems fairly reductive to dismiss the legitimate
| advantages of increased productivity.
|
| Certainly there are some, but I think we passed the point
| of diminishing returns long long ago and we're now well
| into the territory of regression. I would argue that we
| are actually experiencing negative productivity increases
| from a lot of the abstractions we employ, because we've
| built up giant abstraction stacks where each new
| abstraction has new leaks to deal with and everything is
| much more complicated than it needs to be because of it.
| nicoburns wrote:
| Hmm... I think our standards for application
| functionality are also a lot higher. For example, how
| many applications from the 90s dealt flawlessly with
| unicode text.
| AnIdiotOnTheNet wrote:
| How much added slowness do you think Unicode is
| responsible for? Because as much of a complex nightmarish
| standard as it is[0], there are plenty of applications
| that are fast that handle it just fine as far as I can
| tell. They're built with native widgets and written in
| (probably) C.
|
| [0] plenty of slow as fuck modern software doesn't handle
| it even close to 'flawlessly'
| danieldk wrote:
| _There 's always been a tradeoff in writing code between
| developer experience and taking full advantage of what the
| hardware is capable of. That "waste" in execution efficiency
| is often worth it for the sake of representing helpful
| abstractions and generally helping developer productivity._
|
| The GFLOP/s is 1/28th of what you'd get when using the native
| Accelerate framework on M1 Macs [1]. I am all in for powerful
| abstractions, but not using native APIs for this (even if
| it's just the browser calling Accelerate in some way) is just
| a huge waste of everyone's CPU cycles and electricity.
|
| [1] https://github.com/danieldk/gemm-
| benchmark#1-to-16-threads
| Salgat wrote:
| Once you realize that it's a completely sandboxed environment
| that works on any platform, it's a lot more impressive.
| dr_zoidberg wrote:
| Wasteful of computing resources, yes, but for a long time we've
| been prioritizing developer time. That happens because you can
| get faster hardware cheaper than you can get more developer
| time (and not all developers time is equal, say, Carmack con do
| in a few hours things I couldn't do in months).
|
| I do agree that we'd get fantastic performance out of our
| systems if we had the important layers optimized like this (or
| more), but it seems few (if any) have been pushing in that
| direction.
| terafo wrote:
| But you can't get faster hardware cheaper anymore. Not
| naively faster hardware anyways. You are getting more and
| more optimization opportunities nowadays though. Vectorize
| your code, offload some work to the GPU or one of the
| countless other accelerators that are present on modern SOC,
| change your I/O stack so you can utilize SSDs efficiently,
| etc. I think it's a matter of time until someone puts FPGA
| onto mainstream SOC, and the gap between efficient and
| mainstream software will only widen from that point.
| dr_zoidberg wrote:
| You are precisely telling me the ways in which I can get
| faster hardware: GPU, accelerators, the I/O stack and SSDs,
| etc.
|
| I agree that the software layer has become slow, crufy,
| bloated, etc. But it's still cheaper to get faster hardware
| (or wait a bit for it, see M1, Alder Lake, Zen 3, to name a
| few, and those are getting successors later on this year)
| than to get a good programmer to optimize your code.
|
| And I know that we'd get much better performance out of
| current (and probably future) hardware if we had more
| optimized software, but it's rare to see companies and
| projects tackle on such optimization efforts.
| terafo wrote:
| But you can't get all these things in the browser. You
| don't just increase CPU frequency and get free
| performance anymore. You need conscious effort to use GPU
| computing, conscious effort to ditch current I/O stack
| for io_uring. Modern hardware gives performance to ones
| who are willing to fight for it. Disparity between naive
| approach and optimized approach grows every year.
| peterhunt wrote:
| The real issue here is that the hardware isn't capable of
| sandboxing without introducing tons of side channel attacks.
| Lots of applications are willing to sacrifice a lot of
| performance in order to gain the distribution advantages from a
| safe, sandboxed execution environment.
| not2b wrote:
| In a sane world (which is the world that we live in), it's best
| to find a well-optimized library for common operations like
| matrix multiplication. But if you want to do something unusual
| (multiply large matrices inside a browser, quickly) you've
| exited the sane world so you'll have to work at it.
| ska wrote:
| > Anything else is wasteful.
|
| Everything has a cost. If the developer is a slave to machine
| architecture, development is slow and error prone. If the
| machine is a slave to a abstraction, everything will run
| slowly. Unsurprisingly, the real trick is finding appropriate
| balance for your situation.
|
| Of course you can make things worse, in both directions.
| Zababa wrote:
| On the other hand, in your sane world, productivity would be a
| fraction of what it currently is, for developers and users. You
| favor computer time over developer time. While computer time
| can be a proxy for user time, it isn't always as developer time
| can be used to speed up user time too. A single-minded focus on
| computer time sounds like a case of throwing out metrics like
| developer time and user time because they are harder to measure
| than computer time. In any case, it sounds like a mistake to
| me.
| bruce343434 wrote:
| I don't understand the naming and notation of this article
| because the author is assuming context that I don't have.
|
| Section baseline: What are N, M, K? 3 matrices or? Laid out as a
| flat array, or what? `c[m * N + n] += a[m * K + k] * b[k * N +
| n];`, ah, apparently a b and c are the matrices? How does this
| work?
|
| Section body: What is the mathy "C'=aC+A[?]B"? derivative of a
| constant is the angle times the constant plus the dot product of
| A and B???
|
| Please, if you write a public blog post, use your head. Not
| everybody will understand your terse notes.
| engmgrmgr wrote:
| Not to be too snarky, but perhaps the onus is on you to do some
| homework if you want to understand a niche article for which
| you lack context?
|
| Laying out matrices like that is pretty standard, especially
| for a post about vectorization.
| ausbah wrote:
| shouldn't compilers handle stuff like this?
| brrrrrm wrote:
| In an ideal world, absolutely! It's a hard problem and there
| are many attempts to make that happen automatically including
| polyhedral optimization (Polly[1]) and tensor compiler
| libraries (XLA[2] and TVM[3]). I work on a project called
| LoopTool[4] which is researching ways to dramatically reduce
| the representations of the other projects to simplify
| optimization scope.
|
| [1] https://polly.llvm.org
|
| [2] https://www.tensorflow.org/xla
|
| [3] https://tvm.apache.org
|
| [4] https://github.com/facebookresearch/loop_tool
| visarga wrote:
| If they worked so well AMD would not be in such a bad position
| with their GPUs in ML. They would just need to compile to their
| arch.
___________________________________________________________________
(page generated 2022-01-25 23:00 UTC)