[HN Gopher] Faster TypedArrays: Vector Addition in WebAssembly
___________________________________________________________________
Faster TypedArrays: Vector Addition in WebAssembly
Author : brrrrrm
Score : 72 points
Date : 2022-01-02 16:43 UTC (6 hours ago)
(HTM) web link (jott.live)
(TXT) w3m dump (jott.live)
| lmeyerov wrote:
| Cool to see this stuff getting out into the wild, long time
| coming!
|
| What is the current state of these -- do simd wasm ops run in
| ffox/chrome/edge/safari, and does the data have to be slowly bulk
| copied back-and-forth from a worker (heartbleed removal of zero-
| copy ownership transfers)?
|
| We love js typed arrays and columnar analytics -- graphistry
| contributed the first years of Apache Arrow JS -- but
| intentionally didn't do wasm for kernels because of this kind of
| stuff, despite promising internal multicore etc prototypes. A
| surprise win for us of GPU python/js offloading to the server has
| been not just scale but perf reliability. Curious how much it has
| improved in the typical case, as it always made sense on paper!
| brrrrrm wrote:
| Yep, you still need to bulk copy memory around if it already
| exists. If you own the memory, though, you can avoid copying
| (wasm has "import") but you'll need to manage that manually.
|
| I've found the cleanest approach for me is to have modules
| allocate space for inputs and outputs and then try to get
| functions to write directly into the input space.
|
| Either way, nothing available is super friendly for user
| provided arrays or canvas interactions.
| IvanK_net wrote:
| BTW. you can unwrap loops in pure Javascript too, and it also
| makes the code several times faster.
|
| 4x unwrap: https://jsfiddle.net/49j7htdz/1/
|
| 8x unwrap: https://jsfiddle.net/49j7htdz/2/
| brrrrrm wrote:
| It's definitely a bit faster with the unroll, but on my machine
| it's not by too much. I've added that idea to the interactive
| benchmark if you'd like to check it out!
| IvanK_net wrote:
| You probably implemented it in a wrong way. On my machine, a
| JSFiddle version is 3.5x faster, while in your benchmark, it
| is 1.4x faster.
| [deleted]
| Matheus28 wrote:
| If we're not counting the time to zero out the array, it seems
| that typed arrays are slower than plain javascript because it's
| converting to & from floats/doubles back and forth. Try the same
| with Float64Array. My microbenchmark says f32 is 15% slower than
| f64 on chrome: https://jsbench.me/gnkxxkjag8/1
| brrrrrm wrote:
| Great catch! I assumed the JIT would identify f32 arithmetic
| but I guess that isn't really valid numerically. I wonder if
| there's a way to use Math.fround[1] on your benchmark to get
| expected speedups?
|
| [1] https://developer.mozilla.org/en-
| US/docs/Web/JavaScript/Refe...
| olliej wrote:
| Yup, the spec requires computation to be as doubles and so
| any computation must be done that way as the end result is
| observable.
|
| I didn't know about fround but I suspect its primary use case
| is for trying to catch floating point overflow during double
| arithmetic, because again precision difference is observable.
| greggman3 wrote:
| And as usual for me, JavaScriptCore blows away V8 in most
| microbenchmarks I've run
|
| Same machine, JSC (Safari) is 3x faster than Chrome
| https://jsbenchit.org/?src=c792550e65de1d038b1f24b446c74592
| visarga wrote:
| Would have been nice to see CPU and GPU benchmark scores
| alongside.
|
| I get 1.5mil iterations in numpy on CPU which is on par with
| typed arrays and much slower than wasmblr.
| brrrrrm wrote:
| I believe numpy is bounded by Python's interpreter (which runs
| at about ~1us per dispatch to C function from what I recall
| while working on PyTorch). The number you get is definitely
| what I'd expect.
|
| If you use larger arrays (and use the "out" variant:
| `numpy.add(a, b, out=c)`), you might get similar total
| throughput at least. import numpy as np
| N = 1024 * 128 A = np.random.randn(N) B =
| np.random.randn(N) C = np.random.randn(N)
| import time for _ in range(1000): np.add(A,
| B, out=C) iters = 1000 t = time.time()
| for _ in range(iters): np.add(A, B, out=C) d =
| time.time() - t print(f"{iters * N * 3 * 4 / d / 1e9:.2f}
| GB/s")
| formerly_proven wrote:
| (~13 GFLOPS over a ~13 kB working set)
| still_grokking wrote:
| Oh, someone measuring cache bandwidth?
| brrrrrm wrote:
| mostly L1 data-cache, yea (50% of peak BW). Which is a good
| sign for an interpreted ISA.
| olliej wrote:
| Any WASM that runs for more than a few milliseconds is
| going to be compiled to native code. I do wonder just where
| the remaining 50% of bandwidth is going. IIRC bounds
| checking is only be around a 10% penalty in most studies. I
| guess there are other checks needed in JS (stripping
| signaling NaNs, etc but I don't think WASM requires that)
| twoodfin wrote:
| I'd be interested to see if the manual loop unrolling is
| necessary, or if the WASM toolchain would perform that
| optimization automatically if `len` were a known constant.
| brrrrrm wrote:
| That's a good question, and I think there's some data in the
| benchmark dump to help answer it. The tuning logic for the
| results labeled `wasmblr (tuned X)` sweeps through a couple
| values for X (the unroll size) and shows the best one. On some
| browsers (and in node.js) this value goes to 1, which means
| unrolling isn't found to be necessary for that size.
|
| Mostly, from what I've seen, it tunes to a value greater than
| 1. So I think generally it's still necessary today.
___________________________________________________________________
(page generated 2022-01-02 23:00 UTC)