[HN Gopher] JavaScript Benchmarking Is a Mess
___________________________________________________________________
JavaScript Benchmarking Is a Mess
Author : joseneca
Score : 104 points
Date : 2024-12-24 12:37 UTC (10 hours ago)
(HTM) web link (byteofdev.com)
(TXT) w3m dump (byteofdev.com)
| diggan wrote:
| > Essentially, these differences just mean you should benchmark
| across all engines that you expect to run your code to ensure
| code that is fast in one isn't slow in another.
|
| In short, the JavaScript backend people now need to do what we
| JavaScript frontend people been doing since SPAs became a thing,
| run benchmarks across multiple engines instead of just one.
| sylware wrote:
| If you use javascript, use a lean engine coded in a lean SDK,
| certainly not the c++ abominations in Big Tech web engines.
|
| Look at quickjs, and use your own very lean OS interfaces.
| joseneca wrote:
| QuickJS is great for cases where you are more limited by
| startup and executable size than anything else but it tends to
| perform quite terribly (https://bellard.org/quickjs/bench.html)
| compared to V8 and anything else with JIT compilation.
|
| More code does not inherently mean worse performance.
| sylware wrote:
| But its SDK is not the c++ abominations like v8, and that
| alone is enough to choose quickjs or similar, since we all
| know here, that c++ (and similar, namely rust/java/etc) is a
| definitive nono (when it is not forced down our throat as a
| user, don't forget to thanks the guys doing that, usually
| well hidden behind internet...).
|
| For performance, don't use javascript anyway...
|
| That said, a much less worse middleground would be to have
| performance critical block written in assembly (RISC-V Now!)
| orchestrated by javascript.
| ramon156 wrote:
| I could be reading your comment wrong, but what do you mean
| with "c++ is a definitive nono"? Also how is a complicated
| repository enough (reason) to choose quickjs?
| sylware wrote:
| quickjs or similar. Namely, small depending on a
| reasonable SDK (which does include the computer
| language).
| surajrmal wrote:
| I'm not sure I understand the point of mentioning the
| language. If it was written in a language like c but still
| "shoved down your throat" would you still have qualms with
| it? Do you just like things written by corporate entities
| and those languages tend to be popular as they scale to
| larger teams well? Or do you dislike software that is too
| large to understand and optimized for the needs of larger
| teams? Because it doesn't matter if it's software or some
| grouping of distinct software - at some point there will be
| a point at which it becomes challenging to understand the
| full set of software.
|
| If I were to create an analogy, it feels like you're
| complaining about civil engineers who design skyscrapers to
| be built out of steel and concrete instead of wood and
| brick like we use for houses. Sure the former is not really
| maintainable by a single person but it's also built for
| higher capacity occupancy and teams of folks to maintain.
| sylware wrote:
| The "need of larger teams" does not justify to delegate
| the core of the technical interfaces to a grotesquely and
| absurdely complex and gigantic computer language with its
| compiler (probably very few real life ones).
|
| This is accute lack of perspective, border-line fraud.
| dan-robertson wrote:
| Re VM warmup, see
| https://tratt.net/laurie/blog/2022/more_evidence_for_problem...
| and the linked earlier research for some interesting discussion.
| Roughly, there is a belief when benchmarking that one can work
| around not having the most-optimised JIT-compiled version by
| running your benchmark a number of times and then throwing away
| the result before doing 'real' runs. But it turns out that:
|
| (a) sometimes the jit doesn't run
|
| (b) sometimes it makes performance worse
|
| (c) sometimes you don't even get to a steady state with
| performance
|
| (d) and obviously in the real world you may not end up with the
| same jitted version that you get in your benchmarks
| vitus wrote:
| In general, this isn't even a JS problem, or a JIT problem. You
| have similar issues even in a lower-level language like C++:
| branch prediction, cache warming, heck, even power state
| transitions if you're using AVX512 instructions on an older
| CPU. Stop-the-world GC causing pauses? Variation in memory
| management exists in C, too -- malloc and free are not fixed-
| cost, especially under high churn.
|
| Benchmarks can be a useful tool, but they should not be
| mistaken for real-world performance.
| dan-robertson wrote:
| I think that sort of thing is a bit different because you can
| have more control over it. If you're serious about
| benchmarking or running particularly performance sensitive
| code, you'll be able to get reasonably consistent benchmark
| results run-to-run, and you'll have a big checklist of things
| like hugepages, pgo, tickless mode, writing code a specific
| way, and so on to get good and consistent performance. I
| think you end up being less at the mercy of choices made by
| the VM.
| marcosdumay wrote:
| The amount of control you have varies in a continuum
| between hand-written assembly to SQL queries. But there
| isn't really a difference of kind here, it's just a
| continuum.
|
| If there's anything unique about Javascript is that has an
| unusually high rate of "unpredictability" / "abstraction
| level". But again, it has pretty normal values of both of
| those, just the relation is away from the norm.
| dan-robertson wrote:
| I'm quite confused by your comment. I think this
| subthread is about reasons one might see variance on
| modern hardware in your 'most control' end of the
| continuum.
|
| The start of the thread was about some ways where VM
| warmup (and benchmarking) may behave differently from how
| many reasonably experienced people expect.
|
| I claim it's reasonably different because one cannot tame
| parts of the VM warmup issues whereas one can tame many
| of the sources of variability one sees in high-
| performance systems outside of VMs, eg by cutting out the
| OS/scheduler, by disabling power saving and properly
| cooling the chip, by using CAT to limit interference with
| the L3 cache, and so on.
| marcosdumay wrote:
| > eg by cutting out the OS/scheduler, by disabling power
| saving and properly cooling the chip, by using CAT to
| limit interference with the L3 cache, and so on
|
| That's not very different from targeting a single JS
| interpreter.
|
| In fact, you get much more predictability by targeting a
| single JS interpreter than by trying to guess your
| hardware limitations... Except, of course if you target a
| single hardware specification.
| hinkley wrote:
| When we were upgrading to ES6 I was surprised/relieved to
| find that moving some leaf node code in the call graph to
| classes from prototypes did help. The common wisdom at
| the time was that classes were still relatively
| expensive. But they force strict, which we were using
| inconsistently, and they flatten the object
| representation (I discovered these issues in the heap
| dump, rather than the flame graph). Reducing memory
| pressure can overcome the cost of otherwise suboptimal
| code.
|
| OpenTelemetry having not been invented yet, someone
| implemented server side HAR reports and the data
| collection on that was a substantial bottleneck.
| Particularly the original implementation.
| hinkley wrote:
| Running benchmarks on the old Intel MacBook a previous job
| gave me was like pulling teeth. Thermal throttling all the
| time. Anything less than at least a 2x speed up was just
| noise and I'd have to push my changes to CI/CD to test, which
| is how our build process sprouted a benchmark pass. And a
| grafana dashboard showing the trend lines over time.
|
| My wheelhouse is making lots of 4-15% improvements and
| laptops are no good for those.
| hyperpape wrote:
| I think Tratt's work is great, but most of the effects that
| article highlights seem small enough that I think they're most
| relevant to VM implementors measuring their own internal
| optimizations.
|
| Iirc, the effects on long running benchmarks in that paper are
| usually < 1%, which is a big deal for runtime optimizations,
| but typically dwarfed by the differences between two methods
| you might measure.
| hinkley wrote:
| Cross cutting concerns run into these sorts of problems. And
| they can sneak up on you because as you add these calls to
| your coding conventions, it's incrementally added in new code
| and substantial edits to old, so what added a few tenths of a
| ms at the beginning may be tens of milliseconds a few years
| later. Someone put me on a trail of this sort last year and I
| managed to find about 75 ms of improvement (and another 50ms
| in stupid mistakes adjacent to the search).
|
| And since I didn't eliminate the logic I just halved the
| cost, that means we were spending about twice that much. But
| I did lower the slope of the regression line quite a lot, and
| I believe enough that new nodeJS versions improve response
| time faster than it was organically decaying. There were
| times it took EC2 instance type updates to see forward
| progress.
| hyperpape wrote:
| I think you might be responding to a different point than
| the one I made. The 1% I'm referring to is a 1% variation
| between subsequent runs of the same code. This is a
| measurement error, and it inhibits your ability to
| accurately compare the performance of two pieces of code
| that differ by a very small amount.
|
| Now, it reads like you' think I'm saying you shouldn't care
| about a method if it's only 1% of your runtime. I
| definitely don't believe that. Sure, start with the big
| pieces, but once you've reached the point of diminishing
| returns, you're often left optimizing methods that
| individually are very small.
|
| It sounds like you're describing a case where some method
| starts off taking < 1% of the runtime of your overall
| program, and grows over time. It's true that if you do an
| full program benchmark, you might be unable to detect the
| difference between that method and an alternate
| implementation (even that's not guaranteed, you can often
| use statistics to overcome the variance in the runtime).
|
| However, you often still will be able to use micro-
| benchmarks to measure the difference between implementation
| A and implementation B, because odds are they differ not by
| 1% in their own performance, but 10% or 50% or something.
|
| That's why I say that Tratt's work is great, but I think
| the variance it describes is a modest obstacle to most
| application developers, even if they're very performance
| minded.
| __alexs wrote:
| Benchmarking that isn't based on communicating the distribution
| of execution times is fundamentally wrong on almost any
| platform.
| blacklion wrote:
| Very strange take on "JIT introduce a lot of error into result".
| I'm from JVM/Java world, but it is JITted VM too, and in our
| world question is: why you want to benchmark interpreted code at
| all!?
|
| Only final-stage, fully-JIT-ted and profile-optimized code is
| what matter.
|
| Short-lived interpreted / level-1 JITted code is not interesting
| at all from benchmarking perspective, because it will be compiled
| fast enough to doesn't matter in grand scheme of things.
| the_mitsuhiko wrote:
| > I'm from JVM/Java world, but it is JITted VM too, and in our
| world question is: why you want to benchmark interpreted code
| at all!?
|
| Java gives you exceptional control over the JVM allowing you to
| create really good benchmark harnesses. That today is not the
| case with JavaScript and the proliferation of different
| runtimes makes that also harder. To the best of my knowledge
| there is no JMH equivalent for JavaScript today.
| dzaima wrote:
| JIT can be very unpredictable. I've seen cases with JVM of
| running the exact same benchmark in the same VM twice having
| the second run be 2x slower than the first, occurrences of
| having ran one benchmark before another making the latter 5x
| slower, and similar.
|
| Sure, if you make a 100% consistent environment of a VM running
| just the single microbenchmark you may get a consistent result
| on one system, but is a consistent result in any way meaningful
| if it may be a massive factor away from what you'd get in a
| real environment? And even then I've had cases of like 1.5x-2x
| differences for the exact same benchmark run-to-run.
|
| Granted, this may be less of a benchmarking issue, more just a
| JIT performance issue, but it's nevertheless also a
| benchmarking issue.
|
| Also, for JS, in browser specifically, pre-JIT performance is
| actually a pretty meaningful measurement, as each website load
| starts anew.
| gmokki wrote:
| How long did you run the benchmark if you got so large
| variation?
|
| For simple methods I usually run the benchnarkes method 100k
| times, 10k is minimum for full JIT.
|
| For large programs I have noticed the performance keeps
| getting better for the first 24 hours, after which I take a
| profiling dump.
| dzaima wrote:
| Most of the simple benches I do are for ~1 second. The
| order-dependent things definitely were reproducible
| (something along the lines of rerunning resulting in some
| rare virtual method case finally being invoked enough
| times/with enough cases to heavily penalize the vastly more
| frequent case). And the case of very different results C2
| deciding to compile the code differently (looking at the
| assembly was problematic as adding printassembly whatever
| skewed the case it took), and stayed stable for tens of
| seconds after the first ~second iirc (though, granted, it
| was preview jdk.incubator.vector code).
| Etheryte wrote:
| Agreed, comparing functions in isolation can give you
| drastically different results from the real world, where your
| application can have vastly different memory access patterns.
| natdempk wrote:
| Does anyone know how well the JIT/cache on the browser works
| eg. how useful it is to profile JIT'd vs non-JIT'd and what
| those different scenarios might represent in practice? For
| example is it just JIT-ing as the page loads/executes, or are
| there cached functions that persist across page loads, etc?
| ufo wrote:
| Javascript code is often short lived and doesn't have enough
| time to wait for the JIT to watm up.
| pizlonator wrote:
| When JITing Java, the main profiling inputs are for call
| devirtualization. That has a lot of randomness, but it's
| confined to just those callsites where the JIT would need
| profiling to devirtualize.
|
| When JITing JavaScript, every single fundamental operation has
| profiling. Adding stuff has multiple bits of profiling. Every
| field access. Every array access. Like, basically everything,
| including also callsites. And without that profiling, the JS
| JIT can't do squat, so it depends entirely on that profiling.
| So the randomness due to profiling has a much more extreme
| effect on what the compiler can even do.
| munificent wrote:
| _> Short-lived interpreted / level-1 JITted code is not
| interesting at all from benchmarking perspective, because it
| will be compiled fast enough to doesn't matter in grand scheme
| of things._
|
| This is true for servers but extremely not true for client-side
| GUI applications and web apps. Often, the entire process of [
| user starts app > user performs a few tasks > user exits app ]
| can be done in a second. Often, the JIT never has a chance to
| warm up.
| vitus wrote:
| > This effort, along with a move to prevent timing attacks, led
| to JavaScript engines intentionally making timing inaccurate, so
| hackers can't get precise measurements of the current computers
| performance or how expensive a certain operation is.
|
| The primary motivation for limiting timer resolution was the rise
| of speculative execution attacks (Spectre / Meltdown), where
| high-resolution timers are integral for differentiating between
| timings within the memory hierarchy.
|
| https://github.com/google/security-research-pocs/tree/master...
|
| If you look at when various browsers changed their timer
| resolutions, it's entirely a response to Spectre.
|
| https://blog.mozilla.org/security/2018/01/03/mitigations-lan...
|
| https://issues.chromium.org/issues/40556716 (SSCA -> "speculative
| side channel attacks")
| hyperpape wrote:
| For anyone interested in this subject, I'd recommend reading
| about JMH. The JVM isn't 100% the same as JS VMs, but as a
| benchmarking environment it shares the same constraint of JIT
| compilation.
|
| The right design is probably one that:
|
| 1) runs different tests in different forked processes, to avoid
| variance based on the order in which tests are run changing the
| JIT's decisions.
|
| 2) runs tests for a long time (seconds or more per test) to
| ensure full JIT compilation and statistically meaningful results
|
| Then you need to realize that your micro benchmarks give you
| information and help you understand, but the acid test is
| improving the performance of actual code.
| thecodrr wrote:
| Benchmarking is a mess _everywhere_. Sure you can get some level
| of accuracy but reproducing any kind of benchmark results across
| machines is impossible. That 's why perf people focus on things
| like CPU cycles, heap size, cache access etc instead of time.
| Even with multiple runs and averaged out results you can only get
| a surface level idea of how your code is actually performing.
| skybrian wrote:
| Performance is inherently non-portable. In fact, ignoring
| performance differences is what _enables_ portability.
|
| Not knowing what performance to expect is what allows you to
| build a website and expect it to run properly years later, on
| browsers that haven't been released yet, running on future mobile
| phones that use chips that haven't been designed yet, over a
| half-working WiFi connection in some cafe somewhere.
|
| Being ignorant of performance is what allows you to create Docker
| images that work on random servers in arbitrary datacenters, at
| the same time that perfect strangers are running _their_ jobs and
| arbitrarily changing what hardware is available for your code to
| use.
|
| It's also what allows you to depend on a zillion packages written
| by others and available for free, and _upgrade_ those packages
| without things horribly breaking due to performance differences,
| at least most of the time.
|
| If you want fixed performance, you have to deploy on fixed,
| dedicated hardware, like video game consoles or embedded devices,
| and test on the same hardware that you'll use in production. And
| then you drastically limit your audience. It's sometimes useful,
| but it's not what the web is about.
|
| But faster is better than slower, so we try anyway. Understanding
| the performance of portable code is a messy business because it's
| mostly not the code, it's our assumptions about the environment.
|
| We run tests that don't generalize. For scientific studies, this
| is called the "external validity" problem. We're often doing the
| equivalent of testing on mice and assuming the results are
| relevant for humans.
| Max-q wrote:
| Ignoring performance is what gives you slow code, costing you a
| lot if the code you write will be a success because you have to
| throw a lot more hardware at it. Think back to early Twitter
| that crashed and went down often hours each day.
|
| Most optimization will improve on all or some VMs. Most will
| not make it slower on others.
|
| If you write code that will be scaled up, optimization can save
| a lot of money, give better uptime, and it's not a bad thing,
| the better code is not less portable in most cases.
| ericyd wrote:
| Maybe I'm doing it wrong, but when I benchmark code, my goal is
| to compare two implementations of the same function and see which
| is faster. This article seems to be concerned with finding some
| absolute metric of performance, but to me that isn't what
| benchmarking is for. Performance will vary based on hardware and
| runtime which often aren't in your control. The limitations
| described in this article are interesting notes, but I don't see
| how they would stop me from getting a reasonable assessment of
| which implementation is faster for a single benchmark.
| epolanski wrote:
| Well, the issue is that micro benchmarking in JS is borderline
| useless.
|
| You can have some function that iterates over something and
| benchmark two different implementations and draw conclusions
| that one is better than the other.
|
| Then, in real world, when it's in the context of some other
| code, you just can't draw conclusions because different engines
| will optimize the very same paths differently in different
| contexts.
|
| Also, your micro benchmark may tell you that A is faster than
| B...when it's a hot function that has been optimized due to
| being used frequently. But then you find that B which is used
| only few times and doesn't get optimized will run faster by
| default.
|
| It is really not easy nor obvious to benchmark different
| implementations. Let alone the fact that you have differences
| across engines, browsers, devices and OSs (which will use
| different OS calls and compiler behaviors).
| ericyd wrote:
| I guess I've just never seen any alternative to
| microbenchmarking in the JS world. Do you know of any
| projects that do "macrobenchmarking" to a significant degree
| so I could see that approach?
| epolanski wrote:
| Real world app and some other projects focus on entire app
| benchmarks.
| dizhn wrote:
| Isn't that more profiling than benchmarking?
| mort96 wrote:
| No? Profiling tells you which parts of your code take time.
| dizhn wrote:
| Thanks
| hyperpape wrote:
| The basic problem is that if the compiler handles your code in
| an unusual way in the benchmark, you haven't really measured
| the two implementations against each other, you've measured
| something different.
|
| Dead code elimination is the most obvious way this happens, but
| you can also have issues where you give the branch predictor
| "help", or you can use a different number of implementations of
| a method so you get different inlining behavior (this can make
| a benchmark better or worse than reality), and many others.
|
| As for runtime, if you're creating a library, you probably care
| at least a little bit about alternate runtimes, though you may
| well just target node/V8 (on the JVM, I've done limited
| benchmarking on runtimes other than HotSpot, though if any of
| my projects get more traction, I'd anticipate needing to do
| more).
| wpollock wrote:
| You're not wrong, but there are cases where "absolute"
| performance matters. For example, when your app must meet a
| performance SLA.
| DJBunnies wrote:
| The whole JavaScript ecosystem is a mess.
| egberts1 wrote:
| As one who mapped the evolution of JavaScript and actually
| benchmark each of those iteration (company proprietary info),
| it Doesn't get anymore accurate that the OP's reiteration of
| the article's title.
|
| I upvoted that.
|
| Evolution chart:
|
| https://egbert.net/blog/articles/javascript-jit-engines-time...
| pygy_ wrote:
| I have been sleeping on this for quite a while (long covid is a
| bitch), but I have built a benchmarking lib that sidesteps quite
| a few of these problems, by
|
| - running the benchmark in thin slices, interspersed and suffled,
| rather than in one big batch per item (which also avoids having
| one scenario penalized by transient noise)
|
| - displaying a graphs that show possible multi-modal
| distributions when the JIT gets in the way
|
| - varying the lengths of the thin slices between run to work
| around the poor timer resolution in browsers
|
| - assigning the results of the benchmark to a global (or a
| variable in the parent scope as it is in the WEB demo below)
| avoid dead code elimination
|
| This isn't a panacea, but it is better than the existing
| solutions AFA I'm aware.
|
| There are still issues because, sometimes, even if the task order
| is shuffled for each slice, the literal source order can
| influence how/if a bit of code is compiled, resulting in
| unreliable results. The "thin slice" approach can also dilute the
| GC runtime between scenarios if the amount of garbage isn't
| identical between scenarios.
|
| I think it is, however, a step in the right direction.
|
| - CLI runner for NODE:
| https://github.com/pygy/bunchmark.js/tree/main/packages/cli
|
| - WIP WEB UI:
| https://flems.io/https://gist.github.com/pygy/3de7a5193989e0...
|
| In both case, if you've used JSPerf you should feel right at home
| in the WEB UI. The CLI UI is meant to replicate the WEB UI as
| close as possible (see the example file).
| pygy_ wrote:
| I hadn't run these in a while, but in the current Chrome
| version, you can clearly see the multi-modality of the results
| with the dummy Math.random() benchmark.
| sroussey wrote:
| For the love of god, please do not do this example:
| for (int i = 0; i<1000; i++) { console.time() //
| do some expensive work console.timeEnd() }
|
| Take your timing before and after the loop and divide by the
| count. Too much jitter otherwise.
|
| d8 and node have many options for benchmarking and if you really
| care, go command line. JSC is what is behind Bun so you can go
| that direction as well.
|
| And BTW: console.time et al does a bunch of stuff itself. You
| will get the JIT looking to optimize it as well in that loop
| above, lol.
| igouy wrote:
| > and divide by the count
|
| Which gives an average rather than a time?
| gmokki wrote:
| I usually do a
|
| var innerCount = 2000; // should run about 2 seconds for (var
| i=0; i<1000; i++) { var start = currentMillis(); for (var
| j=0; j<innerCount; j++) { benchmark method(); } best =
| min(best, (currentMillis() - start) / (double) innerCount); }
|
| That way I can both get enough precision form the millisecond
| resolution and run the whole thing enough times to get the
| best result without JIT/GC pauses. The result is usually very
| stable, even when benchmarking calls to database (running
| locally).
| igouy wrote:
| No interest in a more general tool?
|
| https://github.com/sosy-lab/benchexec
| gred wrote:
| If you find yourself benchmarking JavaScript, you chose the wrong
| language.
| pizlonator wrote:
| (I designed JavaScriptCore's optimizing JITs and its garbage
| collector and a bunch of the runtime. And I often benchmark
| stuff.)
|
| Here's my advice for how to run benchmarks and be happy with the
| results.
|
| - Any experiment you perform has the risk of producing an outcome
| that misleads you. You have to viscerally and spiritually accept
| this fact if you run any benchmarks. Don't rely on the outcome of
| a benchmark as if it's some kind of Truth. Even if you do
| everything right, there's something like a 1/10 risk that you're
| fooling yourself. This is true for any experiment, not just ones
| involving JavaScript, or JITs, or benchmarking.
|
| - Benchmark large code. Language implementations (including ahead
| of time compilers for C!) have a lot of "winning in the average"
| kind of optimizations that will kick in or not based on
| heuristics, and those heuristics have broad visibility into large
| chunks of your code. AOTs get there by looking at the entire
| compilation unit, or sometimes even your whole program. JITs get
| to see a random subset of the whole program. So, if you have a
| small snippet of code then the performance of that snippet will
| vary wildly depending on how it's used. Therefore, putting some
| small operation in a loop and seeing how long it runs tells you
| almost nothing about what will happen when you use that snippet
| in anger as part of a larger program.
|
| How do you benchmark large code? Build end-to-end benchmarks that
| measure how your whole application is doing perf-wise. This is
| sometimes easy (if you're writing a database you can easily
| benchmark TPS, and then you're running the whole DB impl and not
| just some small snippet of the DB). This is sometimes very hard
| (if you're building UX then it can be hard to measure what it
| means for your UX to be responsive, but it is possible). Then, if
| you want to know whether some function should be implemented one
| way or another way, run an A:B test where you benchmark your
| whole app with one implementation versus the other.
|
| Why is that better? Because then, you're measuring how your
| snippet of code is performing in the context of how it's used,
| rather than in isolation. So, your measurement will account for
| how your choices impact the language implementation's heuristics.
|
| Even then, you might end up fooling yourself, but it's much less
| likely.
| leeoniya wrote:
| great points! i do a lot of JS benchmarking + optimization and
| whole-program measurement is key. sometimes fixing one hotspot
| changes the whole profile, not just shifts the bottleneck to
| the next biggest thing in the original profile. GC behaves
| differently in different JS vms. sometimes if you benchmark
| something like CSV parsers which can stress the GC,
| Benchmark.js does a poor job by not letting the GC collect
| properly between cycles. there's a lengthy discussion about why
| i use a custom benchmark runner for this purpose [1]. i can
| recommend js-framework-benchmark [2] as a good example of one
| that is done well, also WebKit's speedometer [3].
|
| [1] https://github.com/leeoniya/uDSV/issues/2
|
| [2] https://github.com/krausest/js-framework-benchmark
|
| [3] https://github.com/WebKit/Speedometer
| kmiller68 wrote:
| I completely agree with this advice. Micro-benchmarking can
| work well as long as you already have an understanding of
| what's happening behind the scenes. Without that it greatly
| increases the chance that you'll get information unrelated to
| how your code would perform in the real world. Even worse, I've
| found a lot of the performance micro-benchmarking websites can
| actually induce performance issues. Here's an example of a
| recent performance bug that appears to have been entirely
| driven by the website's harness.
| https://bugs.webkit.org/show_bug.cgi?id=283118
| hinkley wrote:
| > there's something like a 1/10 risk that you're fooling
| yourself.
|
| You're being generous or a touch ironic. It's at least 1/10 and
| probably more like 1/5 on average and 1/3 for people who don't
| take advice.
|
| Beyond testing changes in a larger test fixture, I also find
| that sometimes multiplying the call count for the code under
| examination can help clear things up. Putting a loop in to run
| the offending code 10 times instead of once is a clearer
| signal. Of course it still may end up being a false signal.
|
| I like a two phase approach, wheee you use a small scale
| benchmark while iterating on optimization ideas, with checking
| the larger context once you feel you've made progress, and
| again before you file a PR.
|
| At the end of the day, eliminating accidental duplication of
| work is the most reliable form of improvement, and one that
| current and previous generation analysis tools don't do well.
| Make your test cases deterministic and look at invocation
| counts to verify that you expect n calls of a certain shape to
| call the code in question exactly kn times. Then figure out why
| it's mn instead. (This is why I say caching is the death of
| perf analysis. Once it's added this signal disappears)
| aardvark179 wrote:
| Excellent advice. It's also very important to know what any
| micro benchmarks you do have are really measuring. I've seen
| enough that actually measured the time to setup or parse
| something because they dominated and wasn't cached correctly.
| Conversely I've seen cases where the JIT correctly optimised
| away almost everything because there was a check on the final
| value.
|
| Oh, and if each op takes under a nanosecond than your benchmark
| is almost certainly completely broken.
| croes wrote:
| Do the users care?
|
| I think they are used to waiting because they no longer know the
| speed of desktop applications.
| leeoniya wrote:
| it's fun to write "fast" js code and watch people's amazement.
| has hardware becomes cheaper and faster devs become lazier and
| careless.
|
| it's all fun and games until your battery dies 3 hours too
| soon.
|
| https://en.m.wikipedia.org/wiki/Jevons_paradox
| CalChris wrote:
| Laurence Tratt's paper _Virtual machine warmup blows hot and
| cold_ [1] paper has been posted several times and never really
| discussed. It covers this problem for Java VMs and also presents
| a benchmarking methodology.
|
| [1] https://dl.acm.org/doi/10.1145/3133876
| 1oooqooq wrote:
| kids dont recall when chrome was cheating left and right to be
| faster than firefox (after they were honestly for a couple
| months).
|
| you'd have to run benchmarks for all sort of little thibgs
| because no browser would leave things be. If they thought one
| popular benchmark was using string+string it was all or nothing
| to optimize that, harming everything else. next week if that
| benchmark changed to string[].join... you get the idea. your code
| was all over the place in performance. Flying today, molasses
| next week... sometimes chrome and ff would switch the
| optimizations, so you'd serve string+string to one and array.join
| to the other. sigh.
| evnwashere wrote:
| That's why i created mitata, it greatly improves on javascript
| (micro-)benchmarking tooling
|
| it provides bunch of features to help avoiding jit optimization
| foot-guns during benchmarking and dips into more advanced stuff
| like hardware cpu counters to see what's the end result of jit on
| cpu
| henning wrote:
| While there may be challenges, caring about frontend performance
| is still worth it. When I click the Create button in JIRA and
| start typing, the text field lags behind my typing. I use a 2019
| MacBook Pro. Unforgivable. Whether one alternate implementation
| that lets me type normally is 10% faster than another or not or
| whatever may be harder to answer. If I measure how bad the UI is
| and it's actually 60x slower than vanilla JS rather than 70x
| because of measurement error, the app is still a piece of shit.
| spankalee wrote:
| My old team at Google created a tool to help do better browser
| benchmarking called Tachometer:
| https://github.com/google/tachometer
|
| It tries to deal with the uncertainties of different browsers,
| JITs, GCs, CPU throttling, varying hardware, etc., several ways:
|
| - Runs benchmarks round-robin to hopefully subject each
| implementation to varying CPU load and thermal properties evenly.
|
| - It reports the confidence interval for an implementation, not
| the mean. Doesn't throw out outlier samples.
|
| - For multiple implementations, compares the distributions of
| samples, de-emphasizing the mean
|
| - For comparisons, reports an NxM difference table, showing how
| each impl compares to the other.
|
| - Can auto-run until confidence intervals for different
| implementations no longer overlap, giving high confidence that
| there is an actual difference.
|
| - Uses WebDriver to run benchmarks in multiple browsers, also
| round-robin, and compares results.
|
| - Can manage npm dependencies, so you can run the same benchmark
| with different dependencies and see how different versions change
| the result.
|
| Lit and Preact use Tachometer to tease out performance changes of
| PRs, even on unreliable GitHub Action hardware. We needed the
| advanced statistical comparisons exactly because certain things
| could be faster or slower in different JIT tiers, different
| browsers, or different code paths.
|
| We wanted to be able to test changes that might have small but
| reliable overall perf impact, in the context of a non-micro-
| benchmark, and get reliable results.
|
| Tachometer is browser-focused, but we made it before there were
| so many server runtimes. It'd be really interesting to make it
| run benchmarks against Node, Bun, Deno, etc. too.
| nemomarx wrote:
| how relevant is browser benchmarking now that chrome owns most
| of the space?
| austin-cheney wrote:
| There is a common sentiment I see there that I see regularly
| repeated in software. Here is my sarcastic take:
|
| _I hate measuring things because accuracy is hard. I wish I
| could just make up my own numbers to make myself feel better._
|
| It is surprising to me how many developers cannot measure things,
| do so incorrectly, and then look for things to blame for their
| emotional turmoil.
|
| Here is quick guide to solve for this:
|
| 1. Know what you are measuring and what its relevance is to your
| product. It is never about big or small because numerous small
| things make big things.
|
| 2. Measuring things means generating numbers and comparing those
| numbers against other numbers from a different but similar
| measure. The numbers are meaningless is there is no comparison.
|
| 3. If precision is important use the high performance tools
| provided by the browser and Node for measuring things. You can
| get greater than nanosecond precision and then account for the
| variance, that plus/minus range, in your results. If you are
| measuring real world usage and your numbers get smaller, due to
| performance refactoring, expect variance to increase. It's ok, I
| promise.
|
| 4. Measure a whole bunch of different shit. The point of
| measuring things isn't about speed. It's about identifying bias.
| The only way to get faster is to know what's really happening and
| just how off base your assumptions are.
|
| 5. Never ever trust performance indicators from people lacking
| objectivity. Expect to have your results challenged and be glad
| when they are. Rest on the strength of your evidence and ease of
| reproduction that you provide.
| xpl wrote:
| I once created this tool for benchmarking JS:
| https://github.com/xpl/what-code-is-faster
|
| It does JIT warmup and ensures that your code doesn't get
| optimized out (by making it produce a side effect in result).
___________________________________________________________________
(page generated 2024-12-24 23:00 UTC)