[HN Gopher] JavaScript Benchmarking Is a Mess
       ___________________________________________________________________
        
       JavaScript Benchmarking Is a Mess
        
       Author : joseneca
       Score  : 104 points
       Date   : 2024-12-24 12:37 UTC (10 hours ago)
        
 (HTM) web link (byteofdev.com)
 (TXT) w3m dump (byteofdev.com)
        
       | diggan wrote:
       | > Essentially, these differences just mean you should benchmark
       | across all engines that you expect to run your code to ensure
       | code that is fast in one isn't slow in another.
       | 
       | In short, the JavaScript backend people now need to do what we
       | JavaScript frontend people been doing since SPAs became a thing,
       | run benchmarks across multiple engines instead of just one.
        
       | sylware wrote:
       | If you use javascript, use a lean engine coded in a lean SDK,
       | certainly not the c++ abominations in Big Tech web engines.
       | 
       | Look at quickjs, and use your own very lean OS interfaces.
        
         | joseneca wrote:
         | QuickJS is great for cases where you are more limited by
         | startup and executable size than anything else but it tends to
         | perform quite terribly (https://bellard.org/quickjs/bench.html)
         | compared to V8 and anything else with JIT compilation.
         | 
         | More code does not inherently mean worse performance.
        
           | sylware wrote:
           | But its SDK is not the c++ abominations like v8, and that
           | alone is enough to choose quickjs or similar, since we all
           | know here, that c++ (and similar, namely rust/java/etc) is a
           | definitive nono (when it is not forced down our throat as a
           | user, don't forget to thanks the guys doing that, usually
           | well hidden behind internet...).
           | 
           | For performance, don't use javascript anyway...
           | 
           | That said, a much less worse middleground would be to have
           | performance critical block written in assembly (RISC-V Now!)
           | orchestrated by javascript.
        
             | ramon156 wrote:
             | I could be reading your comment wrong, but what do you mean
             | with "c++ is a definitive nono"? Also how is a complicated
             | repository enough (reason) to choose quickjs?
        
               | sylware wrote:
               | quickjs or similar. Namely, small depending on a
               | reasonable SDK (which does include the computer
               | language).
        
             | surajrmal wrote:
             | I'm not sure I understand the point of mentioning the
             | language. If it was written in a language like c but still
             | "shoved down your throat" would you still have qualms with
             | it? Do you just like things written by corporate entities
             | and those languages tend to be popular as they scale to
             | larger teams well? Or do you dislike software that is too
             | large to understand and optimized for the needs of larger
             | teams? Because it doesn't matter if it's software or some
             | grouping of distinct software - at some point there will be
             | a point at which it becomes challenging to understand the
             | full set of software.
             | 
             | If I were to create an analogy, it feels like you're
             | complaining about civil engineers who design skyscrapers to
             | be built out of steel and concrete instead of wood and
             | brick like we use for houses. Sure the former is not really
             | maintainable by a single person but it's also built for
             | higher capacity occupancy and teams of folks to maintain.
        
               | sylware wrote:
               | The "need of larger teams" does not justify to delegate
               | the core of the technical interfaces to a grotesquely and
               | absurdely complex and gigantic computer language with its
               | compiler (probably very few real life ones).
               | 
               | This is accute lack of perspective, border-line fraud.
        
       | dan-robertson wrote:
       | Re VM warmup, see
       | https://tratt.net/laurie/blog/2022/more_evidence_for_problem...
       | and the linked earlier research for some interesting discussion.
       | Roughly, there is a belief when benchmarking that one can work
       | around not having the most-optimised JIT-compiled version by
       | running your benchmark a number of times and then throwing away
       | the result before doing 'real' runs. But it turns out that:
       | 
       | (a) sometimes the jit doesn't run
       | 
       | (b) sometimes it makes performance worse
       | 
       | (c) sometimes you don't even get to a steady state with
       | performance
       | 
       | (d) and obviously in the real world you may not end up with the
       | same jitted version that you get in your benchmarks
        
         | vitus wrote:
         | In general, this isn't even a JS problem, or a JIT problem. You
         | have similar issues even in a lower-level language like C++:
         | branch prediction, cache warming, heck, even power state
         | transitions if you're using AVX512 instructions on an older
         | CPU. Stop-the-world GC causing pauses? Variation in memory
         | management exists in C, too -- malloc and free are not fixed-
         | cost, especially under high churn.
         | 
         | Benchmarks can be a useful tool, but they should not be
         | mistaken for real-world performance.
        
           | dan-robertson wrote:
           | I think that sort of thing is a bit different because you can
           | have more control over it. If you're serious about
           | benchmarking or running particularly performance sensitive
           | code, you'll be able to get reasonably consistent benchmark
           | results run-to-run, and you'll have a big checklist of things
           | like hugepages, pgo, tickless mode, writing code a specific
           | way, and so on to get good and consistent performance. I
           | think you end up being less at the mercy of choices made by
           | the VM.
        
             | marcosdumay wrote:
             | The amount of control you have varies in a continuum
             | between hand-written assembly to SQL queries. But there
             | isn't really a difference of kind here, it's just a
             | continuum.
             | 
             | If there's anything unique about Javascript is that has an
             | unusually high rate of "unpredictability" / "abstraction
             | level". But again, it has pretty normal values of both of
             | those, just the relation is away from the norm.
        
               | dan-robertson wrote:
               | I'm quite confused by your comment. I think this
               | subthread is about reasons one might see variance on
               | modern hardware in your 'most control' end of the
               | continuum.
               | 
               | The start of the thread was about some ways where VM
               | warmup (and benchmarking) may behave differently from how
               | many reasonably experienced people expect.
               | 
               | I claim it's reasonably different because one cannot tame
               | parts of the VM warmup issues whereas one can tame many
               | of the sources of variability one sees in high-
               | performance systems outside of VMs, eg by cutting out the
               | OS/scheduler, by disabling power saving and properly
               | cooling the chip, by using CAT to limit interference with
               | the L3 cache, and so on.
        
               | marcosdumay wrote:
               | > eg by cutting out the OS/scheduler, by disabling power
               | saving and properly cooling the chip, by using CAT to
               | limit interference with the L3 cache, and so on
               | 
               | That's not very different from targeting a single JS
               | interpreter.
               | 
               | In fact, you get much more predictability by targeting a
               | single JS interpreter than by trying to guess your
               | hardware limitations... Except, of course if you target a
               | single hardware specification.
        
               | hinkley wrote:
               | When we were upgrading to ES6 I was surprised/relieved to
               | find that moving some leaf node code in the call graph to
               | classes from prototypes did help. The common wisdom at
               | the time was that classes were still relatively
               | expensive. But they force strict, which we were using
               | inconsistently, and they flatten the object
               | representation (I discovered these issues in the heap
               | dump, rather than the flame graph). Reducing memory
               | pressure can overcome the cost of otherwise suboptimal
               | code.
               | 
               | OpenTelemetry having not been invented yet, someone
               | implemented server side HAR reports and the data
               | collection on that was a substantial bottleneck.
               | Particularly the original implementation.
        
           | hinkley wrote:
           | Running benchmarks on the old Intel MacBook a previous job
           | gave me was like pulling teeth. Thermal throttling all the
           | time. Anything less than at least a 2x speed up was just
           | noise and I'd have to push my changes to CI/CD to test, which
           | is how our build process sprouted a benchmark pass. And a
           | grafana dashboard showing the trend lines over time.
           | 
           | My wheelhouse is making lots of 4-15% improvements and
           | laptops are no good for those.
        
         | hyperpape wrote:
         | I think Tratt's work is great, but most of the effects that
         | article highlights seem small enough that I think they're most
         | relevant to VM implementors measuring their own internal
         | optimizations.
         | 
         | Iirc, the effects on long running benchmarks in that paper are
         | usually < 1%, which is a big deal for runtime optimizations,
         | but typically dwarfed by the differences between two methods
         | you might measure.
        
           | hinkley wrote:
           | Cross cutting concerns run into these sorts of problems. And
           | they can sneak up on you because as you add these calls to
           | your coding conventions, it's incrementally added in new code
           | and substantial edits to old, so what added a few tenths of a
           | ms at the beginning may be tens of milliseconds a few years
           | later. Someone put me on a trail of this sort last year and I
           | managed to find about 75 ms of improvement (and another 50ms
           | in stupid mistakes adjacent to the search).
           | 
           | And since I didn't eliminate the logic I just halved the
           | cost, that means we were spending about twice that much. But
           | I did lower the slope of the regression line quite a lot, and
           | I believe enough that new nodeJS versions improve response
           | time faster than it was organically decaying. There were
           | times it took EC2 instance type updates to see forward
           | progress.
        
             | hyperpape wrote:
             | I think you might be responding to a different point than
             | the one I made. The 1% I'm referring to is a 1% variation
             | between subsequent runs of the same code. This is a
             | measurement error, and it inhibits your ability to
             | accurately compare the performance of two pieces of code
             | that differ by a very small amount.
             | 
             | Now, it reads like you' think I'm saying you shouldn't care
             | about a method if it's only 1% of your runtime. I
             | definitely don't believe that. Sure, start with the big
             | pieces, but once you've reached the point of diminishing
             | returns, you're often left optimizing methods that
             | individually are very small.
             | 
             | It sounds like you're describing a case where some method
             | starts off taking < 1% of the runtime of your overall
             | program, and grows over time. It's true that if you do an
             | full program benchmark, you might be unable to detect the
             | difference between that method and an alternate
             | implementation (even that's not guaranteed, you can often
             | use statistics to overcome the variance in the runtime).
             | 
             | However, you often still will be able to use micro-
             | benchmarks to measure the difference between implementation
             | A and implementation B, because odds are they differ not by
             | 1% in their own performance, but 10% or 50% or something.
             | 
             | That's why I say that Tratt's work is great, but I think
             | the variance it describes is a modest obstacle to most
             | application developers, even if they're very performance
             | minded.
        
         | __alexs wrote:
         | Benchmarking that isn't based on communicating the distribution
         | of execution times is fundamentally wrong on almost any
         | platform.
        
       | blacklion wrote:
       | Very strange take on "JIT introduce a lot of error into result".
       | I'm from JVM/Java world, but it is JITted VM too, and in our
       | world question is: why you want to benchmark interpreted code at
       | all!?
       | 
       | Only final-stage, fully-JIT-ted and profile-optimized code is
       | what matter.
       | 
       | Short-lived interpreted / level-1 JITted code is not interesting
       | at all from benchmarking perspective, because it will be compiled
       | fast enough to doesn't matter in grand scheme of things.
        
         | the_mitsuhiko wrote:
         | > I'm from JVM/Java world, but it is JITted VM too, and in our
         | world question is: why you want to benchmark interpreted code
         | at all!?
         | 
         | Java gives you exceptional control over the JVM allowing you to
         | create really good benchmark harnesses. That today is not the
         | case with JavaScript and the proliferation of different
         | runtimes makes that also harder. To the best of my knowledge
         | there is no JMH equivalent for JavaScript today.
        
         | dzaima wrote:
         | JIT can be very unpredictable. I've seen cases with JVM of
         | running the exact same benchmark in the same VM twice having
         | the second run be 2x slower than the first, occurrences of
         | having ran one benchmark before another making the latter 5x
         | slower, and similar.
         | 
         | Sure, if you make a 100% consistent environment of a VM running
         | just the single microbenchmark you may get a consistent result
         | on one system, but is a consistent result in any way meaningful
         | if it may be a massive factor away from what you'd get in a
         | real environment? And even then I've had cases of like 1.5x-2x
         | differences for the exact same benchmark run-to-run.
         | 
         | Granted, this may be less of a benchmarking issue, more just a
         | JIT performance issue, but it's nevertheless also a
         | benchmarking issue.
         | 
         | Also, for JS, in browser specifically, pre-JIT performance is
         | actually a pretty meaningful measurement, as each website load
         | starts anew.
        
           | gmokki wrote:
           | How long did you run the benchmark if you got so large
           | variation?
           | 
           | For simple methods I usually run the benchnarkes method 100k
           | times, 10k is minimum for full JIT.
           | 
           | For large programs I have noticed the performance keeps
           | getting better for the first 24 hours, after which I take a
           | profiling dump.
        
             | dzaima wrote:
             | Most of the simple benches I do are for ~1 second. The
             | order-dependent things definitely were reproducible
             | (something along the lines of rerunning resulting in some
             | rare virtual method case finally being invoked enough
             | times/with enough cases to heavily penalize the vastly more
             | frequent case). And the case of very different results C2
             | deciding to compile the code differently (looking at the
             | assembly was problematic as adding printassembly whatever
             | skewed the case it took), and stayed stable for tens of
             | seconds after the first ~second iirc (though, granted, it
             | was preview jdk.incubator.vector code).
        
         | Etheryte wrote:
         | Agreed, comparing functions in isolation can give you
         | drastically different results from the real world, where your
         | application can have vastly different memory access patterns.
        
           | natdempk wrote:
           | Does anyone know how well the JIT/cache on the browser works
           | eg. how useful it is to profile JIT'd vs non-JIT'd and what
           | those different scenarios might represent in practice? For
           | example is it just JIT-ing as the page loads/executes, or are
           | there cached functions that persist across page loads, etc?
        
         | ufo wrote:
         | Javascript code is often short lived and doesn't have enough
         | time to wait for the JIT to watm up.
        
         | pizlonator wrote:
         | When JITing Java, the main profiling inputs are for call
         | devirtualization. That has a lot of randomness, but it's
         | confined to just those callsites where the JIT would need
         | profiling to devirtualize.
         | 
         | When JITing JavaScript, every single fundamental operation has
         | profiling. Adding stuff has multiple bits of profiling. Every
         | field access. Every array access. Like, basically everything,
         | including also callsites. And without that profiling, the JS
         | JIT can't do squat, so it depends entirely on that profiling.
         | So the randomness due to profiling has a much more extreme
         | effect on what the compiler can even do.
        
         | munificent wrote:
         | _> Short-lived interpreted  / level-1 JITted code is not
         | interesting at all from benchmarking perspective, because it
         | will be compiled fast enough to doesn't matter in grand scheme
         | of things._
         | 
         | This is true for servers but extremely not true for client-side
         | GUI applications and web apps. Often, the entire process of [
         | user starts app > user performs a few tasks > user exits app ]
         | can be done in a second. Often, the JIT never has a chance to
         | warm up.
        
       | vitus wrote:
       | > This effort, along with a move to prevent timing attacks, led
       | to JavaScript engines intentionally making timing inaccurate, so
       | hackers can't get precise measurements of the current computers
       | performance or how expensive a certain operation is.
       | 
       | The primary motivation for limiting timer resolution was the rise
       | of speculative execution attacks (Spectre / Meltdown), where
       | high-resolution timers are integral for differentiating between
       | timings within the memory hierarchy.
       | 
       | https://github.com/google/security-research-pocs/tree/master...
       | 
       | If you look at when various browsers changed their timer
       | resolutions, it's entirely a response to Spectre.
       | 
       | https://blog.mozilla.org/security/2018/01/03/mitigations-lan...
       | 
       | https://issues.chromium.org/issues/40556716 (SSCA -> "speculative
       | side channel attacks")
        
       | hyperpape wrote:
       | For anyone interested in this subject, I'd recommend reading
       | about JMH. The JVM isn't 100% the same as JS VMs, but as a
       | benchmarking environment it shares the same constraint of JIT
       | compilation.
       | 
       | The right design is probably one that:
       | 
       | 1) runs different tests in different forked processes, to avoid
       | variance based on the order in which tests are run changing the
       | JIT's decisions.
       | 
       | 2) runs tests for a long time (seconds or more per test) to
       | ensure full JIT compilation and statistically meaningful results
       | 
       | Then you need to realize that your micro benchmarks give you
       | information and help you understand, but the acid test is
       | improving the performance of actual code.
        
       | thecodrr wrote:
       | Benchmarking is a mess _everywhere_. Sure you can get some level
       | of accuracy but reproducing any kind of benchmark results across
       | machines is impossible. That 's why perf people focus on things
       | like CPU cycles, heap size, cache access etc instead of time.
       | Even with multiple runs and averaged out results you can only get
       | a surface level idea of how your code is actually performing.
        
       | skybrian wrote:
       | Performance is inherently non-portable. In fact, ignoring
       | performance differences is what _enables_ portability.
       | 
       | Not knowing what performance to expect is what allows you to
       | build a website and expect it to run properly years later, on
       | browsers that haven't been released yet, running on future mobile
       | phones that use chips that haven't been designed yet, over a
       | half-working WiFi connection in some cafe somewhere.
       | 
       | Being ignorant of performance is what allows you to create Docker
       | images that work on random servers in arbitrary datacenters, at
       | the same time that perfect strangers are running _their_ jobs and
       | arbitrarily changing what hardware is available for your code to
       | use.
       | 
       | It's also what allows you to depend on a zillion packages written
       | by others and available for free, and _upgrade_ those packages
       | without things horribly breaking due to performance differences,
       | at least most of the time.
       | 
       | If you want fixed performance, you have to deploy on fixed,
       | dedicated hardware, like video game consoles or embedded devices,
       | and test on the same hardware that you'll use in production. And
       | then you drastically limit your audience. It's sometimes useful,
       | but it's not what the web is about.
       | 
       | But faster is better than slower, so we try anyway. Understanding
       | the performance of portable code is a messy business because it's
       | mostly not the code, it's our assumptions about the environment.
       | 
       | We run tests that don't generalize. For scientific studies, this
       | is called the "external validity" problem. We're often doing the
       | equivalent of testing on mice and assuming the results are
       | relevant for humans.
        
         | Max-q wrote:
         | Ignoring performance is what gives you slow code, costing you a
         | lot if the code you write will be a success because you have to
         | throw a lot more hardware at it. Think back to early Twitter
         | that crashed and went down often hours each day.
         | 
         | Most optimization will improve on all or some VMs. Most will
         | not make it slower on others.
         | 
         | If you write code that will be scaled up, optimization can save
         | a lot of money, give better uptime, and it's not a bad thing,
         | the better code is not less portable in most cases.
        
       | ericyd wrote:
       | Maybe I'm doing it wrong, but when I benchmark code, my goal is
       | to compare two implementations of the same function and see which
       | is faster. This article seems to be concerned with finding some
       | absolute metric of performance, but to me that isn't what
       | benchmarking is for. Performance will vary based on hardware and
       | runtime which often aren't in your control. The limitations
       | described in this article are interesting notes, but I don't see
       | how they would stop me from getting a reasonable assessment of
       | which implementation is faster for a single benchmark.
        
         | epolanski wrote:
         | Well, the issue is that micro benchmarking in JS is borderline
         | useless.
         | 
         | You can have some function that iterates over something and
         | benchmark two different implementations and draw conclusions
         | that one is better than the other.
         | 
         | Then, in real world, when it's in the context of some other
         | code, you just can't draw conclusions because different engines
         | will optimize the very same paths differently in different
         | contexts.
         | 
         | Also, your micro benchmark may tell you that A is faster than
         | B...when it's a hot function that has been optimized due to
         | being used frequently. But then you find that B which is used
         | only few times and doesn't get optimized will run faster by
         | default.
         | 
         | It is really not easy nor obvious to benchmark different
         | implementations. Let alone the fact that you have differences
         | across engines, browsers, devices and OSs (which will use
         | different OS calls and compiler behaviors).
        
           | ericyd wrote:
           | I guess I've just never seen any alternative to
           | microbenchmarking in the JS world. Do you know of any
           | projects that do "macrobenchmarking" to a significant degree
           | so I could see that approach?
        
             | epolanski wrote:
             | Real world app and some other projects focus on entire app
             | benchmarks.
        
         | dizhn wrote:
         | Isn't that more profiling than benchmarking?
        
           | mort96 wrote:
           | No? Profiling tells you which parts of your code take time.
        
             | dizhn wrote:
             | Thanks
        
         | hyperpape wrote:
         | The basic problem is that if the compiler handles your code in
         | an unusual way in the benchmark, you haven't really measured
         | the two implementations against each other, you've measured
         | something different.
         | 
         | Dead code elimination is the most obvious way this happens, but
         | you can also have issues where you give the branch predictor
         | "help", or you can use a different number of implementations of
         | a method so you get different inlining behavior (this can make
         | a benchmark better or worse than reality), and many others.
         | 
         | As for runtime, if you're creating a library, you probably care
         | at least a little bit about alternate runtimes, though you may
         | well just target node/V8 (on the JVM, I've done limited
         | benchmarking on runtimes other than HotSpot, though if any of
         | my projects get more traction, I'd anticipate needing to do
         | more).
        
         | wpollock wrote:
         | You're not wrong, but there are cases where "absolute"
         | performance matters. For example, when your app must meet a
         | performance SLA.
        
       | DJBunnies wrote:
       | The whole JavaScript ecosystem is a mess.
        
         | egberts1 wrote:
         | As one who mapped the evolution of JavaScript and actually
         | benchmark each of those iteration (company proprietary info),
         | it Doesn't get anymore accurate that the OP's reiteration of
         | the article's title.
         | 
         | I upvoted that.
         | 
         | Evolution chart:
         | 
         | https://egbert.net/blog/articles/javascript-jit-engines-time...
        
       | pygy_ wrote:
       | I have been sleeping on this for quite a while (long covid is a
       | bitch), but I have built a benchmarking lib that sidesteps quite
       | a few of these problems, by
       | 
       | - running the benchmark in thin slices, interspersed and suffled,
       | rather than in one big batch per item (which also avoids having
       | one scenario penalized by transient noise)
       | 
       | - displaying a graphs that show possible multi-modal
       | distributions when the JIT gets in the way
       | 
       | - varying the lengths of the thin slices between run to work
       | around the poor timer resolution in browsers
       | 
       | - assigning the results of the benchmark to a global (or a
       | variable in the parent scope as it is in the WEB demo below)
       | avoid dead code elimination
       | 
       | This isn't a panacea, but it is better than the existing
       | solutions AFA I'm aware.
       | 
       | There are still issues because, sometimes, even if the task order
       | is shuffled for each slice, the literal source order can
       | influence how/if a bit of code is compiled, resulting in
       | unreliable results. The "thin slice" approach can also dilute the
       | GC runtime between scenarios if the amount of garbage isn't
       | identical between scenarios.
       | 
       | I think it is, however, a step in the right direction.
       | 
       | - CLI runner for NODE:
       | https://github.com/pygy/bunchmark.js/tree/main/packages/cli
       | 
       | - WIP WEB UI:
       | https://flems.io/https://gist.github.com/pygy/3de7a5193989e0...
       | 
       | In both case, if you've used JSPerf you should feel right at home
       | in the WEB UI. The CLI UI is meant to replicate the WEB UI as
       | close as possible (see the example file).
        
         | pygy_ wrote:
         | I hadn't run these in a while, but in the current Chrome
         | version, you can clearly see the multi-modality of the results
         | with the dummy Math.random() benchmark.
        
       | sroussey wrote:
       | For the love of god, please do not do this example:
       | for (int i = 0; i<1000; i++) {         console.time()         //
       | do some expensive work         console.timeEnd()       }
       | 
       | Take your timing before and after the loop and divide by the
       | count. Too much jitter otherwise.
       | 
       | d8 and node have many options for benchmarking and if you really
       | care, go command line. JSC is what is behind Bun so you can go
       | that direction as well.
       | 
       | And BTW: console.time et al does a bunch of stuff itself. You
       | will get the JIT looking to optimize it as well in that loop
       | above, lol.
        
         | igouy wrote:
         | > and divide by the count
         | 
         | Which gives an average rather than a time?
        
           | gmokki wrote:
           | I usually do a
           | 
           | var innerCount = 2000; // should run about 2 seconds for (var
           | i=0; i<1000; i++) { var start = currentMillis(); for (var
           | j=0; j<innerCount; j++) { benchmark method(); } best =
           | min(best, (currentMillis() - start) / (double) innerCount); }
           | 
           | That way I can both get enough precision form the millisecond
           | resolution and run the whole thing enough times to get the
           | best result without JIT/GC pauses. The result is usually very
           | stable, even when benchmarking calls to database (running
           | locally).
        
             | igouy wrote:
             | No interest in a more general tool?
             | 
             | https://github.com/sosy-lab/benchexec
        
       | gred wrote:
       | If you find yourself benchmarking JavaScript, you chose the wrong
       | language.
        
       | pizlonator wrote:
       | (I designed JavaScriptCore's optimizing JITs and its garbage
       | collector and a bunch of the runtime. And I often benchmark
       | stuff.)
       | 
       | Here's my advice for how to run benchmarks and be happy with the
       | results.
       | 
       | - Any experiment you perform has the risk of producing an outcome
       | that misleads you. You have to viscerally and spiritually accept
       | this fact if you run any benchmarks. Don't rely on the outcome of
       | a benchmark as if it's some kind of Truth. Even if you do
       | everything right, there's something like a 1/10 risk that you're
       | fooling yourself. This is true for any experiment, not just ones
       | involving JavaScript, or JITs, or benchmarking.
       | 
       | - Benchmark large code. Language implementations (including ahead
       | of time compilers for C!) have a lot of "winning in the average"
       | kind of optimizations that will kick in or not based on
       | heuristics, and those heuristics have broad visibility into large
       | chunks of your code. AOTs get there by looking at the entire
       | compilation unit, or sometimes even your whole program. JITs get
       | to see a random subset of the whole program. So, if you have a
       | small snippet of code then the performance of that snippet will
       | vary wildly depending on how it's used. Therefore, putting some
       | small operation in a loop and seeing how long it runs tells you
       | almost nothing about what will happen when you use that snippet
       | in anger as part of a larger program.
       | 
       | How do you benchmark large code? Build end-to-end benchmarks that
       | measure how your whole application is doing perf-wise. This is
       | sometimes easy (if you're writing a database you can easily
       | benchmark TPS, and then you're running the whole DB impl and not
       | just some small snippet of the DB). This is sometimes very hard
       | (if you're building UX then it can be hard to measure what it
       | means for your UX to be responsive, but it is possible). Then, if
       | you want to know whether some function should be implemented one
       | way or another way, run an A:B test where you benchmark your
       | whole app with one implementation versus the other.
       | 
       | Why is that better? Because then, you're measuring how your
       | snippet of code is performing in the context of how it's used,
       | rather than in isolation. So, your measurement will account for
       | how your choices impact the language implementation's heuristics.
       | 
       | Even then, you might end up fooling yourself, but it's much less
       | likely.
        
         | leeoniya wrote:
         | great points! i do a lot of JS benchmarking + optimization and
         | whole-program measurement is key. sometimes fixing one hotspot
         | changes the whole profile, not just shifts the bottleneck to
         | the next biggest thing in the original profile. GC behaves
         | differently in different JS vms. sometimes if you benchmark
         | something like CSV parsers which can stress the GC,
         | Benchmark.js does a poor job by not letting the GC collect
         | properly between cycles. there's a lengthy discussion about why
         | i use a custom benchmark runner for this purpose [1]. i can
         | recommend js-framework-benchmark [2] as a good example of one
         | that is done well, also WebKit's speedometer [3].
         | 
         | [1] https://github.com/leeoniya/uDSV/issues/2
         | 
         | [2] https://github.com/krausest/js-framework-benchmark
         | 
         | [3] https://github.com/WebKit/Speedometer
        
         | kmiller68 wrote:
         | I completely agree with this advice. Micro-benchmarking can
         | work well as long as you already have an understanding of
         | what's happening behind the scenes. Without that it greatly
         | increases the chance that you'll get information unrelated to
         | how your code would perform in the real world. Even worse, I've
         | found a lot of the performance micro-benchmarking websites can
         | actually induce performance issues. Here's an example of a
         | recent performance bug that appears to have been entirely
         | driven by the website's harness.
         | https://bugs.webkit.org/show_bug.cgi?id=283118
        
         | hinkley wrote:
         | > there's something like a 1/10 risk that you're fooling
         | yourself.
         | 
         | You're being generous or a touch ironic. It's at least 1/10 and
         | probably more like 1/5 on average and 1/3 for people who don't
         | take advice.
         | 
         | Beyond testing changes in a larger test fixture, I also find
         | that sometimes multiplying the call count for the code under
         | examination can help clear things up. Putting a loop in to run
         | the offending code 10 times instead of once is a clearer
         | signal. Of course it still may end up being a false signal.
         | 
         | I like a two phase approach, wheee you use a small scale
         | benchmark while iterating on optimization ideas, with checking
         | the larger context once you feel you've made progress, and
         | again before you file a PR.
         | 
         | At the end of the day, eliminating accidental duplication of
         | work is the most reliable form of improvement, and one that
         | current and previous generation analysis tools don't do well.
         | Make your test cases deterministic and look at invocation
         | counts to verify that you expect n calls of a certain shape to
         | call the code in question exactly kn times. Then figure out why
         | it's mn instead. (This is why I say caching is the death of
         | perf analysis. Once it's added this signal disappears)
        
         | aardvark179 wrote:
         | Excellent advice. It's also very important to know what any
         | micro benchmarks you do have are really measuring. I've seen
         | enough that actually measured the time to setup or parse
         | something because they dominated and wasn't cached correctly.
         | Conversely I've seen cases where the JIT correctly optimised
         | away almost everything because there was a check on the final
         | value.
         | 
         | Oh, and if each op takes under a nanosecond than your benchmark
         | is almost certainly completely broken.
        
       | croes wrote:
       | Do the users care?
       | 
       | I think they are used to waiting because they no longer know the
       | speed of desktop applications.
        
         | leeoniya wrote:
         | it's fun to write "fast" js code and watch people's amazement.
         | has hardware becomes cheaper and faster devs become lazier and
         | careless.
         | 
         | it's all fun and games until your battery dies 3 hours too
         | soon.
         | 
         | https://en.m.wikipedia.org/wiki/Jevons_paradox
        
       | CalChris wrote:
       | Laurence Tratt's paper _Virtual machine warmup blows hot and
       | cold_ [1] paper has been posted several times and never really
       | discussed. It covers this problem for Java VMs and also presents
       | a benchmarking methodology.
       | 
       | [1] https://dl.acm.org/doi/10.1145/3133876
        
       | 1oooqooq wrote:
       | kids dont recall when chrome was cheating left and right to be
       | faster than firefox (after they were honestly for a couple
       | months).
       | 
       | you'd have to run benchmarks for all sort of little thibgs
       | because no browser would leave things be. If they thought one
       | popular benchmark was using string+string it was all or nothing
       | to optimize that, harming everything else. next week if that
       | benchmark changed to string[].join... you get the idea. your code
       | was all over the place in performance. Flying today, molasses
       | next week... sometimes chrome and ff would switch the
       | optimizations, so you'd serve string+string to one and array.join
       | to the other. sigh.
        
       | evnwashere wrote:
       | That's why i created mitata, it greatly improves on javascript
       | (micro-)benchmarking tooling
       | 
       | it provides bunch of features to help avoiding jit optimization
       | foot-guns during benchmarking and dips into more advanced stuff
       | like hardware cpu counters to see what's the end result of jit on
       | cpu
        
       | henning wrote:
       | While there may be challenges, caring about frontend performance
       | is still worth it. When I click the Create button in JIRA and
       | start typing, the text field lags behind my typing. I use a 2019
       | MacBook Pro. Unforgivable. Whether one alternate implementation
       | that lets me type normally is 10% faster than another or not or
       | whatever may be harder to answer. If I measure how bad the UI is
       | and it's actually 60x slower than vanilla JS rather than 70x
       | because of measurement error, the app is still a piece of shit.
        
       | spankalee wrote:
       | My old team at Google created a tool to help do better browser
       | benchmarking called Tachometer:
       | https://github.com/google/tachometer
       | 
       | It tries to deal with the uncertainties of different browsers,
       | JITs, GCs, CPU throttling, varying hardware, etc., several ways:
       | 
       | - Runs benchmarks round-robin to hopefully subject each
       | implementation to varying CPU load and thermal properties evenly.
       | 
       | - It reports the confidence interval for an implementation, not
       | the mean. Doesn't throw out outlier samples.
       | 
       | - For multiple implementations, compares the distributions of
       | samples, de-emphasizing the mean
       | 
       | - For comparisons, reports an NxM difference table, showing how
       | each impl compares to the other.
       | 
       | - Can auto-run until confidence intervals for different
       | implementations no longer overlap, giving high confidence that
       | there is an actual difference.
       | 
       | - Uses WebDriver to run benchmarks in multiple browsers, also
       | round-robin, and compares results.
       | 
       | - Can manage npm dependencies, so you can run the same benchmark
       | with different dependencies and see how different versions change
       | the result.
       | 
       | Lit and Preact use Tachometer to tease out performance changes of
       | PRs, even on unreliable GitHub Action hardware. We needed the
       | advanced statistical comparisons exactly because certain things
       | could be faster or slower in different JIT tiers, different
       | browsers, or different code paths.
       | 
       | We wanted to be able to test changes that might have small but
       | reliable overall perf impact, in the context of a non-micro-
       | benchmark, and get reliable results.
       | 
       | Tachometer is browser-focused, but we made it before there were
       | so many server runtimes. It'd be really interesting to make it
       | run benchmarks against Node, Bun, Deno, etc. too.
        
         | nemomarx wrote:
         | how relevant is browser benchmarking now that chrome owns most
         | of the space?
        
       | austin-cheney wrote:
       | There is a common sentiment I see there that I see regularly
       | repeated in software. Here is my sarcastic take:
       | 
       |  _I hate measuring things because accuracy is hard. I wish I
       | could just make up my own numbers to make myself feel better._
       | 
       | It is surprising to me how many developers cannot measure things,
       | do so incorrectly, and then look for things to blame for their
       | emotional turmoil.
       | 
       | Here is quick guide to solve for this:
       | 
       | 1. Know what you are measuring and what its relevance is to your
       | product. It is never about big or small because numerous small
       | things make big things.
       | 
       | 2. Measuring things means generating numbers and comparing those
       | numbers against other numbers from a different but similar
       | measure. The numbers are meaningless is there is no comparison.
       | 
       | 3. If precision is important use the high performance tools
       | provided by the browser and Node for measuring things. You can
       | get greater than nanosecond precision and then account for the
       | variance, that plus/minus range, in your results. If you are
       | measuring real world usage and your numbers get smaller, due to
       | performance refactoring, expect variance to increase. It's ok, I
       | promise.
       | 
       | 4. Measure a whole bunch of different shit. The point of
       | measuring things isn't about speed. It's about identifying bias.
       | The only way to get faster is to know what's really happening and
       | just how off base your assumptions are.
       | 
       | 5. Never ever trust performance indicators from people lacking
       | objectivity. Expect to have your results challenged and be glad
       | when they are. Rest on the strength of your evidence and ease of
       | reproduction that you provide.
        
       | xpl wrote:
       | I once created this tool for benchmarking JS:
       | https://github.com/xpl/what-code-is-faster
       | 
       | It does JIT warmup and ensures that your code doesn't get
       | optimized out (by making it produce a side effect in result).
        
       ___________________________________________________________________
       (page generated 2024-12-24 23:00 UTC)