[HN Gopher] Hyperfine: A command-line benchmarking tool
       ___________________________________________________________________
        
       Hyperfine: A command-line benchmarking tool
        
       Author : hundredwatt
       Score  : 221 points
       Date   : 2024-11-18 21:47 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | mmastrac wrote:
       | Hyperfine is a great tool but when I was using it at Deno to
       | benchmark startup time there was a lot of weirdness around the
       | operating system apparently caching inodes of executables.
       | 
       | If you are looking at shaving sub 20ms numbers, be aware you may
       | need to pull tricks on macos especially to get real numbers.
        
         | JackYoustra wrote:
         | I've found pretty good results with the System Trace template
         | in xcode instruments. You can also stack instruments, for
         | example combining the file inspector with a virtual memory
         | inspector.
         | 
         | I've run into some memory corruption with it sometimes, though,
         | so be wary of that. Emerge tools has an alternative for iOS at
         | least, maybe one day they'll port it to mac.
        
           | art049 wrote:
           | I never tried xcode instruments. Is the UX good for this kind
           | of tool?
        
         | sharkdp wrote:
         | Caching is something that you almost always have to be aware of
         | when benchmarking command line applications, even if the
         | application itself has no caching behavior. Please see
         | https://github.com/sharkdp/hyperfine?tab=readme-ov-file#warm...
         | on how to run either warm-cache benchmarks or cold-cache
         | benchmarks.
        
           | mmastrac wrote:
           | I'm fully aware but it's not a problem that warmup runs fix.
           | An executable freshly compiled will always benchmark
           | differently than one that has "cooled off" on macos,
           | regardless of warmup runs.
           | 
           | I've tried to understand what the issue is (played with
           | resigning executables etc) but it's literally something about
           | the inode of the executable itself. Most likely part of the
           | OSX security system.
        
             | renewiltord wrote:
             | Interesting. I've encountered this obviously on first run
             | (because of the security checking it does on novel
             | executables) but didn't realize this expired. Probably
             | because I usually attribute it to a recompilation. Thanks.
        
         | maccard wrote:
         | Not being able to rely on numbers to 20ms is pretty poor.
         | That's longer than a frame in a video game.
         | 
         | Windows has microsecond precision counters (see
         | QueryPerformanceCounter and friends)
        
       | 7e wrote:
       | What database product does the community commonly send benchmark
       | results to? This tool is great, but I'd love to analyze results
       | relationally.
        
         | rmorey wrote:
         | Something like Geekbench for CLI tools would be awesome
        
       | forrestthewoods wrote:
       | Hyperfine is hyper frustrating because it only works with really
       | really fine microsecond level benchmarks. Once you get into the
       | millisecond range it's worthless.
        
         | anotherhue wrote:
         | It spawns a new process each time right? I would think that
         | would but a cap on how accurate it can get.
         | 
         | For my purposes I use it all the time though, quick and easy
         | sanity-check.
        
           | forrestthewoods wrote:
           | The issue is it runs a kajillion tests to try and be
           | "statistical". But there's no good way to say "just run it
           | for 5 seconds and give me the best answer you can". It's very
           | much designed for nanosecond to low microsecond benchmarks.
           | Trying to fight this is trying to smash a square peg through
           | a round hole.
        
             | gforce_de wrote:
             | At least it gives some numbers and point in a direction:
             | $ hyperfine --warmup 3 './hello-world-bin-sh.sh' './hello-
             | world-env-python3.py'       Benchmark 1: ./hello-world-bin-
             | sh.sh         Time (mean +- s):       1.3 ms +-   0.4 ms
             | [User: 1.0 ms, System: 0.5 ms]       ...       Benchmark 2:
             | ./hello-world-env-python3.py         Time (mean +- s):
             | 43.1 ms +-   1.4 ms    [User: 33.6 ms, System: 8.4 ms]
             | ...
        
             | PhilipRoman wrote:
             | I disagree that it is designed for nano/micro benchmarks.
             | If you want that level of detail, you need to stay within a
             | single process, pinned to a core which is isolated from
             | scheduler. At least I found it almost impossible to
             | benchmark assembly routines with it.
        
             | sharkdp wrote:
             | > The issue is it runs a kajillion tests to try and be
             | "statistical".
             | 
             | If you see any reason for putting "statistical" in quotes,
             | please let us know. hyperfine does not run a lot of tests,
             | but it does try to find outliers in your measurements. This
             | is really valuable in some cases. For example: we can
             | detect when the first run of your program takes much longer
             | than the rest of the runs. We can then show you a warning
             | to let you know that you probably want to either use some
             | warmup runs, or a "--prepare" command to clean (OS) caches
             | if you want a cold-cache benchmark.
             | 
             | > But there's no good way to say "just run it for 5 seconds
             | and give me the best answer you can".
             | 
             | What is the "best answer you can"?
             | 
             | > It's very much designed for nanosecond to low microsecond
             | benchmarks.
             | 
             | Absolutely not. With hyperfine, you can not measure
             | execution times in the "low microsecond" range, let alone
             | nanosecond range. See also my other comment.
        
           | oguz-ismail wrote:
           | It spawns a new _shell_ for each run and subtracts the
           | average shell startup time from final results. Too much noise
        
             | PhilipRoman wrote:
             | The shell can be disabled, leaving just fork+exec
        
               | sharkdp wrote:
               | Yes. If you don't make use of shell builtins/syntax, you
               | can use hyperfine's `--shell=none`/`-N` option to disable
               | the intermediate shell.
        
               | oguz-ismail wrote:
               | You still need to quote the command though. `hyperfine -N
               | ls "$dir"' won't work, you need `hyperfine -N "ls
               | ${dir@Q}"' or something. It'd be better if you could
               | specify commands like with `find -exec'.
        
               | PhilipRoman wrote:
               | Oh that sucks, I really hate when programs impose useless
               | shell parsing instead of letting the user give an
               | argument vector natively.
        
               | sharkdp wrote:
               | I don't think it's useless. You can use hyperfine to run
               | multiple benchmarks at the same time, to get a comparison
               | between multiple tools. So if you want it to work without
               | quotes, you need to (1) come up with a way to separate
               | commands and (2) come up with a way to distinguish
               | hyperfine arguments from command arguments. It's doable,
               | but it's also not a great UX if you have to write
               | something like                   hyperfine -N -- ls
               | "$dir" \; my_ls "$dir"
        
               | oguz-ismail wrote:
               | > not a great UX
               | 
               | Looks fine to me. Obviously it's too late to undo that
               | mistake, but a new flag to enable new behavior wouldn't
               | hurt anyone.
        
         | sharkdp wrote:
         | That doesn't make a lot of sense. It's more like the opposite
         | of what you are saying. The precision of hyperfine is typically
         | in the single-digit millisecond range. Maybe just below 1 ms if
         | you take special care to run the benchmark on a quiet system.
         | Everything _below_ that (microsecond or nanosecond range) is
         | something that you need to address with other forms of
         | benchmarking.
         | 
         | But for everything in the right range (milliseconds, seconds,
         | minutes or above), hyperfine is well suited.
        
           | forrestthewoods wrote:
           | No it's not.
           | 
           | Back in the day my goal for Advent of Code was to run all
           | solutions in under 1 second total. Hyperfine would take like
           | 30 minutes to benchmark a 1 second runtime.
           | 
           | It was hyper frustrating. I could not find a good way to get
           | Hyperfine to do what I wanted.
        
             | sharkdp wrote:
             | If that's the case, I would consider it a bug. Please feel
             | free to report it. In general, hyperfine should not take
             | longer than ~3 seconds, unless the command itself takes >
             | 300 ms second to run. In the latter case, we do a minimum
             | of 10 runs by default. So if your program takes 3 min for a
             | single iteration, it would take 30 min by default -- yes.
             | But this can be controlled using the `-m`/`--min-runs`
             | option. You can also specify the exact amount of runs using
             | `-r`/`--runs`, if you prefer that.
             | 
             | > I could not find a good way to get Hyperfine to do what I
             | wanted
             | 
             | This is all documented here: https://github.com/sharkdp/hyp
             | erfine/tree/master?tab=readme-... under "Basic benchmarks".
             | The options to control the amount of runs are also listed
             | in `hyperfine --help` and in the man page. Please let us
             | know if you think we can improve the documentation /
             | discovery of those options.
        
             | fwip wrote:
             | I've been using it for about four or five years, and never
             | experienced this behavior.
             | 
             | Current defaults: "By default, it will perform at least 10
             | benchmarking runs and measure for at least 3 seconds." If
             | your program takes 1s to run, it should take 10 seconds to
             | benchmark.
             | 
             | Is it possible that your program was waiting for input that
             | never came? One "gotcha" is that it expects each argument
             | to be a full program, so if you ran `hyperfine ./a.out
             | input.txt`, it will first bench a.out with no args, then
             | try to bench input.txt (which will fail). If a.out reads
             | from stdin when no argument is given, then it would hang
             | forever, and I can see why you'd give up after a half hour.
        
               | sharkdp wrote:
               | > Is it possible that your program was waiting for input
               | that never came?
               | 
               | We do close stdin to prevent this. So you can benchmark
               | `cat`, for example, and it works just fine.
        
               | fwip wrote:
               | Oh, my bad! Thank you for the correction, and for all
               | your work making hyperfine.
        
       | usrme wrote:
       | I've also had a good experience using the 'perf'[^1] tools for
       | when I don't want to install 'hyperfine'. Shameless plug for a
       | small blog post about it as I don't think it is that well known:
       | https://usrme.xyz/tils/perf-is-more-robust-for-repeated-timi....
       | 
       | ---
       | 
       | [^1]: https://www.mankier.com/1/perf
        
         | vdm wrote:
         | I too have scripted time(1) in a loop badly. perf stat is more
         | likely to be already installed than hyperfine. Thank you for
         | sharing!
        
         | CathalMullan wrote:
         | There's also 'poop', which is a nice middle-ground between
         | 'hyperfine' and 'perf'. https://github.com/andrewrk/poop
        
           | llimllib wrote:
           | worth mentioning that it's linux-only
        
       | mosselman wrote:
       | Hyperfine is great. I use it sometimes for some quick web page
       | benchmarks:
       | 
       | https://abuisman.com/posts/developer-tools/quick-page-benchm...
       | 
       | As mentioned here in the thread, when you want to go into the
       | single ms optimisations it is not the best approach since there
       | is a lot of overhead especially the way I demonstrate here, but
       | it works very well for some sanity checks.
        
         | llimllib wrote:
         | I find k6 a lot nicer for HTTP benching, and no slower to set
         | up than hyperfine (which I love for CLI benching):
         | https://k6.io/
        
           | jiehong wrote:
           | Could hyperfine running curl be an alternative?
        
             | mosselman wrote:
             | That is what I do in my blog post.
        
         | Sesse__ wrote:
         | > Hyperfine is great.
         | 
         | Is it, though?
         | 
         | What I would expect a system like this to have, at a minimum:
         | * Robust statistics with p-values (not just min/max,
         | compensation for multiple hypotheses, no Gaussian assumptions)
         | * Multiple stopping points depending on said statistics.
         | * Automatic isolation to the greatest extent possible (given
         | appropriate permissions)       * Interleaved execution, in case
         | something external changes mid-way.
         | 
         | I don't see any of this in hyperfine. It just... runs things N
         | times and then does a naive average/min/max? At that rate, one
         | could just as well use a shell script and eyeball the results.
        
           | bee_rider wrote:
           | What do you suggest? Those sound like great features.
        
             | Sesse__ wrote:
             | I've only seen such things in internal tools so far,
             | unfortunately, so if you see anything in public, please
             | tell me :-) I'm just confused why everything thinks
             | hyperfine is so awesome, when it does not meet what I'd
             | consider a fairly low bar for benchmarking tools? ("Best
             | publicly available" != "great", in my book.)
        
               | sharkdp wrote:
               | > "Best publicly available" != "great"
               | 
               | Of course. But it is free and open source. And everyone
               | is invited to make it better.
        
           | sharkdp wrote:
           | > Robust statistics with p-values (not just min/max,
           | compensation for multiple hypotheses, no Gaussian
           | assumptions)
           | 
           | This is not included in the core of hyperfine, but we do have
           | scripts to compute "advanced" statistics, and to perform
           | t-tests here:
           | https://github.com/sharkdp/hyperfine/tree/master/scripts
           | 
           | Please feel free to comment here if you think it should be
           | included in hyperfine itself:
           | https://github.com/sharkdp/hyperfine/issues/523
           | 
           | > Automatic isolation to the greatest extent possible (given
           | appropriate permissions)
           | 
           | This sounds interesting. Please feel free to open a ticket if
           | you have any ideas.
           | 
           | > Interleaved execution, in case something external changes
           | mid-way.
           | 
           | Please see the discussion here:
           | https://github.com/sharkdp/hyperfine/issues/21
           | 
           | > It just... runs things N times and then does a naive
           | average/min/max?
           | 
           | While there is nothing wrong with computing average/min/max,
           | this is not all hyperfine does. We also compute modified
           | Z-scores to detect outliers. We use that to issue warnings,
           | if we think the mean value is influenced by them. We also
           | warn if the first run of a command took significantly longer
           | than the rest of the runs and suggest counter-measures.
           | 
           | Depending on the benchmark I do, I tend to look at either the
           | `min` or the `mean`. If I need something more fine-grained, I
           | export the results and use the scripts referenced above.
           | 
           | > At that rate, one could just as well use a shell script and
           | eyeball the results.
           | 
           | Statistical analysis (which you can consider to be basic) is
           | just one reason why I wrote hyperfine. The other reason is
           | that I wanted to make benchmarking easy to use. I use warmup
           | runs, preparation commands and parametrized benchmarks all
           | the time. I also frequently use the Markdown export or the
           | JSON export to generate graphs or histograms. This is my
           | personal experience. If you are not interested in all of
           | these features, you can obviously "just as well use a shell
           | script".
        
             | Sesse__ wrote:
             | > This is not included in the core of hyperfine, but we do
             | have scripts to compute "advanced" statistics, and to
             | perform t-tests here:
             | https://github.com/sharkdp/hyperfine/tree/master/scripts
             | 
             | t-tests run afoul of the "no Gaussian assumptions", though.
             | Distributions arising from benchmarking frequently has
             | various forms of skew which messes up t-tests and gives
             | artificially narrow confidence intervals.
             | 
             | (I'll gladly give you credit for your outlier detection,
             | though!)
             | 
             | >> Automatic isolation to the greatest extent possible
             | (given appropriate permissions) > This sounds interesting.
             | Please feel free to open a ticket if you have any ideas.
             | 
             | Off the top of my head, some option that would:
             | 
             | * Bind to isolated CPUs, if booted with it (isolcpus=) *
             | Binding to a consistent set of cores/hyperthreads (the
             | scheduler frequently sabotages benchmarking, especially if
             | your cores are have very different maximum frequency) *
             | Warns if thermal throttling is detected during the run *
             | Warns if an inappropriate CPU governor is enabled * Locks
             | the program into RAM (probably hard to do without some sort
             | of help from the program) * Enables realtime priority if
             | available (e.g., if isolcpus= is not enabled, or you're not
             | on Linux)
             | 
             | Of course, sometimes you would _want_ to benchmark some of
             | these effects, and that's fine. But most people probably
             | won't, and won't know that they exist. I may easily have
             | forgotten some.
             | 
             | On the flip side (making things more random as opposed to
             | less), something that randomizes the initial stack pointer
             | would be nice, as I've sometimes seen this go really,
             | really wrong (renaming a binary from foo to foo_new made it
             | run >1% slower!).
        
               | sharkdp wrote:
               | > On the flip side (making things more random as opposed
               | to less), something that randomizes the initial stack
               | pointer would be nice, as I've sometimes seen this go
               | really, really wrong (renaming a binary from foo to
               | foo_new made it run >1% slower!).
               | 
               | This is something we do already. We set a
               | `HYPERFINE_RANDOMIZED_ENVIRONMENT_OFFSET` environment
               | variable with a random-length value: https://github.com/s
               | harkdp/hyperfine/blob/87d77c861f1b6c761a...
        
           | renewiltord wrote:
           | Personally, I'm all about the UNIX philosophy of doing one
           | thing and doing it well. All I want is the process to be
           | invoked k times to do a thing with warmup etc. etc. If I want
           | additional stats, it's easy to calculate. I just `--export-
           | json` and then once it's in a dataframe I can do what I want
           | with it.
        
       | shawndavidson7 wrote:
       | "Hyperfine seems like an incredibly useful tool for anyone
       | working with command-line utilities. The ability to benchmark
       | processes straightforwardly is vital for optimizing performance.
       | I'm particularly impressed with how simple it is to use compared
       | to other benchmarking tools. I'd love to see more examples of how
       | Hyperfine can be integrated into different workflows, especially
       | for large-scale applications.
       | 
       | https://www.osplabs.com/
        
       | edwardzcn wrote:
       | Hyperfine is great! I remember I learned about it when comparing
       | functions with/without tail recursion (not sure if it was from
       | the Go reference or the Rust reference). It provides simple
       | configurations for unit test. But I have not tried it on DBMS
       | (e.g. like sysbench). Does anyone have a try?
        
       | smartmic wrote:
       | A capable alternative based on "boring, old" technology is
       | multitime [1]
       | 
       | Back at the time I needed it, it had peak memory usage -
       | hyperfine was not able to show it. Maybe this had changed by now.
       | 
       | [1] https://tratt.net/laurie/src/multitime/
        
       | accelbred wrote:
       | Hyperfine is a really useful tool.
       | 
       | Weirdest thing I've used it for is comparing io throughput on
       | various disks.
        
       | ratrocket wrote:
       | Perhaps interesting (for some) to note that hyperfine is from the
       | same author as at least a few other "ne{w,xt} generation" command
       | line tools (that could maybe be seen as part of "rewrite it in
       | Rust", but I don't want to paint the author with a brush they
       | disagree with!!): fd (find alternative;
       | https://github.com/sharkdp/fd), bat ("supercharged version of the
       | cat command"; https://github.com/sharkdp/bat), and hexyl (hex
       | viewer; https://github.com/sharkdp/hexyl). (And certainly others
       | I've missed!)
       | 
       | Pointing this out because I myself appreciate comments that do
       | this.
       | 
       | For myself, `fd` is the one most incorporated into my own
       | "toolbox" -- used it this morning prior to seeing this thread on
       | hyperfine! So, thanks for all that, sharkdp if you're reading!
       | 
       | Ok, end OT-ness.
        
         | varenc wrote:
         | ++ to `fd`
         | 
         | It's absolutely my preferred `find` replacement. Its CLI
         | interface just clicks for me and I can quickly express my
         | desires. Quite unlike `find`. `fd` is one of the first packages
         | I install on a new system.
        
           | ratrocket wrote:
           | The "funny" thing for me about `fd` is that the set of
           | operations I use for `find` are very hard-wired into my
           | muscle memory from using it for 20+ years, so when I reach
           | for `fd` I often have to reference the man page! I'm getting
           | a little better from more exposure, but it's just different
           | enough from `find` to create a bit of an uncanny valley
           | effect (I think that's the right use of the term...).
           | 
           | Even with that I reach for `fd` for some of its quality-of-
           | life features: respecting .gitignore, its speed, regex-
           | ability. (Though not its choices with color; I am a pretty
           | staunch "--color never" person, for better or worse!)
           | 
           | Anyway, that actually points to another good thing about
           | sharkdp's tools: they have good man pages!!
        
       ___________________________________________________________________
       (page generated 2024-11-19 23:01 UTC)