https://lemire.me/blog/2023/04/06/are-your-memory-bound-benchmarking-timings-normally-distributed/ Skip to content Daniel Lemire's blog Daniel Lemire is a computer science professor at the Data Science Laboratory of the Universite du Quebec (TELUQ) in Montreal. His research is focused on software performance and data engineering. He is a techno-optimist and a free-speech advocate. Menu and widgets * My home page * My papers * My software Subscribe Join over 12,500 subscribers: Email Address [ ] [ ] [Subscribe by email] You can also follow this blog on telegram. You can find me on twitter as @lemire or on Mastodon. Search for: [ ] [Search] Support my work! I do not accept any advertisement. However, you can you can sponsor my open-source work on GitHub. Recent Posts * Are your memory-bound benchmarking timings normally distributed? * What are we going to do about ChatGPT? * C++20: consteval and constexpr functions * Can GPT pass my programming courses? * Counting cycles and instructions on ARM-based Apple systems Recent Comments * Brian on Care is needed to use C++ std::optional with non-trivial objects * al butlerian on What are we going to do about ChatGPT? * Patrick on What are we going to do about ChatGPT? * Ian Beauregard on What are we going to do about ChatGPT? * Daniel Lemire on What are we going to do about ChatGPT? Pages * A short history of technology * About me * Book recommendations * Checkout-Result * Cognitive biases * Interviews and talks * My bets * My favorite articles * My favorite quotes * My readers * My sayings * Predictions * Privacy Policy * Products * Recommended video games * Terms of use * Write good papers Archives Archives [Select Month ] Boring stuff * Log in * Entries feed * Comments feed * WordPress.org Are your memory-bound benchmarking timings normally distributed? When optimizing software, we routinely measure the time that takes a given function or task. The typical assumption is that we get a normal distribution, and so we should therefore report the average time. I believe that this hidden assumption (normality) is frequently violated in software benchmarks. I recently gave a talk on precise and accurate benchmarking (video, slides) at a "Benchmarking in the Data Center" workshop where I made this point in detail. Why should it matter? If you have normal distributions, you can mathematically increase the accuracy of the measure by taking more samples. Your error should go down as the square root of the number of measures. If you ever tried to increase the number of samples in the hope of getting a more accurate result, you may have been severely disappointed. And that is because the timings often more closely ressemble a log normal distribution: [Capture-decran-le-2023-04-06-a-15] A log normal distribution is asymmetrical, you have mean that is relatively close the minimum, and a long tail... as you take more and more samples, you may find more and more large values. You can often show that you do not have a normal distribution because you find 4-sigma, 5-sigma or even 13-sigma events: you measure values that are far above the average compared to your estimated standard deviation. It is not possible, in a normal distribution, to be multiple times the standard deviation away from the mean. However, it happens much more easily with a log normal. Of course, real data is messy. I am not claiming that your timings precisely follow a log normal distribution. It is merely a model. Nevertheless, it suggests that reporting the average and the standard error is inadequate. I like to measure the minimum and the average. You can then use the distance between the average and the minimum as an easy-to-measure error metric: if the average is far from the minimum, then your real-world performance could differ drastically from the minimum. The minimum is easier to measure accurately than the average. I routinely achieve a precision of 1% or better in practice. That is, if I rerun the same benchmark another day on the same machine, I can be almost certain that the result won't vary by much more than 1%. The average is slightly more difficult to nail down. You can verify that it is so with a model: if you generate log-normally distributed samples, you will find it easier to determine the minimum than the average with high accuracy. For my talk, I used a compute-bound routine: the performance of my function was not bound by the speed of the RAM, by the network or by the disk. I took data that was in CPU cache and I generated more data in CPU cache. For these types of problems, I find that timings often ressemble a log normal. What about other types of tasks? What about memory-bound tasks? I took a memory benchmark that I like to use. It consists of a large array spanning hundreds of megabytes, and the software must jump from location to location in it, being incapable of predicting the next jump before completing the read. It is often called a "pointer chasing" routine. I can interleave the pointer-chasing routines so that I have several loads in flight at each time: I call this the number of lanes. The more lanes I have, the more "bandwidth limited" I become, the fewer the lanes, the more "memory latency" I become. I am never compute bound in these tests, meaning that the number of instructions retired per cycle is always low. For each fixed number of lanes, I run the benchmark 30 times on an Amazon Intel c6i node running Ubuntu 22. The time elapsed vary between over 3 seconds per run (for one lane) to about 1/6 s (for 30 lanes). I then estimate the standard deviation, I compute the mean, the maximum and the minimum. I then compute the bottom sigma as the gap (in number of standard deviations) between the minimum and the average, and then the gap between the average and the maximum. If the distribution is normal, I should have roughly the same number of sigmas on either side (min and max) and I should not exceed 3 sigmas. I find that one side (the maximum), easily exceed 3 sigmas, so it is not a normal distribution. It is also clearly not symmetrical. [plot] Yet my measures are relatively precise. The relative distance between the minimum and the mean, I get a tiny margin, often under 1%. [plotsm] It is interesting that the more lanes I have, the more accurate the results: intuitively, this is not entirely surprising as it breaks down the data dependency and one bad step has less impact on the whole processing. Thus, for a wide range of performance-related timings, you should not assume that you have a normal distribution without checking first! Computing the distance between the maximum and the mean divided by the standard deviation is a useful indicator. I personally find that a log normal distribution is a better model for my timings. Published by [2ca999] Daniel Lemire A computer science professor at the University of Quebec (TELUQ). View all posts by Daniel Lemire Posted on April 6, 2023Author Daniel LemireCategories Leave a Reply Cancel reply Your email address will not be published. To create code blocks or other preformatted text, indent by four spaces: This will be displayed in a monospaced font. The first four spaces will be stripped off, but all other whitespace will be preserved. Markdown is turned off in code blocks: [This is not a link](http://example.com) To create not a block, but an inline code span, use backticks: Here is some inline `code`. For more help see http://daringfireball.net/projects/markdown/syntax [ ] [ ] [ ] [ ] [ ] [ ] [ ] Comment * [ ] Name * [ ] Email * [ ] Website [ ] [ ] Save my name, email, and website in this browser for the next time I comment. Receive Email Notifications? [no, do not subscribe ] [instantly ] Or, you can subscribe without commenting. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] You may subscribe to this blog by email. Post navigation Previous Previous post: What are we going to do about ChatGPT? Terms of use Proudly powered by WordPress