https://lemire.me/blog/2019/01/29/data-scientists-need-to-learn-about-significant-digits/ Skip to content Daniel Lemire's blog Daniel Lemire is a computer science professor at the Data Science Laboratory of the Universite du Quebec (TELUQ) in Montreal. His research is focused on software performance and data engineering. He is a techno-optimist and a free-speech advocate. Menu and widgets * My home page * My papers * My software Join over 12,500 email subscribers: [ ][Go!] You can follow this blog on telegram. You can find me on twitter as @lemire or on Mastodon. Search for: [ ] [Search] Support my work! I do not accept any advertisement. However, you can you can sponsor my open-source work on GitHub. Recent Posts * Measuring the size of the cache line empirically * Fast Buffer-to-String conversion in JavaScript with a Lookup Table * How fast can you validate UTF-8 strings in JavaScript? * Parsing 8-bit integers quickly * A simple WebSocket benchmark in Python Recent Comments * Alex on A simple WebSocket benchmark in JavaScript: Node.js versus Bun * Bryan Veloso on Breaking news: HTML+CSS is Turing complete * Nathan Myers on For processing strings, streams in C++ can be slow * Nathan Myers on Appending to an std::string character-by-character: how does the capacity grow? * Nathan Myers on Generating arrays at compile-time in C++ with lambdas Pages * A short history of technology * About me * Book recommendations * Cognitive biases * Interviews and talks * My bets * My favorite articles * My favorite quotes * My rules * Newsletter * Predictions * Privacy Policy * Recommended video games * Terms of use * Write good papers Archives Archives [Select Month ] Boring stuff * Log in * Entries feed * Comments feed * WordPress.org Data scientists need to learn about significant digits Suppose that you classify people on income or gender. Your boss asks you about the precision of your model. Which answer do you give? Whatever your software tells you (e.g., 87.14234%) or a number made of a small and fixed number of significant digits (e.g., 87%). The latter is the right answer in almost all instances. And the difference matters: 1. There is a general principle at play when communicating with human beings: you should give just the relevant information, nothing more. Most human beings are happy with a 1% error margin. There are, of course, exceptions. High-energy physicists might need the mass of a particle down to 6 significant digits. However, if you are doing data science or statistics, it is highly unlikely that people will care for more than two significant digits. 2. Overly precise numbers are often misleading because your actual accuracy is much lower. Wikipedia tells us that the number of significant digits implies some knowledge about your uncertainty: Uncertainty may be implied by the last significant figure if it is not explicitly expressed.The implied uncertainty is +- the half of the minimum scale at the last significant figure position. For example, if the mass of an object is reported as 3.78 kg without mentioning uncertainty, then +- 0.005 kg measurement uncertainty may be implied. So if you give 4 digits, you are telling us that you know the true value very precisely. Yes, you have 10,000 samples and properly classified 5,124 of them so your mathematical precision is 0.5124. But if you stop there, you show that you have not given much thought to your error margin. First of all, you are probably working out of a sample. If someone else redid your work, they might have a different sample. Even if one uses exactly the same algorithm you have been using, implementation matters. Small things like how your records are ordered can change results. Moreover, most software is not truly deterministic. Even if you were to run exactly the same software twice on the same data, you probably would not get the same answers. Software needs to break ties, and often does so arbitrarily or randomly. Some algorithms involve sampling or other randomization. Cross-validation is often randomized. I am not advocating that you should go as far as reporting exact error margins for each and every measure you report. It gets cumbersome for both the reader and the author. It is also not the case that you should never use many significant digits. However, if you write a report or a research paper, and you report measures, like precision or timings, and you have not given any thought to significant digits, you are doing it wrong. You must choose the number of significant digits deliberately. There are objections to my view: * "I have been using 6 significant digits for years and nobody ever objected." That is true. There are entire communities that have never heard about the concept of significant digit. But that is not an excuse. * "It sounds more serious to offer more precision, this way people know that I did not make it up." It may be true that some people are easily impressed by very precise answers, but serious people will not be so easily fooled, and non-specialists will be turned off by the excessive precision. Published by [2ca999] Daniel Lemire A computer science professor at the University of Quebec (TELUQ). View all posts by Daniel Lemire Posted on January 29, 2019December 25, 2023Author Daniel Lemire Categories 2 thoughts on "Data scientists need to learn about significant digits" 1. [3d6bd7] John the Scott says: January 30, 2019 at 4:44 pm most excellent post. i recommend gustafson's book for another angle on digital error. https://www.amazon.com/End-Error-Computing-Chapman-Computational/ dp/1482239868/ref=sr_1_1?s=books&ie=UTF8&qid=1548866338&sr=1-1& keywords=the+end+of+error Reply 2. [083d33] ttoinou says: January 31, 2019 at 6:25 am Of course you're right. If you're exchanging information with scientists / engineers you could also provide with every F figure its A+-P "precision" (Y% of chance to be in the Gaussian centered on F with k(Y)*P standard-deviation, k to be computed from Y). That way if the person you're giving information to needs to compute a new statistic, it can combines Gaussian models and have a new (F' A+- P') Reply Leave a Reply Cancel reply Your email address will not be published. To create code blocks or other preformatted text, indent by four spaces: This will be displayed in a monospaced font. The first four spaces will be stripped off, but all other whitespace will be preserved. Markdown is turned off in code blocks: [This is not a link](http://example.com) To create not a block, but an inline code span, use backticks: Here is some inline `code`. For more help see http://daringfireball.net/projects/markdown/syntax [ ] [ ] [ ] [ ] [ ] [ ] [ ] Comment * [ ] Name * [ ] Email * [ ] Website [ ] [ ] Save my name, email, and website in this browser for the next time I comment. Receive Email Notifications? [no, do not subscribe ] [instantly ] Or, you can subscribe without commenting. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] You may subscribe to this blog by email. Post navigation Previous Previous post: Rethinking Hamming's questions Next Next post: What is the space overhead of Base64 encoding? Terms of use Proudly powered by WordPress