https://lemire.me/blog/2019/01/29/data-scientists-need-to-learn-about-significant-digits/

Skip to content

Daniel Lemire's blog

Daniel Lemire is a computer science professor at the Data Science
Laboratory of the Universite du Quebec (TELUQ) in Montreal. His
research is focused on software performance and data engineering. He
is a techno-optimist and a free-speech advocate.

Menu and widgets

  * My home page
  * My papers
  * My software

Join over 12,500 email subscribers:

[                    ][Go!]
You can follow this blog on telegram.
You can find me on twitter as @lemire or on Mastodon.
Search for: [                    ] [Search]
Support my work!

I do not accept any advertisement. However, you can you can sponsor
my open-source work on GitHub.

Recent Posts

  * Measuring the size of the cache line empirically
  * Fast Buffer-to-String conversion in JavaScript with a Lookup
    Table
  * How fast can you validate UTF-8 strings in JavaScript?
  * Parsing 8-bit integers quickly
  * A simple WebSocket benchmark in Python

Recent Comments

  * Alex on A simple WebSocket benchmark in JavaScript: Node.js
    versus Bun
  * Bryan Veloso on Breaking news: HTML+CSS is Turing complete
  * Nathan Myers on For processing strings, streams in C++ can be
    slow
  * Nathan Myers on Appending to an std::string
    character-by-character: how does the capacity grow?
  * Nathan Myers on Generating arrays at compile-time in C++ with
    lambdas

Pages

  * A short history of technology
  * About me
  * Book recommendations
  * Cognitive biases
  * Interviews and talks
  * My bets
  * My favorite articles
  * My favorite quotes
  * My rules
  * Newsletter
  * Predictions
  * Privacy Policy
  * Recommended video games
  * Terms of use
  * Write good papers

Archives

Archives [Select Month        ]

Boring stuff

  * Log in
  * Entries feed
  * Comments feed
  * WordPress.org

Data scientists need to learn about significant digits

Suppose that you classify people on income or gender. Your boss asks
you about the precision of your model. Which answer do you give?
Whatever your software tells you (e.g., 87.14234%) or a number made
of a small and fixed number of significant digits (e.g., 87%).

The latter is the right answer in almost all instances. And the
difference matters:

 1. There is a general principle at play when communicating with
    human beings: you should give just the relevant information,
    nothing more. Most human beings are happy with a 1% error margin.
    There are, of course, exceptions. High-energy physicists might
    need the mass of a particle down to 6 significant digits.
    However, if you are doing data science or statistics, it is
    highly unlikely that people will care for more than two
    significant digits.
 2. Overly precise numbers are often misleading because your actual
    accuracy is much lower. Wikipedia tells us that the number of
    significant digits implies some knowledge about your uncertainty:


        Uncertainty may be implied by the last significant figure if
        it is not explicitly expressed.The implied uncertainty is +-
        the half of the minimum scale at the last significant figure
        position. For example, if the mass of an object is reported
        as 3.78 kg without mentioning uncertainty, then +- 0.005 kg
        measurement uncertainty may be implied.

    So if you give 4 digits, you are telling us that you know the
    true value very precisely. Yes, you have 10,000 samples and
    properly classified 5,124 of them so your mathematical precision
    is 0.5124. But if you stop there, you show that you have not
    given much thought to your error margin. First of all, you are
    probably working out of a sample. If someone else redid your
    work, they might have a different sample. Even if one uses
    exactly the same algorithm you have been using, implementation
    matters. Small things like how your records are ordered can
    change results. Moreover, most software is not truly
    deterministic. Even if you were to run exactly the same software
    twice on the same data, you probably would not get the same
    answers. Software needs to break ties, and often does so
    arbitrarily or randomly. Some algorithms involve sampling or
    other randomization. Cross-validation is often randomized.

I am not advocating that you should go as far as reporting exact
error margins for each and every measure you report. It gets
cumbersome for both the reader and the author. It is also not the
case that you should never use many significant digits. However, if
you write a report or a research paper, and you report measures, like
precision or timings, and you have not given any thought to
significant digits, you are doing it wrong. You must choose the
number of significant digits deliberately.

There are objections to my view:

  * "I have been using 6 significant digits for years and nobody ever
    objected." That is true. There are entire communities that have
    never heard about the concept of significant digit. But that is
    not an excuse.
  * "It sounds more serious to offer more precision, this way people
    know that I did not make it up." It may be true that some people
    are easily impressed by very precise answers, but serious people
    will not be so easily fooled, and non-specialists will be turned
    off by the excessive precision.

Published by

[2ca999]

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).
View all posts by Daniel Lemire

Posted on January 29, 2019December 25, 2023Author Daniel Lemire
Categories  

2 thoughts on "Data scientists need to learn about significant
digits"

 1. [3d6bd7] John the Scott says:
    January 30, 2019 at 4:44 pm

    most excellent post. i recommend gustafson's book for another
    angle on digital error.

    https://www.amazon.com/End-Error-Computing-Chapman-Computational/
    dp/1482239868/ref=sr_1_1?s=books&ie=UTF8&qid=1548866338&sr=1-1&
    keywords=the+end+of+error

    Reply
 2. [083d33] ttoinou says:
    January 31, 2019 at 6:25 am

    Of course you're right.

    If you're exchanging information with scientists / engineers you
    could also provide with every F figure its A+-P "precision" (Y% of
    chance to be in the Gaussian centered on F with k(Y)*P
    standard-deviation, k to be computed from Y). That way if the
    person you're giving information to needs to compute a new
    statistic, it can combines Gaussian models and have a new (F' A+-
    P')

    Reply

Leave a Reply Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four
spaces:

    This will be displayed in a monospaced font. The first four
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
Comment * [                                             ]

Name * [                              ]

Email * [                              ]

Website [                              ]

[ ] Save my name, email, and website in this browser for the next
time I comment.

Receive Email Notifications? [no, do not subscribe      ] 
[instantly    ]
Or, you can subscribe without commenting.

[Post Comment] 

 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
D[                                             ] 

You may subscribe to this blog by email.

Post navigation

Previous Previous post: Rethinking Hamming's questions
Next Next post: What is the space overhead of Base64 encoding?
Terms of use Proudly powered by WordPress