https://lemire.me/blog/2021/01/29/number-parsing-at-a-gigabyte-per-second/

Skip to content

Daniel Lemire's blog

Daniel Lemire is a computer science professor at the University of
Quebec (TELUQ) in Montreal. His research is focused on software
performance and data engineering. He is a techno-optimist.

Menu and widgets

  * My home page
  * My papers
  * My software

Subscribe

You can subscribe to this blog by email.

Where to find me?

I am on Twitter and GitHub:

Follow @lemire
You can also find Daniel Lemire on

  * on Google Scholar with 4k citations and over 75 peer-reviewed
    publications,
  * on Facebook,
  * and on LinkedIn.

Before the pandemic of 2020, you could meet Daniel in person, as he
was organizing regular talks open to the public in Montreal: tribalab
and technolab .
Search for: [                    ] [Search]
Support my work!

I do not accept any advertisement. However, you can support the blog
with donations through paypal. Please consider getting in touch if
you are a supporter so that I can thank you.

Recent Posts

  * Number Parsing at a Gigabyte per Second
  * Science and Technology links (January 24th 2021)
  * Science and Technology links (January 16th 2021)
  * Science and Technology links (January 9th 2021)
  * Memory access on the Apple M1 processor

Recent Comments

  * Daniel Lemire on Number Parsing at a Gigabyte per Second
  * Frank Astier on Number Parsing at a Gigabyte per Second
  * Dominic Amann on Science and Technology links (January 24th 2021)
  * Alexander Mikhalev on Science and Technology links (January 24th
    2021)
  * John on External-memory shuffling in linear time?

Pages

  * A short history of technology
  * About me
  * Book recommendations
  * Cognitive biases
  * Interviews and talks
  * My bets
  * My favorite articles
  * My favorite quotes
  * My readers
  * My sayings
  * Predictions
  * Recommended video games
  * Terms of use
  * Write good papers

Archives

Archives [Select Month        ]

Boring stuff

  * Log in
  * Entries feed
  * Comments feed
  * WordPress.org

Number Parsing at a Gigabyte per Second

Computers typically rely on binary floating-point numbers. Most often
they span 64 bits or 32 bits. Many programming languages call them
double and float. JavaScript represents all its numbers, by default,
with a 64-bit binary floating-point number type.

Human beings most of often represent numbers in decimal notation,
such as 0.1 or 1e-1. Thus many systems store numbers in decimal
notation using ASCII text. The software must go from binary
floating-point numbers to ASCII and back. There has been much work
done on the serialization (from binary floating-point numbers to
ASCII) but comparatively less work on the deserialization (from ASCII
to binary floating-point numbers).

Typically, reading decimal numbers and converting them to binary
floating-point numbers is slow. How slow? Often on the order of 200
MB/s. So much slower than your disk, if you have a fast disk. A
PlayStation 5 has a disk capable of over 5 GB/s in bandwidth.

You can do much better. I finally published a manuscript that
explains a better approach: Number Parsing at a Gigabyte per Second.
Do not miss the acknowledgements section of the paper: this was joint
work with really smart people.

The benchmarks in the paper are mostly based on the C++ library
fast_float. The library requires a C++11 standard compliant compiler.
It provides functions that closely emulate the standard C++
from_chars functions for float and double types. It is used by Apache
Arrow and Yandex ClickHouse. It is also part of the fastest Yaml
library in the world. These from_char functions are part of the C++17
standard. To my knowledge, only microsoft implemented it at this
point: they are not available in GNU GCC.

On my Apple M1 MacBook, using a realistic data file (canada), we get
that fast_float can far exceeds a gigabyte per second, and get close
to 2 GB/s. The conventional C function (strtod) provided by the
default Apple standard library does quite poorly on this benchmark.
[Screen-Shot-2021-01-29-at-12]

What about other programming languages?

A simplified version of the approach is now part of the Go standard
library, thanks to Nigel Tao and other great engineers. It
accelerated Go processing while helping to provide exact parsing.
Nigel Tao has a nice post entitled The Eisel-Lemire ParseNumberF64
Algorithm.

What about Rust? There is a Rust port. Unsurprisingly, the Rust
version is a match for the C++ version, speed-wise. Here are the
results using the same file and the same processor (Apple M1):

from_str (standard)    130 MB/s
lexical (popular lib.) 370 MB/s
fast-float             1200 MB/s

A C# port is in progress and preliminary results suggest we can beat
the standard library by a healthy margin. I am hoping to get a Swift
and Java port going this year (help and initiative are invited).

Video. Last year, I gave a talk at Go Systems Conf SF 2020
entitled Floating-point Number Parsing w/Perfect Accuracy at GB/sec.
It is on YouTube.

Further reading. See my earlier posts... Fast float parsing in practice
(March 2020 blog post) and Multiplying backward for profit (April
2020 blog post).

Published by

[4b7361]

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).
View all posts by Daniel Lemire

Posted on January 29, 2021January 30, 2021Author Daniel Lemire
Categories  

2 thoughts on "Number Parsing at a Gigabyte per Second"

 1. [fd55bf] Frank Astier says:
    January 30, 2021 at 5:17 pm

    But, when I have to store e.g. a big matrix of floating point
    numbers, I would do a copy of that contiguous chunk of memory to
    disk, and vice-versa, possibly throwing in mmap - precisely to
    avoid parsing from text?

    Reply
     1. [4b7361] Daniel Lemire says:
        January 30, 2021 at 6:12 pm

        Right. If you serialize your numbers in binary form, you
        obviously have no parsing difficulty. In the paper, I also
        allude to another possibility: you can use hexadecimal
        floating-point numbers.

        Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked 
*

To create code blocks or other preformatted text, indent by four
spaces:

    This will be displayed in a monospaced font. The first four
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
Comment [                                             ]

Name * [                              ]

Email * [                              ]

Website [                              ]

[ ] Save my name, email, and website in this browser for the next
time I comment.

Receive Email Notifications? [no, do not subscribe      ] 
[instantly    ]
Or, you can subscribe without commenting.

[Post Comment] 

Post navigation

Previous Previous post: Science and Technology links (January 24th
2021)
Proudly powered by WordPress