[HN Gopher] Number Parsing at a Gigabyte per Second
       ___________________________________________________________________
        
       Number Parsing at a Gigabyte per Second
        
       Author : mpweiher
       Score  : 70 points
       Date   : 2021-01-30 09:00 UTC (14 hours ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | lenkite wrote:
       | This referenced paper is dense yet surprisingly readable. Looking
       | forward to all all C++ implementations implementing the
       | std::from_chars function.
        
       | FrozenVoid wrote:
       | These are really nice libraries, but they indicate there is a
       | problem with input data having "gigabytes of ASCII floats", where
       | you store floats as strings instead of binary formats(which
       | doesn't require parsing anything). for scripts and user-input,
       | i'd rather use the standard stuff that has been tried and
       | tested(strtod/strtold).
        
         | asicsp wrote:
         | > _On my Apple M1 MacBook, using a realistic data file
         | (canada), we get that fast_float can far exceeds a gigabyte per
         | second, and get close to 2 GB /s. The conventional C function
         | (strtod) provided by the default Apple standard library does
         | quite poorly on this benchmark._
        
         | MaxBarraclough wrote:
         | If it's so cheap to decode that it's barely slower than copying
         | the bytes, the problem largely goes away, no?
         | 
         | No doubt you're right that it would be more efficient to store
         | the binary representation natively, especially assuming
         | portability isn't an issue, and especially if a zero-copy
         | solution were used, but many real-world systems have to cope
         | with data formats that aren't the most efficient. Huge amounts
         | of data are shipped as XML and JSON.
        
           | FrozenVoid wrote:
           | If performance matters, storage space should matter too.
        
             | MaxBarraclough wrote:
             | Representing numbers in decimal ASCII/Unicode shouldn't be
             | too bad in that regard, should it? If we're representing a
             | sparse matrix, the '0' character occupies one byte, whereas
             | a 32-bit representation occupies 4 (whether floating point
             | or fixed point).
             | 
             | Both representations should lend themselves to compression.
        
               | jstimpfle wrote:
               | Text requires at least two bytes (it needs a separator),
               | and negative numbers require yet another byte. So it's
               | hard to make a point that they occupy less storage. But
               | there is another advantage: it can represent infinitely
               | large numbers.
        
               | MaxBarraclough wrote:
               | > Text requires at least two bytes (it needs a
               | separator), and negative numbers require yet another
               | byte.
               | 
               | True.
               | 
               | > it's hard to make a point that they occupy less storage
               | 
               | No, like I said, it would occupy less storage for sparse
               | data. That's true even with separator characters (2 bytes
               | as against 4). Doubtless there are much better solutions
               | for representing sparse data that aren't human readable.
               | A simple _(index, value)_ dictionary, say.
               | 
               | > it can represent infinitely large numbers
               | 
               | Right, there's no upper bound beyond the limitations of
               | the systems, although again a non-human-readable bignum
               | format could do this more efficiently.
        
               | retrac wrote:
               | The algorithm give above is an excellent, extremely fast
               | compression algorithm for ASCII-encoded numbers :)
        
         | hacker_9 wrote:
         | Sometimes third parties give you json and that's it, in these
         | cases these libraries are useful if being fastest to react is a
         | constraint.
        
       | sjdhasjhd wrote:
       | https://blog.anicode.in/easiest-python-beginner-video-tutori...
        
       | MaxBarraclough wrote:
       | Somewhat related, from the same author: _Base64 encoding and
       | decoding at almost the speed of a memory copy._ (2019) [0]
       | 
       | Also this discussion of algorithms for the inverse problem,
       | converting floats to strings. [1]
       | 
       | [0] https://news.ycombinator.com/item?id=21459839
       | 
       | [1] https://news.ycombinator.com/item?id=24939411
        
       | Out_of_Characte wrote:
       | This seems very usefull but only when you have gigabytes of
       | compliant ascii text of floating point numbers. I do wonder how
       | well this performs on non-compliant text and wether there are
       | systems out there that are limited by the 130MB transfer speeds
       | of standard libraries.
       | 
       | Now that I think about it, Excel spreadsheets, Json,
       | XML,textfiles are all mayor contributors to sometimes very flawed
       | ascii-based workloads that should have a complementary binary
       | backing.
        
         | the8472 wrote:
         | Not all optimization work consists of attacking the dominating
         | function. Sometimes most low-hanging fruit already are plucked
         | and you'll have to speed up dozens of smaller things and float
         | parsing can be one of those.
        
         | maccard wrote:
         | Just because something isn't the bottleneck doesn't mean it's
         | not worth optimising.if you spend 10% decoding, 80% working and
         | 10% saving the results, saving that first or last 10% is
         | definitely worthwhile.
         | 
         | My $400 consumer motherboard has a 10Gb network card, and my
         | SSD reads at over 50Gb/s. Anything that brings IO closer to
         | those speeds is welcome.
        
       | di4na wrote:
       | Reading it, i get a strange deja-vu. Having been spending the
       | past 3 months on implementing Ryu to do the reverse operation,
       | this looks a lot like what Ulf Adams have been working on to do
       | the reverse based on Ryu. The lookup table are really close and
       | the map look the same.
       | 
       | While Ryu and Ulf work is indeed cited for the reverse, i do not
       | see any acknowledgement for his work on the string to float
       | version.... despite really close similitude.
       | 
       | I suppose sometimes it is too obvious...
        
       | jhokanson wrote:
       | I'm curious as to what the biggest win in terms of speed was here
       | (in terms of an approach, good lookup tables?). Also I'm curious
       | how this compares to the many (?) JSON parsers that have rolled
       | their own number parser because everyone knows the standard
       | library is so slow ... (just more accurate?, faster?).
       | Regardless, kudos to the authors on their work!
        
         | SloopJon wrote:
         | He touched on JSON parsers in a previous post about
         | fast_double_parser: "People who write fast parsers tend to roll
         | their own number parsers (e.g., RapidJSON, sajson), and so we
         | did. However, we sacrifice some standard compliance." (The "we"
         | in this context refers to simdjson.)
         | 
         | https://lemire.me/blog/2020/03/10/fast-float-parsing-in-prac...
         | 
         | He followed up in a comment: "RapidJSON has at least two fast-
         | parsing mode. The fast mode, which I think is what you refer
         | to, is indeed quite fast, but it can be off by one ULP, so it
         | is not standard compliant."
         | 
         | The Github README for this new project says, "The fast_float
         | library provides a performance similar to that of the
         | fast_double_parser library."
         | 
         | https://github.com/fastfloat/fast_float
         | 
         | However, the benchmarks show a significant improvement relative
         | to those in the fast_double_parser README:
         | 
         | https://github.com/lemire/fast_double_parser
         | 
         | I tried to run the benchmarks, but my CMake is apparently too
         | old, and Homebrew barfed all over the living room rug when I
         | tried to update it.
        
           | jhokanson wrote:
           | Wow, those are big performance differences (660 MB/s for
           | fast-double vs 1042 MB/s for the 'newer' fast-float),
           | although most of the numbers (for the different libraries
           | being tested) are all over the place, and even 'strtod' more
           | than doubled in speed between the two tests (70 MB/s fast-
           | double vs 190 fast-float MB/s). It wouldn't surprise me if
           | those two code bases are essentially the same.
           | 
           | That highlights the complexity of benchmarking in general and
           | the importance of comparing within the same benchmark. I
           | haven't looked at this in a while but I thought some of the
           | newer JSON parsers were standards compliant (maybe not?).
           | 
           | Anyway, that other blog post answers my question as it looks
           | like the big insight is that you use the fast approach (that
           | everyone uses) when you can, and fall back to slow if you
           | really have to. From that blog link:
           | 
           | "The full idea requires a whole blog post to explain, but the
           | gist of it is that we can attempt to compute the answer,
           | optimistically using a fast algorithm, and fall back on
           | something else (like the standard library) as needed. It
           | turns out that for the kind of numbers we find in JSON
           | documents, we can parse 99% of them using a simple approach.
           | All we have to do is correctly detect the error cases and
           | bail out."
           | 
           | Again, I swear I've seen this in one of the other JSON
           | parsers but maybe I'm misremembering. And again, good for
           | them for breaking it out into a header library for others to
           | use.
        
       ___________________________________________________________________
       (page generated 2021-01-30 23:02 UTC)