[HN Gopher] Leveraging SIMD: Splitting CSV Files at 3Gb/S
       ___________________________________________________________________
        
       Leveraging SIMD: Splitting CSV Files at 3Gb/S
        
       Author : __exit__
       Score  : 75 points
       Date   : 2021-12-16 10:09 UTC (12 hours ago)
        
 (HTM) web link (blog.tinybird.co)
 (TXT) w3m dump (blog.tinybird.co)
        
       | michaelg7x wrote:
       | Presumably solving the same kind of delimiter-finding issues as
       | Hyperscan? https://news.ycombinator.com/item?id=19270199
        
         | michaelg7x wrote:
         | I'm sorry, I don't mean Hyperscan, I mean simdjson [0]. I think
         | I got confused by my recollection of Lemire/Langdale.
         | 
         | [0] https://github.com/simdjson/simdjson
        
       | rwmj wrote:
       | Nice, but I'm afraid real world CSVs are a lot more complicated
       | than described so don't use this code in production.
        
         | mschuster91 wrote:
         | If you're doing user-supplied CSVs, definitely... but if you
         | are ingesting CSVs from a known source with known format
         | (<insert audible sigh here>) it can definitely make sense to
         | use a high-speed optimized ingester.
         | 
         | One might wonder if it might be worth the time to look into
         | optimising the runtimes of various languages. I took a look,
         | all operate on naive byte-by-byte scanning, and all sans PHP
         | are written in the respective language which means any form of
         | SIMD optimization is right off the table (okay, _maybe_
         | something could be done in Java, but it seems incredibly
         | complex, see https://www.morling.dev/blog/fizzbuzz-simd-
         | style/):
         | 
         | - PHP isn't optimized _anywhere_ , but at least it's C:
         | https://github.com/php/php-src/blob/1c0e613cf1a24cdc159861e4...
         | 
         | - Python's C implementation is the same:
         | https://github.com/python/cpython/blob/main/Modules/_csv.c
         | 
         | - Java doesn't have a "standard" way at all
         | (https://www.baeldung.com/java-csv-file-array), and OpenCSV
         | seems the usual object-oriented hell (https://sourceforge.net/p
         | /opencsv/source/ci/master/tree/src/...).
         | 
         | - Ruby's CSV is native Ruby:
         | https://github.com/ruby/ruby/blob/bd65757f394255ceeb2c958e87...
        
           | __s wrote:
           | Python's csv imports _csv for core functionality, which is C:
           | https://github.com/python/cpython/blob/main/Modules/_csv.c
        
             | mschuster91 wrote:
             | Thanks! Updated accordingly.
        
               | jwandborg wrote:
               | You should update "all sans PHP" to reflect the update.
        
           | nickpeterson wrote:
           | It's funny, csv files are so common and yet many mainstream
           | languages don't even attempt a decent parser baked in. I
           | think dotnet has 3-4 different ones and as I recall they're
           | all pretty slow.
        
             | jwandborg wrote:
             | There's multiple dialects of CSV. Besides the more
             | standardish dialect there are some weird ones that prevent
             | some types of optimization. I remember Apple's "Enterprise
             | Partner Feed" had a dialect I've never seen elsewhere so
             | far. Columns were separated by 0x01, rows were separated by
             | 0x02 0x0A.
             | 
             | The row separator being two bytes throws a wrench in most
             | parsers.
        
               | ipdashc wrote:
               | What a bizarre choice. If they're going to commit to
               | weird ASCII control chars you'd think they could just use
               | 0x1C to 0x1F, which are explicitly intended as
               | delimiters/Separators... sigh. (I've always wondered why
               | more people don't use the various Separators, but I admit
               | human-readability is a big advantage)
        
               | mschuster91 wrote:
               | > The row separator being two bytes throws a wrench in
               | most parsers.
               | 
               | Huh? Anything that ingests Windows-origin files needs to
               | be capable with \r\n by default.
        
           | [deleted]
        
           | clscott wrote:
           | Perl's best known library Terxt::CSV has both a pure-perl and
           | a C implementation.
           | 
           | Here is the C version
           | 
           | https://github.com/Tux/Text-CSV_XS/blob/master/CSV_XS.xs
        
       | liuliu wrote:
       | Splitting CSV file into chunks and process them independently
       | won't necessarily be wrong (although there are implementations
       | out there that I won't name would, because they do guess). The
       | trick however requires to scan twice:
       | https://liuliu.me/eyes/loading-csv-file-at-the-speed-limit-o...
       | 
       | Nice article otherwise!
        
       | Tuna-Fish wrote:
       | Why is the unit expression in topic messed up?
        
       | jagrsw wrote:
       | Not sure how the author of this entry on HN managed to change
       | original title from
       | 
       | gigabytes per second
       | 
       | to
       | 
       | gigabits per siemens
       | 
       | :)
        
         | jeltz wrote:
         | Isn't it easier to write 3GbO instead of 3Gb/S?
        
         | __exit__ wrote:
         | Autocorrector issues + fast fingers to click on submit without
         | double checking. Sorry for that.
         | 
         | Whoever fixed the title, thank you :D
        
           | CodesInChaos wrote:
           | > fixed the title
           | 
           | It still shows as "3Gb/S" for me, instead of "3GB/s"
        
         | HPsquared wrote:
         | Staying with Physics, "Gb/S" is Gigabarns per Siemens. Some
         | relation of electrical conductance with cross-sectional area.
         | 
         | The barn is a unit of cross-sectional area, based on the
         | Uranium nucleus (area 1 barn). Uranium is pretty large in
         | atomic terms; the name is from the idiom "couldn't hit the
         | broad side of a barn".
        
           | jagrsw wrote:
           | And since 1/S = 1O, it'd be Gigabarnohm.
        
             | robertlagrant wrote:
             | Hence the old joke: how many Gigabarnohms does it take to
             | start a circus?
             | 
             | Answer, of course, one million.
        
         | Sebb767 wrote:
         | Probably auto-capitalization gone wrong. Or some very new code
         | ;)
        
           | MisterTea wrote:
           | Genetic code most likely.
        
       | zwegner wrote:
       | Pretty similar article from very recently:
       | https://nullprogram.com/blog/2021/12/04/
       | 
       | Discussion: https://news.ycombinator.com/item?id=29439403
       | 
       | The article mentions in an addendum (and BeeOnRope also pointed
       | it out in the HN thread) a nice CLMUL trick for dealing with
       | quotes originally discovered by Geoff Langdale. That should work
       | here for a nice speedup.
       | 
       | But without the CLMUL trick, I'd guess that the unaligned loads
       | that generally occur after a vector containing both quotes and
       | newlines in this version (the "else" case on lines 34-40) would
       | hamper the performance somewhat, since it would eat up twice as
       | much L1 cache bandwidth. I'd suggest dealing with the masks using
       | bitwise operations in a loop, and letting i stay divisible by 16.
       | Or just use CLMUL :)
        
         | davidm1729 wrote:
         | Hi, I'm one of the authors of the post
         | 
         | Thanks for pointing us to CLMUL, I'm not familiar with these
         | kind of multiplications, but, converting the quote bitmask to a
         | _quoted_ bitmask would certainly make it faster. With this new
         | bitmask, we could negate it and AND it with the newline mask,
         | generating a mask of newlines that are not inside quotes.
         | Getting the last newline then would be a simple CLZ of that
         | mask. And there wouldn 't be a need to resort to byte to byte
         | processing.
         | 
         | In our tests, going byte to byte for more iterations to keep
         | the alignment when hitting the "else case" performed worse than
         | making the unaligned loads, but as you say "just use CLMUL" (as
         | all loads will be aligned) :D
        
           | zwegner wrote:
           | CLMUL in general is a bit weird to wrap your head around, but
           | a CLMUL with -1 isn't too tricky: it's like a running 1-bit
           | sum, or in other words, each bit in the result is the parity
           | of all the bits up to that point in the multiplier.
           | 
           | > In our tests, going byte to byte for more iterations to
           | keep the alignment when hitting the "else case" performed
           | worse than making the unaligned loads, but as you say "just
           | use CLMUL" (as all loads will be aligned) :D
           | 
           | I was talking about using bitwise operations with the
           | quote/escape/newline masks already computed (like in the blog
           | post I linked), rather than a byte-by-byte loop. But yeah,
           | CLMUL is better anyways :)
        
           | jart wrote:
           | PMOVMSKB/BSF/POPCNT takes serious wizardry, but instructions
           | like PCLMULLQLQDQ make you feel like Gandalf. It's defined:
           | pair clmul(uint64_t a, uint64_t b) {           uint64_t t, x
           | = 0, y = 0;           if (a && b) {             if (bsr(a) <
           | bsr(b)) t = a, a = b, b = t; /* optional */             for
           | (t = 0; b; a <<= 1, b >>= 1) {               if (b & 1) x ^=
           | a, y ^= t;               t = t << 1 | a >> 63;             }
           | }           return (pair){x, y};         }
           | 
           | There's a famous paper on how it can perform polynomial
           | division at 40gbps. It's really cool that it has practical
           | applications in things like CSV too. https://www.intel.com/co
           | ntent/dam/www/public/us/en/documents...
        
           | gpderetta wrote:
           | CLMUL is quite interesting. I learned about it when going in
           | depth on how multiplications help with hashing.
           | 
           | A multiplication is in practice: - a sum over - a series
           | (i.e. one for each bit set in the multiplier) - of shifts
           | (where the shift amount is the index of that bit in the
           | multiplier)
           | 
           | The shifting and the combining are great for hashing as they
           | "distribute" each bit around.
           | 
           | CLMUL simply replaces the addition in step one with xor
           | (which can also be thought as the single bit carryless
           | addition).
        
         | pclmulqdq wrote:
         | The carryless multiplication instructions are amazing and
         | people should use them more often. They are just so poorly
         | explained that they feel like magic.
        
       ___________________________________________________________________
       (page generated 2021-12-16 23:01 UTC)