[HN Gopher] Leveraging SIMD: Splitting CSV Files at 3Gb/S
___________________________________________________________________
Leveraging SIMD: Splitting CSV Files at 3Gb/S
Author : __exit__
Score : 75 points
Date : 2021-12-16 10:09 UTC (12 hours ago)
(HTM) web link (blog.tinybird.co)
(TXT) w3m dump (blog.tinybird.co)
| michaelg7x wrote:
| Presumably solving the same kind of delimiter-finding issues as
| Hyperscan? https://news.ycombinator.com/item?id=19270199
| michaelg7x wrote:
| I'm sorry, I don't mean Hyperscan, I mean simdjson [0]. I think
| I got confused by my recollection of Lemire/Langdale.
|
| [0] https://github.com/simdjson/simdjson
| rwmj wrote:
| Nice, but I'm afraid real world CSVs are a lot more complicated
| than described so don't use this code in production.
| mschuster91 wrote:
| If you're doing user-supplied CSVs, definitely... but if you
| are ingesting CSVs from a known source with known format
| (<insert audible sigh here>) it can definitely make sense to
| use a high-speed optimized ingester.
|
| One might wonder if it might be worth the time to look into
| optimising the runtimes of various languages. I took a look,
| all operate on naive byte-by-byte scanning, and all sans PHP
| are written in the respective language which means any form of
| SIMD optimization is right off the table (okay, _maybe_
| something could be done in Java, but it seems incredibly
| complex, see https://www.morling.dev/blog/fizzbuzz-simd-
| style/):
|
| - PHP isn't optimized _anywhere_ , but at least it's C:
| https://github.com/php/php-src/blob/1c0e613cf1a24cdc159861e4...
|
| - Python's C implementation is the same:
| https://github.com/python/cpython/blob/main/Modules/_csv.c
|
| - Java doesn't have a "standard" way at all
| (https://www.baeldung.com/java-csv-file-array), and OpenCSV
| seems the usual object-oriented hell (https://sourceforge.net/p
| /opencsv/source/ci/master/tree/src/...).
|
| - Ruby's CSV is native Ruby:
| https://github.com/ruby/ruby/blob/bd65757f394255ceeb2c958e87...
| __s wrote:
| Python's csv imports _csv for core functionality, which is C:
| https://github.com/python/cpython/blob/main/Modules/_csv.c
| mschuster91 wrote:
| Thanks! Updated accordingly.
| jwandborg wrote:
| You should update "all sans PHP" to reflect the update.
| nickpeterson wrote:
| It's funny, csv files are so common and yet many mainstream
| languages don't even attempt a decent parser baked in. I
| think dotnet has 3-4 different ones and as I recall they're
| all pretty slow.
| jwandborg wrote:
| There's multiple dialects of CSV. Besides the more
| standardish dialect there are some weird ones that prevent
| some types of optimization. I remember Apple's "Enterprise
| Partner Feed" had a dialect I've never seen elsewhere so
| far. Columns were separated by 0x01, rows were separated by
| 0x02 0x0A.
|
| The row separator being two bytes throws a wrench in most
| parsers.
| ipdashc wrote:
| What a bizarre choice. If they're going to commit to
| weird ASCII control chars you'd think they could just use
| 0x1C to 0x1F, which are explicitly intended as
| delimiters/Separators... sigh. (I've always wondered why
| more people don't use the various Separators, but I admit
| human-readability is a big advantage)
| mschuster91 wrote:
| > The row separator being two bytes throws a wrench in
| most parsers.
|
| Huh? Anything that ingests Windows-origin files needs to
| be capable with \r\n by default.
| [deleted]
| clscott wrote:
| Perl's best known library Terxt::CSV has both a pure-perl and
| a C implementation.
|
| Here is the C version
|
| https://github.com/Tux/Text-CSV_XS/blob/master/CSV_XS.xs
| liuliu wrote:
| Splitting CSV file into chunks and process them independently
| won't necessarily be wrong (although there are implementations
| out there that I won't name would, because they do guess). The
| trick however requires to scan twice:
| https://liuliu.me/eyes/loading-csv-file-at-the-speed-limit-o...
|
| Nice article otherwise!
| Tuna-Fish wrote:
| Why is the unit expression in topic messed up?
| jagrsw wrote:
| Not sure how the author of this entry on HN managed to change
| original title from
|
| gigabytes per second
|
| to
|
| gigabits per siemens
|
| :)
| jeltz wrote:
| Isn't it easier to write 3GbO instead of 3Gb/S?
| __exit__ wrote:
| Autocorrector issues + fast fingers to click on submit without
| double checking. Sorry for that.
|
| Whoever fixed the title, thank you :D
| CodesInChaos wrote:
| > fixed the title
|
| It still shows as "3Gb/S" for me, instead of "3GB/s"
| HPsquared wrote:
| Staying with Physics, "Gb/S" is Gigabarns per Siemens. Some
| relation of electrical conductance with cross-sectional area.
|
| The barn is a unit of cross-sectional area, based on the
| Uranium nucleus (area 1 barn). Uranium is pretty large in
| atomic terms; the name is from the idiom "couldn't hit the
| broad side of a barn".
| jagrsw wrote:
| And since 1/S = 1O, it'd be Gigabarnohm.
| robertlagrant wrote:
| Hence the old joke: how many Gigabarnohms does it take to
| start a circus?
|
| Answer, of course, one million.
| Sebb767 wrote:
| Probably auto-capitalization gone wrong. Or some very new code
| ;)
| MisterTea wrote:
| Genetic code most likely.
| zwegner wrote:
| Pretty similar article from very recently:
| https://nullprogram.com/blog/2021/12/04/
|
| Discussion: https://news.ycombinator.com/item?id=29439403
|
| The article mentions in an addendum (and BeeOnRope also pointed
| it out in the HN thread) a nice CLMUL trick for dealing with
| quotes originally discovered by Geoff Langdale. That should work
| here for a nice speedup.
|
| But without the CLMUL trick, I'd guess that the unaligned loads
| that generally occur after a vector containing both quotes and
| newlines in this version (the "else" case on lines 34-40) would
| hamper the performance somewhat, since it would eat up twice as
| much L1 cache bandwidth. I'd suggest dealing with the masks using
| bitwise operations in a loop, and letting i stay divisible by 16.
| Or just use CLMUL :)
| davidm1729 wrote:
| Hi, I'm one of the authors of the post
|
| Thanks for pointing us to CLMUL, I'm not familiar with these
| kind of multiplications, but, converting the quote bitmask to a
| _quoted_ bitmask would certainly make it faster. With this new
| bitmask, we could negate it and AND it with the newline mask,
| generating a mask of newlines that are not inside quotes.
| Getting the last newline then would be a simple CLZ of that
| mask. And there wouldn 't be a need to resort to byte to byte
| processing.
|
| In our tests, going byte to byte for more iterations to keep
| the alignment when hitting the "else case" performed worse than
| making the unaligned loads, but as you say "just use CLMUL" (as
| all loads will be aligned) :D
| zwegner wrote:
| CLMUL in general is a bit weird to wrap your head around, but
| a CLMUL with -1 isn't too tricky: it's like a running 1-bit
| sum, or in other words, each bit in the result is the parity
| of all the bits up to that point in the multiplier.
|
| > In our tests, going byte to byte for more iterations to
| keep the alignment when hitting the "else case" performed
| worse than making the unaligned loads, but as you say "just
| use CLMUL" (as all loads will be aligned) :D
|
| I was talking about using bitwise operations with the
| quote/escape/newline masks already computed (like in the blog
| post I linked), rather than a byte-by-byte loop. But yeah,
| CLMUL is better anyways :)
| jart wrote:
| PMOVMSKB/BSF/POPCNT takes serious wizardry, but instructions
| like PCLMULLQLQDQ make you feel like Gandalf. It's defined:
| pair clmul(uint64_t a, uint64_t b) { uint64_t t, x
| = 0, y = 0; if (a && b) { if (bsr(a) <
| bsr(b)) t = a, a = b, b = t; /* optional */ for
| (t = 0; b; a <<= 1, b >>= 1) { if (b & 1) x ^=
| a, y ^= t; t = t << 1 | a >> 63; }
| } return (pair){x, y}; }
|
| There's a famous paper on how it can perform polynomial
| division at 40gbps. It's really cool that it has practical
| applications in things like CSV too. https://www.intel.com/co
| ntent/dam/www/public/us/en/documents...
| gpderetta wrote:
| CLMUL is quite interesting. I learned about it when going in
| depth on how multiplications help with hashing.
|
| A multiplication is in practice: - a sum over - a series
| (i.e. one for each bit set in the multiplier) - of shifts
| (where the shift amount is the index of that bit in the
| multiplier)
|
| The shifting and the combining are great for hashing as they
| "distribute" each bit around.
|
| CLMUL simply replaces the addition in step one with xor
| (which can also be thought as the single bit carryless
| addition).
| pclmulqdq wrote:
| The carryless multiplication instructions are amazing and
| people should use them more often. They are just so poorly
| explained that they feel like magic.
___________________________________________________________________
(page generated 2021-12-16 23:01 UTC)