[HN Gopher] Fast CSV Processing with SIMD
___________________________________________________________________
Fast CSV Processing with SIMD
Author : ingve
Score : 151 points
Date : 2021-12-04 07:37 UTC (15 hours ago)
(HTM) web link (nullprogram.com)
(TXT) w3m dump (nullprogram.com)
| plafl wrote:
| CSV are nice for data interchange or even for storage if
| compressed (storage is interchange between the past and the
| future). But the very first thing one should do when working with
| CSV data is convert it to a binary representation. It will force
| you to make a full pass over all the data, understand it, think
| about the types of the columns, and you will increase iteration
| speed.
|
| More related to TFA: does someone recommend a good book or other
| resource about modern code optimization? I have interiorized
| several lessons but it would be nice to dig deeper.
| speedgoose wrote:
| Some years ago I had huge performance increase by simply
| converting a csv dataset to parquet before processing it with
| Apache Spark.
| mgradowski wrote:
| Also, you will sleep better at night knowing that your column
| dtypes are safe from harm, exactly as you stored them. Moving
| from CSV (or god forbid, .xlsx) has been such a quality of
| life improvement.
|
| One thing I miss though is how easy it is to inspect .csv and
| .xlsx. I kinda solved it using [1], but it only works on
| Windows. More portable recommendations welcome!
|
| [1] https://github.com/mukunku/ParquetViewer
| speedgoose wrote:
| I used to use Zeppelin, some kind of Jupyter Notebook for
| Spark (that supports Parquet). But it may be better
| alternatives.
|
| https://zeppelin.apache.org/
| cb321 wrote:
| The "real" format being binary with debugging tools is
| absolutely the best way to go. For example, you can use
| `nio print` (or even just `nio p`) in the Nim multi-program
| https://github.com/c-blake/nio to get "text debugging
| output" of binary files.
| ZeroGravitas wrote:
| I really like Visidata for "exploring" csv-type data.
|
| Its a vi(m) inspired tool.
|
| It also handles xls(x), sqlitedb and a bunch of other
| random things, and it appears to support parquet via
| pandas:
|
| https://www.visidata.org/docs/loading/
| mgradowski wrote:
| Nice one!
| shoo wrote:
| > recommend a good book or other resource about modern code
| optimization
|
| https://www.agner.org/optimize/
| plafl wrote:
| Thanks, and I must say, very old school web page:)
| errantmind wrote:
| Agner Fog's writings are all well worth the read if you
| care about optimization. Some of the techniques he uses in
| his optimized versions of common C functions are really
| interesting.
|
| For example, he uses SIMD in several of these, but one of
| the performance problems associated with SIMD is the last
| 'n' bytes. If you processing, say, 16 or 32 bytes at a
| time, what do you do when you get towards the end of your
| buffer and that process would over-read past the end of the
| buffer? Most code I've seen just uses a simple linear
| process for the last few bytes but this is often very slow.
| Agner's solution was to just read past the end of the
| buffer, but use some assembly tricks to ensure there is
| always valid memory available after the end of the buffer,
| that way reading past the end of the buffer would (almost)
| never cause a problem.
| jandrewrogers wrote:
| Another useful trick for reading the last few bytes,
| especially if suitable buffer padding is not possible and
| the read would cross a memory page boundary, is to use
| PSHUFB aligned against the memory page boundary to pull
| the last few bytes into a register. Safe and fast.
| burntsushi wrote:
| For the specific case of substring/byte searching,
| another useful trick here is to do an unaligned load that
| lines up with the end of the haystack. Some portion of
| that load will have already been searched, but you
| already know nothing is in it.
| ComputerGuru wrote:
| I have a one-liner alias/script csv_to_sqlite that just starts
| an sqlite3 shell with the csv converted to a table. It's a
| lifesaver.
| cb321 wrote:
| I get ~50% the speed of the article's variant with no SIMD at all
| in https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim
|
| While in Nim, the main logic is really just bout 1 screenful for
| me and should not be so hard to follow.
|
| As commented elsewhere in this thread, but bearing (much!)
| repeating, a better approach is to bulk convert text to binary
| and then operate off of that. One feature you get is fixed sized
| rows and thus random access to rows without an index. You can
| even mmap the file and cast it to a struct pointer if you like
| (though you need the struct pointer to be the right type). When
| DBs or DB-ish file formats are faster, being in binary is the 0th
| order reason why.
|
| The main reason _not_ to do this is if you have no disk space for
| it.
| secondcoming wrote:
| Isn't casting to a struct pointer full of Undefined Behaviour?
| I believe you have to memcpy or bitcast and hope that the
| compiler optimises it out.
| cb321 wrote:
| No.
|
| To be more clear, you can mmap read only and just cast the
| pointer. You probably want the struct type to be "packed"
| unless you are intimately familiar with your compiler's
| struct padding behavior. Every C compiler I am aware of has
| _some_ way to specify "packed", but it is admittedly not the
| most portable construct. In any event, if you ever doing
| memcpy calls then you still must worry about struct layout.
| So, the memcpy is not buying you anything.
|
| You also do need to not exceed the size of the "file array",
| just like any array in C, but pre-mmap you usually open &
| fstat and the latter tells you the size of the file and
| sizeof(structType) tells you the size of the struct.
|
| So, there is this implicit assumption that the underlying
| file is not shrunken between the said fstat and whenever you
| access it by virtual memory indirection. Outright file
| deletion would actually be ok - but truncating it would leave
| the kernel unable to satisfy the page fault.
|
| Other code/entities shrinking files on you is really very far
| from what one usually means by "casting pointers and
| undefined behavior", though. It applies to any prog.lang.
| Even "safe Rust" doesn't block this problem.
|
| Another similar aspect of your follow-up belief depends upon
| the OS/sys admin, really. For example, many Linux systems
| default to overcommitting memory. This means you could malloc
| and memcpy and still cast a (packed) struct, but if that were
| swapped out and then when the swap in starts to happen the
| system is low on real physical RAM you might segfault.
| Heuristics like out-of-memory killers are just that -
| heuristic not precise. If the data is on an actually
| persistent backing store, though, the kernel can always page
| that in. So, in some ways the mmap read only approach is
| safer than the alloc & copy approach (if you know that
| nothing will shrink the file on you post-fstat).
|
| Cheers
| amanaplanablog wrote:
| Seems a bit overly simplistic and not realistic. Most real life
| data wouldn't be ASCII based. I deal with a lot of large CSVs and
| do a lot of processing (multi-terabyte CSVs). Null bytes, invalid
| unicode, all kinds of issues are very common in real world data.
| Not to mention improperly formatted CSVs (not closing quotes,
| etc).
| jeffbee wrote:
| I don't know, huge amounts of real-world data are ASCII in my
| experience, but it is also easy to program and cheap at runtime
| to scan ahead in your input to check if the high bit is
| universally clear. Current computers can do this at absurd
| speeds (~64B/cycle or ~200GB/s).
| lettergram wrote:
| I really should write up how we did delimiter and quote detection
| in this library:
|
| https://github.com/capitalone/DataProfiler
|
| It turns out delimited files IMO are much harder to parse than
| say, JSON. Largely because they have so many different
| permutations. The article covers CSVs, but many files are tab or
| null separated. We've even seen @ separated with ' for quotes.
|
| Given the above, it should still be possible to use the method
| described. I'm guessing you'd have to detect the separators and
| quote chars first, however. You'd have to also handle empty rows
| and corrupted rows (which happen often enough).
| bob1029 wrote:
| > It turns out delimited files IMO are much harder to parse
| than say, JSON.
|
| I have to agree, having recently done a deep-dive into the new
| Utf8JsonReader in .NET. Tracking state in a JSON parsing
| context can actually be quite elegant. CSV is a minefield of
| edge cases in my experience.
| pueblito wrote:
| That was a really fun and accessible read, thanks
| adamgordonbell wrote:
| This is an excellent post and I'm excited that my post on
| handling CSVs in AWK[1] at least somewhat inspired it.
|
| Some responses here are missing the idea that csvquote enables
| the use of CSV files with standard UNIX tools like AWK, head and
| so on. If you are building some data processing pipeline then
| converting to a binary format makes sense, but I'm not sure that
| is where tools like this shine either. It's for a different case.
|
| [1] https://earthly.dev/blog/awk-csv/
| BeeOnRope wrote:
| You can probably get another "few x" faster still:
|
| https://branchfree.org/2019/03/06/code-fragment-finding-quot...
|
| Based on a cursory scan this is the same problem so Geoff's trick
| should be directly applicable.
|
| The ideas are quite similar: using bit manipulation to detect the
| in and out regions. Geoff's trick does the "smearing" of the bits
| in a single SIMD "carryless multiply", plus some supporting
| stuff. It is constant time for an input word: there is no
| dependence on the number of bits. So it could be many times
| faster for inputs with very dense quotes.
|
| For an actual implementation, you'd want to handle the case where
| you don't have any quotes at all or where they are very sparse,
| which is common to a lot of CSV with a fast path that bails out
| when it sees a quote (possibly to be be resumed later if there is
| again a quote-free run).
| henrydark wrote:
| Langdale also has a state machine simd idea that can be used
| for the full csv transition table:
| https://branchfree.org/2018/05/25/say-hello-to-my-little-fri...
|
| I implemented this at some point (the "noisy" version, in
| Langdale's words) and got a csv tokenizer that spent about 1.5
| cycles per byte.
| nieve wrote:
| The author seems to assume that the bit patterns representing
| commas or quotation marks won't show up in any multibyte (not
| just unicode) patterns and each byte can be interpreted in
| isolation, but is that actually safe?
| Const-me wrote:
| Yes, assuming the input is valid UTF-8.
|
| UTF-8 bytes less than 0x80 are only used for the first 128
| Unicode codepoints. Characters who need multiple bytes in UTF8
| representation encode into bytes with high bits set, i.e. >=
| 0x80: https://en.wikipedia.org/wiki/UTF-8#Encoding
| jacoblambda wrote:
| For any multibyte representation no however I'd say it's safe
| to just say "all encodings must be valid UTF-8" and require the
| user/service to validate that first.
|
| There are a number of algorithms out there that can validate
| UTF-8 with significantly less than 1 instruction per byte. I'd
| imagine the overhead for pipelining the two is significantly
| cheaper than trying to handle the cases in the same pass.
___________________________________________________________________
(page generated 2021-12-04 23:01 UTC)