[HN Gopher] Fast CSV Processing with SIMD
       ___________________________________________________________________
        
       Fast CSV Processing with SIMD
        
       Author : ingve
       Score  : 151 points
       Date   : 2021-12-04 07:37 UTC (15 hours ago)
        
 (HTM) web link (nullprogram.com)
 (TXT) w3m dump (nullprogram.com)
        
       | plafl wrote:
       | CSV are nice for data interchange or even for storage if
       | compressed (storage is interchange between the past and the
       | future). But the very first thing one should do when working with
       | CSV data is convert it to a binary representation. It will force
       | you to make a full pass over all the data, understand it, think
       | about the types of the columns, and you will increase iteration
       | speed.
       | 
       | More related to TFA: does someone recommend a good book or other
       | resource about modern code optimization? I have interiorized
       | several lessons but it would be nice to dig deeper.
        
         | speedgoose wrote:
         | Some years ago I had huge performance increase by simply
         | converting a csv dataset to parquet before processing it with
         | Apache Spark.
        
           | mgradowski wrote:
           | Also, you will sleep better at night knowing that your column
           | dtypes are safe from harm, exactly as you stored them. Moving
           | from CSV (or god forbid, .xlsx) has been such a quality of
           | life improvement.
           | 
           | One thing I miss though is how easy it is to inspect .csv and
           | .xlsx. I kinda solved it using [1], but it only works on
           | Windows. More portable recommendations welcome!
           | 
           | [1] https://github.com/mukunku/ParquetViewer
        
             | speedgoose wrote:
             | I used to use Zeppelin, some kind of Jupyter Notebook for
             | Spark (that supports Parquet). But it may be better
             | alternatives.
             | 
             | https://zeppelin.apache.org/
        
             | cb321 wrote:
             | The "real" format being binary with debugging tools is
             | absolutely the best way to go. For example, you can use
             | `nio print` (or even just `nio p`) in the Nim multi-program
             | https://github.com/c-blake/nio to get "text debugging
             | output" of binary files.
        
             | ZeroGravitas wrote:
             | I really like Visidata for "exploring" csv-type data.
             | 
             | Its a vi(m) inspired tool.
             | 
             | It also handles xls(x), sqlitedb and a bunch of other
             | random things, and it appears to support parquet via
             | pandas:
             | 
             | https://www.visidata.org/docs/loading/
        
               | mgradowski wrote:
               | Nice one!
        
         | shoo wrote:
         | > recommend a good book or other resource about modern code
         | optimization
         | 
         | https://www.agner.org/optimize/
        
           | plafl wrote:
           | Thanks, and I must say, very old school web page:)
        
             | errantmind wrote:
             | Agner Fog's writings are all well worth the read if you
             | care about optimization. Some of the techniques he uses in
             | his optimized versions of common C functions are really
             | interesting.
             | 
             | For example, he uses SIMD in several of these, but one of
             | the performance problems associated with SIMD is the last
             | 'n' bytes. If you processing, say, 16 or 32 bytes at a
             | time, what do you do when you get towards the end of your
             | buffer and that process would over-read past the end of the
             | buffer? Most code I've seen just uses a simple linear
             | process for the last few bytes but this is often very slow.
             | Agner's solution was to just read past the end of the
             | buffer, but use some assembly tricks to ensure there is
             | always valid memory available after the end of the buffer,
             | that way reading past the end of the buffer would (almost)
             | never cause a problem.
        
               | jandrewrogers wrote:
               | Another useful trick for reading the last few bytes,
               | especially if suitable buffer padding is not possible and
               | the read would cross a memory page boundary, is to use
               | PSHUFB aligned against the memory page boundary to pull
               | the last few bytes into a register. Safe and fast.
        
               | burntsushi wrote:
               | For the specific case of substring/byte searching,
               | another useful trick here is to do an unaligned load that
               | lines up with the end of the haystack. Some portion of
               | that load will have already been searched, but you
               | already know nothing is in it.
        
         | ComputerGuru wrote:
         | I have a one-liner alias/script csv_to_sqlite that just starts
         | an sqlite3 shell with the csv converted to a table. It's a
         | lifesaver.
        
       | cb321 wrote:
       | I get ~50% the speed of the article's variant with no SIMD at all
       | in https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim
       | 
       | While in Nim, the main logic is really just bout 1 screenful for
       | me and should not be so hard to follow.
       | 
       | As commented elsewhere in this thread, but bearing (much!)
       | repeating, a better approach is to bulk convert text to binary
       | and then operate off of that. One feature you get is fixed sized
       | rows and thus random access to rows without an index. You can
       | even mmap the file and cast it to a struct pointer if you like
       | (though you need the struct pointer to be the right type). When
       | DBs or DB-ish file formats are faster, being in binary is the 0th
       | order reason why.
       | 
       | The main reason _not_ to do this is if you have no disk space for
       | it.
        
         | secondcoming wrote:
         | Isn't casting to a struct pointer full of Undefined Behaviour?
         | I believe you have to memcpy or bitcast and hope that the
         | compiler optimises it out.
        
           | cb321 wrote:
           | No.
           | 
           | To be more clear, you can mmap read only and just cast the
           | pointer. You probably want the struct type to be "packed"
           | unless you are intimately familiar with your compiler's
           | struct padding behavior. Every C compiler I am aware of has
           | _some_ way to specify  "packed", but it is admittedly not the
           | most portable construct. In any event, if you ever doing
           | memcpy calls then you still must worry about struct layout.
           | So, the memcpy is not buying you anything.
           | 
           | You also do need to not exceed the size of the "file array",
           | just like any array in C, but pre-mmap you usually open &
           | fstat and the latter tells you the size of the file and
           | sizeof(structType) tells you the size of the struct.
           | 
           | So, there is this implicit assumption that the underlying
           | file is not shrunken between the said fstat and whenever you
           | access it by virtual memory indirection. Outright file
           | deletion would actually be ok - but truncating it would leave
           | the kernel unable to satisfy the page fault.
           | 
           | Other code/entities shrinking files on you is really very far
           | from what one usually means by "casting pointers and
           | undefined behavior", though. It applies to any prog.lang.
           | Even "safe Rust" doesn't block this problem.
           | 
           | Another similar aspect of your follow-up belief depends upon
           | the OS/sys admin, really. For example, many Linux systems
           | default to overcommitting memory. This means you could malloc
           | and memcpy and still cast a (packed) struct, but if that were
           | swapped out and then when the swap in starts to happen the
           | system is low on real physical RAM you might segfault.
           | Heuristics like out-of-memory killers are just that -
           | heuristic not precise. If the data is on an actually
           | persistent backing store, though, the kernel can always page
           | that in. So, in some ways the mmap read only approach is
           | safer than the alloc & copy approach (if you know that
           | nothing will shrink the file on you post-fstat).
           | 
           | Cheers
        
       | amanaplanablog wrote:
       | Seems a bit overly simplistic and not realistic. Most real life
       | data wouldn't be ASCII based. I deal with a lot of large CSVs and
       | do a lot of processing (multi-terabyte CSVs). Null bytes, invalid
       | unicode, all kinds of issues are very common in real world data.
       | Not to mention improperly formatted CSVs (not closing quotes,
       | etc).
        
         | jeffbee wrote:
         | I don't know, huge amounts of real-world data are ASCII in my
         | experience, but it is also easy to program and cheap at runtime
         | to scan ahead in your input to check if the high bit is
         | universally clear. Current computers can do this at absurd
         | speeds (~64B/cycle or ~200GB/s).
        
       | lettergram wrote:
       | I really should write up how we did delimiter and quote detection
       | in this library:
       | 
       | https://github.com/capitalone/DataProfiler
       | 
       | It turns out delimited files IMO are much harder to parse than
       | say, JSON. Largely because they have so many different
       | permutations. The article covers CSVs, but many files are tab or
       | null separated. We've even seen @ separated with ' for quotes.
       | 
       | Given the above, it should still be possible to use the method
       | described. I'm guessing you'd have to detect the separators and
       | quote chars first, however. You'd have to also handle empty rows
       | and corrupted rows (which happen often enough).
        
         | bob1029 wrote:
         | > It turns out delimited files IMO are much harder to parse
         | than say, JSON.
         | 
         | I have to agree, having recently done a deep-dive into the new
         | Utf8JsonReader in .NET. Tracking state in a JSON parsing
         | context can actually be quite elegant. CSV is a minefield of
         | edge cases in my experience.
        
       | pueblito wrote:
       | That was a really fun and accessible read, thanks
        
       | adamgordonbell wrote:
       | This is an excellent post and I'm excited that my post on
       | handling CSVs in AWK[1] at least somewhat inspired it.
       | 
       | Some responses here are missing the idea that csvquote enables
       | the use of CSV files with standard UNIX tools like AWK, head and
       | so on. If you are building some data processing pipeline then
       | converting to a binary format makes sense, but I'm not sure that
       | is where tools like this shine either. It's for a different case.
       | 
       | [1] https://earthly.dev/blog/awk-csv/
        
       | BeeOnRope wrote:
       | You can probably get another "few x" faster still:
       | 
       | https://branchfree.org/2019/03/06/code-fragment-finding-quot...
       | 
       | Based on a cursory scan this is the same problem so Geoff's trick
       | should be directly applicable.
       | 
       | The ideas are quite similar: using bit manipulation to detect the
       | in and out regions. Geoff's trick does the "smearing" of the bits
       | in a single SIMD "carryless multiply", plus some supporting
       | stuff. It is constant time for an input word: there is no
       | dependence on the number of bits. So it could be many times
       | faster for inputs with very dense quotes.
       | 
       | For an actual implementation, you'd want to handle the case where
       | you don't have any quotes at all or where they are very sparse,
       | which is common to a lot of CSV with a fast path that bails out
       | when it sees a quote (possibly to be be resumed later if there is
       | again a quote-free run).
        
         | henrydark wrote:
         | Langdale also has a state machine simd idea that can be used
         | for the full csv transition table:
         | https://branchfree.org/2018/05/25/say-hello-to-my-little-fri...
         | 
         | I implemented this at some point (the "noisy" version, in
         | Langdale's words) and got a csv tokenizer that spent about 1.5
         | cycles per byte.
        
       | nieve wrote:
       | The author seems to assume that the bit patterns representing
       | commas or quotation marks won't show up in any multibyte (not
       | just unicode) patterns and each byte can be interpreted in
       | isolation, but is that actually safe?
        
         | Const-me wrote:
         | Yes, assuming the input is valid UTF-8.
         | 
         | UTF-8 bytes less than 0x80 are only used for the first 128
         | Unicode codepoints. Characters who need multiple bytes in UTF8
         | representation encode into bytes with high bits set, i.e. >=
         | 0x80: https://en.wikipedia.org/wiki/UTF-8#Encoding
        
         | jacoblambda wrote:
         | For any multibyte representation no however I'd say it's safe
         | to just say "all encodings must be valid UTF-8" and require the
         | user/service to validate that first.
         | 
         | There are a number of algorithms out there that can validate
         | UTF-8 with significantly less than 1 instruction per byte. I'd
         | imagine the overhead for pipelining the two is significantly
         | cheaper than trying to handle the cases in the same pass.
        
       ___________________________________________________________________
       (page generated 2021-12-04 23:01 UTC)