[HN Gopher] Show HN: High-speed UTF-8 validation in Rust
       ___________________________________________________________________
        
       Show HN: High-speed UTF-8 validation in Rust
        
       Author : kryps
       Score  : 111 points
       Date   : 2021-04-21 09:58 UTC (13 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | celeritascelery wrote:
       | > The implementation is similar to the one in simdjson except
       | that it aligns reads to the block size of the SIMD extension,
       | which leads to better peak performance compared to the
       | implementation in simdjson.
       | 
       | I didn't really understand this part. Aligned to what? to the
       | cache line? SIMD always reads the block size. Unless I am missing
       | something here.
        
         | unwind wrote:
         | I read it as "to the width of the SIMD registers" which I have
         | seen in other quick scanners, but did not read the code here.
        
           | celeritascelery wrote:
           | Aren't all reads aligned to the width of the SIMD register?
           | If I do an AVX512 command it will read 512 bits right?
        
             | jsheard wrote:
             | It's about where you read data from, not how much data gets
             | read. For example an AVX read is aligned if the address
             | being read from is a multiple of 32 bytes, otherwise it's
             | unaligned and runs slightly slower, and slower still if it
             | happens to straddle two cachelines. The same applies to
             | write instructions as well.
             | 
             | It's less of an issue than it used to be, the penalty for
             | unaligned access has steadily been reduced by newer CPU
             | architectures, but it's still there.
        
               | celeritascelery wrote:
               | ahh, so it does come back to cache line alignment.
               | Reading aligned data doesn't give any benefit in and of
               | itself[1]. At least not on modern hardware. I guess the
               | performance improvement would make sense since SIMD
               | instructions are sized to be a multiple of the cache line
               | size.
               | 
               | [1] https://lemire.me/blog/2012/05/31/data-alignment-for-
               | speed-m...
        
       | meisel wrote:
       | How does this compare, speed-wise, to the UTF-8 validation done
       | in simdjson?
        
         | tyingq wrote:
         | If someone is wanting to make a code comparison, most of the
         | UTF-8 validation in simdjson seems to be nicely summed up in
         | this pull request that sped it up:
         | https://github.com/simdjson/simdjson/pull/993/files
        
         | kryps wrote:
         | Check the benchmarks section
         | (https://github.com/rusticstuff/simdutf8#Benchmarks), second
         | table. simdutf8 is up to 28 % faster on my Comet Lake CPU.
         | However with pure ASCII clang does something magical with
         | simdjson and it beats my implementation by a lot. GCC-compiled
         | simdjson is slower all around except for a few outliers with
         | short byte sequences.
         | 
         | The algorithm is the one from simdjson, the main difference is
         | that it uses an extra step in the beginning to align reads to
         | the SIMD block size.
        
       | nerdponx wrote:
       | Does Rust compile code that can be used/called from other
       | languages with an FFI? That to me is always one of the persistent
       | advantages of C, I don't have to fully understand how compile
       | machine code works on a technical level, but there are lots of
       | different languages that let me use C libraries in their own
       | language runtimes.
       | 
       | It would be great to have things like high-performance unicode
       | handling with consistent semantics across multiple languages!
        
         | hardwaresofton wrote:
         | Yes it does -- one of the great parts about Rust. There are
         | lots of projects out there slotting rust into other languages
         | and getting crazy speedups.
         | 
         | - https://doc.rust-lang.org/reference/linkage.html
         | 
         | - https://doc.rust-lang.org/book/ch19-01-unsafe-
         | rust.html#call...
        
         | Disp4tch wrote:
         | Yes, from the python side of things there are tools like py03
         | that make integrating rust into python code really painless.
         | 
         | I have a sql parsing library (shameless plug) that is 50x
         | faster than any other python implementation, it is just a super
         | simple wrapper around a rust crate.
         | 
         | https://github.com/wseaton/sqloxide
        
         | gliptic wrote:
         | Yes, you can expose things with the C ABI. This is essential
         | for how it's used in Firefox.
        
         | jhgb wrote:
         | A few days ago I read https://dev.to/veer66/calling-rust-from-
         | common-lisp-45c5 , which illustrates presumably what you're
         | talking about.
        
         | kzrdude wrote:
         | Yes, Rust natively can expose C ffi functions (for example).
         | 
         | The https://cxx.rs/ project is also a major crate for C++
         | interoperability.
        
       | parhamn wrote:
       | I was wonder about what the ARM/M1 support for simd like
       | instructions are. It seems like it will be a while for these simd
       | packages to be as performant on ARM. Is this correct?
        
         | jsheard wrote:
         | ARM has it's own SIMD instruction sets (NEON and SVE) but the
         | intrinsics have yet to be stabilized in Rust, so the Rust SIMD
         | ecosystem is x86-centric for now.
         | 
         | The plan is for Rust to eventually have a portable SIMD
         | abstraction built into the standard library to reduce the need
         | for CPU-specific code.
        
       | TwoBit wrote:
       | Given the current ubiquity of UTF-8, is there value in having
       | native CPU instructions for it? Or do they exist somewhere
       | already?
        
         | dahfizz wrote:
         | What would these cpu instructions do?
        
       | kouteiheika wrote:
       | So are there any plans to eventually contribute this to the
       | standard library, e.g. just like hashbrown was?
        
         | mrec wrote:
         | The author discusses this in the corresponding /r/rust thread:
         | 
         | https://www.reddit.com/r/rust/comments/mvc6o5/incredibly_fas...
         | 
         | tl;dr: it's not straightforward, mostly because the standard
         | functionality lives in `core` which doesn't have access to OS
         | support for CPU feature detection.
        
           | bombcar wrote:
           | Does Rust allow for libraries to "override" core
           | functionality when available?
        
             | Kinrany wrote:
             | If it was all userspace Rust, it would have been possible
             | to choose the implementation at compile time conditional on
             | the presence of std.
        
         | [deleted]
        
         | [deleted]
        
       | tialaramex wrote:
       | One flavour I would expect to be valuable that isn't present here
       | is this:
       | 
       | Process the input (as quickly as possible) and never fail, but
       | replace each invalid sequence of bytes with U+FFFD (bytes 0xEF
       | 0xBF 0xBD).
        
         | nerdponx wrote:
         | I don't know, you are throwing away the length of the invalid
         | sequence that way. Maybe useful in certain situations, but
         | probably not in most.
        
         | nezirus wrote:
         | Yeah, I'd like to see that as well, something like
         | to_string_lossy(), but really fast, so I can call it whenever I
         | expect occasional broken UTF-8 string (and don't care about few
         | replaced characters).
        
         | chrismorgan wrote:
         | One major problem with this: you can't do it in place. Invalid
         | byte sequences could be 1-4 bytes long, but U+FFFD is exactly
         | three bytes long.
        
           | cryptonector wrote:
           | Use an ASCII character like `?` or, even better, ASCII SUB
           | (0x1A, substitute), which is specifically intended for this
           | sort of thing. Failing that, there are a number of other
           | unused ASCII control characters like VT (vertical tab), FS
           | (file separator), GS (group separator), US (unit separator),
           | CAN (cancel), etc.
        
             | tialaramex wrote:
             | No. Don't do any of these things. The reason U+FFFD exists,
             | even though ASCII has any number of fun things you can
             | scribble in one byte is that U+FFFD specifically _isn 't_
             | any of the things your program probably didn't expect to
             | appear unexpectedly after unrelated processing.
             | 
             | It isn't a letter, or a digit, or whitespace, or
             | punctuation, or a word separator, or a control character,
             | it is neither uppercase nor lowercase, it doesn't have any
             | canonical equivalents - it's just a codepoint that exists
             | specifically for this purpose.
             | 
             | As a result it's _much_ less likely that if gibberish
             | sneaks into your system somehow and gets turned into U+FFFD
             | this causes something important to break elsewhere.
             | 
             | And when sooner or later a _human_ is shown this text, it
             | 's very obvious that U+FFFD isn't what they expected,
             | whether that was E-acute, a Euro currency symbol, a cat
             | emoji or whatever else, and the human will know something
             | went wrong and can decide if they care about that.
        
               | cryptonector wrote:
               | This is about what can be done in SIMD. See GP and GGP. I
               | think using ASCII SUB is a perfectly reasonable thing to
               | do in this context. You can always further post-process
               | the result to turn ASCII SUB into U+FFFD.
        
           | AjtevhihBudhirv wrote:
           | There're still faster approaches than naive and probably
           | common "copy valid byte sequences one by one into a resizable
           | result buffer". For instance, scan through the input bytes
           | all at once, keeping track of position and length of valid
           | sequences, then memcopy each valid sequence into a
           | preallocated buffer.
           | 
           | Edit: Although, it looks like Rust's std already does this,
           | except for preallocating an exactly correct size result
           | buffer: https://doc.rust-
           | lang.org/src/alloc/string.rs.html#538
        
             | duckerude wrote:
             | > preallocating an exactly correct size result buffer
             | 
             | Looks like it just uses the size of the original slice. If
             | the average broken chunk is less than three bytes (maybe
             | quite common?) then it'll have to grow the buffer, at least
             | doubling it.                 >> let bytestring =
             | b"foobar\xcc";       >> bytestring.len()       7       >>
             | let cleaned =
             | String::from_utf8_lossy(bytestring).into_owned();       >>
             | cleaned.len()       9       >> cleaned.capacity()       14
        
         | loa_in_ wrote:
         | I bet one could modify this to save indexes of invalid chars
         | and do what you say with just one additional copy
        
       | celeritascelery wrote:
       | I find it interesting that with the state of rust SIMD many
       | implementations just opted for reimplementing the code multiple
       | times for each intrinsic (SSE, Neon, AVX, etc). One would think
       | that one of the generic libraries (faster, simdeez, packed_simd)
       | would start to see widespread use, but they all seem to have
       | issues that have prevented adoption.
        
         | monocasa wrote:
         | The various SIMD ISAs are different enough from eachother that
         | you really want a custom implementation per ISA for perf
         | reasons. Particularly if you're doing something off the beaten
         | path like this rather than cranking through some vector math.
        
           | Fronzie wrote:
           | Even for vector math, there are subtle differences, such as
           | the numerical precision of the hardware reciprocal, leading
           | to different implementations for x/y.
           | 
           | The Eigen library did recently combine most SSE and AVX code
           | paths, however:
           | https://eigen.tuxfamily.org/index.php?title=3.4
        
           | celeritascelery wrote:
           | I don't know. Looking in their code, they have one for SSE
           | and another for AVX2. The algorithm is identical expect for
           | one is 128 and the other is 256 bits. It seems like something
           | that would be easy to abstract over so you only had a single
           | implementation (at least for x86 SIMD). But it is obviously a
           | hard problem or it would be solved already.
        
         | BelenusMordred wrote:
         | packed_simd is now packed_simd2, the original author is MIA and
         | no one else has crate or repo permissions
         | 
         | I really think the general push right now is towards stdsimd
         | being the future for portability.
        
         | mooman219 wrote:
         | I work on a SIMD optimized font library [0] and have stumbled
         | into the same situation of hand writing SIMD intrinsics. Some
         | things are just kinda hard to make sure they get optimized
         | correctly, and there is enough difference between the platforms
         | where that matters when fiddling with bits. I also kinda have
         | fun writing SIMD code like this too.
         | 
         | [0]:
         | https://github.com/mooman219/fontdue/blob/master/src/platfor...
        
       | jfk13 wrote:
       | I'd be curious how this compares with the encoding_rs crate
       | (https://github.com/hsivonen/encoding_rs), which also offers
       | UTF-8 validation (among many other things).
        
         | cryptonector wrote:
         | This is SIMD based. If the encoding_rs crate isn't SIMD based,
         | then this thing will beat its pants off.
        
           | burntsushi wrote:
           | It does use SIMD (when enabled). But also, the encoding_rs
           | crate is doing more. It isn't just checking for validity.
           | It's also doing transcoding and error detection.
        
       ___________________________________________________________________
       (page generated 2021-04-21 23:02 UTC)