[HN Gopher] Show HN: High-speed UTF-8 validation in Rust
___________________________________________________________________
Show HN: High-speed UTF-8 validation in Rust
Author : kryps
Score : 111 points
Date : 2021-04-21 09:58 UTC (13 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| celeritascelery wrote:
| > The implementation is similar to the one in simdjson except
| that it aligns reads to the block size of the SIMD extension,
| which leads to better peak performance compared to the
| implementation in simdjson.
|
| I didn't really understand this part. Aligned to what? to the
| cache line? SIMD always reads the block size. Unless I am missing
| something here.
| unwind wrote:
| I read it as "to the width of the SIMD registers" which I have
| seen in other quick scanners, but did not read the code here.
| celeritascelery wrote:
| Aren't all reads aligned to the width of the SIMD register?
| If I do an AVX512 command it will read 512 bits right?
| jsheard wrote:
| It's about where you read data from, not how much data gets
| read. For example an AVX read is aligned if the address
| being read from is a multiple of 32 bytes, otherwise it's
| unaligned and runs slightly slower, and slower still if it
| happens to straddle two cachelines. The same applies to
| write instructions as well.
|
| It's less of an issue than it used to be, the penalty for
| unaligned access has steadily been reduced by newer CPU
| architectures, but it's still there.
| celeritascelery wrote:
| ahh, so it does come back to cache line alignment.
| Reading aligned data doesn't give any benefit in and of
| itself[1]. At least not on modern hardware. I guess the
| performance improvement would make sense since SIMD
| instructions are sized to be a multiple of the cache line
| size.
|
| [1] https://lemire.me/blog/2012/05/31/data-alignment-for-
| speed-m...
| meisel wrote:
| How does this compare, speed-wise, to the UTF-8 validation done
| in simdjson?
| tyingq wrote:
| If someone is wanting to make a code comparison, most of the
| UTF-8 validation in simdjson seems to be nicely summed up in
| this pull request that sped it up:
| https://github.com/simdjson/simdjson/pull/993/files
| kryps wrote:
| Check the benchmarks section
| (https://github.com/rusticstuff/simdutf8#Benchmarks), second
| table. simdutf8 is up to 28 % faster on my Comet Lake CPU.
| However with pure ASCII clang does something magical with
| simdjson and it beats my implementation by a lot. GCC-compiled
| simdjson is slower all around except for a few outliers with
| short byte sequences.
|
| The algorithm is the one from simdjson, the main difference is
| that it uses an extra step in the beginning to align reads to
| the SIMD block size.
| nerdponx wrote:
| Does Rust compile code that can be used/called from other
| languages with an FFI? That to me is always one of the persistent
| advantages of C, I don't have to fully understand how compile
| machine code works on a technical level, but there are lots of
| different languages that let me use C libraries in their own
| language runtimes.
|
| It would be great to have things like high-performance unicode
| handling with consistent semantics across multiple languages!
| hardwaresofton wrote:
| Yes it does -- one of the great parts about Rust. There are
| lots of projects out there slotting rust into other languages
| and getting crazy speedups.
|
| - https://doc.rust-lang.org/reference/linkage.html
|
| - https://doc.rust-lang.org/book/ch19-01-unsafe-
| rust.html#call...
| Disp4tch wrote:
| Yes, from the python side of things there are tools like py03
| that make integrating rust into python code really painless.
|
| I have a sql parsing library (shameless plug) that is 50x
| faster than any other python implementation, it is just a super
| simple wrapper around a rust crate.
|
| https://github.com/wseaton/sqloxide
| gliptic wrote:
| Yes, you can expose things with the C ABI. This is essential
| for how it's used in Firefox.
| jhgb wrote:
| A few days ago I read https://dev.to/veer66/calling-rust-from-
| common-lisp-45c5 , which illustrates presumably what you're
| talking about.
| kzrdude wrote:
| Yes, Rust natively can expose C ffi functions (for example).
|
| The https://cxx.rs/ project is also a major crate for C++
| interoperability.
| parhamn wrote:
| I was wonder about what the ARM/M1 support for simd like
| instructions are. It seems like it will be a while for these simd
| packages to be as performant on ARM. Is this correct?
| jsheard wrote:
| ARM has it's own SIMD instruction sets (NEON and SVE) but the
| intrinsics have yet to be stabilized in Rust, so the Rust SIMD
| ecosystem is x86-centric for now.
|
| The plan is for Rust to eventually have a portable SIMD
| abstraction built into the standard library to reduce the need
| for CPU-specific code.
| TwoBit wrote:
| Given the current ubiquity of UTF-8, is there value in having
| native CPU instructions for it? Or do they exist somewhere
| already?
| dahfizz wrote:
| What would these cpu instructions do?
| kouteiheika wrote:
| So are there any plans to eventually contribute this to the
| standard library, e.g. just like hashbrown was?
| mrec wrote:
| The author discusses this in the corresponding /r/rust thread:
|
| https://www.reddit.com/r/rust/comments/mvc6o5/incredibly_fas...
|
| tl;dr: it's not straightforward, mostly because the standard
| functionality lives in `core` which doesn't have access to OS
| support for CPU feature detection.
| bombcar wrote:
| Does Rust allow for libraries to "override" core
| functionality when available?
| Kinrany wrote:
| If it was all userspace Rust, it would have been possible
| to choose the implementation at compile time conditional on
| the presence of std.
| [deleted]
| [deleted]
| tialaramex wrote:
| One flavour I would expect to be valuable that isn't present here
| is this:
|
| Process the input (as quickly as possible) and never fail, but
| replace each invalid sequence of bytes with U+FFFD (bytes 0xEF
| 0xBF 0xBD).
| nerdponx wrote:
| I don't know, you are throwing away the length of the invalid
| sequence that way. Maybe useful in certain situations, but
| probably not in most.
| nezirus wrote:
| Yeah, I'd like to see that as well, something like
| to_string_lossy(), but really fast, so I can call it whenever I
| expect occasional broken UTF-8 string (and don't care about few
| replaced characters).
| chrismorgan wrote:
| One major problem with this: you can't do it in place. Invalid
| byte sequences could be 1-4 bytes long, but U+FFFD is exactly
| three bytes long.
| cryptonector wrote:
| Use an ASCII character like `?` or, even better, ASCII SUB
| (0x1A, substitute), which is specifically intended for this
| sort of thing. Failing that, there are a number of other
| unused ASCII control characters like VT (vertical tab), FS
| (file separator), GS (group separator), US (unit separator),
| CAN (cancel), etc.
| tialaramex wrote:
| No. Don't do any of these things. The reason U+FFFD exists,
| even though ASCII has any number of fun things you can
| scribble in one byte is that U+FFFD specifically _isn 't_
| any of the things your program probably didn't expect to
| appear unexpectedly after unrelated processing.
|
| It isn't a letter, or a digit, or whitespace, or
| punctuation, or a word separator, or a control character,
| it is neither uppercase nor lowercase, it doesn't have any
| canonical equivalents - it's just a codepoint that exists
| specifically for this purpose.
|
| As a result it's _much_ less likely that if gibberish
| sneaks into your system somehow and gets turned into U+FFFD
| this causes something important to break elsewhere.
|
| And when sooner or later a _human_ is shown this text, it
| 's very obvious that U+FFFD isn't what they expected,
| whether that was E-acute, a Euro currency symbol, a cat
| emoji or whatever else, and the human will know something
| went wrong and can decide if they care about that.
| cryptonector wrote:
| This is about what can be done in SIMD. See GP and GGP. I
| think using ASCII SUB is a perfectly reasonable thing to
| do in this context. You can always further post-process
| the result to turn ASCII SUB into U+FFFD.
| AjtevhihBudhirv wrote:
| There're still faster approaches than naive and probably
| common "copy valid byte sequences one by one into a resizable
| result buffer". For instance, scan through the input bytes
| all at once, keeping track of position and length of valid
| sequences, then memcopy each valid sequence into a
| preallocated buffer.
|
| Edit: Although, it looks like Rust's std already does this,
| except for preallocating an exactly correct size result
| buffer: https://doc.rust-
| lang.org/src/alloc/string.rs.html#538
| duckerude wrote:
| > preallocating an exactly correct size result buffer
|
| Looks like it just uses the size of the original slice. If
| the average broken chunk is less than three bytes (maybe
| quite common?) then it'll have to grow the buffer, at least
| doubling it. >> let bytestring =
| b"foobar\xcc"; >> bytestring.len() 7 >>
| let cleaned =
| String::from_utf8_lossy(bytestring).into_owned(); >>
| cleaned.len() 9 >> cleaned.capacity() 14
| loa_in_ wrote:
| I bet one could modify this to save indexes of invalid chars
| and do what you say with just one additional copy
| celeritascelery wrote:
| I find it interesting that with the state of rust SIMD many
| implementations just opted for reimplementing the code multiple
| times for each intrinsic (SSE, Neon, AVX, etc). One would think
| that one of the generic libraries (faster, simdeez, packed_simd)
| would start to see widespread use, but they all seem to have
| issues that have prevented adoption.
| monocasa wrote:
| The various SIMD ISAs are different enough from eachother that
| you really want a custom implementation per ISA for perf
| reasons. Particularly if you're doing something off the beaten
| path like this rather than cranking through some vector math.
| Fronzie wrote:
| Even for vector math, there are subtle differences, such as
| the numerical precision of the hardware reciprocal, leading
| to different implementations for x/y.
|
| The Eigen library did recently combine most SSE and AVX code
| paths, however:
| https://eigen.tuxfamily.org/index.php?title=3.4
| celeritascelery wrote:
| I don't know. Looking in their code, they have one for SSE
| and another for AVX2. The algorithm is identical expect for
| one is 128 and the other is 256 bits. It seems like something
| that would be easy to abstract over so you only had a single
| implementation (at least for x86 SIMD). But it is obviously a
| hard problem or it would be solved already.
| BelenusMordred wrote:
| packed_simd is now packed_simd2, the original author is MIA and
| no one else has crate or repo permissions
|
| I really think the general push right now is towards stdsimd
| being the future for portability.
| mooman219 wrote:
| I work on a SIMD optimized font library [0] and have stumbled
| into the same situation of hand writing SIMD intrinsics. Some
| things are just kinda hard to make sure they get optimized
| correctly, and there is enough difference between the platforms
| where that matters when fiddling with bits. I also kinda have
| fun writing SIMD code like this too.
|
| [0]:
| https://github.com/mooman219/fontdue/blob/master/src/platfor...
| jfk13 wrote:
| I'd be curious how this compares with the encoding_rs crate
| (https://github.com/hsivonen/encoding_rs), which also offers
| UTF-8 validation (among many other things).
| cryptonector wrote:
| This is SIMD based. If the encoding_rs crate isn't SIMD based,
| then this thing will beat its pants off.
| burntsushi wrote:
| It does use SIMD (when enabled). But also, the encoding_rs
| crate is doing more. It isn't just checking for validity.
| It's also doing transcoding and error detection.
___________________________________________________________________
(page generated 2021-04-21 23:02 UTC)