https://github.com/rusticstuff/simdutf8 Skip to content Sign up Sign up * Why GitHub? Features - + Mobile - + Actions - + Codespaces - + Packages - + Security - + Code review - + Project management - + Integrations - + GitHub Sponsors - + Customer stories- * Team * Enterprise * Explore + Explore GitHub - Learn and contribute + Topics - + Collections - + Trending - + Learning Lab - + Open source guides - Connect with others + The ReadME Project - + Events - + Community forum - + GitHub Education - + GitHub Stars program - * Marketplace * Pricing Plans - + Compare plans - + Contact Sales - + Education - [ ] [search-key] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this organization All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up Sign up {{ message }} rusticstuff / simdutf8 * Notifications * Star 148 * Fork 2 SIMD-accelerated UTF-8 validation for Rust. Apache-2.0 License 148 stars 2 forks Star Notifications * Code * Issues 4 * Pull requests 0 * Actions * Projects 0 * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Security * Insights main Switch branches/tags [ ] Branches Tags Nothing to show {{ refName }} default View all branches Nothing to show {{ refName }} default View all tags 1 branch 4 tags Go to file Code Clone HTTPS GitHub CLI [https://github.com/r] Use Git or checkout with SVN using the web URL. [gh repo clone rustic] Work fast with our official CLI. Learn more. * Open with GitHub Desktop * Download ZIP Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Go back Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Go back Launching Xcode If nothing happens, download Xcode and try again. Go back Launching Visual Studio If nothing happens, download the GitHub extension for Visual Studio and try again. Go back Latest commit @hkratz hkratz update doc: could be interpreted as not falling back to the std impl ... ... 1be9c25 Apr 21, 2021 update doc: could be interpreted as not falling back to the std impl ... ...on unsupported architectures. 1be9c25 Git stats * 528 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time .github/workflows fix ci Apr 21, 2021 afl add afl support Apr 17, 2021 bench bump versions Apr 21, 2021 fuzz rename pure -> basic Apr 20, 2021 img add images Apr 20, 2021 inlining fix inlining specs sort order after rename Apr 20, 2021 src fix lints Apr 21, 2021 .gitignore init Apr 10, 2021 BENCHMARKING.md add test machines Apr 19, 2021 Cargo.toml bump to 0.1.0 Apr 21, 2021 LICENSE add Apache license Apr 20, 2021 README.md update doc: could be interpreted as not falling back to the std impl ... Apr 21, 2021 TODO.md todo Apr 21, 2021 rustfmt.toml placeholder rustfmt.toml file Apr 17, 2021 View code simdutf8 - High-speed UTF-8 validation for Rust Disclaimer Features Quick start APIs Basic flavor Compat flavor Implementation selection When not to use Benchmarks simdutf8 basic vs std library UTF-8 validation simdutf8 basic vs simdjson UTF-8 validation simdutf8 basic vs simdutf8 compat UTF-8 validation Technical details Thanks License References README.md CI crates.io docs.rs simdutf8 - High-speed UTF-8 validation for Rust Blazingly fast API-compatible UTF-8 validation for Rust using SIMD extensions, based on the implementation from simdjson. Originally ported to Rust by the developers of simd-json.rs. Disclaimer This software should be considered alpha quality and should not (yet) be used in production, though it has been tested with sample data as well as a fuzzer and there are no known bugs. It will be tested more rigorously before the first production release. Features * basic API for the fastest validation, optimized for valid UTF-8 * compat API as a fully compatible replacement for std::str::from_utf8() * Up to twenty times faster than the std library on non-ASCII, up to twice as fast on ASCII * Up to 28% faster on non-ASCII input compared to the original simdjson implementation * Supports AVX2 and SIMD implementations on x86 and x86-64. ARMv7 and ARMv8 neon support is planned * Selects the fastest implementation at runtime based on CPU support * Written in pure Rust * No dependencies * No-std support * Falls back to the excellent std implementation if SIMD extensions are not supported Quick start Add the dependency to your Cargo.toml file: [dependencies] simdutf8 = { version = "0.1.0" } Use simdutf8::basic::from_utf8 as a drop-in replacement for std::str::from_utf8(). use simdutf8::basic::from_utf8; println!("{}", from_utf8(b"I \xE2\x9D\xA4\xEF\xB8\x8F UTF-8!").unwrap()); If you need detailed information on validation failures, use simdutf8::compat::from_utf8 instead. use simdutf8::compat::from_utf8; let err = from_utf8(b"I \xE2\x9D\xA4\xEF\xB8 UTF-8!").unwrap_err(); assert_eq!(err.valid_up_to(), 5); assert_eq!(err.error_len(), Some(2)); APIs Basic flavor Use the basic API flavor for maximum speed. It is fastest on valid UTF-8, but only checks for errors after processing the whole byte sequence and does not provide detailed information if the data is not valid UTF-8. simdutf8::basic::Utf8Error is a zero-sized error struct. Compat flavor The compat flavor is fully API-compatible with std::str::from_utf8. In particular, simdutf8::compat::from_utf8() returns a simdutf8::compat::Utf8Error, which has valid_up_to() and error_len() methods. The first is useful for verification of streamed data. The second is useful e.g. for replacing invalid byte sequences with a replacement character. It also fails early: errors are checked on-the-fly as the string is processed and once an invalid UTF-8 sequence is encountered, it returns without processing the rest of the data. This comes at a performance penality compared to the basic API even if the input is valid UTF-8. Implementation selection The fastest implementation is selected at runtime using the std::is_x86_feature_detected! macro unless the CPU targeted by the compiler supports the fastest available implementation. So if you compile with RUSTFLAGS="-C target-cpu=native" on a recent x86-64 machine, the AVX 2 implementation is selected at compile time and runtime selection is disabled. For no-std support (compiled with --no-default-features) the implementation is always selected at compile time based on the targeted CPU. Use RUSTFLAGS="-C target-feature=+avx2" for the AVX 2 implementation or RUSTFLAGS="-C target-feature=+sse4.2" for the SSE 4.2 implementation. If you want to be able to call A SIMD implementation directly, use the public_imp feature flag. The validation implementations are then accessible via simdutf8::(basic|compat)::imp::x86::(avx2| sse42)::validate_utf8(). When not to use If you are only processing short byte sequences (less than 64 bytes), the excellent scalar algorithm in the standard library is likely faster. Also, this library uses unsafe code which has not been battle-tested and should not (yet) be used in production. Benchmarks The benchmarks have been done with criterion, the tables are created with critcmp. Source code and data are in the bench directory. The name schema is id-charset/size. 0-empty is the empty byte slice, x-error/66536 is a 64KiB slice where the very first character is invalid UTF-8. All benchmarks were run on a laptop with an Intel Core i7-10750H CPU (Comet Lake) on Windows with Rust 1.51.0. simdutf8 basic vs std library UTF-8 validation critcmp stimdutf8 basic vs std lib simdutf8 performs better except for inputs <= 64 bytes. simdutf8 basic vs simdjson UTF-8 validation critcmp st lib vs stimdutf8 basic simdutf8 is faster than simdjson except for some crazy optimization by clang for the pure ASCII loop (to be investigated). simdjson is compiled using clang and gcc from MSYS. simdutf8 basic vs simdutf8 compat UTF-8 validation critcmp st lib vs stimdutf8 basic There is a small performance penalty to continuously checking the error status while processing data, but detecting errors early provides a huge benefit for the x-error/66536 benchmark. Technical details The implementation is similar to the one in simdjson except that it aligns reads to the block size of the SIMD extension, which leads to better peak performance compared to the implementation in simdjson. This alignment means that an incomplete block needs to be processed before the aligned data is read, which would lead to worse performance on short byte sequences. Thus, aligned reads are only used with 2048 bytes of data or more. Incomplete reads for the first unaligned and the last incomplete block are done in two aligned 64-byte buffers. For the compat API we need to check the error buffer on each 64-byte block instead of just aggregating it. If an error is found, the last bytes of the previous block are checked for a cross-block continuation and then std::str::from_utf8() is run to find the exact location of the error. Care is taken that all functions are properly inlined up to the public interface. Thanks * to the authors of simdjson for coming up with the high-performance SIMD implementation. * to the authors of the simdjson Rust port who did most of the heavy lifting of porting the C++ code to Rust. License This code is made available under the Apache License 2.0. It is based on code distributed with simd-json.rs, the Rust port of simdjson, which is dual-licensed under the MIT license and Apache 2.0 license. simdjson itself is distributed under the Apache License 2.0. References John Keiser, Daniel Lemire, Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice and Experience 51 (5), 2021 About SIMD-accelerated UTF-8 validation for Rust. Topics rust unicode utf-8 simd-extensions Resources Readme License Apache-2.0 License Releases 4 First semver release Latest Apr 21, 2021 + 3 releases Contributors 2 * @hkratz hkratz Hans Kratz * @lemire lemire Daniel Lemire Languages * Rust 100.0% * (c) 2021 GitHub, Inc. * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.