https://lemire.me/blog/2024/07/05/scan-html-faster-with-simd-instructions-net-c-edition/ Skip to content Daniel Lemire's blog Daniel Lemire is a computer science professor at the Data Science Laboratory of the Universite du Quebec (TELUQ) in Montreal. His research is focused on software performance. Menu and widgets * My home page * GitHub profile Support my work! I do not accept any advertisement. However, you can you can sponsor my open-source work on GitHub. Join over 12,500 email subscribers: [ ][Go!] You can follow this blog on telegram. You can find me on twitter as @lemire or on Mastodon. Search for: [ ] [Search] Recent Posts * Scan HTML faster with SIMD instructions: .NET/C# Edition * How much memory does a call to 'malloc' allocate? * Performance tip: avoid unnecessary copies * Validating gigabytes of Unicode strings per second... in C#? * Rolling your own fast matrix multiplication: loop order and vectorization Recent Comments * Maks Verver on How much memory does a call to 'malloc' allocate? * Jatin Bhateja on Why are unrolled loops faster? * Clive Robinson on How much memory does a call to 'malloc' allocate? * Laurent on How much memory does a call to 'malloc' allocate? * Daniel Lemire on Measuring memory usage: virtual versus real memory Pages * A short history of technology * About me * Book recommendations * Cognitive biases * Interviews and talks * My bets * My favorite articles * My favorite quotes * My rules * Newsletter * Predictions * Privacy Policy * Recommended video games * Terms of use * Write good papers Archives Archives [Select Month ] Boring stuff * Log in * Entries feed * Comments feed * WordPress.org Scan HTML faster with SIMD instructions: .NET/C# Edition Recently, the two major Web engines (WebKit and Chromium) adopted fast SIMD routines to scan HTML content. The key insight is to use vectorized classification (Langdale and Lemire, 2019): you load blocks of characters and identify the characters you seek using a few instructions. In particular, we use 'SIMD instructions', special instructions that are available on practically all modern processors and can process 16 bytes or more at once. The problem that WebKit and Chromium solve is to jump to the next relevant characters: one of <, &, \r and \0. Thus we must identify quickly whether we have found one of these characters in a block. On my Apple macbook, a fast SIMD-based approach can scan an HTML page at about 7 GB/s, with code written in C/C++. But what about C#? The recent C# runtime (.NET8) supports fast SIMD instructions. Let us first consider a simple version of the function: public unsafe static void NaiveAdvanceString(ref byte* start, byte* end) { while (start < end) { if (*start == '<' || *start == '&' || *start == '\r' || *start == '\0') { return; } start++; } } This function just visits each character, one by one, and it compares it against the target characters. If one target character is found, we return. Let us consider a SIMD version of the same function. It is slightly more complicated. public unsafe static void SIMDAdvanceString(ref byte* start, byte* end) { Vector128 low_nibble_mask = Vector128.Create(0, 0, 0, 0, 0, 0, (byte)0x26, 0, 0, 0, 0, 0, (byte)0x3c, (byte)0xd, 0, 0); Vector128 v0f = Vector128.Create((byte)0x0F); Vector128 bit_mask = Vector128.Create((byte)16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1); const int stride = 16; while (start + (stride - 1) < end) { Vector128 data = AdvSimd.LoadVector128((byte*)start); Vector128 lowpart = AdvSimd.Arm64.VectorTableLookup(low_nibble_mask, data & v0f); Vector128 matchesones = AdvSimd.CompareEqual(lowpart, data); if (matchesones != Vector128.Zero) { Vector128 matches = AdvSimd.And(bit_mask, matchesones); int m = AdvSimd.Arm64.MaxAcross(matches).ToScalar(); start += 16 - m; return; } start += stride; } while (start < end) { if (*start == '<' || *start == '&' || *start == '\r' || *start == '\0') { return; } start++; } } The function takes two pointers (ref byte* start and byte* end) that mark the beginning and end of the byte array. The main loop continues as long as start is at least 16 bytes away from end. This ensures there's enough data for vectorized operations. We load in the variable 'data' 16 bytes from the memory pointed to by start. We use a vectorized lookup table and a comparison to quickly identify the target characters.The code checks if any element in matchesones is not zero. If there's a match, then we locate the first one (out of 16 characters), we advance start and return. If no match is found, we advance by 16 characters and repeat. We conclude with a fallback look that processes the leftover data (less than 16 bytes). As an optimization, it is helpful to use a local variable for the reference to the first pointer. Doing so improves the perfomance substantially: C# is not happy when we repeatedly modify a reference. Thus, at the start of the function, you may set byte* mystart = start, use mystart throughout, and then, just before a return, you set start = mystart. I wrote a benchmark (in C#) that you can run if you have an ARM-based processor. Conventional 1.4 GB/s SIMD (ARM NEON) 6.2 GB/s Incredibly, the SIMD-based function is over 4 times faster than the conventional function in these tests, and the accelerated C# function about 15% slower than the C++ version. The non-SIMD C# version is also slightly slower than the C++ version. Harold Aptroot provided support for x64 processor (up to AVX2) so I extended my benchmark to an Intel Ice Lake system: Conventional 1.0 GB/s SIMD (ARM AVX2) 7.5 GB/s This time, the SIMD version is over 7 times faster than the scalar. In fact, it matches the performance numbers that I get with C/C++. It is almost important for performance to write the code in such a way that the C# compiler tends to inline the scanning function, since it is called repeatedly. Initially, I had written the benchmark with some abstraction, using a delegate function, but it limited the best possible speed. In other words, .NET/C# allows you to write fast code using SIMD instructions. It may be well worth the effort. Daniel Lemire, "Scan HTML faster with SIMD instructions: .NET/C# Edition," in Daniel Lemire's blog, July 5, 2024. Published by [2ca999] Daniel Lemire A computer science professor at the University of Quebec (TELUQ). View all posts by Daniel Lemire Posted on July 5, 2024July 5, 2024Author Daniel LemireCategories Leave a Reply Cancel reply Your email address will not be published. [ ] [ ] [ ] [ ] [ ] [ ] [ ] Comment * [ ] Name * [ ] Email * [ ] Website [ ] [ ] Save my name, email, and website in this browser for the next time I comment. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] You may subscribe to this blog by email. Post navigation Previous Previous post: How much memory does a call to 'malloc' allocate? Terms of use Proudly powered by WordPress