[HN Gopher] Faster substring search with SIMD in Zig
___________________________________________________________________
Faster substring search with SIMD in Zig
Author : todsacerdoti
Score : 169 points
Date : 2025-08-11 09:41 UTC (13 hours ago)
(HTM) web link (aarol.dev)
(TXT) w3m dump (aarol.dev)
| lokeg wrote:
| What about the worst case? I.e. something like searching for 1000
| 'a's in a long string of 'a's interspersed with 'b's every
| 500-1000 steps? Seems accidentally quadradic unfortunately in the
| absence of some KMP-like fallback
| expenses3 wrote:
| How is it quadratic? You do 1000 checks every character in the
| haystack but that's still O(n)
| burntsushi wrote:
| The worst case is that `std.mem.eql`[1] is executed at every
| position in the haystack, which gives you `O(n*m)` time
| complexity. Several substring search algorithms are `O(n)`.
|
| [1]: https://github.com/aarol/substr/blob/9392f9557de735929df
| b79e...
| MattPalmer1086 wrote:
| Worst case for these types of search is O(mn), m length of
| needle, n length of haystack. It is not linear in n.
|
| The absolute worst case is when both the needle and haystack
| are both composed of the same byte repeated (e.g. all zero).
| suddenlybananas wrote:
| >The difference between 4ms vs 1ms is extremely small, but it's
| slightly faster nonetheless.
|
| Put that in a loop and its an enormous speed-up.
| codethief wrote:
| Nice article!
|
| Also, this might be a stupid question (I'm a Zig newbie) but...
| instead of calling std.mem.eql() in the while loop to look at
| each potential match individually, couldn't you repeat the same
| trick as before? That is, use SIMD to search for the second and
| second-to-last character of the needle, then third and third-to-
| last, and so on, and finally take a bitwise AND of all the
| resulting bit masks? This way, one would avoid looking at each
| potential match one by one, and instead look at all of them at
| the same time.
|
| Even if that doesn't work for some reason and you still need to
| loop over all potential matches individually, couldn't you use
| SIMD _inside_ the while loop to replace std.mem.eql and thereby
| speed up string comparison? My understanding was that std.mem.eql
| loops over bytes one by one and compares them?
| ncruces wrote:
| Knowing little about zig, std.mem.eql _very likely_ already
| uses SIMD.
|
| This is about using SIMD to _avoid even calling_ std.mem.eql
| for 99% of the possible attempts.
| llimllib wrote:
| std.mem.eql is here, super easy to read: https://github.com/z
| iglang/zig/blob/master/lib/std/mem.zig#L...
|
| My read is it would use SIMD if T is @Vector, and not
| otherwise? But I'm neither a zig nor SIMD expert
| ozgrakkurt wrote:
| Pretty sure that compiles into assembly for any primitive
| type like integers, floats etc.
| IshKebab wrote:
| What do you mean it "compiles into assembly"?
| ozgrakkurt wrote:
| I meant to write SIMD
| jiehong wrote:
| Nice!
|
| But, does that work with non-ascii characters? (aka Unicode).
| llimllib wrote:
| Kind of! This script is assuming that you're dealing with a
| byte slice, which means you've already encoded your unicode
| data.
|
| If you just encoded your string to bytes naively, it will
| probably-mostly still work, but it will get some combining
| characters wrong if they're represented differently in the two
| sources you're comparing. (eg, e-with-an-accent-character vs.
| accent-combining-character+e)
|
| If you want to be correct-er you'll normalize your UTF
| string[1], but note that there are four different defined ways
| to do this, so you'll need to choose the one that is the best
| tradeoff for your particular application and data sources.
|
| [1]:
| https://en.wikipedia.org/wiki/Unicode_equivalence#Normalizat...
| codethief wrote:
| > If you just encoded your string to bytes naively
|
| By "naively" I assume you mean you would just plug in UTF-8
| bytestrings for haystack & needle, without adjusting the
| implementation?
|
| Wouldn't the code still need to take into account where
| characters (code points) begin and end, though, in order to
| prevent incorrect matches?
| burntsushi wrote:
| IDK what "encoded your string to bytes naively" means
| personally. There is only one way to correctly UTF-8 encode
| a sequence of Unicode scalar values.
|
| In any case, no, this works because UTF-8 is self
| synchronizing. As long as both your needle and your
| haystack are valid UTF-8, the byte offsets returned by the
| search will always fall on a valid codepoint boundary.
|
| In terms of getting "combining characters wrong," this is a
| reference to different Unicode normalization forms.
|
| To be more precise... Consider a needle and a haystack,
| represented by a sequence of Unicode scalar values
| (typically represented by a sequence of unsigned 32-bit
| integers). Now encode them to UTF-8 (a sequence of unsigned
| 8-bit integers) and run a byte level search as shown by the
| OP here. That will behave _as if_ you 've executed the
| search on the sequence of Unicode scalar values.
|
| So semantically, a "substring search" is a "sequence of
| Unicode scalar values search." At the semantic level, this
| may or may not be what you want. For example, if you always
| want `office` to find substrings like `office` in your
| haystack, then this byte level search will not do what you
| want.
|
| The standard approach for performing a substring search
| that accounts for normalization forms is to convert both
| the needle and haystack to the same normal form and then
| execute a byte level search.
|
| (One small caveat is when the needle is an empty string. If
| you want to enforce correct UTF-8 boundaries, you'll need
| to handle that specially.)
| llimllib wrote:
| By naively, I meant without normalization.
|
| You know much more about this than I do though
|
| edit: this is what I mean for example, that `test` !=
| `test` in rg, because \ue9 (e with accent) != e\u0301 (e
| followed by combining character accent)
| $ printf "t\\u00E9st" > /tmp/a $ xxd /tmp/a
| 00000000: 74c3 a973 74 t..st
| $ cat /tmp/a test $ printf
| "te\\u0301st" > /tmp/b $ xxd /tmp/b
| 00000000: 7465 cc81 7374 te..st
| $ cat /tmp/b test $ printf
| "t\\u00E9st" | rg -f - /tmp/a 1:test $
| printf "t\\u00E9st" | rg -f - /tmp/b # ed: no
| result
|
| edit 2: if we normalize the UTF-8, the two strings will
| match $ printf "t\\u00E9st" | uconv -x
| any-nfc | xxd 00000000: 74c3 a973 74
| t..st $ printf "te\\u0301st" | uconv -x any-nfc |
| xxd 00000000: 74c3 a973 74
| t..st
|
| Which you know, and indicate! Just working an example of
| it that maybe will help people understand, I dunno
| jiehong wrote:
| Thanks for this detailed answer!
| codethief wrote:
| I suppose generalizing the approach to UTF-32 should be
| straightforward, but variable-length encodings like UTF-8 and
| UTF-16 might be more involved(?) Either way, I'm sure
| BurntSushi found a solution and built it into ripgrep.
| burntsushi wrote:
| ripgrep always deals with UTF-8. When it sees a different
| encoding, like UTF-16, ripgrep first transcodes to UTF-8 and
| then searches.
|
| This is absolutely in part because of all of the byte
| oriented optimizations that are baked into ripgrep (and its
| regex engine). Note that I said a _part_. Making ripgrep (and
| its regex engine) work on things other than a sequence of
| bytes is far more difficult than just porting a bunch of SIMD
| algorithms. There are also many optimizations and
| architectural constraints _in the code_ based on the alphabet
| size. That is, with 8-bit integers, its alphabet size is 256.
| With 16-bit integers, the alphabet size is 65,536.
| tialaramex wrote:
| I think this is the right choice because in practice UTF-8
| "won" just like how the two's complement machine integer
| won. It's pretty good, Wikipedia has a brief section
| explaining how Ken Thompson for example made it self-
| synchronizing, which seems like a "duh" feature today but
| the concept before Ken touched it didn't have this. It's a
| Best Common Practice for the Internet, it's the default in
| most modern systems and places such as Java's virtual
| machine or Windows which can't easily "just" use UTF-8 have
| nevertheless gradually shifted toward being very friendly
| toward it.
| ashvardanian wrote:
| I like that more people are getting involved with SIMD, and there
| have been several posts lately on both memmem-like and memcpy-
| like operations implemented in SIMD in different programming
| languages.
|
| In most cases, though, these still focus on AVX/NEON instructions
| from over 10 years ago, rather than newer and more powerful
| AVX-512 variations, SVE & SVE2, or RVV.
|
| These newer ISAs can noticeably change how one would implement a
| state-of-the-art substring search or copy/move operation. In my
| projects, such as StringZilla, I often use mask K registers
| (https://github.com/ashvardanian/StringZilla/blob/2f4b1386ca2...)
| and an input-dependent mix of temporal and non-temporal loads and
| stores (https://github.com/ashvardanian/StringZilla/blob/2f4b1386
| ca2...).
|
| In typical cases, the difference between the suggested SIMD
| kernels and the state-of-the-art can be as significant as 50% in
| throughput. As SIMD becomes more widespread, it would be
| beneficial to focus more on delivering software and bundling
| binaries, rather than just the kernels.
| moregrist wrote:
| I'm not as familiar with the NEON side, but AVX512 support is
| pretty variable on new processors. Alder Lake omits it
| entirely. So we're still in a world where AVX2 is the lowest
| common denominator for a system library that wants wide
| support.
| debugnik wrote:
| Even that is too high of a requirement if your target user
| runs low end hardware. Most Intel chips launched between 2017
| and 2021 under the Pentium Silver/Gold and Celeron brands
| lack AVX (the first one, let alone AVX2).
| teo_zero wrote:
| How strange! I was about to add a comment that I would
| probably stick to SSE2 or something like that to be sure my
| code suits as large an audience as possible, including CPUs
| from more than 10 years ago, ARM, etc.
|
| Case in point: I've been very disappointed lately when I
| wanted to try Ghostty on my laptop and the binary compiled
| for Debian failed to run due to an invalid instruction. I
| don't want to force the same experience to others.
| burntsushi wrote:
| This is sort of a category error. I don't know what ghostty
| is doing (or perhaps its distributor), but lots of widely
| used software (including my own ripgrep, but also even
| things like glibc) will query what the CPU supports at
| _runtime_ and dispatch to the correct routine based on
| that. So things like, "I'm only going to stick to SSE2 for
| maximal compatibility" have a false assumption baked into
| them.
| nromiun wrote:
| Because it is very hard to find new hardware to test it, let
| alone expect your users to take advantage of it on their
| machines.
|
| AVX512 is such a mess that Intel just removed it after a
| generation or two. And on ARM SVE side it is even worse. There
| is already SVE2, but good luck finding even a SVE enabled
| machine.
|
| Apple does not support it on their Apple Silicon(tm) (only
| SME), Snapdragon does not support it even on their latest 8
| Elite. 8 Elite Gen 2 is supposed to come with it.
|
| Only Mediatek and Neoverse chips support them. So finding one
| machine to develop and test such code can be a little
| difficult.
| ashvardanian wrote:
| PS: Finding CPUs that support AVX-512 and SVE is relatively
| trivial - practically every cloud has them by now. It's harder
| to find Arm CPUs with wide physical registers, but that's
| another story.
| nromiun wrote:
| But no one likes to develop on the cloud. The latency and
| storage synchronization can be very off putting.
| ack_complete wrote:
| Sure, but I have to support a range of target CPUs in the
| consumer desktop market, and the _older_ CPUs are the ones that
| need optimizations the most. That means NEON on ARM64 and AVX2
| or SSE2-4 on x64. Time spent on higher vector instruction sets
| benefits a smaller fraction of the user base that already has
| better performance, and that 's especially problematic if the
| algorithm has to be reworked to take best advantage of the
| higher extensions.
|
| AVX-512 is also in bad shape market-wise, despite its amazing
| feature set and how long it's been since initial release. The
| Steam Hardware Survey, which skews toward the higher end of the
| market, only shows 18% of the user base having AVX-512 support.
| And even that is despite Intel's best efforts to reverse
| progress by shipping all new consumer CPUs with AVX-512 support
| disabled.
| lukaslalinsky wrote:
| I really wish Zig decided to add SIMD intrisics. There are many
| SIMD algorithms that can be done, but you have to switch back to
| C for those, because they depend on operations outside of what
| LLVM provides for vectors.
| steeve wrote:
| Like what?
|
| You can also directly call LLVM intrinsics in case this doesn't
| work
| lukaslalinsky wrote:
| Just a couple of days ago, I wanted to implement specialized
| StreamVByte decoding in Zig, but @shuffle() needs to mask to
| be compile time known, while _mm_shuffle_epi8() works just
| fine with a dynamic mask. I remember that some time ago, I
| couldn't find an alternative to _mm_alignr_epi8().
| aqrit wrote:
| `_mm_alignr_epi8` is a compile-time known shuffle that gets
| optimized well by LLVM [1].
|
| If you need the exact behavior of `pshufb` you can use asm
| or the llvm intrinsic [2]. iirc, I once got the compiler to
| emit a `pshufb` for a runtime shuffle... that always
| guaranteed indices in the 0..15 range?
|
| Ironically, I also wanted to try zig by doing a StreamVByte
| implementation, but got derailed by the lack of SSE/AVX
| intrinsics support.
|
| [1] https://github.com/aqrit/sse2zig/blob/444ed8d129625ab5d
| eec34... [2] https://github.com/aqrit/sse2zig/blob/444ed8d1
| 29625ab5deec34...
| AndyKelley wrote:
| Missing SIMD functionality is welcome issue reports. Obviously
| we're not going to copy the C intrinsics directly since Zig
| SIMD is portable, but everything should be expressible.
|
| It doesn't really have to do with what operations LLVM provides
| for vectors. LLVM supports all the SIMD intrinsics of clang,
| and LLVM is one of many backends of zig.
| burntsushi wrote:
| Good write-up. This is a very popular approach to substring
| search! It is still worst case `O(m*n)` though. Do you have a
| fallback implementation like the `memchr` crate has to guarantee
| `O(m+n)`?
|
| I'll also push back on some bits in the end: >
| But if it's so much better, then why haven't I made a pull
| request to > change std.mem.indexOf to use SIMD? Well,
| the reason is that > > std.mem.indexOf is generic
| over element size, and having a size > larger than u8
| makes the algorithm much slower > > The algorithm
| used in stdmem.indexOf is cross-platform, while the >
| SIMD code wouldn't be. (not all platforms have SIMD registers at
| all, > Arm has only 128-bit)
|
| Does Zig not have a way to specialize this for sequences of
| unsigned 8-bit integers? If not, and you're thereforce force to
| used a more generic algorithm, that seems pretty unfortunate.
| > Substring searching is rarely the bottleneck in programs,
| > especially ones written in a fast language like Zig. That's why
| > I don't personally think it would be worth it to add it to the
| > standard library.
|
| Oh I'm not sure I buy this at all! Substring search is a
| primitive operation and easily _can_ be a bottleneck. There 's a
| reason why widely used substring search implementations tend to
| be highly optimized.
| ozgrakkurt wrote:
| It is very easy to specialise a function in zig, you just put
| if(T == u8) or something like that inside the function and do
| w/e in there
| unwind wrote:
| Can the compiler detect that and use the proper code so no
| test is needed at runtime?
|
| This is Zig so I guess the answer is "yeah, duh" but wanted
| to ask since it sounded like the solution is less, uh,
| "compiler-friendly" than I would expect.
| hansvm wrote:
| Yes, and if you're paranoid you can write
| if (comptime T == u8) { // code }
|
| to guarantee that if you're wrong about how the compiler
| behaves then you'll get a compiler error.
| dpatterbee wrote:
| Zig has no type info at runtime so this is and always
| will be guaranteed to be a comptime check.
| aarol wrote:
| I'm the author of the post, thanks for your feedback! I was
| inspired by your comment on HN a while back and started
| learning about this stuff, reading the source code of `memchr`
| was especially great.
|
| You're totally right about the first part there was a serious
| consideration to add this to zig's standard library, there
| would definitely need to be a fallback to avoid the `O(m*n)`
| situation.
|
| I'll admit that there are a lot of false assumptions at the
| end, you could totally specialize it for u8 and also get the
| block size according to CPU features at compile time with
| `std.simd.suggestVectorSize()`
| hansvm wrote:
| Or at runtime, if you'd like. You can create a generic binary
| that runs faster on supported platforms.
| vlovich123 wrote:
| > Or at runtime, if you'd like
|
| You have to be careful about how you do it because those
| runtime checks can easily swamp the performance gains you
| get from SIMD.
|
| > also get the block size according to CPU features at
| compile time with `std.simd.suggestVectorSize()`
|
| You have to be careful with this since
| std.simd.suggestVectorSize is going to return values for
| the minimum SIMD version you're targeting I believe which
| can be suboptimal for portable binaries.
|
| You probably want a mix where you carefully compute the
| vector size for the current platform globally once and have
| multiple compiled dispatch paths in your binary that you
| can pick based on that value & let the CPU prefetcher hide
| the cost of a check before each invocation.
| stouset wrote:
| > You have to be careful about how you do it because
| those runtime checks can easily swamp the performance
| gains you get from SIMD.
|
| That seems surprising, particularly given that
| autovectorizing compilers tend to insert pretty extensive
| preambles that check for whether or not it's likely the
| vectorized one will have a speedup over the looping
| version (e.g., based on the number of iterations) and
| postambles that handle the cases where the number of loop
| iterations isn't cleanly divisible by the number of
| elements in the chosen vector size.
|
| Why would checking for supported SIMD instructions cause
| that much additional work?
|
| Also, even if this is the case, you can always check once
| and then replace the function body with the chosen one,
| eliding the check.
| garganzol wrote:
| Every language tries to implement the best in class memory
| set/search primitives. Maybe we should move them to something
| called libos, where they will be implemented by every host OS?
| Then, OS manufacturers can supply libos.so/libos.dll/libos.dylib
| as part of the official OS distributions.
|
| If the wheels get reinvented again and again, it means that they
| should be readily available.
| ozgrakkurt wrote:
| Isn't this what libc is? Like musl-libc or glibc
| garganzol wrote:
| libc is de jure for C language in Unix-like systems. De
| facto, it is used for other things and by other languages
| creating a bit of an architectural mess.
|
| What I am talking about is creating a cross-vendor standard.
| ww520 wrote:
| Excellent write up. I really appreciate the clear and detailed
| explanation of the algorithm.
|
| The nice thing about Zig's SIMD operation is the register size
| support is transparent. You can just declare a 64-byte vector as
| the Block, and Zig would use an AVX256 or two AVX2 (32-byte)
| registers behind the scene. All other SIMD operations on the type
| are transparently done with regard to the registers when compiled
| to the targeted platform.
|
| Even using two AVX2 registers for 64 bytes of data is a win due
| to instruction pipelining. Most CPU have multiple SIMD registers
| and the two 32-byte data chunks are independent. CPU can run them
| in parallel.
|
| The next optimization is to line the data up at 64 byte
| alignment, to match the L1/L2 cache line size. Memory access is
| slow in general.
| burntsushi wrote:
| Another challenge may be as a result of using a portable SIMD
| API instead of specific ISA instructions. I'm specifically
| thinking about computing the mask, which on x86-64 is I assume
| implemented via movemask. But aarch64 lacks such an
| instruction, so you need to do other shenanigans for the best
| codegen:
| https://github.com/BurntSushi/memchr/blob/ceef3c921b5685847e...
| ww520 wrote:
| A language level or std library level poly-fill for the
| missing SIMD operations for the platforms would be great.
| okr wrote:
| I remember reading Daniel Lemire's Blog about SIMD a few days
| ago, in which he discussed substring search as well!
|
| https://lemire.me/blog/2025/08/09/why-do-we-even-need-simd-i...
___________________________________________________________________
(page generated 2025-08-11 23:00 UTC)