[HN Gopher] Faster substring search with SIMD in Zig
       ___________________________________________________________________
        
       Faster substring search with SIMD in Zig
        
       Author : todsacerdoti
       Score  : 169 points
       Date   : 2025-08-11 09:41 UTC (13 hours ago)
        
 (HTM) web link (aarol.dev)
 (TXT) w3m dump (aarol.dev)
        
       | lokeg wrote:
       | What about the worst case? I.e. something like searching for 1000
       | 'a's in a long string of 'a's interspersed with 'b's every
       | 500-1000 steps? Seems accidentally quadradic unfortunately in the
       | absence of some KMP-like fallback
        
         | expenses3 wrote:
         | How is it quadratic? You do 1000 checks every character in the
         | haystack but that's still O(n)
        
           | burntsushi wrote:
           | The worst case is that `std.mem.eql`[1] is executed at every
           | position in the haystack, which gives you `O(n*m)` time
           | complexity. Several substring search algorithms are `O(n)`.
           | 
           | [1]: https://github.com/aarol/substr/blob/9392f9557de735929df
           | b79e...
        
         | MattPalmer1086 wrote:
         | Worst case for these types of search is O(mn), m length of
         | needle, n length of haystack. It is not linear in n.
         | 
         | The absolute worst case is when both the needle and haystack
         | are both composed of the same byte repeated (e.g. all zero).
        
       | suddenlybananas wrote:
       | >The difference between 4ms vs 1ms is extremely small, but it's
       | slightly faster nonetheless.
       | 
       | Put that in a loop and its an enormous speed-up.
        
       | codethief wrote:
       | Nice article!
       | 
       | Also, this might be a stupid question (I'm a Zig newbie) but...
       | instead of calling std.mem.eql() in the while loop to look at
       | each potential match individually, couldn't you repeat the same
       | trick as before? That is, use SIMD to search for the second and
       | second-to-last character of the needle, then third and third-to-
       | last, and so on, and finally take a bitwise AND of all the
       | resulting bit masks? This way, one would avoid looking at each
       | potential match one by one, and instead look at all of them at
       | the same time.
       | 
       | Even if that doesn't work for some reason and you still need to
       | loop over all potential matches individually, couldn't you use
       | SIMD _inside_ the while loop to replace std.mem.eql and thereby
       | speed up string comparison? My understanding was that std.mem.eql
       | loops over bytes one by one and compares them?
        
         | ncruces wrote:
         | Knowing little about zig, std.mem.eql _very likely_ already
         | uses SIMD.
         | 
         | This is about using SIMD to _avoid even calling_ std.mem.eql
         | for 99% of the possible attempts.
        
           | llimllib wrote:
           | std.mem.eql is here, super easy to read: https://github.com/z
           | iglang/zig/blob/master/lib/std/mem.zig#L...
           | 
           | My read is it would use SIMD if T is @Vector, and not
           | otherwise? But I'm neither a zig nor SIMD expert
        
             | ozgrakkurt wrote:
             | Pretty sure that compiles into assembly for any primitive
             | type like integers, floats etc.
        
               | IshKebab wrote:
               | What do you mean it "compiles into assembly"?
        
               | ozgrakkurt wrote:
               | I meant to write SIMD
        
       | jiehong wrote:
       | Nice!
       | 
       | But, does that work with non-ascii characters? (aka Unicode).
        
         | llimllib wrote:
         | Kind of! This script is assuming that you're dealing with a
         | byte slice, which means you've already encoded your unicode
         | data.
         | 
         | If you just encoded your string to bytes naively, it will
         | probably-mostly still work, but it will get some combining
         | characters wrong if they're represented differently in the two
         | sources you're comparing. (eg, e-with-an-accent-character vs.
         | accent-combining-character+e)
         | 
         | If you want to be correct-er you'll normalize your UTF
         | string[1], but note that there are four different defined ways
         | to do this, so you'll need to choose the one that is the best
         | tradeoff for your particular application and data sources.
         | 
         | [1]:
         | https://en.wikipedia.org/wiki/Unicode_equivalence#Normalizat...
        
           | codethief wrote:
           | > If you just encoded your string to bytes naively
           | 
           | By "naively" I assume you mean you would just plug in UTF-8
           | bytestrings for haystack & needle, without adjusting the
           | implementation?
           | 
           | Wouldn't the code still need to take into account where
           | characters (code points) begin and end, though, in order to
           | prevent incorrect matches?
        
             | burntsushi wrote:
             | IDK what "encoded your string to bytes naively" means
             | personally. There is only one way to correctly UTF-8 encode
             | a sequence of Unicode scalar values.
             | 
             | In any case, no, this works because UTF-8 is self
             | synchronizing. As long as both your needle and your
             | haystack are valid UTF-8, the byte offsets returned by the
             | search will always fall on a valid codepoint boundary.
             | 
             | In terms of getting "combining characters wrong," this is a
             | reference to different Unicode normalization forms.
             | 
             | To be more precise... Consider a needle and a haystack,
             | represented by a sequence of Unicode scalar values
             | (typically represented by a sequence of unsigned 32-bit
             | integers). Now encode them to UTF-8 (a sequence of unsigned
             | 8-bit integers) and run a byte level search as shown by the
             | OP here. That will behave _as if_ you 've executed the
             | search on the sequence of Unicode scalar values.
             | 
             | So semantically, a "substring search" is a "sequence of
             | Unicode scalar values search." At the semantic level, this
             | may or may not be what you want. For example, if you always
             | want `office` to find substrings like `office` in your
             | haystack, then this byte level search will not do what you
             | want.
             | 
             | The standard approach for performing a substring search
             | that accounts for normalization forms is to convert both
             | the needle and haystack to the same normal form and then
             | execute a byte level search.
             | 
             | (One small caveat is when the needle is an empty string. If
             | you want to enforce correct UTF-8 boundaries, you'll need
             | to handle that specially.)
        
               | llimllib wrote:
               | By naively, I meant without normalization.
               | 
               | You know much more about this than I do though
               | 
               | edit: this is what I mean for example, that `test` !=
               | `test` in rg, because \ue9 (e with accent) != e\u0301 (e
               | followed by combining character accent)
               | $ printf "t\\u00E9st" > /tmp/a          $ xxd /tmp/a
               | 00000000: 74c3 a973 74                             t..st
               | $ cat /tmp/a         test              $ printf
               | "te\\u0301st" > /tmp/b          $ xxd /tmp/b
               | 00000000: 7465 cc81 7374                           te..st
               | $ cat /tmp/b         test              $ printf
               | "t\\u00E9st" | rg -f - /tmp/a         1:test         $
               | printf "t\\u00E9st" | rg -f - /tmp/b         # ed: no
               | result
               | 
               | edit 2: if we normalize the UTF-8, the two strings will
               | match                   $ printf "t\\u00E9st" | uconv -x
               | any-nfc | xxd         00000000: 74c3 a973 74
               | t..st         $ printf "te\\u0301st" | uconv -x any-nfc |
               | xxd         00000000: 74c3 a973 74
               | t..st
               | 
               | Which you know, and indicate! Just working an example of
               | it that maybe will help people understand, I dunno
        
           | jiehong wrote:
           | Thanks for this detailed answer!
        
         | codethief wrote:
         | I suppose generalizing the approach to UTF-32 should be
         | straightforward, but variable-length encodings like UTF-8 and
         | UTF-16 might be more involved(?) Either way, I'm sure
         | BurntSushi found a solution and built it into ripgrep.
        
           | burntsushi wrote:
           | ripgrep always deals with UTF-8. When it sees a different
           | encoding, like UTF-16, ripgrep first transcodes to UTF-8 and
           | then searches.
           | 
           | This is absolutely in part because of all of the byte
           | oriented optimizations that are baked into ripgrep (and its
           | regex engine). Note that I said a _part_. Making ripgrep (and
           | its regex engine) work on things other than a sequence of
           | bytes is far more difficult than just porting a bunch of SIMD
           | algorithms. There are also many optimizations and
           | architectural constraints _in the code_ based on the alphabet
           | size. That is, with 8-bit integers, its alphabet size is 256.
           | With 16-bit integers, the alphabet size is 65,536.
        
             | tialaramex wrote:
             | I think this is the right choice because in practice UTF-8
             | "won" just like how the two's complement machine integer
             | won. It's pretty good, Wikipedia has a brief section
             | explaining how Ken Thompson for example made it self-
             | synchronizing, which seems like a "duh" feature today but
             | the concept before Ken touched it didn't have this. It's a
             | Best Common Practice for the Internet, it's the default in
             | most modern systems and places such as Java's virtual
             | machine or Windows which can't easily "just" use UTF-8 have
             | nevertheless gradually shifted toward being very friendly
             | toward it.
        
       | ashvardanian wrote:
       | I like that more people are getting involved with SIMD, and there
       | have been several posts lately on both memmem-like and memcpy-
       | like operations implemented in SIMD in different programming
       | languages.
       | 
       | In most cases, though, these still focus on AVX/NEON instructions
       | from over 10 years ago, rather than newer and more powerful
       | AVX-512 variations, SVE & SVE2, or RVV.
       | 
       | These newer ISAs can noticeably change how one would implement a
       | state-of-the-art substring search or copy/move operation. In my
       | projects, such as StringZilla, I often use mask K registers
       | (https://github.com/ashvardanian/StringZilla/blob/2f4b1386ca2...)
       | and an input-dependent mix of temporal and non-temporal loads and
       | stores (https://github.com/ashvardanian/StringZilla/blob/2f4b1386
       | ca2...).
       | 
       | In typical cases, the difference between the suggested SIMD
       | kernels and the state-of-the-art can be as significant as 50% in
       | throughput. As SIMD becomes more widespread, it would be
       | beneficial to focus more on delivering software and bundling
       | binaries, rather than just the kernels.
        
         | moregrist wrote:
         | I'm not as familiar with the NEON side, but AVX512 support is
         | pretty variable on new processors. Alder Lake omits it
         | entirely. So we're still in a world where AVX2 is the lowest
         | common denominator for a system library that wants wide
         | support.
        
           | debugnik wrote:
           | Even that is too high of a requirement if your target user
           | runs low end hardware. Most Intel chips launched between 2017
           | and 2021 under the Pentium Silver/Gold and Celeron brands
           | lack AVX (the first one, let alone AVX2).
        
           | teo_zero wrote:
           | How strange! I was about to add a comment that I would
           | probably stick to SSE2 or something like that to be sure my
           | code suits as large an audience as possible, including CPUs
           | from more than 10 years ago, ARM, etc.
           | 
           | Case in point: I've been very disappointed lately when I
           | wanted to try Ghostty on my laptop and the binary compiled
           | for Debian failed to run due to an invalid instruction. I
           | don't want to force the same experience to others.
        
             | burntsushi wrote:
             | This is sort of a category error. I don't know what ghostty
             | is doing (or perhaps its distributor), but lots of widely
             | used software (including my own ripgrep, but also even
             | things like glibc) will query what the CPU supports at
             | _runtime_ and dispatch to the correct routine based on
             | that. So things like,  "I'm only going to stick to SSE2 for
             | maximal compatibility" have a false assumption baked into
             | them.
        
         | nromiun wrote:
         | Because it is very hard to find new hardware to test it, let
         | alone expect your users to take advantage of it on their
         | machines.
         | 
         | AVX512 is such a mess that Intel just removed it after a
         | generation or two. And on ARM SVE side it is even worse. There
         | is already SVE2, but good luck finding even a SVE enabled
         | machine.
         | 
         | Apple does not support it on their Apple Silicon(tm) (only
         | SME), Snapdragon does not support it even on their latest 8
         | Elite. 8 Elite Gen 2 is supposed to come with it.
         | 
         | Only Mediatek and Neoverse chips support them. So finding one
         | machine to develop and test such code can be a little
         | difficult.
        
         | ashvardanian wrote:
         | PS: Finding CPUs that support AVX-512 and SVE is relatively
         | trivial - practically every cloud has them by now. It's harder
         | to find Arm CPUs with wide physical registers, but that's
         | another story.
        
           | nromiun wrote:
           | But no one likes to develop on the cloud. The latency and
           | storage synchronization can be very off putting.
        
         | ack_complete wrote:
         | Sure, but I have to support a range of target CPUs in the
         | consumer desktop market, and the _older_ CPUs are the ones that
         | need optimizations the most. That means NEON on ARM64 and AVX2
         | or SSE2-4 on x64. Time spent on higher vector instruction sets
         | benefits a smaller fraction of the user base that already has
         | better performance, and that 's especially problematic if the
         | algorithm has to be reworked to take best advantage of the
         | higher extensions.
         | 
         | AVX-512 is also in bad shape market-wise, despite its amazing
         | feature set and how long it's been since initial release. The
         | Steam Hardware Survey, which skews toward the higher end of the
         | market, only shows 18% of the user base having AVX-512 support.
         | And even that is despite Intel's best efforts to reverse
         | progress by shipping all new consumer CPUs with AVX-512 support
         | disabled.
        
       | lukaslalinsky wrote:
       | I really wish Zig decided to add SIMD intrisics. There are many
       | SIMD algorithms that can be done, but you have to switch back to
       | C for those, because they depend on operations outside of what
       | LLVM provides for vectors.
        
         | steeve wrote:
         | Like what?
         | 
         | You can also directly call LLVM intrinsics in case this doesn't
         | work
        
           | lukaslalinsky wrote:
           | Just a couple of days ago, I wanted to implement specialized
           | StreamVByte decoding in Zig, but @shuffle() needs to mask to
           | be compile time known, while _mm_shuffle_epi8() works just
           | fine with a dynamic mask. I remember that some time ago, I
           | couldn't find an alternative to _mm_alignr_epi8().
        
             | aqrit wrote:
             | `_mm_alignr_epi8` is a compile-time known shuffle that gets
             | optimized well by LLVM [1].
             | 
             | If you need the exact behavior of `pshufb` you can use asm
             | or the llvm intrinsic [2]. iirc, I once got the compiler to
             | emit a `pshufb` for a runtime shuffle... that always
             | guaranteed indices in the 0..15 range?
             | 
             | Ironically, I also wanted to try zig by doing a StreamVByte
             | implementation, but got derailed by the lack of SSE/AVX
             | intrinsics support.
             | 
             | [1] https://github.com/aqrit/sse2zig/blob/444ed8d129625ab5d
             | eec34... [2] https://github.com/aqrit/sse2zig/blob/444ed8d1
             | 29625ab5deec34...
        
         | AndyKelley wrote:
         | Missing SIMD functionality is welcome issue reports. Obviously
         | we're not going to copy the C intrinsics directly since Zig
         | SIMD is portable, but everything should be expressible.
         | 
         | It doesn't really have to do with what operations LLVM provides
         | for vectors. LLVM supports all the SIMD intrinsics of clang,
         | and LLVM is one of many backends of zig.
        
       | burntsushi wrote:
       | Good write-up. This is a very popular approach to substring
       | search! It is still worst case `O(m*n)` though. Do you have a
       | fallback implementation like the `memchr` crate has to guarantee
       | `O(m+n)`?
       | 
       | I'll also push back on some bits in the end:                   >
       | But if it's so much better, then why haven't I made a pull
       | request to         > change std.mem.indexOf to use SIMD? Well,
       | the reason is that         >         > std.mem.indexOf is generic
       | over element size, and having a size         > larger than u8
       | makes the algorithm much slower         >         > The algorithm
       | used in stdmem.indexOf is cross-platform, while the         >
       | SIMD code wouldn't be. (not all platforms have SIMD registers at
       | all,         > Arm has only 128-bit)
       | 
       | Does Zig not have a way to specialize this for sequences of
       | unsigned 8-bit integers? If not, and you're thereforce force to
       | used a more generic algorithm, that seems pretty unfortunate.
       | > Substring searching is rarely the bottleneck in programs,
       | > especially ones written in a fast language like Zig. That's why
       | > I don't personally think it would be worth it to add it to the
       | > standard library.
       | 
       | Oh I'm not sure I buy this at all! Substring search is a
       | primitive operation and easily _can_ be a bottleneck. There 's a
       | reason why widely used substring search implementations tend to
       | be highly optimized.
        
         | ozgrakkurt wrote:
         | It is very easy to specialise a function in zig, you just put
         | if(T == u8) or something like that inside the function and do
         | w/e in there
        
           | unwind wrote:
           | Can the compiler detect that and use the proper code so no
           | test is needed at runtime?
           | 
           | This is Zig so I guess the answer is "yeah, duh" but wanted
           | to ask since it sounded like the solution is less, uh,
           | "compiler-friendly" than I would expect.
        
             | hansvm wrote:
             | Yes, and if you're paranoid you can write
             | if (comptime T == u8) {         // code       }
             | 
             | to guarantee that if you're wrong about how the compiler
             | behaves then you'll get a compiler error.
        
               | dpatterbee wrote:
               | Zig has no type info at runtime so this is and always
               | will be guaranteed to be a comptime check.
        
         | aarol wrote:
         | I'm the author of the post, thanks for your feedback! I was
         | inspired by your comment on HN a while back and started
         | learning about this stuff, reading the source code of `memchr`
         | was especially great.
         | 
         | You're totally right about the first part there was a serious
         | consideration to add this to zig's standard library, there
         | would definitely need to be a fallback to avoid the `O(m*n)`
         | situation.
         | 
         | I'll admit that there are a lot of false assumptions at the
         | end, you could totally specialize it for u8 and also get the
         | block size according to CPU features at compile time with
         | `std.simd.suggestVectorSize()`
        
           | hansvm wrote:
           | Or at runtime, if you'd like. You can create a generic binary
           | that runs faster on supported platforms.
        
             | vlovich123 wrote:
             | > Or at runtime, if you'd like
             | 
             | You have to be careful about how you do it because those
             | runtime checks can easily swamp the performance gains you
             | get from SIMD.
             | 
             | > also get the block size according to CPU features at
             | compile time with `std.simd.suggestVectorSize()`
             | 
             | You have to be careful with this since
             | std.simd.suggestVectorSize is going to return values for
             | the minimum SIMD version you're targeting I believe which
             | can be suboptimal for portable binaries.
             | 
             | You probably want a mix where you carefully compute the
             | vector size for the current platform globally once and have
             | multiple compiled dispatch paths in your binary that you
             | can pick based on that value & let the CPU prefetcher hide
             | the cost of a check before each invocation.
        
               | stouset wrote:
               | > You have to be careful about how you do it because
               | those runtime checks can easily swamp the performance
               | gains you get from SIMD.
               | 
               | That seems surprising, particularly given that
               | autovectorizing compilers tend to insert pretty extensive
               | preambles that check for whether or not it's likely the
               | vectorized one will have a speedup over the looping
               | version (e.g., based on the number of iterations) and
               | postambles that handle the cases where the number of loop
               | iterations isn't cleanly divisible by the number of
               | elements in the chosen vector size.
               | 
               | Why would checking for supported SIMD instructions cause
               | that much additional work?
               | 
               | Also, even if this is the case, you can always check once
               | and then replace the function body with the chosen one,
               | eliding the check.
        
       | garganzol wrote:
       | Every language tries to implement the best in class memory
       | set/search primitives. Maybe we should move them to something
       | called libos, where they will be implemented by every host OS?
       | Then, OS manufacturers can supply libos.so/libos.dll/libos.dylib
       | as part of the official OS distributions.
       | 
       | If the wheels get reinvented again and again, it means that they
       | should be readily available.
        
         | ozgrakkurt wrote:
         | Isn't this what libc is? Like musl-libc or glibc
        
           | garganzol wrote:
           | libc is de jure for C language in Unix-like systems. De
           | facto, it is used for other things and by other languages
           | creating a bit of an architectural mess.
           | 
           | What I am talking about is creating a cross-vendor standard.
        
       | ww520 wrote:
       | Excellent write up. I really appreciate the clear and detailed
       | explanation of the algorithm.
       | 
       | The nice thing about Zig's SIMD operation is the register size
       | support is transparent. You can just declare a 64-byte vector as
       | the Block, and Zig would use an AVX256 or two AVX2 (32-byte)
       | registers behind the scene. All other SIMD operations on the type
       | are transparently done with regard to the registers when compiled
       | to the targeted platform.
       | 
       | Even using two AVX2 registers for 64 bytes of data is a win due
       | to instruction pipelining. Most CPU have multiple SIMD registers
       | and the two 32-byte data chunks are independent. CPU can run them
       | in parallel.
       | 
       | The next optimization is to line the data up at 64 byte
       | alignment, to match the L1/L2 cache line size. Memory access is
       | slow in general.
        
         | burntsushi wrote:
         | Another challenge may be as a result of using a portable SIMD
         | API instead of specific ISA instructions. I'm specifically
         | thinking about computing the mask, which on x86-64 is I assume
         | implemented via movemask. But aarch64 lacks such an
         | instruction, so you need to do other shenanigans for the best
         | codegen:
         | https://github.com/BurntSushi/memchr/blob/ceef3c921b5685847e...
        
           | ww520 wrote:
           | A language level or std library level poly-fill for the
           | missing SIMD operations for the platforms would be great.
        
       | okr wrote:
       | I remember reading Daniel Lemire's Blog about SIMD a few days
       | ago, in which he discussed substring search as well!
       | 
       | https://lemire.me/blog/2025/08/09/why-do-we-even-need-simd-i...
        
       ___________________________________________________________________
       (page generated 2025-08-11 23:00 UTC)