[HN Gopher] SIMD-friendly algorithms for substring searching (2018)
       ___________________________________________________________________
        
       SIMD-friendly algorithms for substring searching (2018)
        
       Author : Rendello
       Score  : 190 points
       Date   : 2025-06-14 03:31 UTC (19 hours ago)
        
 (HTM) web link (0x80.pl)
 (TXT) w3m dump (0x80.pl)
        
       | ncruces wrote:
       | I implemented these in my attempt to SIMD optimize Wasm/WASI
       | libc.
       | 
       | https://github.com/ncruces/go-sqlite3/tree/main/sqlite3/libc
       | 
       | For known haystack lengths, and large needles, I found it useful
       | to augment the approach with Quick Search.
       | 
       | https://igm.univ-mlv.fr/~lecroq/string/node19.html
        
       | clauderoux wrote:
       | I implemented many different algorithms for searching and
       | splitting strings using SMID methods as well:
       | https://github.com/naver/tamgu/blob/06aedf2f14895925d7b5a8e2... I
       | used a different algorithm than the ones presented here.
        
         | ncruces wrote:
         | Could you summarize your algorithm? Like a high-level
         | description?
         | 
         | Like algorithm 1 (from the article) checks N possible positions
         | at a time by matching the first and last character; algorithm 2
         | instead tests the 4 leading characters.
        
         | Rendello wrote:
         | I tried to implement a modified version for LZ77 window search
         | with Zig's generic SIMD a few years ago:
         | 
         | https://news.ycombinator.com/item?id=44273983
        
       | rurban wrote:
       | We know that the libc strstr performs horrible, but musl is fast
       | and state of the art.
       | 
       | Now we just need a name, and we can add it to the smart shootout.
       | Wonder how it compares to the best SIMD algos
        
         | ncruces wrote:
         | If you're interested in a rough benchmark, I compared musl's
         | "two way" with a variation on this algorithm for my SIMD
         | optimized libc linked in a sister comment.
         | 
         | The benchmark involves finding this line in that file:
         | https://github.com/ncruces/go-sqlite3/blob/v0.26.1/sqlite3/l...
         | 
         | The improvements over musl are in this table:
         | https://github.com/ncruces/go-sqlite3/tree/v0.26.1/sqlite3/l...
         | 
         | I'd say it's a decent improvement, if not spectacular. musl
         | does particularly well with known length strings (memmem),
         | whereas I'm allowed to cheat for unknown length strings
         | (strstr) because the Wasm environment allows me to skirt some
         | UB.
         | 
         | The NUL terminated string curve ball wrecks many good
         | algorithms.
        
         | MattPalmer1086 wrote:
         | Would be good to have more SIMD algorithms in smart. Maybe I'll
         | have a play with it if I get time.
        
           | MattPalmer1086 wrote:
           | Looking at them, would need a little work to make them safe
           | (not reading past needle or haystack). Probably not too much
           | effort, may need a different search for data mod block size
           | at the end.
        
       | mgaunard wrote:
       | missing (2018) ?
        
       | sylware wrote:
       | What is the string sizes thresold for efficiency?
       | 
       | Because the SIMD (aligned, size computations, etc) setup code is
       | usually more expensive that byte-oriented basic search for
       | "small" strings.
       | 
       | Yes, it depends heavily on the hardware architecture.
        
         | ncruces wrote:
         | Here, it's often the opposite.
         | 
         | These algorithms are basically brute force with very little
         | setup, and can become quadratic for large, periodic needles.
         | 
         | String search algorithms that avoid quadratic behaviour (and
         | may even be sublinear) have an higher cost setup that explores
         | needle structure
        
           | sylware wrote:
           | What? It all depends on the sizes of the strings. Hardware
           | benchmarks, for each significant hardware uarchitectures,
           | must be done on combinations of string sizes in order to know
           | "when" the SIMD setup code is worth it.
           | 
           | Often the setup code is not worth it, even the SIMD
           | complexity is not bringing enough performance in many use
           | cases (basically, kept for niche applications, little lib
           | statically linked for those).
        
             | ncruces wrote:
             | Did you look at the algorithm 1? The setup code is a pair
             | of splats. There's no consideration for alignment, it works
             | with unaligned data just fine. So what do you mean? What
             | threshold would you expect for it to be worth it (assuming
             | it's useful at all)?
        
               | sylware wrote:
               | Dude, I cannot be more explicit than that.
               | 
               | That said, when you work on such code, use assembly, C
               | code is for reference (have a look at dav1d av1 decoder).
        
               | burntsushi wrote:
               | What? You don't have to do that. Hell, the SIMD code that
               | ripgrep uses is written in Rust.
        
               | ncruces wrote:
               | I can't either.
               | 
               | Having implemented this, I'll claim that this is already
               | _competitive_ for needles as small as 2 bytes, and if the
               | needle doesn 't show up in the first couple "vector
               | sizes" of the haystack.
               | 
               | Also, look at burntsushi's comment. They use this
               | algorithm for similarly small sizes. They _do_ use Rabin-
               | Karp for "supremely short haystacks" (just because that
               | lib is awesome and uses every trick in the book), but
               | that wording should give you an idea of how small the
               | threshold will be.
               | 
               | And really, how common are "supremely small" _haystacks_?
        
       | eska wrote:
       | The swar algorithm has UB because it casts the 1 byte aligned
       | data to 8 byte aligned. Perhaps it suffers some performance
       | issues from unaligned loads?
        
         | burntsushi wrote:
         | I always took these as "idealized" algorithms or pseudo-code.
         | In addition to not doing correct unaligned loads (using
         | `memcpy`), the boundary checks are wrong. The code assumes the
         | length of the haystack is a multiple of 8. Additionally, if the
         | needle is empty, you'll get unsigned integer overflow leading
         | to an out-of-bounds access on the needle. That's three
         | different ways to get UB.
         | 
         | This is pretty common in my experience when demonstrating SIMD
         | code. Often only the relevant bits are shown, and not a lot of
         | effort is given to getting the boundary cases correct. It's
         | rarely acknowledged explicitly, but I would assume it's because
         | the boundary conditions are somewhat less interesting than the
         | "meat" of how to use vectors in an interesting way.
        
           | Rendello wrote:
           | When I modified the idealized algorithm to suit my needs a
           | few years ago, I learned just how painful the boundary cases
           | are. I realized I didn't want to ever touch SIMD string
           | searching unless:
           | 
           | - I had a very specific use-case that I really needed SIMD
           | for and was willing to really invest a lot in it; or
           | 
           | - The work had already been done for me in a generic sense
           | and I could reap the fruits of that labour (ripgrep / Rust
           | Regex, thanks for that by the way)
        
           | eska wrote:
           | True. The size assumption is mentioned in the article
           | however. The simd library by agner makes the same assumption
           | and demands that the user over-allocates their data
        
       | burntsushi wrote:
       | The "AVX2 (generic)" approach is roughly what ripgrep uses (via
       | Rust's `regex` crate) to accelerate most searches. Even something
       | like `\w+\s+Sherlock\s+\w+` will benefit since ripgrep will pluck
       | `Sherlock` out of the regex and search that.
       | 
       | The actual implementation is here:
       | https://github.com/BurntSushi/memchr?tab=readme-ov-file#algo...
       | 
       | The main difference with the algorithm presented here is that
       | instead of always using the first and last bytes of the needle, a
       | heuristic is used to try to pick two bytes that occur less
       | frequently according to a background frequency distribution.
       | 
       | It ends up being quite a bit faster than just plain Two-Way or
       | even GNU libc's implementation of `memmem`. From the root of the
       | `memchr` repository:                   $ rebar rank
       | benchmarks/record/x86_64/2023-12-29.csv -e
       | '^rust/memchr/memmem/(oneshot|prebuilt|twoway)' -e
       | '^libc/memmem/oneshot'         Engine
       | Version  Geometric mean of speed ratios  Benchmark count
       | ------                       -------
       | ------------------------------  ---------------
       | rust/memchr/memmem/prebuilt  2.7.1    1.07
       | 57         rust/memchr/memmem/oneshot   2.7.1    1.39
       | 54         libc/memmem/oneshot          unknown  3.15
       | 54         rust/memchr/memmem/twoway    2.7.1    5.26
       | 54
       | 
       | The "prebuilt" benchmarks also demonstrate the API deficiencies
       | of routines like `memmem`. Because of their API, they need to
       | rebuild the state necessary to execute a search for the needle
       | given. But it is often the case that the needle is invariant
       | across repeated calls with different haystacks. So you end up
       | doing that repeated work for no gain.
        
         | jakobnissen wrote:
         | Interesting. How do you know the background distribution of
         | bytes? Won't a scan of the haystack to get the distribution
         | take up a significant amount of time?
        
           | burntsushi wrote:
           | It's a _background_ distribution. That is, a guess. A
           | heuristic. It doesn't change with the input.
        
       | azhenley wrote:
       | Now if I can just use SIMD directly from Python without calling
       | out to another language.
        
         | adgjlsfhk1 wrote:
         | why would you want to? if you are performance bound with Python
         | code, you can pick up a 20-50x performance improvement by
         | switching language.
        
         | ashvardanian wrote:
         | Hi Austin! Was just checking your blog and the vowels detection
         | post (< https://austinhenley.com/blog/vowels.html>).
         | 
         | What exactly do you mean by "use SIMD directly without calling
         | out to another language"?
         | 
         | In some way Assembly will probably anyways be another
         | language... but that's a technicality.
         | 
         | I guess the spectrum of SIMD-related work in relation to Python
         | is quite broad. There are projects like PeachPy, to help one
         | write x86 Asm in Python, new Python'esque languages like Mojo,
         | or SIMD libraries with thin CPython bindings. Do you mean one
         | of those?
         | 
         | PS: I'm in the last camp. And in case you are focused on ASCII,
         | for vowel search, have you tried StringZilla's `find_first_of`
         | (< https://github.com/ashvardanian/StringZilla/blob/960967d78c7
         | ...>)? I think it should perform well ;)
        
       | klaussilveira wrote:
       | Reminds me of https://github.com/nikneym/hparse
       | 
       | Which uses SIMD algos for fast HTTP parsing.
        
       | mwsherman wrote:
       | C# has implemented some SIMD for IndexOf:
       | https://github.com/dotnet/runtime/pull/63285
        
       ___________________________________________________________________
       (page generated 2025-06-14 23:00 UTC)