[HN Gopher] SIMD-friendly algorithms for substring searching (2018)
___________________________________________________________________
SIMD-friendly algorithms for substring searching (2018)
Author : Rendello
Score : 190 points
Date : 2025-06-14 03:31 UTC (19 hours ago)
(HTM) web link (0x80.pl)
(TXT) w3m dump (0x80.pl)
| ncruces wrote:
| I implemented these in my attempt to SIMD optimize Wasm/WASI
| libc.
|
| https://github.com/ncruces/go-sqlite3/tree/main/sqlite3/libc
|
| For known haystack lengths, and large needles, I found it useful
| to augment the approach with Quick Search.
|
| https://igm.univ-mlv.fr/~lecroq/string/node19.html
| clauderoux wrote:
| I implemented many different algorithms for searching and
| splitting strings using SMID methods as well:
| https://github.com/naver/tamgu/blob/06aedf2f14895925d7b5a8e2... I
| used a different algorithm than the ones presented here.
| ncruces wrote:
| Could you summarize your algorithm? Like a high-level
| description?
|
| Like algorithm 1 (from the article) checks N possible positions
| at a time by matching the first and last character; algorithm 2
| instead tests the 4 leading characters.
| Rendello wrote:
| I tried to implement a modified version for LZ77 window search
| with Zig's generic SIMD a few years ago:
|
| https://news.ycombinator.com/item?id=44273983
| rurban wrote:
| We know that the libc strstr performs horrible, but musl is fast
| and state of the art.
|
| Now we just need a name, and we can add it to the smart shootout.
| Wonder how it compares to the best SIMD algos
| ncruces wrote:
| If you're interested in a rough benchmark, I compared musl's
| "two way" with a variation on this algorithm for my SIMD
| optimized libc linked in a sister comment.
|
| The benchmark involves finding this line in that file:
| https://github.com/ncruces/go-sqlite3/blob/v0.26.1/sqlite3/l...
|
| The improvements over musl are in this table:
| https://github.com/ncruces/go-sqlite3/tree/v0.26.1/sqlite3/l...
|
| I'd say it's a decent improvement, if not spectacular. musl
| does particularly well with known length strings (memmem),
| whereas I'm allowed to cheat for unknown length strings
| (strstr) because the Wasm environment allows me to skirt some
| UB.
|
| The NUL terminated string curve ball wrecks many good
| algorithms.
| MattPalmer1086 wrote:
| Would be good to have more SIMD algorithms in smart. Maybe I'll
| have a play with it if I get time.
| MattPalmer1086 wrote:
| Looking at them, would need a little work to make them safe
| (not reading past needle or haystack). Probably not too much
| effort, may need a different search for data mod block size
| at the end.
| mgaunard wrote:
| missing (2018) ?
| sylware wrote:
| What is the string sizes thresold for efficiency?
|
| Because the SIMD (aligned, size computations, etc) setup code is
| usually more expensive that byte-oriented basic search for
| "small" strings.
|
| Yes, it depends heavily on the hardware architecture.
| ncruces wrote:
| Here, it's often the opposite.
|
| These algorithms are basically brute force with very little
| setup, and can become quadratic for large, periodic needles.
|
| String search algorithms that avoid quadratic behaviour (and
| may even be sublinear) have an higher cost setup that explores
| needle structure
| sylware wrote:
| What? It all depends on the sizes of the strings. Hardware
| benchmarks, for each significant hardware uarchitectures,
| must be done on combinations of string sizes in order to know
| "when" the SIMD setup code is worth it.
|
| Often the setup code is not worth it, even the SIMD
| complexity is not bringing enough performance in many use
| cases (basically, kept for niche applications, little lib
| statically linked for those).
| ncruces wrote:
| Did you look at the algorithm 1? The setup code is a pair
| of splats. There's no consideration for alignment, it works
| with unaligned data just fine. So what do you mean? What
| threshold would you expect for it to be worth it (assuming
| it's useful at all)?
| sylware wrote:
| Dude, I cannot be more explicit than that.
|
| That said, when you work on such code, use assembly, C
| code is for reference (have a look at dav1d av1 decoder).
| burntsushi wrote:
| What? You don't have to do that. Hell, the SIMD code that
| ripgrep uses is written in Rust.
| ncruces wrote:
| I can't either.
|
| Having implemented this, I'll claim that this is already
| _competitive_ for needles as small as 2 bytes, and if the
| needle doesn 't show up in the first couple "vector
| sizes" of the haystack.
|
| Also, look at burntsushi's comment. They use this
| algorithm for similarly small sizes. They _do_ use Rabin-
| Karp for "supremely short haystacks" (just because that
| lib is awesome and uses every trick in the book), but
| that wording should give you an idea of how small the
| threshold will be.
|
| And really, how common are "supremely small" _haystacks_?
| eska wrote:
| The swar algorithm has UB because it casts the 1 byte aligned
| data to 8 byte aligned. Perhaps it suffers some performance
| issues from unaligned loads?
| burntsushi wrote:
| I always took these as "idealized" algorithms or pseudo-code.
| In addition to not doing correct unaligned loads (using
| `memcpy`), the boundary checks are wrong. The code assumes the
| length of the haystack is a multiple of 8. Additionally, if the
| needle is empty, you'll get unsigned integer overflow leading
| to an out-of-bounds access on the needle. That's three
| different ways to get UB.
|
| This is pretty common in my experience when demonstrating SIMD
| code. Often only the relevant bits are shown, and not a lot of
| effort is given to getting the boundary cases correct. It's
| rarely acknowledged explicitly, but I would assume it's because
| the boundary conditions are somewhat less interesting than the
| "meat" of how to use vectors in an interesting way.
| Rendello wrote:
| When I modified the idealized algorithm to suit my needs a
| few years ago, I learned just how painful the boundary cases
| are. I realized I didn't want to ever touch SIMD string
| searching unless:
|
| - I had a very specific use-case that I really needed SIMD
| for and was willing to really invest a lot in it; or
|
| - The work had already been done for me in a generic sense
| and I could reap the fruits of that labour (ripgrep / Rust
| Regex, thanks for that by the way)
| eska wrote:
| True. The size assumption is mentioned in the article
| however. The simd library by agner makes the same assumption
| and demands that the user over-allocates their data
| burntsushi wrote:
| The "AVX2 (generic)" approach is roughly what ripgrep uses (via
| Rust's `regex` crate) to accelerate most searches. Even something
| like `\w+\s+Sherlock\s+\w+` will benefit since ripgrep will pluck
| `Sherlock` out of the regex and search that.
|
| The actual implementation is here:
| https://github.com/BurntSushi/memchr?tab=readme-ov-file#algo...
|
| The main difference with the algorithm presented here is that
| instead of always using the first and last bytes of the needle, a
| heuristic is used to try to pick two bytes that occur less
| frequently according to a background frequency distribution.
|
| It ends up being quite a bit faster than just plain Two-Way or
| even GNU libc's implementation of `memmem`. From the root of the
| `memchr` repository: $ rebar rank
| benchmarks/record/x86_64/2023-12-29.csv -e
| '^rust/memchr/memmem/(oneshot|prebuilt|twoway)' -e
| '^libc/memmem/oneshot' Engine
| Version Geometric mean of speed ratios Benchmark count
| ------ -------
| ------------------------------ ---------------
| rust/memchr/memmem/prebuilt 2.7.1 1.07
| 57 rust/memchr/memmem/oneshot 2.7.1 1.39
| 54 libc/memmem/oneshot unknown 3.15
| 54 rust/memchr/memmem/twoway 2.7.1 5.26
| 54
|
| The "prebuilt" benchmarks also demonstrate the API deficiencies
| of routines like `memmem`. Because of their API, they need to
| rebuild the state necessary to execute a search for the needle
| given. But it is often the case that the needle is invariant
| across repeated calls with different haystacks. So you end up
| doing that repeated work for no gain.
| jakobnissen wrote:
| Interesting. How do you know the background distribution of
| bytes? Won't a scan of the haystack to get the distribution
| take up a significant amount of time?
| burntsushi wrote:
| It's a _background_ distribution. That is, a guess. A
| heuristic. It doesn't change with the input.
| azhenley wrote:
| Now if I can just use SIMD directly from Python without calling
| out to another language.
| adgjlsfhk1 wrote:
| why would you want to? if you are performance bound with Python
| code, you can pick up a 20-50x performance improvement by
| switching language.
| ashvardanian wrote:
| Hi Austin! Was just checking your blog and the vowels detection
| post (< https://austinhenley.com/blog/vowels.html>).
|
| What exactly do you mean by "use SIMD directly without calling
| out to another language"?
|
| In some way Assembly will probably anyways be another
| language... but that's a technicality.
|
| I guess the spectrum of SIMD-related work in relation to Python
| is quite broad. There are projects like PeachPy, to help one
| write x86 Asm in Python, new Python'esque languages like Mojo,
| or SIMD libraries with thin CPython bindings. Do you mean one
| of those?
|
| PS: I'm in the last camp. And in case you are focused on ASCII,
| for vowel search, have you tried StringZilla's `find_first_of`
| (< https://github.com/ashvardanian/StringZilla/blob/960967d78c7
| ...>)? I think it should perform well ;)
| klaussilveira wrote:
| Reminds me of https://github.com/nikneym/hparse
|
| Which uses SIMD algos for fast HTTP parsing.
| mwsherman wrote:
| C# has implemented some SIMD for IndexOf:
| https://github.com/dotnet/runtime/pull/63285
___________________________________________________________________
(page generated 2025-06-14 23:00 UTC)