[HN Gopher] Span<T>.SequenceEquals is faster than memcmp
___________________________________________________________________
Span<T>.SequenceEquals is faster than memcmp
Author : xnorswap
Score : 104 points
Date : 2025-03-30 14:53 UTC (8 hours ago)
(HTM) web link (richardcocks.github.io)
(TXT) w3m dump (richardcocks.github.io)
| xnorswap wrote:
| A more meaningful adventure into microbenchmarking than my last.
| I look at why we no longer need to P/Invoke memcmp to efficiently
| compare arrays in C# / .NET.
|
| Old stackoverflow answers are a dangerous form of bit-rot. They
| get picked up by well-meaning developers and LLMs alike and
| recreated years after they are out of date.
| neonsunset wrote:
| For loop regression in .NET 9, please submit an issue at
| dotnet/runtime. It's yet another loop tearing miscompilation
| caused by suboptimal loop lowering changes if my guess is
| correct.
| xnorswap wrote:
| No problem, I've raised the issue as
| https://github.com/dotnet/runtime/issues/114047 .
| neonsunset wrote:
| Thanks!
| timewizard wrote:
| The number of times I've caught developers wholesale copying
| stack overflow posts, errors and all, is far too high.
| guerrilla wrote:
| Indeed, the problems of LLMs are not new. We just automated
| what people who have no idea what they are doing were doing
| anyway. We... optimized incompetence.
| eddythompson80 wrote:
| This is a bit elitist isn't it. It highly depends on the
| type of code copied and it's huge part of software engineer
| bullishness approach to LLMs compared to most other
| professions.
|
| Regardless of how competent as a programmer you are, you
| don't necessarily possess the knowledge/answer to "How to
| find open ports on Linux" or "How to enumerate child pids
| of a parent pid" or "what is most efficient way to compare
| 2 byte arrays in {insert language}" etc. A search engine or
| an LLM is a fine solution for those problems.
|
| You know that the answer to that question if what you're
| after. I'd generally consider you knowing the right
| question to ask is all that matters. The answer is not
| interesting. It's most likely a deeply nested knowledge
| about how Linux networking stack works, or how process
| management works on a particular OS. If that was the
| central point of the software we're build (like for example
| we're a Linux Networking Stack company) then by all means.
| It's silly to find a lead engineer in our company who is
| confused about how open ports work in Linux.
| timewizard wrote:
| In this case it was code to generate an "oauth2
| code_challenge" and the correctly URLEncode it. Instead
| of using replaceAll the example used replace. So only the
| first character in the string was getting converted.
|
| When pressed the developer said they thought their code
| was "too fast for the oauth server" and that's why it
| failed about 25% of the time.
|
| The level of disappointment I had when I found the
| problem was enough to be memorable, but to find the post
| he flat out copied on stack overflow, along with a
| comment below it highlighting the bug AND the fix, nearly
| brought be to apoplexy.
| eddythompson80 wrote:
| To me ".replace()" vs ".replaceAll()" (in JS at least) is
| a perfect example to evaluate a developer on. Any JS
| developer would know that replace()'s main gotcha is that
| it's not replaceAll(). I used C# professionally for years
| before using JS. And ".Replace()" in C# works the same
| way ".replaceAll()" does in JS. It was one of the first
| things I learned about JS and how I needed to reevaluate
| all my code in JS.
|
| In interviews, I'd often ask the interviewee "what is
| your background" and "do you know that in JS .replace()
| is unlike .replace() in Java or .Replace() in .NET". That
| statement should make perfect sense to any developer who
| realizes the word "replace" is somewhat ambiguous. I
| would always argue that the behavior of Java and .NET is
| the right behavior, but it's an ambiguous word
| nonetheless.
| guerrilla wrote:
| > This is a bit elitist isn't it.
|
| Damn straight. Understand what you're doing or don't do
| it. Software is bad enough as it is. There's absolutely
| no room for the incompetent in this game. That science
| experiment has been done to death and we're certain of
| the results.
| sroussey wrote:
| Read the license. CC BY-SA.
|
| Copying code and breaking the license is a liability many
| companies don't want and therefore block SO when in the
| office.
|
| I've seen upvoted answers to questions around with stuff
| that purposefully has a backdoor in it (one character
| away from being a correct answer, so you are vulnerable
| only if you actually copied and pasted).
|
| I think S.O. Is great, and LLMs too, but any "lead"
| engineer would try to learn and refute the content.
|
| BTW: my favorite thing to do after an LLM gives a coding
| answer: now fix the bug.
|
| The answers are hilarious. Oh, I see the security
| vulnerabilities. Or oh, this won't work in an
| asynchronous environment. Etc, etc. Sometimes you have to
| be specific with the type of bug you spot (looking at
| you, sonnet 3.7). It's worth adding to your cursor rules
| or similar.
| eddythompson80 wrote:
| All my 24-year career is among 4 "very large" software
| companies and 1 startup. 3 out of the 4 had a culture of
| "// https://stackoverflow.com/xxxxx" type comments on top
| of any piece of code that someone learned about from
| stackoverflow. There was one where everyone made a big
| fuss about such things in code reviews. They'll ask "we
| don't have any functions in this project that use this
| Linux syscall. How do you know this is what needs to be
| called???" And you had 2 ways of answering. You could
| link a kernel.org url saying "I looked through Linux
| sources and learned that to do X you need to call Y api"
| and everyone would reply "cool", "great find", etc. You
| could also say "I searched for X and found this
| stackoverflow response" which everyone will reply to as
| "stackoverflow is often wrong", "do we have the right
| license to use that code", "don't use stackoverflow",
| "please reconsider this code"
| sroussey wrote:
| Or just put the link in the code as the license requires.
|
| Then... you could have a bot that watches for updates to
| the post in case it was wrong and someone points it out.
| mschuster91 wrote:
| > There was one where everyone made a big fuss about such
| things in code reviews.
|
| There's always dumb morons... sigh.
|
| Even if you _don 't_ copy code from SO, it still makes
| sense to link to it if there is a decent explanation on
| whatever problem you were facing. When I write code and I
| hit some issue - particularly if it's some sort of weird
| ass edge case - I _always_ leave a link to SO, and if it
| 's something that's also known upstream but not fixed yet
| (common if you use niche stuff), I'll also leave a TODO
| comment linking to the upstream issue.
|
| Code should not just be code, it should also be a
| document of knowledge and learning to your next fellow
| coder who touches the code.
|
| (This also means: FFS do not just link stackoverflow in
| the git commit history. _No one_ is looking there years
| later)
| knome wrote:
| It's hardly unreasonable to expect your peers to at least
| _try_ to understand what they are doing. Copypaste coding
| is never conducive to a good codebase.
| eddythompson80 wrote:
| I do expect them to understand the code they are
| copying/pasting. Though to an extent. I understand they
| would test the code. They would try different inputs to
| the code and its result. I'd also understand they would
| test that code across all the different "Linux distros"
| we use, for example. After all, that code basically calls
| a Linux syscall, so I understand that's very stable.
|
| Then I learn that this particular syscall depends on this
| kernel build flag that Debian passes, but not alpine. You
| can get it in alpine if you set that other flag. What are
| you a "caveman not knowing that `pctxl: true` is the
| build flag to enable this feature?"
| SketchySeaBeast wrote:
| The problem with the LLM equivalent is that you can't see
| the timestamp of the knowledge it's drawing from. With
| stack overflow I can see a post is from 2010 and look for
| something more modern, that due diligence is no longer
| available with an LLM, which has little reason to choose
| the newest solution.
| junto wrote:
| It's astounding just how fast modern .NET has become. I'd be
| curious as to how the .NET (Framework excluded) benchmarks run in
| a Linux container.
| api wrote:
| I wonder how it compares to (1) Go, (2) the JVM, and (3) native
| stuff like Rust and C++.
|
| Obviously as with all such benchmarks the skill of the
| programmer doing the implementing matters a lot. You can write
| inefficient clunky code in any language.
| kfuse wrote:
| All modern popular languages are fast, except the most
| popular one.
| paulddraper wrote:
| And Python+Ruby
| api wrote:
| JavaScript is hella fast for a dynamically typed language,
| but that's because we've put insane amounts of effort into
| making fast JITing VMs for it.
| zamalek wrote:
| Sure, but "for a dynamically typed language" still means
| that it's slow amongst all languages.
| jeffbee wrote:
| Java and .NET (and JS or anything that runs under v8 or
| HotSpot) usually compare favorably to others because they
| come out of the box with PGO. The outcomes for peak-optimized
| C++ are very good, but few organizations are capable of
| actually getting from their C++ build what every .NET user
| gets for free.
| metaltyphoon wrote:
| .NET go as far as having D(ynamic)PGO, which is enabled by
| default.
| junto wrote:
| https://medium.com/deno-the-complete-reference/net-vs-
| java-v...
| kristianp wrote:
| I would say go is not in the same category of speed as rust
| anf c/c++. The level of optimisation done by them is next
| level. Go also doesn't inline your assembly functions, has
| less vectorisation in the standard libraries, and doesn't
| allow you to easily add vectorisation with intrinsics.
| Analemma_ wrote:
| I agree, .NET Core has improved by gigantic leaps and bounds.
| Which makes it all the more frustrating to me that .NET and
| Java both had "lost decades" of little to no improvement. Java
| mostly only on the language side, where 3rd-party JVMs still
| saw decent changes, but .NET both on the language and runtime
| side. I think this freeze made (and continues to make) people
| think the ceiling of both performance and developer ergonomics
| of these languages is much lower than it actually is.
| paavohtl wrote:
| I certainly agree that Java / JVM had a lost decade (or even
| more), but not really with C# / .NET. When do you consider
| that lost decade to have been? C# has had a major release
| with new language features every 1-3 years, consistently for
| the past 20+ years.
| merb wrote:
| > That's not a super helpful description, but the summary is that
| it's stack-allocated rather than heap allocated.
|
| I'm pretty sure that this is not 100% correct, since one can also
| use other allocation methods and use a span to represent it. Only
| with stackalloc will the memory it points to be stackallocated.
| What it basically means is that the type is stack allocated,
| always, but not the memory it points to.
| john-h-k wrote:
| Yes, this is correct. The span itself - the (ptr, len) pair -
| is on stack (by default) but the data is almost always on the
| heap, with stackalloc being the most notable exception
| neonsunset wrote:
| The design of spans does not make assumptions about this
| however. `ref T` pointer inside the span can point to any
| memory location.
|
| It is not uncommon to wrap unmanaged memory in spans. Another
| popular case, even if it's something most developers not
| realize, is readonly spans wrapping constant data embedded in
| the application binary. For example, if you pass '[1, 2, 3,
| 4]' to an argument accepting 'ReadOnlySpan<int>' - this will
| just pass a reference to constant data. It also works for new
| T[] { } as long as T is a primitive and the target of the
| expression is a read-only span. It's quite prevalent nowadays
| but the language tries to get out of your way when doing so.
| MarkSweep wrote:
| Yeah, as written this is quite confusing and does not describe
| why a Span is useful. It seems to be a garbled quoting of the
| first sentence of the supplement documentation about this API:
|
| https://learn.microsoft.com/en-us/dotnet/fundamentals/runtim...
|
| I think a better description of what a Span does is later in
| the article:
|
| > A Span<T> represents a contiguous region of arbitrary memory.
| A Span<T> instance is often used to hold the elements of an
| array or a portion of an array. Unlike an array, however, a
| Span<T> instance can point to managed memory, native memory, or
| memory managed on the stack.
|
| The fact that you have to put the Span<T> on the stack only is
| a limitation worth knowing (and enforced by the compiler). But
| it is not the most interesting thing about them.
| xnorswap wrote:
| Thank you, it was indeed a "garbled quoting" of that article.
| I am generally terrible at explaining things.
|
| Trying to improve my ability to explain things was part of my
| motivation for taking up blogging.
| npalli wrote:
| Any idea on how it compares to std::span in C++. .NET is indeed
| quite fast nevertheless, (at least since 6).
| loeg wrote:
| I think it is substantially equivalent to C++ std::span
| (essentially a typed pointer + length pair).
| Dwedit wrote:
| Span<T> is a "ref struct" type, and thus has a lot of
| restrictions on its use. For instance, you can't have one as
| a field or property in a class, so you can't assign it to
| outside of the scope that it was declared in.
| neonsunset wrote:
| You can assign the span to an out of scope variable as long
| as it does not violate the scoping of the memory referenced
| by that said span. The closer primitive to C#'s Span<T> is
| Rust's &mut [T] since both are subject to lifetime analysis
| (of course the one in C#[0] is quite rudimentary in
| comparison to Rust).
|
| [0]: https://em-tg.github.io/csborrow/
| jeffbee wrote:
| C++ std::span doesn't have comparison operators at all. If you
| were to write an equality operator you might use memcmp, in
| which case it will be exactly the same as memcmp, which LLVM
| will helpfully reinterpret as bcmp for best performance.
|
| See for example the difference between std::string::operator==
| and just calling memcmp yourself:
| https://godbolt.org/z/qn1crox8c
| f33d5173 wrote:
| bcmp() is identical to memcmp(3); use it instead.
| loeg wrote:
| This has gotta be some sort of modest overhead from calling into
| C memcmp that is avoided by using the native C# construct, right?
| There's no reason the two implementations shouldn't be doing
| essentially the same thing internally.
| lstodd wrote:
| might as well mean that msvcrt's memcmp is terrible
| xnorswap wrote:
| Outside of the 10 elements case, I don't think it's an overhead
| issue, the overhead is surely minuscule compared to the 1GB of
| data in the final tests, which also show a large difference in
| performance.
|
| I suspect it's that the memcmp in the Visual C++
| redistributable isn't as optimised for modern processor
| instructions as the .NET runtime is.
|
| I'd be interested to see a comparison against a better more
| optimised runtime library.
|
| Ultimately you're right that neither .NET nor C can magic out
| performance from a processor that isn't fundamentally there,
| but it's nice that doing the out-of-the-box approach performs
| well and doesn't require tricks.
| DeathArrow wrote:
| .NET got pretty fast.
| Dwedit wrote:
| The call to "memcmp" has overhead. It's an imported function
| which cannot be inlined, and the marshaller will automatically
| create pinned GC handles to the memory inside of the arrays as
| they are passed to the native code.
|
| I wonder how it would compare if you passed actual pointers to
| "memcmp" instead of marshalled arrays. You'd use "fixed (byte *p
| = bytes) {" on each array first so that the pinning happens
| outside of the function call.
| xnorswap wrote:
| I tried that but cut it from the code because it had the same
| performance.
| Rohansi wrote:
| Have you tried the newer [LibraryImport] attribute?
| xnorswap wrote:
| I haven't, I wasn't aware of that attribute. I would
| gratefully accept a PR with such a benchmark case.
|
| Edit: I've now tried it, and it reduced overhead a small
| amount. (e.g. Average 7.5 ns vs 8 ns for the 10-byte array
| )
| MarkSweep wrote:
| I'm pretty sure the marshaling code for the pinvoke is not
| creating GC handles. It is just using a pinned local, like a
| fixed statement in csharp does. This is what the LibraryImport
| at least and I don't see why the built in marshaller would be
| different. The author says in the peer comment that they
| confirmed the performance is the same.
|
| I think the blog post is quite good at showing that seemingly
| similar things can have different performance tradeoffs. A
| follow up topic might digging deeper into the why. For example,
| if you look at the disassembly of the p/invoke method, you can
| see the source of the overhead: setting up a p/invoke frame so
| the stack is walkable while in native code, doing a GC poll
| after returning from the native function, and removing the
| frame.
|
| https://gist.github.com/AustinWise/21d518fee314ad484eeec981a...
| neonsunset wrote:
| FWIW LINQ's SequenceEqual and many other CoreLib methods
| performing sequence comparison forward to the same underlying
| comparison routine used here whenever possible.
|
| All of this builds on top of very powerful portable SIMD
| primitives and platform intrinsics that ship with the standard
| library.
| bob1029 wrote:
| Span<T> is easily my favorite new abstraction. I've been using
| the hell out of it for building universal Turing machine
| interpreters. It's really great at passing arbitrary views of
| physical data around. I default to using it over arrays in most
| places now.
| OptionOfT wrote:
| I did some digging, and found that SequenceEquals is heavily
| optimized for when T = Byte:
| https://github.com/dotnet/runtime/blob/454673e1d6da406775064...
|
| Does memcmp do all of these things? Is msvcrt.dll checking at
| runtime which extensions the CPU support?
|
| Because I don't think msvcrt.dll is recompiled per machine.
|
| I think a better test would be to create a DLL in C, expose a
| custom version of memcmp, and compile that with all the
| vectorization enabled.
| neonsunset wrote:
| It's not limited to bytes. It works with any bitwise comparable
| primitive i.e. int, long, char, etc.
|
| The logic which decides which path to use is here
| https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
| and here
| https://github.com/dotnet/runtime/blob/main/src/coreclr/tool...
| (this one is used by ILC for NativeAOT but the C++ impl. for
| the JIT is going to be similar)
|
| The [Intrinsic] annotation is present because such comparisons
| on strings/arrays/spans are specially recognized in the
| compiler to be unrolled and inlined whenever one of the
| arguments has constant length or is a constant string or a span
| which points to constant data.
| xnorswap wrote:
| The comparison isn't to prove that .NET is always faster than C
| in all circumstances, it was to demonstrate that the advice to
| call out to C from .NET is outdated and now worse than the
| naive approach.
|
| Can C wizards write faster code? I'm sure they can, but I bet
| it takes longer than writing a.SequenceEquals(b) and moving on
| to the next feature, safe in the knowledge that the standard
| library is taking care of business.
|
| "Your standard library is more heavily optimised" isn't exactly
| a gotcha. Yes, the JIT nature of .NET means that it can
| leverage processor features at runtime, but that is a benefit
| to being compiled JIT.
| asveikau wrote:
| > Does memcmp do all of these things? Is msvcrt.dll checking at
| runtime which extensions the CPU support
|
| It's possible for a C implemention to check the CPU at dynamic
| link time (when the DLL is loaded) and select which memcmp gets
| linked.
|
| The most heavily used libc string functions also have a
| tendency to use SIMD when the data sizes and offsets align, and
| fall back to the slow path for any odd/unaligned bytes.
|
| I don't know to what extent MSVCRT is using these techniques.
| Probably some.
|
| Also, it's common for a compiler to recognize references to
| common string functions and not even emit a call to a shared
| library, but provide an inline implementation.
| userbinator wrote:
| Interestingly, Intel made REP CMPS much faster in the latest
| CPUs:
|
| https://stackoverflow.com/questions/75309389/which-processor...
| CyanLite2 wrote:
| The article missed the biggest thing:
|
| SequenceEquals is SIMD accelerated. memcmp is not.
___________________________________________________________________
(page generated 2025-03-30 23:01 UTC)