[HN Gopher] Span<T>.SequenceEquals is faster than memcmp
       ___________________________________________________________________
        
       Span<T>.SequenceEquals is faster than memcmp
        
       Author : xnorswap
       Score  : 104 points
       Date   : 2025-03-30 14:53 UTC (8 hours ago)
        
 (HTM) web link (richardcocks.github.io)
 (TXT) w3m dump (richardcocks.github.io)
        
       | xnorswap wrote:
       | A more meaningful adventure into microbenchmarking than my last.
       | I look at why we no longer need to P/Invoke memcmp to efficiently
       | compare arrays in C# / .NET.
       | 
       | Old stackoverflow answers are a dangerous form of bit-rot. They
       | get picked up by well-meaning developers and LLMs alike and
       | recreated years after they are out of date.
        
         | neonsunset wrote:
         | For loop regression in .NET 9, please submit an issue at
         | dotnet/runtime. It's yet another loop tearing miscompilation
         | caused by suboptimal loop lowering changes if my guess is
         | correct.
        
           | xnorswap wrote:
           | No problem, I've raised the issue as
           | https://github.com/dotnet/runtime/issues/114047 .
        
             | neonsunset wrote:
             | Thanks!
        
         | timewizard wrote:
         | The number of times I've caught developers wholesale copying
         | stack overflow posts, errors and all, is far too high.
        
           | guerrilla wrote:
           | Indeed, the problems of LLMs are not new. We just automated
           | what people who have no idea what they are doing were doing
           | anyway. We... optimized incompetence.
        
             | eddythompson80 wrote:
             | This is a bit elitist isn't it. It highly depends on the
             | type of code copied and it's huge part of software engineer
             | bullishness approach to LLMs compared to most other
             | professions.
             | 
             | Regardless of how competent as a programmer you are, you
             | don't necessarily possess the knowledge/answer to "How to
             | find open ports on Linux" or "How to enumerate child pids
             | of a parent pid" or "what is most efficient way to compare
             | 2 byte arrays in {insert language}" etc. A search engine or
             | an LLM is a fine solution for those problems.
             | 
             | You know that the answer to that question if what you're
             | after. I'd generally consider you knowing the right
             | question to ask is all that matters. The answer is not
             | interesting. It's most likely a deeply nested knowledge
             | about how Linux networking stack works, or how process
             | management works on a particular OS. If that was the
             | central point of the software we're build (like for example
             | we're a Linux Networking Stack company) then by all means.
             | It's silly to find a lead engineer in our company who is
             | confused about how open ports work in Linux.
        
               | timewizard wrote:
               | In this case it was code to generate an "oauth2
               | code_challenge" and the correctly URLEncode it. Instead
               | of using replaceAll the example used replace. So only the
               | first character in the string was getting converted.
               | 
               | When pressed the developer said they thought their code
               | was "too fast for the oauth server" and that's why it
               | failed about 25% of the time.
               | 
               | The level of disappointment I had when I found the
               | problem was enough to be memorable, but to find the post
               | he flat out copied on stack overflow, along with a
               | comment below it highlighting the bug AND the fix, nearly
               | brought be to apoplexy.
        
               | eddythompson80 wrote:
               | To me ".replace()" vs ".replaceAll()" (in JS at least) is
               | a perfect example to evaluate a developer on. Any JS
               | developer would know that replace()'s main gotcha is that
               | it's not replaceAll(). I used C# professionally for years
               | before using JS. And ".Replace()" in C# works the same
               | way ".replaceAll()" does in JS. It was one of the first
               | things I learned about JS and how I needed to reevaluate
               | all my code in JS.
               | 
               | In interviews, I'd often ask the interviewee "what is
               | your background" and "do you know that in JS .replace()
               | is unlike .replace() in Java or .Replace() in .NET". That
               | statement should make perfect sense to any developer who
               | realizes the word "replace" is somewhat ambiguous. I
               | would always argue that the behavior of Java and .NET is
               | the right behavior, but it's an ambiguous word
               | nonetheless.
        
               | guerrilla wrote:
               | > This is a bit elitist isn't it.
               | 
               | Damn straight. Understand what you're doing or don't do
               | it. Software is bad enough as it is. There's absolutely
               | no room for the incompetent in this game. That science
               | experiment has been done to death and we're certain of
               | the results.
        
               | sroussey wrote:
               | Read the license. CC BY-SA.
               | 
               | Copying code and breaking the license is a liability many
               | companies don't want and therefore block SO when in the
               | office.
               | 
               | I've seen upvoted answers to questions around with stuff
               | that purposefully has a backdoor in it (one character
               | away from being a correct answer, so you are vulnerable
               | only if you actually copied and pasted).
               | 
               | I think S.O. Is great, and LLMs too, but any "lead"
               | engineer would try to learn and refute the content.
               | 
               | BTW: my favorite thing to do after an LLM gives a coding
               | answer: now fix the bug.
               | 
               | The answers are hilarious. Oh, I see the security
               | vulnerabilities. Or oh, this won't work in an
               | asynchronous environment. Etc, etc. Sometimes you have to
               | be specific with the type of bug you spot (looking at
               | you, sonnet 3.7). It's worth adding to your cursor rules
               | or similar.
        
               | eddythompson80 wrote:
               | All my 24-year career is among 4 "very large" software
               | companies and 1 startup. 3 out of the 4 had a culture of
               | "// https://stackoverflow.com/xxxxx" type comments on top
               | of any piece of code that someone learned about from
               | stackoverflow. There was one where everyone made a big
               | fuss about such things in code reviews. They'll ask "we
               | don't have any functions in this project that use this
               | Linux syscall. How do you know this is what needs to be
               | called???" And you had 2 ways of answering. You could
               | link a kernel.org url saying "I looked through Linux
               | sources and learned that to do X you need to call Y api"
               | and everyone would reply "cool", "great find", etc. You
               | could also say "I searched for X and found this
               | stackoverflow response" which everyone will reply to as
               | "stackoverflow is often wrong", "do we have the right
               | license to use that code", "don't use stackoverflow",
               | "please reconsider this code"
        
               | sroussey wrote:
               | Or just put the link in the code as the license requires.
               | 
               | Then... you could have a bot that watches for updates to
               | the post in case it was wrong and someone points it out.
        
               | mschuster91 wrote:
               | > There was one where everyone made a big fuss about such
               | things in code reviews.
               | 
               | There's always dumb morons... sigh.
               | 
               | Even if you _don 't_ copy code from SO, it still makes
               | sense to link to it if there is a decent explanation on
               | whatever problem you were facing. When I write code and I
               | hit some issue - particularly if it's some sort of weird
               | ass edge case - I _always_ leave a link to SO, and if it
               | 's something that's also known upstream but not fixed yet
               | (common if you use niche stuff), I'll also leave a TODO
               | comment linking to the upstream issue.
               | 
               | Code should not just be code, it should also be a
               | document of knowledge and learning to your next fellow
               | coder who touches the code.
               | 
               | (This also means: FFS do not just link stackoverflow in
               | the git commit history. _No one_ is looking there years
               | later)
        
               | knome wrote:
               | It's hardly unreasonable to expect your peers to at least
               | _try_ to understand what they are doing. Copypaste coding
               | is never conducive to a good codebase.
        
               | eddythompson80 wrote:
               | I do expect them to understand the code they are
               | copying/pasting. Though to an extent. I understand they
               | would test the code. They would try different inputs to
               | the code and its result. I'd also understand they would
               | test that code across all the different "Linux distros"
               | we use, for example. After all, that code basically calls
               | a Linux syscall, so I understand that's very stable.
               | 
               | Then I learn that this particular syscall depends on this
               | kernel build flag that Debian passes, but not alpine. You
               | can get it in alpine if you set that other flag. What are
               | you a "caveman not knowing that `pctxl: true` is the
               | build flag to enable this feature?"
        
             | SketchySeaBeast wrote:
             | The problem with the LLM equivalent is that you can't see
             | the timestamp of the knowledge it's drawing from. With
             | stack overflow I can see a post is from 2010 and look for
             | something more modern, that due diligence is no longer
             | available with an LLM, which has little reason to choose
             | the newest solution.
        
       | junto wrote:
       | It's astounding just how fast modern .NET has become. I'd be
       | curious as to how the .NET (Framework excluded) benchmarks run in
       | a Linux container.
        
         | api wrote:
         | I wonder how it compares to (1) Go, (2) the JVM, and (3) native
         | stuff like Rust and C++.
         | 
         | Obviously as with all such benchmarks the skill of the
         | programmer doing the implementing matters a lot. You can write
         | inefficient clunky code in any language.
        
           | kfuse wrote:
           | All modern popular languages are fast, except the most
           | popular one.
        
             | paulddraper wrote:
             | And Python+Ruby
        
             | api wrote:
             | JavaScript is hella fast for a dynamically typed language,
             | but that's because we've put insane amounts of effort into
             | making fast JITing VMs for it.
        
               | zamalek wrote:
               | Sure, but "for a dynamically typed language" still means
               | that it's slow amongst all languages.
        
           | jeffbee wrote:
           | Java and .NET (and JS or anything that runs under v8 or
           | HotSpot) usually compare favorably to others because they
           | come out of the box with PGO. The outcomes for peak-optimized
           | C++ are very good, but few organizations are capable of
           | actually getting from their C++ build what every .NET user
           | gets for free.
        
             | metaltyphoon wrote:
             | .NET go as far as having D(ynamic)PGO, which is enabled by
             | default.
        
           | junto wrote:
           | https://medium.com/deno-the-complete-reference/net-vs-
           | java-v...
        
           | kristianp wrote:
           | I would say go is not in the same category of speed as rust
           | anf c/c++. The level of optimisation done by them is next
           | level. Go also doesn't inline your assembly functions, has
           | less vectorisation in the standard libraries, and doesn't
           | allow you to easily add vectorisation with intrinsics.
        
         | Analemma_ wrote:
         | I agree, .NET Core has improved by gigantic leaps and bounds.
         | Which makes it all the more frustrating to me that .NET and
         | Java both had "lost decades" of little to no improvement. Java
         | mostly only on the language side, where 3rd-party JVMs still
         | saw decent changes, but .NET both on the language and runtime
         | side. I think this freeze made (and continues to make) people
         | think the ceiling of both performance and developer ergonomics
         | of these languages is much lower than it actually is.
        
           | paavohtl wrote:
           | I certainly agree that Java / JVM had a lost decade (or even
           | more), but not really with C# / .NET. When do you consider
           | that lost decade to have been? C# has had a major release
           | with new language features every 1-3 years, consistently for
           | the past 20+ years.
        
       | merb wrote:
       | > That's not a super helpful description, but the summary is that
       | it's stack-allocated rather than heap allocated.
       | 
       | I'm pretty sure that this is not 100% correct, since one can also
       | use other allocation methods and use a span to represent it. Only
       | with stackalloc will the memory it points to be stackallocated.
       | What it basically means is that the type is stack allocated,
       | always, but not the memory it points to.
        
         | john-h-k wrote:
         | Yes, this is correct. The span itself - the (ptr, len) pair -
         | is on stack (by default) but the data is almost always on the
         | heap, with stackalloc being the most notable exception
        
           | neonsunset wrote:
           | The design of spans does not make assumptions about this
           | however. `ref T` pointer inside the span can point to any
           | memory location.
           | 
           | It is not uncommon to wrap unmanaged memory in spans. Another
           | popular case, even if it's something most developers not
           | realize, is readonly spans wrapping constant data embedded in
           | the application binary. For example, if you pass '[1, 2, 3,
           | 4]' to an argument accepting 'ReadOnlySpan<int>' - this will
           | just pass a reference to constant data. It also works for new
           | T[] { } as long as T is a primitive and the target of the
           | expression is a read-only span. It's quite prevalent nowadays
           | but the language tries to get out of your way when doing so.
        
         | MarkSweep wrote:
         | Yeah, as written this is quite confusing and does not describe
         | why a Span is useful. It seems to be a garbled quoting of the
         | first sentence of the supplement documentation about this API:
         | 
         | https://learn.microsoft.com/en-us/dotnet/fundamentals/runtim...
         | 
         | I think a better description of what a Span does is later in
         | the article:
         | 
         | > A Span<T> represents a contiguous region of arbitrary memory.
         | A Span<T> instance is often used to hold the elements of an
         | array or a portion of an array. Unlike an array, however, a
         | Span<T> instance can point to managed memory, native memory, or
         | memory managed on the stack.
         | 
         | The fact that you have to put the Span<T> on the stack only is
         | a limitation worth knowing (and enforced by the compiler). But
         | it is not the most interesting thing about them.
        
           | xnorswap wrote:
           | Thank you, it was indeed a "garbled quoting" of that article.
           | I am generally terrible at explaining things.
           | 
           | Trying to improve my ability to explain things was part of my
           | motivation for taking up blogging.
        
       | npalli wrote:
       | Any idea on how it compares to std::span in C++. .NET is indeed
       | quite fast nevertheless, (at least since 6).
        
         | loeg wrote:
         | I think it is substantially equivalent to C++ std::span
         | (essentially a typed pointer + length pair).
        
           | Dwedit wrote:
           | Span<T> is a "ref struct" type, and thus has a lot of
           | restrictions on its use. For instance, you can't have one as
           | a field or property in a class, so you can't assign it to
           | outside of the scope that it was declared in.
        
             | neonsunset wrote:
             | You can assign the span to an out of scope variable as long
             | as it does not violate the scoping of the memory referenced
             | by that said span. The closer primitive to C#'s Span<T> is
             | Rust's &mut [T] since both are subject to lifetime analysis
             | (of course the one in C#[0] is quite rudimentary in
             | comparison to Rust).
             | 
             | [0]: https://em-tg.github.io/csborrow/
        
         | jeffbee wrote:
         | C++ std::span doesn't have comparison operators at all. If you
         | were to write an equality operator you might use memcmp, in
         | which case it will be exactly the same as memcmp, which LLVM
         | will helpfully reinterpret as bcmp for best performance.
         | 
         | See for example the difference between std::string::operator==
         | and just calling memcmp yourself:
         | https://godbolt.org/z/qn1crox8c
        
           | f33d5173 wrote:
           | bcmp() is identical to memcmp(3); use it instead.
        
       | loeg wrote:
       | This has gotta be some sort of modest overhead from calling into
       | C memcmp that is avoided by using the native C# construct, right?
       | There's no reason the two implementations shouldn't be doing
       | essentially the same thing internally.
        
         | lstodd wrote:
         | might as well mean that msvcrt's memcmp is terrible
        
         | xnorswap wrote:
         | Outside of the 10 elements case, I don't think it's an overhead
         | issue, the overhead is surely minuscule compared to the 1GB of
         | data in the final tests, which also show a large difference in
         | performance.
         | 
         | I suspect it's that the memcmp in the Visual C++
         | redistributable isn't as optimised for modern processor
         | instructions as the .NET runtime is.
         | 
         | I'd be interested to see a comparison against a better more
         | optimised runtime library.
         | 
         | Ultimately you're right that neither .NET nor C can magic out
         | performance from a processor that isn't fundamentally there,
         | but it's nice that doing the out-of-the-box approach performs
         | well and doesn't require tricks.
        
       | DeathArrow wrote:
       | .NET got pretty fast.
        
       | Dwedit wrote:
       | The call to "memcmp" has overhead. It's an imported function
       | which cannot be inlined, and the marshaller will automatically
       | create pinned GC handles to the memory inside of the arrays as
       | they are passed to the native code.
       | 
       | I wonder how it would compare if you passed actual pointers to
       | "memcmp" instead of marshalled arrays. You'd use "fixed (byte *p
       | = bytes) {" on each array first so that the pinning happens
       | outside of the function call.
        
         | xnorswap wrote:
         | I tried that but cut it from the code because it had the same
         | performance.
        
           | Rohansi wrote:
           | Have you tried the newer [LibraryImport] attribute?
        
             | xnorswap wrote:
             | I haven't, I wasn't aware of that attribute. I would
             | gratefully accept a PR with such a benchmark case.
             | 
             | Edit: I've now tried it, and it reduced overhead a small
             | amount. (e.g. Average 7.5 ns vs 8 ns for the 10-byte array
             | )
        
         | MarkSweep wrote:
         | I'm pretty sure the marshaling code for the pinvoke is not
         | creating GC handles. It is just using a pinned local, like a
         | fixed statement in csharp does. This is what the LibraryImport
         | at least and I don't see why the built in marshaller would be
         | different. The author says in the peer comment that they
         | confirmed the performance is the same.
         | 
         | I think the blog post is quite good at showing that seemingly
         | similar things can have different performance tradeoffs. A
         | follow up topic might digging deeper into the why. For example,
         | if you look at the disassembly of the p/invoke method, you can
         | see the source of the overhead: setting up a p/invoke frame so
         | the stack is walkable while in native code, doing a GC poll
         | after returning from the native function, and removing the
         | frame.
         | 
         | https://gist.github.com/AustinWise/21d518fee314ad484eeec981a...
        
       | neonsunset wrote:
       | FWIW LINQ's SequenceEqual and many other CoreLib methods
       | performing sequence comparison forward to the same underlying
       | comparison routine used here whenever possible.
       | 
       | All of this builds on top of very powerful portable SIMD
       | primitives and platform intrinsics that ship with the standard
       | library.
        
       | bob1029 wrote:
       | Span<T> is easily my favorite new abstraction. I've been using
       | the hell out of it for building universal Turing machine
       | interpreters. It's really great at passing arbitrary views of
       | physical data around. I default to using it over arrays in most
       | places now.
        
       | OptionOfT wrote:
       | I did some digging, and found that SequenceEquals is heavily
       | optimized for when T = Byte:
       | https://github.com/dotnet/runtime/blob/454673e1d6da406775064...
       | 
       | Does memcmp do all of these things? Is msvcrt.dll checking at
       | runtime which extensions the CPU support?
       | 
       | Because I don't think msvcrt.dll is recompiled per machine.
       | 
       | I think a better test would be to create a DLL in C, expose a
       | custom version of memcmp, and compile that with all the
       | vectorization enabled.
        
         | neonsunset wrote:
         | It's not limited to bytes. It works with any bitwise comparable
         | primitive i.e. int, long, char, etc.
         | 
         | The logic which decides which path to use is here
         | https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
         | and here
         | https://github.com/dotnet/runtime/blob/main/src/coreclr/tool...
         | (this one is used by ILC for NativeAOT but the C++ impl. for
         | the JIT is going to be similar)
         | 
         | The [Intrinsic] annotation is present because such comparisons
         | on strings/arrays/spans are specially recognized in the
         | compiler to be unrolled and inlined whenever one of the
         | arguments has constant length or is a constant string or a span
         | which points to constant data.
        
         | xnorswap wrote:
         | The comparison isn't to prove that .NET is always faster than C
         | in all circumstances, it was to demonstrate that the advice to
         | call out to C from .NET is outdated and now worse than the
         | naive approach.
         | 
         | Can C wizards write faster code? I'm sure they can, but I bet
         | it takes longer than writing a.SequenceEquals(b) and moving on
         | to the next feature, safe in the knowledge that the standard
         | library is taking care of business.
         | 
         | "Your standard library is more heavily optimised" isn't exactly
         | a gotcha. Yes, the JIT nature of .NET means that it can
         | leverage processor features at runtime, but that is a benefit
         | to being compiled JIT.
        
         | asveikau wrote:
         | > Does memcmp do all of these things? Is msvcrt.dll checking at
         | runtime which extensions the CPU support
         | 
         | It's possible for a C implemention to check the CPU at dynamic
         | link time (when the DLL is loaded) and select which memcmp gets
         | linked.
         | 
         | The most heavily used libc string functions also have a
         | tendency to use SIMD when the data sizes and offsets align, and
         | fall back to the slow path for any odd/unaligned bytes.
         | 
         | I don't know to what extent MSVCRT is using these techniques.
         | Probably some.
         | 
         | Also, it's common for a compiler to recognize references to
         | common string functions and not even emit a call to a shared
         | library, but provide an inline implementation.
        
       | userbinator wrote:
       | Interestingly, Intel made REP CMPS much faster in the latest
       | CPUs:
       | 
       | https://stackoverflow.com/questions/75309389/which-processor...
        
       | CyanLite2 wrote:
       | The article missed the biggest thing:
       | 
       | SequenceEquals is SIMD accelerated. memcmp is not.
        
       ___________________________________________________________________
       (page generated 2025-03-30 23:01 UTC)