[HN Gopher] Less Slow C++
       ___________________________________________________________________
        
       Less Slow C++
        
       Earlier this year, I took a month to reexamine my coding habits and
       rethink some past design choices. I hope to rewrite and improve my
       FOSS libraries this year, and I needed answers to a few questions
       first. Perhaps some of these questions will resonate with others in
       the community, too.                 - Are coroutines viable for
       high-performance work?       - Should I use SIMD intrinsics for
       clarity or drop to assembly for easier library distribution?
       - Has hardware caught up with vectorized scatter/gather in AVX-512
       & SVE?       - How do secure enclaves & pointer tagging differ on
       Intel, Arm, & AMD?       - What's the throughput gap between CPU
       and GPU Tensor Cores (TCs)?       - How costly are misaligned
       memory accesses & split-loads, and what gains do non-temporal
       loads/stores offer?       - Which parts of the standard library hit
       performance hardest?       - How do error-handling strategies
       compare overhead-wise?       - What's the compile-time vs. run-time
       trade-off for lazily evaluated ranges?       - What practical, non-
       trivial use cases exist for meta-programming?       - How
       challenging is Linux Kernel bypass with io_uring vs. POSIX sockets?
       - How close are we to effectively using Networking TS or
       heterogeneous Executors in C++?       - What are best practices for
       propagating stateful allocators in nested containers, and which
       libraries support them?       These questions span from micro-
       kernel optimizations (nanoseconds) to distributed systems
       (micro/millisecond latencies). Rather than tackling them all in one
       post, I compiled my explorations into a repository--extending my
       previous Google Benchmark tutorial
       (<https://ashvardanian.com/posts/google-benchmark>)--to serve as a
       sandbox for performance experimentation.  Some fun observations:
       - Compilers now vectorize 3x3x3 and 4x4x4 single/double precision
       multiplications well! The smaller one is ~60% slower despite 70%
       fewer operations, outperforming my vanilla SSE/AVX and coming
       within 10% of AVX-512.            - Nvidia TCs vary dramatically
       across generations in numeric types, throughput, tile shapes,
       thread synchronization (thread/quad-pair/warp/warp-groups), and
       operand storage. Post-Volta, manual PTX is often needed (as
       intrinsics lag), though the new TileIR (introduced at GTC) promises
       improvements for dense linear algebra kernels.            - The AI
       wave drives CPUs and GPUs to converge in mat-mul throughput &
       programming complexity. It took me a day to debug TMM register
       initialization, and SME is equally odd. Sierra Forest packs 288
       cores/socket, and AVX10.2 drops 256-bit support for 512-bit... I
       wonder if discrete Intel GPUs are even needed, given CPU advances?
       - In common floating-point ranges, scalar sine approximations can
       be up to 40x faster than standard implementations, even without
       SIMD. It's a bit hand-wavy, though; I wish more projects documented
       error bounds and had 1 & 3.5 ULP variants like Sleef.            -
       Meta-programming tools like CTRE can outperform typical RegEx
       engines by 5x and simplify building parsers compared to hand-
       crafted FSMs.            - Once clearly distinct in complexity and
       performance (DPDK/SPDK vs. io_uring), the gap is narrowing. While
       pre-5.5 io_uring can boost UDP throughput by 4x on loopback IO,
       newer zero-copy and concurrency optimizations remain challenging.
       The repository is loaded with links to favorite CppCon lectures,
       GitHub snippets, and tech blog posts. Recognizing that many high-
       level concepts are handled differently across languages, I've also
       started porting examples to Rust & Python in separate repos.
       Coroutines look bad everywhere :(  Overall, this research project
       was rewarding! Most questions found answers in code -- except
       pointer tagging and secure enclaves, which still elude me in public
       cloud. I'd love to hear from others, especially on comparing High-
       Level Synthesis for small matrix multiplications on FPGAs versus
       hand-written VHDL/Verilog for integral types. Let me know if you
       have ideas for other cool, obscure topics to cover!
        
       Author : ashvardanian
       Score  : 175 points
       Date   : 2025-04-18 13:09 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | kreetx wrote:
       | This is only from a cursory look, but it all looks really concise
       | and practical - thank you for putting it all together!
        
       | intelVISA wrote:
       | Awesome, thanks.
       | 
       | Does C++ have a good async ('coroutine') story for io_uring yet?
        
         | ashvardanian wrote:
         | Great question! This has been top of mind for me for the last
         | 2-3 years.
         | 
         | Short answer: sadly, no. I love the "usability" promise of
         | coroutines--and even have 2-3 FOSS projects that could be
         | rewritten entirely around C++ or Rust coroutines for better
         | debuggability and extensibility--but my experiments show that
         | the runtime cost of most coroutine-like abstractions is simply
         | too high. Frankly, I'm not even sure if a better design is
         | possible on modern hardware.
         | 
         | This leads me to conclude that, despite my passion for SIMD and
         | superscalar execution, the highest-impact new assembly
         | instructions that x86 and Arm could standardize would center on
         | async execution and lightweight context switching... yet I
         | haven't seen any movement in that direction.
         | 
         | [?]
         | 
         | I also wrote toy examples for various range/async/stream models
         | in C++, Rust, and Python, with measured latencies in inline
         | comments:                 - Python: <https://github.com/ashvard
         | anian/less_slow.py/blob/2c80c9dda6ef9aec18264871cfb51a13ac34685
         | d/less_slow.py#L711-L722>       - C++: <https://github.com/ashv
         | ardanian/less_slow.cpp/blob/8f32d65cc8976ad20c8cdf645caa5b39a1c
         | f488b/less_slow.cpp#L3718-L3759>       - Rust: <https://github.
         | com/ashvardanian/less_slow.rs/blob/5add609533854bb6a5ddbf358cc1
         | 5f06f68c82d2/less_slow.rs#L522-L533>
         | 
         | Aside from coroutines (toy hand-rolled implementations and
         | commonly used libraries), I've also played around C++
         | executors, senders & receivers, but didn't have much success
         | with them either. May be a skill issue.
        
           | haberman wrote:
           | > my experiments show that the runtime cost of most
           | coroutine-like abstractions is simply too high
           | 
           | Which runtime cost do you mean?
           | 
           | The main one I am aware of is a heap allocation per
           | coroutine, though this can in some cases be elided if the
           | coroutine is being called from another coroutine.
           | 
           | The other cost I am aware of is the initializing of the
           | coroutine handle, but I think this is just a couple of
           | pointers.
           | 
           | In both cases I would expect these overheads to be relatively
           | modest compared to the cost of the I/O itself, though it's
           | definitely better to elide the heap allocation when possible.
           | 
           | I don't know much about coroutine libraries like unifex
           | (which I think your test is using), but a hand-coded
           | prototype I was playing with doesn't seem to add much
           | overhead: https://godbolt.org/z/8Kc1oKf15
           | 
           | If we can compile with -fno-exceptions, the code is even
           | tighter: https://godbolt.org/z/5Yo8Pqvd5
           | 
           | My exploration into coroutines and I/O is only in the
           | earliest stages, so I won't claim any of this to be
           | definitive. But I am very interested in this question of
           | whether the overhead is low enough to be a good match for
           | io_uring or not.
        
           | PaulDavisThe1st wrote:
           | > lightweight context switching
           | 
           | the cost of context switch consists of two parts, one of
           | which can be subdivided:
           | 
           | 1. register save/restore                      1a. user-space
           | only registers            1b. full register save/restore,
           | required in kernel space
           | 
           | 2. the cost of the TLB flush, which is in turn proportional
           | to the working set size of the switched-to process (i.e. if
           | you don't touch much memory after the context switch, the
           | cost is lower than if you do)
           | 
           | I am not sure that any assembler instructions could address
           | either of these.
           | 
           | What do you have in mind?
        
         | nickysielicki wrote:
         | https://github.com/NVIDIA/stdexec/blob/main/examples/io_urin...
        
       | spieswl wrote:
       | Hey! Your Google Benchmark post was one of my go-to resources
       | when I started picking that up a couple years ago. I love to see
       | the focus on performance benchmarking here, and the repository is
       | laid out well. Nice work!
        
       | jonstewart wrote:
       | This is great! There has been far too much sophistry in C++ where
       | the justification for some contortions are "performance", with no
       | empirical data.
       | 
       | I am surprised about CTRE giving good results--I will admit I
       | have thought of it more as a parlor trick than a viable engine. I
       | will need to dig into that more. I also want to dig into the
       | OpenMP & TBB threadpool benchmarks to see whether Boost::ASIO
       | threadpool can be added into it.
        
         | ashvardanian wrote:
         | I was also surprised about CTRE and I can imagine it being a
         | viable tool for implementing parsers for 99% of use cases,
         | where you may not need all the SIMD bells and whistles.
         | 
         | A word of caution, though: I remember the throughput differing
         | vastly between GCC and MSVC builds. The latter struggles with
         | heavy meta-programming and expression templates. I don't know
         | why.
        
           | nickysielicki wrote:
           | std::regex is such a nightmare, I didn't take the time to run
           | the code myself but I'd be curious if you would see the same
           | delta if you swapped it for boost::regex or re2.
           | 
           | I think there's a case to be made for libaries like
           | https://github.com/foonathan/lexy and/or
           | https://github.com/boostorg/parser instead of reaching for a
           | regex in the first place.
        
             | ashvardanian wrote:
             | You can beat CTRE throughput with a non-compile-time
             | solution. My first recommendation will be to look at
             | HyperScan. It has remarkable throughput:
             | https://github.com/intel/hyperscan
             | 
             | I've only avoided it in the tutorial, as I want to keep the
             | build system lean. I wouldn't be surprised if it's 10x
             | faster than Boost in the average case.
        
               | jonstewart wrote:
               | To echo Ash, HyperScan is the performance king. Use it
               | when shipping on x64 and one has control over the pattern
               | sets and you don't mind building it and working through
               | its idiosyncrasies. It has some nonstandard matching
               | semantics, reflecting its focus on network intrusion
               | detection systems (NIDS), and does not have Unicode
               | support.
               | 
               | For general purpose usage, Google's RE2 and PCRE2 in JIT
               | mode will offer pretty good performance. Zoltan Herczeg's
               | work on the PCRE2's JIT is underappreciated. Both these
               | options are widely available and portable.
        
           | jonstewart wrote:
           | Oh MSVC, bless. Mingw setup might be a pain, but the
           | dividends accrue over time.
           | 
           | --
           | 
           | This was a good reminder that I need to pay more attention to
           | Unum's projects. I noticed this older blog article,
           | https://www.unum.cloud/blog/2021-01-03-texts, and that brings
           | up some questions. First, in 2025, is UStore a wholesale
           | replacement for UDisk or are the two complementary? Second,
           | what is the current Unum approach for replacing full-text
           | search engines (e.g., ElasticSearch)?
        
             | ashvardanian wrote:
             | I wish I'd had a short answer :)
             | 
             | For years, I've had a hope to build it in the form of an
             | open-core project: open-source SotA solutions for Storage,
             | Compute, and AI Modeling built bottom up. You can imagine
             | the financial & time burden of building something like that
             | with all the weird optimizations and coding practices
             | listed above.
             | 
             | A few years in, with millions spent out of my pocket,
             | without any venture support or revenue, I've decided to
             | change gears and focus on a few niche workloads until some
             | of the Unum tools become industry standards for something.
             | USearch was precisely that, a toy Vector Search engine that
             | would still, hopefully, be 10x better than alternatives, in
             | one way or another:
             | <https://www.unum.cloud/blog/2023-11-07-scaling-vector-
             | search...>.
             | 
             | Now, ScyllaDB (through Rust SDK) and YugaByte (through C++
             | SDK) are the most recent DBMSs to announce features built
             | on USearch, joining the ranks of many other tech products
             | leveraging some of those optimizations, and I was playing
             | around with different open-source growth & governance ideas
             | last year, looking for way to organize more collaborative
             | environment among our upstream users, rather than
             | competitive -- no major releases, just occasional patches
             | here and there.
             | 
             | It was an interesting period, but now I'm again deep in the
             | "CUDA-GDB" land, and the next major release to come is
             | precisely around Full-Text Search in StringZilla
             | <https://github.com/ashvardanian/stringzilla>, and will be
             | integrated into both USearch <https://github.com/unum-
             | cloud/usearch> and somewhere else ;)
        
               | jonstewart wrote:
               | It's kickass that USearch is in DuckDB. That's something
               | I will play with, for sure.
               | 
               | ElasticSearch has always seemed geared too much towards
               | concurrent querying with mixed workloads, and then it
               | gets applied to logs... and, well, with logs you care
               | about detection of known query sets at ingest, indexing
               | speed, compression, and ability to answer queries over
               | large cold indices in cheap object storage. And often
               | when searching logs, you want exact matching, preferably
               | with regex. Part of me wants to play with rolling my own
               | crazy FM-index library, part of me thinks it might be
               | easier to play games with Parquet dictionary tables (get
               | factors out of regexps, check dictionary tables for the
               | factors for great win), and part of me thinks I will be
               | better off waiting to see what comes of the Rust-based ES
               | challengers.
               | 
               | Will definitely follow announcements to come with
               | StringZilla.
        
               | ashvardanian wrote:
               | > Part of me wants to play with rolling my own crazy FM-
               | index library
               | 
               | Oh, absolutely, go for it! And check out what Pesho
               | <https://github.com/pesho-ivanov> & Ragnar are doing
               | <https://github.com/RagnarGrootKoerkamp> ;)
        
       | boricj wrote:
       | The following comes from the complete opposite side of computing,
       | microcontrollers. I've been working on an embedded system where
       | the heap is about 256 KiB and the biggest stack has 4 KiB. I do
       | write idiomatic modern C++ for the most part (even if I hate C++
       | metaprogramming with a passion), but not all tricks are suitable
       | in all situations:
       | 
       | - CTRE is fine as long as you don't overflow the stack. I tried
       | once to validate a string for a HTTP proxy configuration with an
       | exhaustive regex, CTRE tried to allocate 5 KiB of stack 40 call
       | frames in and therefore crashed the embedded system with a stack
       | overflow. I've had to remove port validation from the regex
       | (matching a number between 1 and 65535 was a bridge too far) and
       | check that part by hand instead. I've also had to dumb down other
       | CTRE regexes too in my code for similar reasons.
       | 
       | - Several constraints and design decisions led me to mostly ditch
       | JSON internally and write my own BSON library. Instead of the
       | traditional dynamically-allocated tree of nodes approach it works
       | directly in-place, so I can give it a std::vector with a chunk of
       | reserved memory upfront and not worry about memory allocation or
       | fragmentation later on. One major benefit is that since there are
       | no string escape sequences, I can return directly a
       | std::string_view for string values. There are downsides to this
       | approach, mostly revolving around modifications: one needs to be
       | very careful not to invalidate iterators (which are raw pointers
       | to the underlying buffer) while doing so and adding/removing
       | entries towards the beginning of a large document is expensive
       | due to the memmove().
       | 
       | - I ditched newlib for picolibc and exterminated anything that
       | pulled in the C/C++ standard library locale code (that was over
       | 130 kilobytes of Flash altogether IIRC), which includes among
       | other things C++ streams (they are bad for plenty of other
       | reasons too, but mine was program size).
       | 
       | You seem to have mostly aimed for throughput and raw performance
       | in your benchmarks, which is fine for a generic desktop or
       | server-class system with a MMU and plenty of resources. Just
       | wanna point out that other environments will have different
       | constraints that will mandate different kinds of optimizations,
       | like memory usage (heap/stack/program size), dynamic memory
       | fragmentation, real-time/jitter...
        
         | ashvardanian wrote:
         | Very true! I'd also go for similar optimizations when
         | processing texts or sparse linear algebra on Nvidia and AMD
         | GPUs. You only have ~50 KB of constant memory, ~50 MB of shared
         | memory, and ~50 GB of global memory. It is BIG compared to
         | microcontrollers but very little compared to the scope of
         | problems often solved on GPUs. So many optimizations revolve
         | around compressed representations and coalesced memory
         | accesses.
         | 
         | I am still looking for a short example of such CUDA kernels,
         | and I would love to see more embedded examples if you have
         | thoughts ;)
        
           | boricj wrote:
           | I haven't had to reach for them so far either professionally
           | or personally, but custom memory allocators (slab allocation,
           | bump allocator...) and allocation strategies is something
           | I've been meaning to look into. Too bad that the one game
           | I've done reverse-engineering on used dynamic memory
           | allocation for just about _everything_ , with an allocator
           | that uses a singly-linked list of used/free chunks that
           | wouldn't look out of place in the 1980s.
           | 
           | I'm aware that the C++ standard library has polymorphic
           | allocators alongside a couple of memory resource
           | implementations. I've also heard that the dynamic dispatch
           | for the polymorphic allocators could bring some optimization
           | or speed penalties compared to a statically dispatched
           | allocator or the standard std::allocator that uses operator
           | new(), but I have no concrete data to judge either way.
        
         | msarnoff wrote:
         | I've done C++ on a Cortex-M0+ with 8KB of flash. Code size is a
         | big issue. You have to disable a bunch of stuff (no exceptions,
         | nothing that does dynamic allocation) but you can still use
         | classes, virtual methods, templates, constexpr, etc. These are
         | all things that are a pain to do in C and usually require a
         | bunch of gross macros.
        
           | chrisrodrigue wrote:
           | Yeah, embedded C++ is a wildly different experience from
           | vanilla. I've worked in large embedded C++ codebases where we
           | couldn't use the STL and had to use homegrown containers for
           | everything.
           | 
           | I wonder how Rust is stacking up (no pun intended) in the
           | embedded game these days.
        
         | loeg wrote:
         | Not sure I'd reach for C++ or regexes in such a constrained
         | micro environment. Anything where you don't directly understand
         | the precise memory use is probably out.
        
           | boricj wrote:
           | The NumWorks N0100 graphical calculator had 1 MiB of Flash
           | and 256 KiB of RAM. It packed seven mathematical apps
           | (calculation, grapher, equations, statistics, regression,
           | sequences, distributions) with a decently powerful maths
           | engine/equation typesetter written in C++ _and_ a MicroPython
           | shell. They 've paid a fair amount of attention to details in
           | order to fit all of that in (least of all no STL), but C++
           | wielded correctly for embedded is no more of a memory hog
           | than C.
           | 
           | Our target has ~1.5 MiB of Flash for program code and 512 KiB
           | of RAM. We're using half of the former and maybe a third of
           | the latter, the team barely paid any attention to program
           | size or memory consumption. One day the project lead became
           | slightly concerned about that and by the end of the day I
           | shed off 20% of Flash and RAM usage going for the lowest
           | hanging fruits.
           | 
           | I find it a bit amusing to call a 250 MHz STM32H5 MCU a
           | constrained micro environment, if anything it's a bit
           | overkill for what we need.
        
         | leni536 wrote:
         | > CTRE is fine as long as you don't overflow the stack
         | 
         | Which is to say CTRE is mostly not fine, if you use it on user-
         | provided strings, regardless of target environment. It's
         | heavily recursion based, with never spilling to the heap and
         | otherwise no safeguards for memory use/recursion depth.
        
           | boricj wrote:
           | The regex which was overflowing the stack was something like
           | this (simplified and from memory):                   ^http:\/
           | \/([a-z0-9.-]+)\/?:([1-9]|[0-9][0-9]|[1-9][0-9][0-9]|[1-9][0-
           | 9][0-9][0-9]|[1-5][0-9][0-9][0-9][0-9]|6[0-4][0-9][0-9][0-9]|
           | 65[0-4][0-9][0-9]|655[0-2][0-9]|6553[0-5])$
           | 
           | Once I've given up validating the port number with the regex,
           | it no longer blew up the stack:
           | ^http:\/\/([a-z0-9.-]+)\/?:([1-9][0-9]{0,4})$
           | 
           | I'll admit I haven't done a thorough job of auditing the
           | stack usage afterwards, but not all regexes look like Perl
           | codegolf. For simple, straightforward patterns I don't see
           | any problems using CTRE, but I'd be interested to see some
           | proof to the contrary if you have some.
        
       | hwpythonner wrote:
       | This is impressive work--feels like it should be packaged as a
       | handbook for high-performance computing or even a practical guide
       | for HFT developers. Great resource!
        
       | nickysielicki wrote:
       | > Let me know if you have ideas for other cool, obscure topics to
       | cover!
       | 
       | Everything that Kris Jusiak has under https://github.com/qlibs/
       | is worth a look.
        
       | EliRivers wrote:
       | If you're taking priority requests:
       | 
       | "How coding FPGA differs from GPU and what is High-Level
       | Synthesis, Verilog, and VHDL? #36" Yes please!
        
         | ashvardanian wrote:
         | I've started implementing this, but organizing a good testing
         | environment was time-consuming. If someone here has a sandbox
         | and the right skill set, 3x3x3 or 4x4x4 matrix multiplications
         | can be a perfect candidate for FPGA ports and case studies!
        
           | synthos wrote:
           | What data type/precision? How much parallelism versus
           | iterative? Is this fully in fabric or part of a SoC? Easy to
           | get mired in many details.
           | 
           | HW design has so many degrees of freedom which makes it fun
           | but challenging.
           | 
           | What specific topics are you trying to address with this?
        
       | Icathian wrote:
       | I would pay some real money for the rust equivalent of this kind
       | of material. Anyone got a book or repo they like and would
       | recommend?
        
         | ashvardanian wrote:
         | Most hardware-level observations, like the latency of various
         | memory accesses or numeric operations, would be the same for
         | the Rust code. As for higher-level abstractions, I've already
         | started porting them to Rust
         | <https://github.com/ashvardanian/less_slow.rs>.
         | 
         | Next, it would be exciting to implement a concurrent job-
         | stealing graph algorithm in both languages to get a feel for
         | their ergonomics in non-trivial problems. I can imagine it
         | looks very different in Rust and C++, but before I get there,
         | I'm looking for best practices for implementing nested
         | associative containers with shared stateful allocators in Rust.
         | 
         | In C++, I've implemented them like this: <https://github.com/as
         | hvardanian/less_slow.cpp/blob/8f32d65cc...>, even though I
         | haven't seen many people doing that in public codebases. Any
         | good examples for Rust?
        
           | anp wrote:
           | I'm definitely interested in seeing this kind of content in
           | Rust, have you looked at Rayon's implementation for work
           | stealing yet? Can result in some very nice high-level code.
        
       | standy13 wrote:
       | I did not know about unnormalized floats. I sometimes wonder
       | about it staring at my GPU when it multiplies matrices.
        
         | ashvardanian wrote:
         | I've made a bunch of bad measurements, until someone reminded
         | me to:                 #if defined(__AVX512F__) ||
         | defined(__AVX2__)       void configure_x86_denormals(void) {
         | _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);         // Flush
         | results to zero
         | _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); // Treat
         | denormal inputs as zero       }       #endif
         | 
         | It had a 50x performance impact on Intel. As benchmarked on
         | `c7i.metal-48xl` instances:                 - `f64` throughput
         | grew from 0.2 to 8.2 TFLOPS.       - `f32` throughput grew from
         | 0.6 to 15.1 TFLOPS.
         | 
         | Here is that section in the repo with more notes AVX-512, AMX,
         | and other instructions: <https://github.com/ashvardanian/less_s
         | low.cpp/blob/8f32d65cc...>.
        
           | cwzwarich wrote:
           | Intel has always had terrible subnormal performance. It's not
           | that difficult to implement in HW, and even if you still want
           | to optimize for the normalized case, we're talking about a 1
           | cycle penalty, not an order(s)-of-magnitude penalty.
        
       | johnisgood wrote:
       | > Less Slow C, C++, and Assembly Code
       | 
       | I expected C code, too, not only .cpp and .S, however.
       | 
       | The less_slow.cpp uses a lot of C++-ism.
       | 
       | This may require fixing, or removal of "C" from the list.
        
       | kllrnohj wrote:
       | > 40x faster trigonometry: Speed-up standard library functions
       | like std::sin in just 3 lines of code.
       | 
       | Huh, ok, let's see how...                   *  By limiting the
       | expansion to a few         *  terms, we can approximate `sin(x)`
       | as:         *         *      sin(x) [?] x - (x^3)/3! + (x^5)/5!
       | *         *  This reduces the computational cost but comes with
       | reduced accuracy.
       | 
       | I see. "reduced accuracy" is an understatement. It's just
       | _horrifically wrong_ for inputs outside the range of [-2, 2]
       | 
       | https://www.wolframalpha.com/input?i=plot+sin+x%2C+x+-+%28x%...
       | 
       | It cannot handle a single interval of a sin wave, much less the
       | repeating nature? What an absolutely useless "optimization"
        
         | Analemma_ wrote:
         | This is kind of a dumb objection. If your sine function has
         | good accuracy in [-pi/2, pi/2], you can compute all other
         | values by shifting the argument and/or multiplying the result
         | by -1.
        
           | SkiFire13 wrote:
           | But then you have to include this in the benchmark and it
           | will no longer be 40x faster.
        
             | dzaima wrote:
             | There are a bunch of real situations where you can assume
             | the input will be in a small range. And while reducing from
             | [-pi;pi] or [-2*pi;2*pi] or whatever is gonna slow it down
             | somewhat, I'm pretty sure it wouldn't be too significant,
             | compared to the FP arith. (and branching on inputs outside
             | of even that expanded target expected range is a fine
             | strategy realistically)
        
               | kllrnohj wrote:
               | > branching on inputs outside of the target expected
               | range is a fine strategy realistically
               | 
               | branches at this scale are actually significant, and so
               | will drastically impact being able to achieve 40x faster
               | as claimed
        
               | dzaima wrote:
               | That's only if they're unpredictable; sure, perhaps on
               | some workload it'll be unpredictable whether the input to
               | sin/cos is grater than 2*pi, but I'm pretty sure on most
               | it'll be nearly-always a "no". Perhaps not an
               | optimization to take in general, but if you've got a
               | workload where you're fine with 0.5% error, you can also
               | spend a couple seconds thinking about what range of
               | inputs to handle in the fast path. (hence "target
               | expected range" - unexpected inputs getting unexpected
               | branches isn't gonna slow down things if you've
               | calibrated your expectations correctly; edited my comment
               | slightly to make it clearer that that's about being out
               | of the expanded range, not just [-pi/2,pi/2])
        
               | kllrnohj wrote:
               | Assuming an even distribution over a single iteration of
               | sin, that is [0,pi], this will have a ~30% misprediction
               | rate. That's not rare.
        
               | dzaima wrote:
               | I'm of course not suggesting branching in cases where you
               | expect a 30% misprediction rate. You'd do branchless
               | reduction from [-2*pi;2*pi] or whatever you expect to be
               | frequent, and branch on inputs with magnitude greater
               | than 2*pi if you want to be extra sure you don't get
               | wrong results if usage changes.
               | 
               | Again, we're in a situation where we know we can tolerate
               | a 0.5% error, we can spare a bit of time to think about
               | what range needs to be handled fast or supported at all.
        
               | kllrnohj wrote:
               | Those reductions need to be part of the function being
               | benchmarked, though. Assuming a range limitation of
               | [-pi,pi] even would be reasonable, there's certainly
               | cases where you don't need multiple revolutions around a
               | circle. But this can't even do that, so it's simply not a
               | substitute for sin, and claiming 40x faster is a sham
        
               | dzaima wrote:
               | Right; the range reduction from [-pi;pi] would be like 5
               | instrs ("x -= copysign(pi/2 & (abs(x)>pi/2), x)" or so),
               | ~2 cycles throughput-wise or so, I think; that's slightly
               | more significant than I was imagining, hmm.
               | 
               | It's indeed not a substitute for sin in general, but it
               | could be in some use-cases, and for those it could really
               | be 40x faster (say, cases where you're already externally
               | doing range reduction because it's necessary for some
               | other reason (in general you don't want your angles
               | infinitely accumulating scale)).
        
           | dzaima wrote:
           | At pi/2 that approximation gets you 1.0045, i.e. half a
           | percent off, so it's not particularly good at that. (still
           | could be sufficient for some uses though; but not the best
           | achievable even with that performance)
        
           | TimorousBestie wrote:
           | Good argument reduction routines are not exactly easy for a
           | novice to write, so I think this is a valid objection.
        
           | kubav027 wrote:
           | At least do not name the function "sin". One former game dev
           | works in my company and he is using similar tricks all the
           | time. It makes code so hard to read and unless you are
           | computing "sin" a lot speedup is not measurable.
        
         | ashvardanian wrote:
         | You can always get more accuracy by expanding those 3 lines to
         | handle more of the Taylor components... but it's important to
         | remember that this is still educational material.
         | 
         | You can find more complete examples in my SimSIMD
         | (https://github.com/ashvardanian/SimSIMD), but they also often
         | assume that at a certain part of a kernel, a floating point
         | number is guaranteed to be in a certain range. This can greatly
         | simplify the implementation for kernels like Atan2. For
         | general-purpose inputs, go to SLEEF (https://sleef.org). Just
         | remember that every large, complicated optimization starts with
         | a small example.
        
           | kllrnohj wrote:
           | > but it's important to remember that this is still
           | educational material.
           | 
           | Then it should be educating on the applicability and
           | limitations of things like this instead of just saying
           | "reduced accuracy" and hoping the reader notices the massive
           | issues? Kinda like the ffast-math section does.
        
           | pclmulqdq wrote:
           | No. Do not use Taylor series approximations in your real
           | code. They are slow and inaccurate. You can do much, much
           | better with some basic numerical analysis. Chebyshev and
           | Remez approximations will give you more bang for your buck
           | every time.
        
           | jcranmer wrote:
           | Educational material that misinforms its readers isn't
           | educational, and it's insanely counterproductive.
           | 
           | People have already ragged on you for doing Taylor
           | approximation, and I'm not the best expert on the numerical
           | analysis of implementing transcendental functions, so I won't
           | pursue that further. But there's still several other
           | unaddressed errors in your trigonometric code:
           | 
           | * If your function is going to omit range reduction, _say so_
           | upfront. Saying  "use me to get a 40x speedup because I omit
           | part of the specification" is misleading to users, especially
           | because you should assume that most users are not
           | knowledgeable about floating-point and thus they aren't even
           | going to understand they're missing something without you
           | explicitly telling them!
           | 
           | * You're doing polynomial evaluation via `x * a + (x * x) * b
           | + (x * x * x) * c` which is not the common way of doing so,
           | and also, it's a slow way of doing so. If you're trying to be
           | educational, do it via the `((((x * c) * x + b) * x) + a) *
           | x` technique--that's how it's done, that's how it _should_ be
           | done.
           | 
           | * Also, doing `x / 6.0` is a disaster for performance,
           | because fdiv is one of the slowest operations you can do. Why
           | not do `x * (1.0 / 6.0)` instead?
           | 
           | * Doing really, really dumb floating-point code and then
           | relying on -ffast-math to make the compiler unpick your
           | dumbness is... a bad way of doing stuff. Especially since
           | you're recommending people go for it for the easy speedup and
           | saying absolutely nothing about where it can go
           | catastrophically wrong. As Simon Byrne said, "Friends don't
           | let friends use -ffast-math" (and the title of my explainer
           | on floating-point will invariably be "Floating Point, or How
           | I Learned to Start Worrying and Hate -ffast-math").
        
             | Certhas wrote:
             | I don't know who the target audience is supposed to be, but
             | who would be the type of person who tries to implement
             | performance critical numerical codes but doesn't know the
             | implications of Taylor expanding the sine function?
        
               | Const-me wrote:
               | > tries to implement performance critical numerical codes
               | but doesn't know the implications of Taylor expanding the
               | sine function?
               | 
               | That would be me, I'm afraid. I know little about Taylor
               | series, but I'm pretty sure it's less than ideal for the
               | use case.
               | 
               | Here's a better way to implement faster trigonometry
               | functions in C++ https://github.com/Const-
               | me/AvxMath/blob/master/AvxMath/AvxM... That particular
               | source file implements that for 4-wide FP64 AVX vectors.
        
               | ack_complete wrote:
               | There are lots of cases where you can get away with
               | moderate accuracy. Rotating a lot of batched sprites
               | would be one of them; could easily get away with a
               | mediocre Taylor series approximation, even though it's
               | leaving free accuracy on the table compared to minimax.
               | 
               | But not having _range reduction_ is a bigger problem, I
               | can't see many uses for a sin() approximation that's only
               | good for half wave. And as others have said, if you need
               | range reduction for the approximation to work in its
               | intended use case, that needs to be included in the
               | benchmark because you're going to be paying that cost
               | relative to `std::sin()`.
        
               | jcranmer wrote:
               | People who found that sin is the performance bottleneck
               | in their code and are trying to find way to speed it up.
               | 
               | One of the big problems with floating-point code in
               | general is that users are largely ignorant of floating-
               | point issues. Even something as basic as "0.1 + 0.2 !=
               | 0.3" shouldn't be that surprising to a programmer if you
               | spend about five minutes explaining it, but the evidence
               | is clear that it is a shocking surprise to a very large
               | fraction of programmers. And that's the most basic
               | floating-point issue, the one you're virtually guaranteed
               | to stumble across if you do anything with floating-point;
               | there's so many more issues that you're not going to
               | think about until you uncover them for the first time
               | (e.g., different hardware gives different results).
        
               | AlotOfReading wrote:
               | Thanks to a lot of effort by countless legions of people,
               | we're largely past the years when it was common for
               | different hardware to give different results. It's pretty
               | much just contraction, FTZ/DAZ, funsafe/ffast-math, and
               | NaN propagation. Anyone interested in practical
               | reproducibility really only has to consider the first two
               | among the basic parts of the language, and they're
               | relatively straightforward to manage.
        
               | jcranmer wrote:
               | Divergent math library implementations is the other main
               | category, and for many practical cases, you might have to
               | worry about parallelization factor changing things. For
               | completeness' sake, I might as well add in approximate
               | functions, but if you using an approximate inverse square
               | root instruction, well, you should probably expect that
               | to be differ on different hardware.
               | 
               | On the plus side, x87 excess precision is largely a thing
               | of the past, and we've seen some major pushes towards
               | getting rid of FTZ/DAZ (I think we're at the point where
               | even the offload architectures are mandating denormal
               | support?). Assuming Intel figures out how to fully get
               | rid of denormal penalties on its hardware, we're probably
               | a decade or so out from making -ffast-math no longer
               | imply denormal flushing, yay. (Also, we're seeing a lot
               | of progress on high-speed implementations of correctly-
               | rounded libm functions, so I also expect to see standard
               | libraries require correctly-rounded implementations as
               | well).
        
               | AlotOfReading wrote:
               | The definition I use for determinism is "same inputs and
               | same order = same results", down to the compiler level.
               | All modern compilers on all modern platforms that I've
               | tested take steps to ensure that for everything except
               | transcendental and special functions (where it'd be an
               | unreasonable guarantee).
               | 
               | I'm somewhat less interested in correctness of the
               | results, so long as they're consistent. rlibm and related
               | are definitely neat, but I'm not optimistic they'll
               | become mainstream.
        
             | ashvardanian wrote:
             | I am genuinely quite surprised that the sine approximation
             | is the eyeball catcher in that entire repo.
             | 
             | It will only take a 5-line PR to add Horner's method and
             | Chebyshev's polynomials and probably around 20 lines of
             | explanations, and everyone passionate about the topic is
             | welcome to add them.
             | 
             | There are more than enough examples in the libraries
             | mentioned above ;)
        
               | croemer wrote:
               | It's eye catching because it's advertised as a 40x
               | speedup without any caveats.
        
               | ashvardanian wrote:
               | Oh, I've missed that part. It's hard to fit README's
               | bullet points on a single line, and I've probably removed
               | too many relevant words.
               | 
               | I'll update the README statement in a second, and already
               | patched the sources to explicitly focus on the [-p/2,
               | p/2] range.
               | 
               | Thanks!
        
             | dingdingdang wrote:
             | Someone is still putting tremendous effort into this
             | project so I reckon it would be worthwhile to submit this,
             | obviously well thought through, criticism as a PR for the
             | repo!
        
           | creato wrote:
           | I'd suggest simply adding `assert(-M_PI/2 <= x && x <=
           | M_PI/2)` to your function. It won't slow down the code in
           | optimized builds, and makes it obvious that it isn't designed
           | to work outside that range even if people copy/paste the code
           | without reading it or any comments.
           | 
           | Also, it would be good to have even in a "production" use of
           | a function like this, in case something outside that range
           | reaches it by accident.
        
             | ashvardanian wrote:
             | Yes, that's a very productive suggestion! Feel free to open
             | a PR, or I'll just patch it myself in a couple of hours
             | when I'm back to the computer. Thanks!
        
         | pclmulqdq wrote:
         | Oof this is bad. If you're going to ask people to approximate,
         | use a Chebyshev approximation please. You will do sin(x) faster
         | than this and more accurately.
        
         | meindnoch wrote:
         | And it's not even using the Horner scheme for evaluating the
         | polynomial.
        
         | boricj wrote:
         | It's not useless if it's good enough for the problem at hand.
         | 
         | Kaze Emanuar has two entire videos dedicated to optimizing
         | sin() on the Nintendo 64 and he's using approximations like
         | this without issues in his homebrew:                 -
         | https://www.youtube.com/watch?v=xFKFoGiGlXQ            -
         | https://www.youtube.com/watch?v=hffgNRfL1XY
        
       | loeg wrote:
       | This is a hodgepodge of random stuff. Completely incoherent.
       | Would not really point anyone at this to try and dig out gems.
        
         | ashvardanian wrote:
         | I agree that it may not be the most readable format. So far,
         | the best-structured piece on similar topics I've seen is
         | Algorithmica: <https://en.algorithmica.org/hpc>.
         | 
         | I am sure it overlaps in terms of topics, maybe even some
         | examples, but the homepage suggests that the book is about 500
         | pages long. I generally don't have a time budget to read that
         | much, and in most cases, I want to play with the code more than
         | read the text, especially when some new IO_uring feature, a
         | metaprogramming tool, or an Assembly instruction comes out.
         | 
         | Another observation is that people don't like to read into
         | 5000-word essays on each topic. At best, those become training
         | materials for the next LLMs and will affect future code only
         | indirectly...
         | 
         | I'm all ears for alternative formats if you have
         | recommendations, as I generally reimplement such projects every
         | couple of years ;)
        
           | johnisgood wrote:
           | Is there one for C? I do not like the "std:" whatnots.
        
         | gblargg wrote:
         | Agreed. I'm not seeing any use of a automatic profiler to find
         | the hot spots of one's code. If you don't know that, you're
         | aiming blind on what to optimize and how much. The only
         | profiling mentioned is something manual you have to add to each
         | scope you want to time.
        
       | samlinnfer wrote:
       | Looks like AI slop.
        
       | vitus wrote:
       | Why choose Abseil for associative containers? It was a big deal
       | when it first appeared on the scene, and it's certainly still
       | better than the standard library for the most part, but there are
       | now a number of alternatives that have since improved upon its
       | weaknesses. (I've been especially impressed by Boost's
       | unordered_flat_map.)
       | 
       | I have increasingly soured on Abseil over the past couple of
       | years. At Google, we've seen an exodus of many of the core
       | library maintainers, and some of the more recent design choices
       | have been questionable at best.
        
         | ashvardanian wrote:
         | I've never tried `unordered_flat_map` and would probably prefer
         | to keep Boost, Poco, Qt, and other mega-frameworks outside of
         | this repo to make the builds leaner, but I'm open to
         | alternatives. Which are the best to look into?
        
           | PaulDavisThe1st wrote:
           | boost:unordered is a header-only library.
        
       ___________________________________________________________________
       (page generated 2025-04-18 23:00 UTC)