[HN Gopher] Less Slow C++
___________________________________________________________________
Less Slow C++
Earlier this year, I took a month to reexamine my coding habits and
rethink some past design choices. I hope to rewrite and improve my
FOSS libraries this year, and I needed answers to a few questions
first. Perhaps some of these questions will resonate with others in
the community, too. - Are coroutines viable for
high-performance work? - Should I use SIMD intrinsics for
clarity or drop to assembly for easier library distribution?
- Has hardware caught up with vectorized scatter/gather in AVX-512
& SVE? - How do secure enclaves & pointer tagging differ on
Intel, Arm, & AMD? - What's the throughput gap between CPU
and GPU Tensor Cores (TCs)? - How costly are misaligned
memory accesses & split-loads, and what gains do non-temporal
loads/stores offer? - Which parts of the standard library hit
performance hardest? - How do error-handling strategies
compare overhead-wise? - What's the compile-time vs. run-time
trade-off for lazily evaluated ranges? - What practical, non-
trivial use cases exist for meta-programming? - How
challenging is Linux Kernel bypass with io_uring vs. POSIX sockets?
- How close are we to effectively using Networking TS or
heterogeneous Executors in C++? - What are best practices for
propagating stateful allocators in nested containers, and which
libraries support them? These questions span from micro-
kernel optimizations (nanoseconds) to distributed systems
(micro/millisecond latencies). Rather than tackling them all in one
post, I compiled my explorations into a repository--extending my
previous Google Benchmark tutorial
(<https://ashvardanian.com/posts/google-benchmark>)--to serve as a
sandbox for performance experimentation. Some fun observations:
- Compilers now vectorize 3x3x3 and 4x4x4 single/double precision
multiplications well! The smaller one is ~60% slower despite 70%
fewer operations, outperforming my vanilla SSE/AVX and coming
within 10% of AVX-512. - Nvidia TCs vary dramatically
across generations in numeric types, throughput, tile shapes,
thread synchronization (thread/quad-pair/warp/warp-groups), and
operand storage. Post-Volta, manual PTX is often needed (as
intrinsics lag), though the new TileIR (introduced at GTC) promises
improvements for dense linear algebra kernels. - The AI
wave drives CPUs and GPUs to converge in mat-mul throughput &
programming complexity. It took me a day to debug TMM register
initialization, and SME is equally odd. Sierra Forest packs 288
cores/socket, and AVX10.2 drops 256-bit support for 512-bit... I
wonder if discrete Intel GPUs are even needed, given CPU advances?
- In common floating-point ranges, scalar sine approximations can
be up to 40x faster than standard implementations, even without
SIMD. It's a bit hand-wavy, though; I wish more projects documented
error bounds and had 1 & 3.5 ULP variants like Sleef. -
Meta-programming tools like CTRE can outperform typical RegEx
engines by 5x and simplify building parsers compared to hand-
crafted FSMs. - Once clearly distinct in complexity and
performance (DPDK/SPDK vs. io_uring), the gap is narrowing. While
pre-5.5 io_uring can boost UDP throughput by 4x on loopback IO,
newer zero-copy and concurrency optimizations remain challenging.
The repository is loaded with links to favorite CppCon lectures,
GitHub snippets, and tech blog posts. Recognizing that many high-
level concepts are handled differently across languages, I've also
started porting examples to Rust & Python in separate repos.
Coroutines look bad everywhere :( Overall, this research project
was rewarding! Most questions found answers in code -- except
pointer tagging and secure enclaves, which still elude me in public
cloud. I'd love to hear from others, especially on comparing High-
Level Synthesis for small matrix multiplications on FPGAs versus
hand-written VHDL/Verilog for integral types. Let me know if you
have ideas for other cool, obscure topics to cover!
Author : ashvardanian
Score : 175 points
Date : 2025-04-18 13:09 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| kreetx wrote:
| This is only from a cursory look, but it all looks really concise
| and practical - thank you for putting it all together!
| intelVISA wrote:
| Awesome, thanks.
|
| Does C++ have a good async ('coroutine') story for io_uring yet?
| ashvardanian wrote:
| Great question! This has been top of mind for me for the last
| 2-3 years.
|
| Short answer: sadly, no. I love the "usability" promise of
| coroutines--and even have 2-3 FOSS projects that could be
| rewritten entirely around C++ or Rust coroutines for better
| debuggability and extensibility--but my experiments show that
| the runtime cost of most coroutine-like abstractions is simply
| too high. Frankly, I'm not even sure if a better design is
| possible on modern hardware.
|
| This leads me to conclude that, despite my passion for SIMD and
| superscalar execution, the highest-impact new assembly
| instructions that x86 and Arm could standardize would center on
| async execution and lightweight context switching... yet I
| haven't seen any movement in that direction.
|
| [?]
|
| I also wrote toy examples for various range/async/stream models
| in C++, Rust, and Python, with measured latencies in inline
| comments: - Python: <https://github.com/ashvard
| anian/less_slow.py/blob/2c80c9dda6ef9aec18264871cfb51a13ac34685
| d/less_slow.py#L711-L722> - C++: <https://github.com/ashv
| ardanian/less_slow.cpp/blob/8f32d65cc8976ad20c8cdf645caa5b39a1c
| f488b/less_slow.cpp#L3718-L3759> - Rust: <https://github.
| com/ashvardanian/less_slow.rs/blob/5add609533854bb6a5ddbf358cc1
| 5f06f68c82d2/less_slow.rs#L522-L533>
|
| Aside from coroutines (toy hand-rolled implementations and
| commonly used libraries), I've also played around C++
| executors, senders & receivers, but didn't have much success
| with them either. May be a skill issue.
| haberman wrote:
| > my experiments show that the runtime cost of most
| coroutine-like abstractions is simply too high
|
| Which runtime cost do you mean?
|
| The main one I am aware of is a heap allocation per
| coroutine, though this can in some cases be elided if the
| coroutine is being called from another coroutine.
|
| The other cost I am aware of is the initializing of the
| coroutine handle, but I think this is just a couple of
| pointers.
|
| In both cases I would expect these overheads to be relatively
| modest compared to the cost of the I/O itself, though it's
| definitely better to elide the heap allocation when possible.
|
| I don't know much about coroutine libraries like unifex
| (which I think your test is using), but a hand-coded
| prototype I was playing with doesn't seem to add much
| overhead: https://godbolt.org/z/8Kc1oKf15
|
| If we can compile with -fno-exceptions, the code is even
| tighter: https://godbolt.org/z/5Yo8Pqvd5
|
| My exploration into coroutines and I/O is only in the
| earliest stages, so I won't claim any of this to be
| definitive. But I am very interested in this question of
| whether the overhead is low enough to be a good match for
| io_uring or not.
| PaulDavisThe1st wrote:
| > lightweight context switching
|
| the cost of context switch consists of two parts, one of
| which can be subdivided:
|
| 1. register save/restore 1a. user-space
| only registers 1b. full register save/restore,
| required in kernel space
|
| 2. the cost of the TLB flush, which is in turn proportional
| to the working set size of the switched-to process (i.e. if
| you don't touch much memory after the context switch, the
| cost is lower than if you do)
|
| I am not sure that any assembler instructions could address
| either of these.
|
| What do you have in mind?
| nickysielicki wrote:
| https://github.com/NVIDIA/stdexec/blob/main/examples/io_urin...
| spieswl wrote:
| Hey! Your Google Benchmark post was one of my go-to resources
| when I started picking that up a couple years ago. I love to see
| the focus on performance benchmarking here, and the repository is
| laid out well. Nice work!
| jonstewart wrote:
| This is great! There has been far too much sophistry in C++ where
| the justification for some contortions are "performance", with no
| empirical data.
|
| I am surprised about CTRE giving good results--I will admit I
| have thought of it more as a parlor trick than a viable engine. I
| will need to dig into that more. I also want to dig into the
| OpenMP & TBB threadpool benchmarks to see whether Boost::ASIO
| threadpool can be added into it.
| ashvardanian wrote:
| I was also surprised about CTRE and I can imagine it being a
| viable tool for implementing parsers for 99% of use cases,
| where you may not need all the SIMD bells and whistles.
|
| A word of caution, though: I remember the throughput differing
| vastly between GCC and MSVC builds. The latter struggles with
| heavy meta-programming and expression templates. I don't know
| why.
| nickysielicki wrote:
| std::regex is such a nightmare, I didn't take the time to run
| the code myself but I'd be curious if you would see the same
| delta if you swapped it for boost::regex or re2.
|
| I think there's a case to be made for libaries like
| https://github.com/foonathan/lexy and/or
| https://github.com/boostorg/parser instead of reaching for a
| regex in the first place.
| ashvardanian wrote:
| You can beat CTRE throughput with a non-compile-time
| solution. My first recommendation will be to look at
| HyperScan. It has remarkable throughput:
| https://github.com/intel/hyperscan
|
| I've only avoided it in the tutorial, as I want to keep the
| build system lean. I wouldn't be surprised if it's 10x
| faster than Boost in the average case.
| jonstewart wrote:
| To echo Ash, HyperScan is the performance king. Use it
| when shipping on x64 and one has control over the pattern
| sets and you don't mind building it and working through
| its idiosyncrasies. It has some nonstandard matching
| semantics, reflecting its focus on network intrusion
| detection systems (NIDS), and does not have Unicode
| support.
|
| For general purpose usage, Google's RE2 and PCRE2 in JIT
| mode will offer pretty good performance. Zoltan Herczeg's
| work on the PCRE2's JIT is underappreciated. Both these
| options are widely available and portable.
| jonstewart wrote:
| Oh MSVC, bless. Mingw setup might be a pain, but the
| dividends accrue over time.
|
| --
|
| This was a good reminder that I need to pay more attention to
| Unum's projects. I noticed this older blog article,
| https://www.unum.cloud/blog/2021-01-03-texts, and that brings
| up some questions. First, in 2025, is UStore a wholesale
| replacement for UDisk or are the two complementary? Second,
| what is the current Unum approach for replacing full-text
| search engines (e.g., ElasticSearch)?
| ashvardanian wrote:
| I wish I'd had a short answer :)
|
| For years, I've had a hope to build it in the form of an
| open-core project: open-source SotA solutions for Storage,
| Compute, and AI Modeling built bottom up. You can imagine
| the financial & time burden of building something like that
| with all the weird optimizations and coding practices
| listed above.
|
| A few years in, with millions spent out of my pocket,
| without any venture support or revenue, I've decided to
| change gears and focus on a few niche workloads until some
| of the Unum tools become industry standards for something.
| USearch was precisely that, a toy Vector Search engine that
| would still, hopefully, be 10x better than alternatives, in
| one way or another:
| <https://www.unum.cloud/blog/2023-11-07-scaling-vector-
| search...>.
|
| Now, ScyllaDB (through Rust SDK) and YugaByte (through C++
| SDK) are the most recent DBMSs to announce features built
| on USearch, joining the ranks of many other tech products
| leveraging some of those optimizations, and I was playing
| around with different open-source growth & governance ideas
| last year, looking for way to organize more collaborative
| environment among our upstream users, rather than
| competitive -- no major releases, just occasional patches
| here and there.
|
| It was an interesting period, but now I'm again deep in the
| "CUDA-GDB" land, and the next major release to come is
| precisely around Full-Text Search in StringZilla
| <https://github.com/ashvardanian/stringzilla>, and will be
| integrated into both USearch <https://github.com/unum-
| cloud/usearch> and somewhere else ;)
| jonstewart wrote:
| It's kickass that USearch is in DuckDB. That's something
| I will play with, for sure.
|
| ElasticSearch has always seemed geared too much towards
| concurrent querying with mixed workloads, and then it
| gets applied to logs... and, well, with logs you care
| about detection of known query sets at ingest, indexing
| speed, compression, and ability to answer queries over
| large cold indices in cheap object storage. And often
| when searching logs, you want exact matching, preferably
| with regex. Part of me wants to play with rolling my own
| crazy FM-index library, part of me thinks it might be
| easier to play games with Parquet dictionary tables (get
| factors out of regexps, check dictionary tables for the
| factors for great win), and part of me thinks I will be
| better off waiting to see what comes of the Rust-based ES
| challengers.
|
| Will definitely follow announcements to come with
| StringZilla.
| ashvardanian wrote:
| > Part of me wants to play with rolling my own crazy FM-
| index library
|
| Oh, absolutely, go for it! And check out what Pesho
| <https://github.com/pesho-ivanov> & Ragnar are doing
| <https://github.com/RagnarGrootKoerkamp> ;)
| boricj wrote:
| The following comes from the complete opposite side of computing,
| microcontrollers. I've been working on an embedded system where
| the heap is about 256 KiB and the biggest stack has 4 KiB. I do
| write idiomatic modern C++ for the most part (even if I hate C++
| metaprogramming with a passion), but not all tricks are suitable
| in all situations:
|
| - CTRE is fine as long as you don't overflow the stack. I tried
| once to validate a string for a HTTP proxy configuration with an
| exhaustive regex, CTRE tried to allocate 5 KiB of stack 40 call
| frames in and therefore crashed the embedded system with a stack
| overflow. I've had to remove port validation from the regex
| (matching a number between 1 and 65535 was a bridge too far) and
| check that part by hand instead. I've also had to dumb down other
| CTRE regexes too in my code for similar reasons.
|
| - Several constraints and design decisions led me to mostly ditch
| JSON internally and write my own BSON library. Instead of the
| traditional dynamically-allocated tree of nodes approach it works
| directly in-place, so I can give it a std::vector with a chunk of
| reserved memory upfront and not worry about memory allocation or
| fragmentation later on. One major benefit is that since there are
| no string escape sequences, I can return directly a
| std::string_view for string values. There are downsides to this
| approach, mostly revolving around modifications: one needs to be
| very careful not to invalidate iterators (which are raw pointers
| to the underlying buffer) while doing so and adding/removing
| entries towards the beginning of a large document is expensive
| due to the memmove().
|
| - I ditched newlib for picolibc and exterminated anything that
| pulled in the C/C++ standard library locale code (that was over
| 130 kilobytes of Flash altogether IIRC), which includes among
| other things C++ streams (they are bad for plenty of other
| reasons too, but mine was program size).
|
| You seem to have mostly aimed for throughput and raw performance
| in your benchmarks, which is fine for a generic desktop or
| server-class system with a MMU and plenty of resources. Just
| wanna point out that other environments will have different
| constraints that will mandate different kinds of optimizations,
| like memory usage (heap/stack/program size), dynamic memory
| fragmentation, real-time/jitter...
| ashvardanian wrote:
| Very true! I'd also go for similar optimizations when
| processing texts or sparse linear algebra on Nvidia and AMD
| GPUs. You only have ~50 KB of constant memory, ~50 MB of shared
| memory, and ~50 GB of global memory. It is BIG compared to
| microcontrollers but very little compared to the scope of
| problems often solved on GPUs. So many optimizations revolve
| around compressed representations and coalesced memory
| accesses.
|
| I am still looking for a short example of such CUDA kernels,
| and I would love to see more embedded examples if you have
| thoughts ;)
| boricj wrote:
| I haven't had to reach for them so far either professionally
| or personally, but custom memory allocators (slab allocation,
| bump allocator...) and allocation strategies is something
| I've been meaning to look into. Too bad that the one game
| I've done reverse-engineering on used dynamic memory
| allocation for just about _everything_ , with an allocator
| that uses a singly-linked list of used/free chunks that
| wouldn't look out of place in the 1980s.
|
| I'm aware that the C++ standard library has polymorphic
| allocators alongside a couple of memory resource
| implementations. I've also heard that the dynamic dispatch
| for the polymorphic allocators could bring some optimization
| or speed penalties compared to a statically dispatched
| allocator or the standard std::allocator that uses operator
| new(), but I have no concrete data to judge either way.
| msarnoff wrote:
| I've done C++ on a Cortex-M0+ with 8KB of flash. Code size is a
| big issue. You have to disable a bunch of stuff (no exceptions,
| nothing that does dynamic allocation) but you can still use
| classes, virtual methods, templates, constexpr, etc. These are
| all things that are a pain to do in C and usually require a
| bunch of gross macros.
| chrisrodrigue wrote:
| Yeah, embedded C++ is a wildly different experience from
| vanilla. I've worked in large embedded C++ codebases where we
| couldn't use the STL and had to use homegrown containers for
| everything.
|
| I wonder how Rust is stacking up (no pun intended) in the
| embedded game these days.
| loeg wrote:
| Not sure I'd reach for C++ or regexes in such a constrained
| micro environment. Anything where you don't directly understand
| the precise memory use is probably out.
| boricj wrote:
| The NumWorks N0100 graphical calculator had 1 MiB of Flash
| and 256 KiB of RAM. It packed seven mathematical apps
| (calculation, grapher, equations, statistics, regression,
| sequences, distributions) with a decently powerful maths
| engine/equation typesetter written in C++ _and_ a MicroPython
| shell. They 've paid a fair amount of attention to details in
| order to fit all of that in (least of all no STL), but C++
| wielded correctly for embedded is no more of a memory hog
| than C.
|
| Our target has ~1.5 MiB of Flash for program code and 512 KiB
| of RAM. We're using half of the former and maybe a third of
| the latter, the team barely paid any attention to program
| size or memory consumption. One day the project lead became
| slightly concerned about that and by the end of the day I
| shed off 20% of Flash and RAM usage going for the lowest
| hanging fruits.
|
| I find it a bit amusing to call a 250 MHz STM32H5 MCU a
| constrained micro environment, if anything it's a bit
| overkill for what we need.
| leni536 wrote:
| > CTRE is fine as long as you don't overflow the stack
|
| Which is to say CTRE is mostly not fine, if you use it on user-
| provided strings, regardless of target environment. It's
| heavily recursion based, with never spilling to the heap and
| otherwise no safeguards for memory use/recursion depth.
| boricj wrote:
| The regex which was overflowing the stack was something like
| this (simplified and from memory): ^http:\/
| \/([a-z0-9.-]+)\/?:([1-9]|[0-9][0-9]|[1-9][0-9][0-9]|[1-9][0-
| 9][0-9][0-9]|[1-5][0-9][0-9][0-9][0-9]|6[0-4][0-9][0-9][0-9]|
| 65[0-4][0-9][0-9]|655[0-2][0-9]|6553[0-5])$
|
| Once I've given up validating the port number with the regex,
| it no longer blew up the stack:
| ^http:\/\/([a-z0-9.-]+)\/?:([1-9][0-9]{0,4})$
|
| I'll admit I haven't done a thorough job of auditing the
| stack usage afterwards, but not all regexes look like Perl
| codegolf. For simple, straightforward patterns I don't see
| any problems using CTRE, but I'd be interested to see some
| proof to the contrary if you have some.
| hwpythonner wrote:
| This is impressive work--feels like it should be packaged as a
| handbook for high-performance computing or even a practical guide
| for HFT developers. Great resource!
| nickysielicki wrote:
| > Let me know if you have ideas for other cool, obscure topics to
| cover!
|
| Everything that Kris Jusiak has under https://github.com/qlibs/
| is worth a look.
| EliRivers wrote:
| If you're taking priority requests:
|
| "How coding FPGA differs from GPU and what is High-Level
| Synthesis, Verilog, and VHDL? #36" Yes please!
| ashvardanian wrote:
| I've started implementing this, but organizing a good testing
| environment was time-consuming. If someone here has a sandbox
| and the right skill set, 3x3x3 or 4x4x4 matrix multiplications
| can be a perfect candidate for FPGA ports and case studies!
| synthos wrote:
| What data type/precision? How much parallelism versus
| iterative? Is this fully in fabric or part of a SoC? Easy to
| get mired in many details.
|
| HW design has so many degrees of freedom which makes it fun
| but challenging.
|
| What specific topics are you trying to address with this?
| Icathian wrote:
| I would pay some real money for the rust equivalent of this kind
| of material. Anyone got a book or repo they like and would
| recommend?
| ashvardanian wrote:
| Most hardware-level observations, like the latency of various
| memory accesses or numeric operations, would be the same for
| the Rust code. As for higher-level abstractions, I've already
| started porting them to Rust
| <https://github.com/ashvardanian/less_slow.rs>.
|
| Next, it would be exciting to implement a concurrent job-
| stealing graph algorithm in both languages to get a feel for
| their ergonomics in non-trivial problems. I can imagine it
| looks very different in Rust and C++, but before I get there,
| I'm looking for best practices for implementing nested
| associative containers with shared stateful allocators in Rust.
|
| In C++, I've implemented them like this: <https://github.com/as
| hvardanian/less_slow.cpp/blob/8f32d65cc...>, even though I
| haven't seen many people doing that in public codebases. Any
| good examples for Rust?
| anp wrote:
| I'm definitely interested in seeing this kind of content in
| Rust, have you looked at Rayon's implementation for work
| stealing yet? Can result in some very nice high-level code.
| standy13 wrote:
| I did not know about unnormalized floats. I sometimes wonder
| about it staring at my GPU when it multiplies matrices.
| ashvardanian wrote:
| I've made a bunch of bad measurements, until someone reminded
| me to: #if defined(__AVX512F__) ||
| defined(__AVX2__) void configure_x86_denormals(void) {
| _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); // Flush
| results to zero
| _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); // Treat
| denormal inputs as zero } #endif
|
| It had a 50x performance impact on Intel. As benchmarked on
| `c7i.metal-48xl` instances: - `f64` throughput
| grew from 0.2 to 8.2 TFLOPS. - `f32` throughput grew from
| 0.6 to 15.1 TFLOPS.
|
| Here is that section in the repo with more notes AVX-512, AMX,
| and other instructions: <https://github.com/ashvardanian/less_s
| low.cpp/blob/8f32d65cc...>.
| cwzwarich wrote:
| Intel has always had terrible subnormal performance. It's not
| that difficult to implement in HW, and even if you still want
| to optimize for the normalized case, we're talking about a 1
| cycle penalty, not an order(s)-of-magnitude penalty.
| johnisgood wrote:
| > Less Slow C, C++, and Assembly Code
|
| I expected C code, too, not only .cpp and .S, however.
|
| The less_slow.cpp uses a lot of C++-ism.
|
| This may require fixing, or removal of "C" from the list.
| kllrnohj wrote:
| > 40x faster trigonometry: Speed-up standard library functions
| like std::sin in just 3 lines of code.
|
| Huh, ok, let's see how... * By limiting the
| expansion to a few * terms, we can approximate `sin(x)`
| as: * * sin(x) [?] x - (x^3)/3! + (x^5)/5!
| * * This reduces the computational cost but comes with
| reduced accuracy.
|
| I see. "reduced accuracy" is an understatement. It's just
| _horrifically wrong_ for inputs outside the range of [-2, 2]
|
| https://www.wolframalpha.com/input?i=plot+sin+x%2C+x+-+%28x%...
|
| It cannot handle a single interval of a sin wave, much less the
| repeating nature? What an absolutely useless "optimization"
| Analemma_ wrote:
| This is kind of a dumb objection. If your sine function has
| good accuracy in [-pi/2, pi/2], you can compute all other
| values by shifting the argument and/or multiplying the result
| by -1.
| SkiFire13 wrote:
| But then you have to include this in the benchmark and it
| will no longer be 40x faster.
| dzaima wrote:
| There are a bunch of real situations where you can assume
| the input will be in a small range. And while reducing from
| [-pi;pi] or [-2*pi;2*pi] or whatever is gonna slow it down
| somewhat, I'm pretty sure it wouldn't be too significant,
| compared to the FP arith. (and branching on inputs outside
| of even that expanded target expected range is a fine
| strategy realistically)
| kllrnohj wrote:
| > branching on inputs outside of the target expected
| range is a fine strategy realistically
|
| branches at this scale are actually significant, and so
| will drastically impact being able to achieve 40x faster
| as claimed
| dzaima wrote:
| That's only if they're unpredictable; sure, perhaps on
| some workload it'll be unpredictable whether the input to
| sin/cos is grater than 2*pi, but I'm pretty sure on most
| it'll be nearly-always a "no". Perhaps not an
| optimization to take in general, but if you've got a
| workload where you're fine with 0.5% error, you can also
| spend a couple seconds thinking about what range of
| inputs to handle in the fast path. (hence "target
| expected range" - unexpected inputs getting unexpected
| branches isn't gonna slow down things if you've
| calibrated your expectations correctly; edited my comment
| slightly to make it clearer that that's about being out
| of the expanded range, not just [-pi/2,pi/2])
| kllrnohj wrote:
| Assuming an even distribution over a single iteration of
| sin, that is [0,pi], this will have a ~30% misprediction
| rate. That's not rare.
| dzaima wrote:
| I'm of course not suggesting branching in cases where you
| expect a 30% misprediction rate. You'd do branchless
| reduction from [-2*pi;2*pi] or whatever you expect to be
| frequent, and branch on inputs with magnitude greater
| than 2*pi if you want to be extra sure you don't get
| wrong results if usage changes.
|
| Again, we're in a situation where we know we can tolerate
| a 0.5% error, we can spare a bit of time to think about
| what range needs to be handled fast or supported at all.
| kllrnohj wrote:
| Those reductions need to be part of the function being
| benchmarked, though. Assuming a range limitation of
| [-pi,pi] even would be reasonable, there's certainly
| cases where you don't need multiple revolutions around a
| circle. But this can't even do that, so it's simply not a
| substitute for sin, and claiming 40x faster is a sham
| dzaima wrote:
| Right; the range reduction from [-pi;pi] would be like 5
| instrs ("x -= copysign(pi/2 & (abs(x)>pi/2), x)" or so),
| ~2 cycles throughput-wise or so, I think; that's slightly
| more significant than I was imagining, hmm.
|
| It's indeed not a substitute for sin in general, but it
| could be in some use-cases, and for those it could really
| be 40x faster (say, cases where you're already externally
| doing range reduction because it's necessary for some
| other reason (in general you don't want your angles
| infinitely accumulating scale)).
| dzaima wrote:
| At pi/2 that approximation gets you 1.0045, i.e. half a
| percent off, so it's not particularly good at that. (still
| could be sufficient for some uses though; but not the best
| achievable even with that performance)
| TimorousBestie wrote:
| Good argument reduction routines are not exactly easy for a
| novice to write, so I think this is a valid objection.
| kubav027 wrote:
| At least do not name the function "sin". One former game dev
| works in my company and he is using similar tricks all the
| time. It makes code so hard to read and unless you are
| computing "sin" a lot speedup is not measurable.
| ashvardanian wrote:
| You can always get more accuracy by expanding those 3 lines to
| handle more of the Taylor components... but it's important to
| remember that this is still educational material.
|
| You can find more complete examples in my SimSIMD
| (https://github.com/ashvardanian/SimSIMD), but they also often
| assume that at a certain part of a kernel, a floating point
| number is guaranteed to be in a certain range. This can greatly
| simplify the implementation for kernels like Atan2. For
| general-purpose inputs, go to SLEEF (https://sleef.org). Just
| remember that every large, complicated optimization starts with
| a small example.
| kllrnohj wrote:
| > but it's important to remember that this is still
| educational material.
|
| Then it should be educating on the applicability and
| limitations of things like this instead of just saying
| "reduced accuracy" and hoping the reader notices the massive
| issues? Kinda like the ffast-math section does.
| pclmulqdq wrote:
| No. Do not use Taylor series approximations in your real
| code. They are slow and inaccurate. You can do much, much
| better with some basic numerical analysis. Chebyshev and
| Remez approximations will give you more bang for your buck
| every time.
| jcranmer wrote:
| Educational material that misinforms its readers isn't
| educational, and it's insanely counterproductive.
|
| People have already ragged on you for doing Taylor
| approximation, and I'm not the best expert on the numerical
| analysis of implementing transcendental functions, so I won't
| pursue that further. But there's still several other
| unaddressed errors in your trigonometric code:
|
| * If your function is going to omit range reduction, _say so_
| upfront. Saying "use me to get a 40x speedup because I omit
| part of the specification" is misleading to users, especially
| because you should assume that most users are not
| knowledgeable about floating-point and thus they aren't even
| going to understand they're missing something without you
| explicitly telling them!
|
| * You're doing polynomial evaluation via `x * a + (x * x) * b
| + (x * x * x) * c` which is not the common way of doing so,
| and also, it's a slow way of doing so. If you're trying to be
| educational, do it via the `((((x * c) * x + b) * x) + a) *
| x` technique--that's how it's done, that's how it _should_ be
| done.
|
| * Also, doing `x / 6.0` is a disaster for performance,
| because fdiv is one of the slowest operations you can do. Why
| not do `x * (1.0 / 6.0)` instead?
|
| * Doing really, really dumb floating-point code and then
| relying on -ffast-math to make the compiler unpick your
| dumbness is... a bad way of doing stuff. Especially since
| you're recommending people go for it for the easy speedup and
| saying absolutely nothing about where it can go
| catastrophically wrong. As Simon Byrne said, "Friends don't
| let friends use -ffast-math" (and the title of my explainer
| on floating-point will invariably be "Floating Point, or How
| I Learned to Start Worrying and Hate -ffast-math").
| Certhas wrote:
| I don't know who the target audience is supposed to be, but
| who would be the type of person who tries to implement
| performance critical numerical codes but doesn't know the
| implications of Taylor expanding the sine function?
| Const-me wrote:
| > tries to implement performance critical numerical codes
| but doesn't know the implications of Taylor expanding the
| sine function?
|
| That would be me, I'm afraid. I know little about Taylor
| series, but I'm pretty sure it's less than ideal for the
| use case.
|
| Here's a better way to implement faster trigonometry
| functions in C++ https://github.com/Const-
| me/AvxMath/blob/master/AvxMath/AvxM... That particular
| source file implements that for 4-wide FP64 AVX vectors.
| ack_complete wrote:
| There are lots of cases where you can get away with
| moderate accuracy. Rotating a lot of batched sprites
| would be one of them; could easily get away with a
| mediocre Taylor series approximation, even though it's
| leaving free accuracy on the table compared to minimax.
|
| But not having _range reduction_ is a bigger problem, I
| can't see many uses for a sin() approximation that's only
| good for half wave. And as others have said, if you need
| range reduction for the approximation to work in its
| intended use case, that needs to be included in the
| benchmark because you're going to be paying that cost
| relative to `std::sin()`.
| jcranmer wrote:
| People who found that sin is the performance bottleneck
| in their code and are trying to find way to speed it up.
|
| One of the big problems with floating-point code in
| general is that users are largely ignorant of floating-
| point issues. Even something as basic as "0.1 + 0.2 !=
| 0.3" shouldn't be that surprising to a programmer if you
| spend about five minutes explaining it, but the evidence
| is clear that it is a shocking surprise to a very large
| fraction of programmers. And that's the most basic
| floating-point issue, the one you're virtually guaranteed
| to stumble across if you do anything with floating-point;
| there's so many more issues that you're not going to
| think about until you uncover them for the first time
| (e.g., different hardware gives different results).
| AlotOfReading wrote:
| Thanks to a lot of effort by countless legions of people,
| we're largely past the years when it was common for
| different hardware to give different results. It's pretty
| much just contraction, FTZ/DAZ, funsafe/ffast-math, and
| NaN propagation. Anyone interested in practical
| reproducibility really only has to consider the first two
| among the basic parts of the language, and they're
| relatively straightforward to manage.
| jcranmer wrote:
| Divergent math library implementations is the other main
| category, and for many practical cases, you might have to
| worry about parallelization factor changing things. For
| completeness' sake, I might as well add in approximate
| functions, but if you using an approximate inverse square
| root instruction, well, you should probably expect that
| to be differ on different hardware.
|
| On the plus side, x87 excess precision is largely a thing
| of the past, and we've seen some major pushes towards
| getting rid of FTZ/DAZ (I think we're at the point where
| even the offload architectures are mandating denormal
| support?). Assuming Intel figures out how to fully get
| rid of denormal penalties on its hardware, we're probably
| a decade or so out from making -ffast-math no longer
| imply denormal flushing, yay. (Also, we're seeing a lot
| of progress on high-speed implementations of correctly-
| rounded libm functions, so I also expect to see standard
| libraries require correctly-rounded implementations as
| well).
| AlotOfReading wrote:
| The definition I use for determinism is "same inputs and
| same order = same results", down to the compiler level.
| All modern compilers on all modern platforms that I've
| tested take steps to ensure that for everything except
| transcendental and special functions (where it'd be an
| unreasonable guarantee).
|
| I'm somewhat less interested in correctness of the
| results, so long as they're consistent. rlibm and related
| are definitely neat, but I'm not optimistic they'll
| become mainstream.
| ashvardanian wrote:
| I am genuinely quite surprised that the sine approximation
| is the eyeball catcher in that entire repo.
|
| It will only take a 5-line PR to add Horner's method and
| Chebyshev's polynomials and probably around 20 lines of
| explanations, and everyone passionate about the topic is
| welcome to add them.
|
| There are more than enough examples in the libraries
| mentioned above ;)
| croemer wrote:
| It's eye catching because it's advertised as a 40x
| speedup without any caveats.
| ashvardanian wrote:
| Oh, I've missed that part. It's hard to fit README's
| bullet points on a single line, and I've probably removed
| too many relevant words.
|
| I'll update the README statement in a second, and already
| patched the sources to explicitly focus on the [-p/2,
| p/2] range.
|
| Thanks!
| dingdingdang wrote:
| Someone is still putting tremendous effort into this
| project so I reckon it would be worthwhile to submit this,
| obviously well thought through, criticism as a PR for the
| repo!
| creato wrote:
| I'd suggest simply adding `assert(-M_PI/2 <= x && x <=
| M_PI/2)` to your function. It won't slow down the code in
| optimized builds, and makes it obvious that it isn't designed
| to work outside that range even if people copy/paste the code
| without reading it or any comments.
|
| Also, it would be good to have even in a "production" use of
| a function like this, in case something outside that range
| reaches it by accident.
| ashvardanian wrote:
| Yes, that's a very productive suggestion! Feel free to open
| a PR, or I'll just patch it myself in a couple of hours
| when I'm back to the computer. Thanks!
| pclmulqdq wrote:
| Oof this is bad. If you're going to ask people to approximate,
| use a Chebyshev approximation please. You will do sin(x) faster
| than this and more accurately.
| meindnoch wrote:
| And it's not even using the Horner scheme for evaluating the
| polynomial.
| boricj wrote:
| It's not useless if it's good enough for the problem at hand.
|
| Kaze Emanuar has two entire videos dedicated to optimizing
| sin() on the Nintendo 64 and he's using approximations like
| this without issues in his homebrew: -
| https://www.youtube.com/watch?v=xFKFoGiGlXQ -
| https://www.youtube.com/watch?v=hffgNRfL1XY
| loeg wrote:
| This is a hodgepodge of random stuff. Completely incoherent.
| Would not really point anyone at this to try and dig out gems.
| ashvardanian wrote:
| I agree that it may not be the most readable format. So far,
| the best-structured piece on similar topics I've seen is
| Algorithmica: <https://en.algorithmica.org/hpc>.
|
| I am sure it overlaps in terms of topics, maybe even some
| examples, but the homepage suggests that the book is about 500
| pages long. I generally don't have a time budget to read that
| much, and in most cases, I want to play with the code more than
| read the text, especially when some new IO_uring feature, a
| metaprogramming tool, or an Assembly instruction comes out.
|
| Another observation is that people don't like to read into
| 5000-word essays on each topic. At best, those become training
| materials for the next LLMs and will affect future code only
| indirectly...
|
| I'm all ears for alternative formats if you have
| recommendations, as I generally reimplement such projects every
| couple of years ;)
| johnisgood wrote:
| Is there one for C? I do not like the "std:" whatnots.
| gblargg wrote:
| Agreed. I'm not seeing any use of a automatic profiler to find
| the hot spots of one's code. If you don't know that, you're
| aiming blind on what to optimize and how much. The only
| profiling mentioned is something manual you have to add to each
| scope you want to time.
| samlinnfer wrote:
| Looks like AI slop.
| vitus wrote:
| Why choose Abseil for associative containers? It was a big deal
| when it first appeared on the scene, and it's certainly still
| better than the standard library for the most part, but there are
| now a number of alternatives that have since improved upon its
| weaknesses. (I've been especially impressed by Boost's
| unordered_flat_map.)
|
| I have increasingly soured on Abseil over the past couple of
| years. At Google, we've seen an exodus of many of the core
| library maintainers, and some of the more recent design choices
| have been questionable at best.
| ashvardanian wrote:
| I've never tried `unordered_flat_map` and would probably prefer
| to keep Boost, Poco, Qt, and other mega-frameworks outside of
| this repo to make the builds leaner, but I'm open to
| alternatives. Which are the best to look into?
| PaulDavisThe1st wrote:
| boost:unordered is a header-only library.
___________________________________________________________________
(page generated 2025-04-18 23:00 UTC)