[HN Gopher] Beyond OpenMP in C++ and Rust: Taskflow, Rayon, Fork...
       ___________________________________________________________________
        
       Beyond OpenMP in C++ and Rust: Taskflow, Rayon, Fork Union
        
       Author : ashvardanian
       Score  : 118 points
       Date   : 2025-09-28 08:53 UTC (14 hours ago)
        
 (HTM) web link (ashvardanian.com)
 (TXT) w3m dump (ashvardanian.com)
        
       | seivan wrote:
       | Wow that was a big difference between rayon and fork union. But
       | it's still missing convenience apis for drop in par_iter().
        
         | ashvardanian wrote:
         | Implementing convenience APIs in Rust has been tricky, since a
         | lot of the usual memory-safety semantics start to break once
         | you push into fast concurrent code that shares buffers across
         | threads. My early drafts were riddled with unsafe and barely
         | functional, so I'd definitely welcome suggestions for quality-
         | of-life improvements.
        
       | ashvardanian wrote:
       | I likely posted this a few months ago, and it looks like it just
       | got bumped by the platform. Since then the library has seen some
       | improvements (especially if you run on a Linux NUMA box), and the
       | GitHub README is probably the best place to start now:
       | https://github.com/ashvardanian/fork_union
       | 
       | There are still a few blind spots I'm working on hardening, and
       | if you have suggestions, I'd be glad to see them in issues or PRs
       | :)
        
       | alextingle wrote:
       | How does it compare to Intel's TBB?
        
         | ashvardanian wrote:
         | I was asked this a few months back but don't have the
         | measurements fresh anymore. In general, I think TBB is one of
         | the more thorough and feature-rich parallelism libraries out
         | there. That said, I just found a comparable usage example in my
         | benchmarks, and it doesn't look like TBB will have the same
         | low-latency profile as Fork Union:
         | https://github.com/ashvardanian/ParallelReductionsBenchmark/...
        
       | SkiFire13 wrote:
       | It's cool to see the end result, but I would prefer if the
       | article focused a bit more on how it achieves such solution. For
       | example, how does it dispatch work to the various threads? Fo
       | they sleep when there's no work to do? If so how do you wake them
       | up? How does it handle cases where work is not uniformly
       | distributes between your work items (i.e. some of them are a lot
       | slower to process)? Is that even part of the end goal?
        
         | ashvardanian wrote:
         | Yes, non-uniform workloads are supported! See `for_n_dynamic`.
         | 
         | The threads "busy-wait" by running an infinite loop in a lower
         | energy state on modern CPUs.
         | 
         | And yes, there are more details in the actual implementation in
         | the repository itself. This section, for example, describes the
         | atomic variables needed to control all of the logic:
         | https://github.com/ashvardanian/fork_union?tab=readme-ov-fil...
        
           | SkiFire13 wrote:
           | > The threads "busy-wait" by running an infinite loop in a
           | lower energy state on modern CPUs.
           | 
           | Doesn't that still use part of the process's timeslots from
           | the OS scheduler's POV?
        
       | mgaunard wrote:
       | I have built many asynchronous and parallel with task queues.
       | 
       | None of them involve allocation or system calls after
       | initialization. Queues are pre-allocated, as they should.
        
         | eska wrote:
         | I think thread pools are one of those solved problems that the
         | silent majority of C programmers has solved ages ago and
         | doesn't release open source projects for. I've also written my
         | 300 line for pool and allocator together and always laughed at
         | taskflow, rayon, etc.. Even NUMA is easy with arena allocation.
         | When casey muratori of the handmade network said the same thing
         | I remember agreeing and he got made fun of for it.
         | 
         | BTW the for-case can simply be supported by setting a
         | pool/global boolean and using that to decide how to wait for a
         | new task (during the paralle for the boolean will be true,
         | otherwise do sleeps with mutexes in the worst case for energy
         | saving)
        
           | ashvardanian wrote:
           | I totally agree -- most C/C++ developers with 10+ years of
           | experience have built similar thread pools or allocators in
           | their own codebases. I'd include myself and a few of my
           | former colleagues on that list.
           | 
           | That said, closed-source solutions for local use aren't quite
           | the same as an open-source project with wider validation.
           | With more third-party usage, reviews, and edge cases, you
           | often discover issues you'd never hit in-house. Some of the
           | most valuable improvements I've seen have come from external
           | bug reports or occasional PRs from people using the code in
           | very different environments.
        
             | mgaunard wrote:
             | There are many open-source and academic libraries for
             | parallel programming which have performance similar or
             | better than OpenMP.
        
           | Ygg2 wrote:
           | > When casey muratori of the handmade network said the same
           | thing I remember agreeing and he got made fun of for it.
           | 
           | Casey Muratori, while a great programmer in his own right,
           | often disregards the use cases, which leads to apple-to-
           | orange comparisons. E.g., why is this ASCII editor from 40
           | years ago much faster than this Unicode (with full ZWC
           | joining emoji suite) text editor?
        
           | articulatepang wrote:
           | I'd love to learn more about this. What
           | resources/books/articles/code can I look at to understand
           | this more? Or, if you have some time, would you mind
           | expanding on it?
           | 
           | The parts I'm specifically interested in: 1. What the 300
           | line pool and allocator look like 2. What this means: "BTW
           | the for-case can simply be supported by setting a pool/global
           | boolean and using that to decide how to wait for a new task
           | (during the paralle for the boolean will be true, otherwise
           | do sleeps with mutexes in the worst case for energy saving)"
           | 
           | Thank you!
        
       | quackzar wrote:
       | Any comparison with heartbeat scheduling i.e.,
       | https://github.com/judofyr/spice or the rust port
       | https://github.com/dragostis/chili ?
        
         | ashvardanian wrote:
         | As a rule of thumb, I find Zig projects often come across as
         | higher quality than many C, C++, or Rust counterparts -- it's a
         | surprisingly strong community. That said, I don't write much in
         | Zig myself, haven't explored the projects you linked in detail,
         | and wasn't even considering Zig alternatives as I needed
         | something for my C++ and Rust projects.
         | 
         | From a first glance, they seem to be tackling a slightly
         | different problem: focusing on efficient handling of nested
         | parallelism. Fork Union doesn't support nested parallelism at
         | all.
        
           | johnisgood wrote:
           | I think Ada is great with its builtin constructs for
           | concurrency. It helps you avoid data races, too. You can
           | formally verify your code as well if you so wish. Ada may be
           | too "serious" for people. :D
        
       | felixguendling wrote:
       | Would you recommend this as a "thread pool" / coroutine scheduler
       | replacement for an application web server?
        
         | ashvardanian wrote:
         | Yes, I was planning a similar experiment with UCall
         | (https://github.com/unum-cloud/ucall), leveraging the NUMA
         | functionality introduced in v2 of Fork Union. I don't currently
         | have the right hardware to test it properly, but it would be
         | very interesting to measure how pinning behaves on machines
         | with multiple NUMA nodes, NICs, and a balanced PCIe topology.
        
         | nextaccountic wrote:
         | Tokio actually has some similarities with Rayon. Tokio is used
         | in most Rust web servers, like Axum and Actix-web
        
           | ashvardanian wrote:
           | That's true -- though in my benchmarks Tokio came out as one
           | of the slower parallelism-enabling projects. The article
           | still included a comparison:                 $
           | PARALLEL_REDUCTIONS_LENGTH=1536 cargo +nightly bench --
           | --output-format bencher              test fork_union ...
           | bench:  5,150 ns/iter (+/- 402)       test rayon ... bench:
           | 47,251 ns/iter (+/- 3,985)       test smol ... bench:
           | 54,931 ns/iter (+/- 10)       test tokio ... bench:
           | 240,707 ns/iter (+/- 921)
           | 
           | ... but I now avoid comparing to Tokio since it doesn't seem
           | fair -- fork-join style parallel processing isn't really its
           | primary use case.
        
       | jcelerier wrote:
       | We've used fork_union in spatgris (https://github.com/GRIS-
       | UdeM/SpatGRIS) recently and got a nice speedup! That said there
       | were some trouble with the busy wait eating 100% of CPU
        
         | ashvardanian wrote:
         | Wow, I didn't realize someone had integrated it into their
         | project before I even tried it in mine -- thanks for the trust
         | and for sharing!
         | 
         | I completely agree that tuning is needed for better CPU sleep
         | scheduling. I'm hoping to look into it this October, ideally on
         | some Alder Lake CPUs with mixed Performance/Efficiency cores
         | and NUMA enabled.
        
           | jcelerier wrote:
           | Haha, when I saw the previous post I thought "this is exactly
           | what I need" - our problem maps 1:1 to what fork_union
           | provides and has a low-latency requirement (real-time audio
           | dsp)
        
       | _flux wrote:
       | Where does the improved performance come from? The project does
       | actually outline different factors, but I wonder which of them
       | are the biggest ones, or are they all equally important?
       | 
       | And how big a player is the busyloop-locking. Yes, the code is
       | telling CPU do it energy-efficiently, but it's not going to beat
       | the OS if the loop is waiting for more job that's not coming for
       | a while.. Is it doing this with every core?
       | 
       | One factor could be that when a subprocess dies, then it doesn't
       | need to release any memory, as the OS deals with it in one* go,
       | versus a thread teardown where you need to be neat. Though I
       | suppose this would not be a lot of work.
        
         | ashvardanian wrote:
         | Compared to Rayon or Taskflow, the biggest initial win is
         | cutting out heap allocations for all the promise/result objects
         | -- those act like mutexes once the allocator gets hammered by
         | many threads.
         | 
         | Hard to rank the rest without a proper breakdown. If I ever
         | tried, I'd probably end up writing a paper -- and I'd rather
         | write code :)
        
       | forrestthewoods wrote:
       | Interesting project. More so than I expected!
       | 
       | I wish it was benchmarked against some "real" work and not
       | summing integers though. I find such nano-benches incredibly
       | unreliable and unhelpful.
        
       ___________________________________________________________________
       (page generated 2025-09-28 23:00 UTC)