[HN Gopher] Beyond OpenMP in C++ and Rust: Taskflow, Rayon, Fork...
___________________________________________________________________
Beyond OpenMP in C++ and Rust: Taskflow, Rayon, Fork Union
Author : ashvardanian
Score : 118 points
Date : 2025-09-28 08:53 UTC (14 hours ago)
(HTM) web link (ashvardanian.com)
(TXT) w3m dump (ashvardanian.com)
| seivan wrote:
| Wow that was a big difference between rayon and fork union. But
| it's still missing convenience apis for drop in par_iter().
| ashvardanian wrote:
| Implementing convenience APIs in Rust has been tricky, since a
| lot of the usual memory-safety semantics start to break once
| you push into fast concurrent code that shares buffers across
| threads. My early drafts were riddled with unsafe and barely
| functional, so I'd definitely welcome suggestions for quality-
| of-life improvements.
| ashvardanian wrote:
| I likely posted this a few months ago, and it looks like it just
| got bumped by the platform. Since then the library has seen some
| improvements (especially if you run on a Linux NUMA box), and the
| GitHub README is probably the best place to start now:
| https://github.com/ashvardanian/fork_union
|
| There are still a few blind spots I'm working on hardening, and
| if you have suggestions, I'd be glad to see them in issues or PRs
| :)
| alextingle wrote:
| How does it compare to Intel's TBB?
| ashvardanian wrote:
| I was asked this a few months back but don't have the
| measurements fresh anymore. In general, I think TBB is one of
| the more thorough and feature-rich parallelism libraries out
| there. That said, I just found a comparable usage example in my
| benchmarks, and it doesn't look like TBB will have the same
| low-latency profile as Fork Union:
| https://github.com/ashvardanian/ParallelReductionsBenchmark/...
| SkiFire13 wrote:
| It's cool to see the end result, but I would prefer if the
| article focused a bit more on how it achieves such solution. For
| example, how does it dispatch work to the various threads? Fo
| they sleep when there's no work to do? If so how do you wake them
| up? How does it handle cases where work is not uniformly
| distributes between your work items (i.e. some of them are a lot
| slower to process)? Is that even part of the end goal?
| ashvardanian wrote:
| Yes, non-uniform workloads are supported! See `for_n_dynamic`.
|
| The threads "busy-wait" by running an infinite loop in a lower
| energy state on modern CPUs.
|
| And yes, there are more details in the actual implementation in
| the repository itself. This section, for example, describes the
| atomic variables needed to control all of the logic:
| https://github.com/ashvardanian/fork_union?tab=readme-ov-fil...
| SkiFire13 wrote:
| > The threads "busy-wait" by running an infinite loop in a
| lower energy state on modern CPUs.
|
| Doesn't that still use part of the process's timeslots from
| the OS scheduler's POV?
| mgaunard wrote:
| I have built many asynchronous and parallel with task queues.
|
| None of them involve allocation or system calls after
| initialization. Queues are pre-allocated, as they should.
| eska wrote:
| I think thread pools are one of those solved problems that the
| silent majority of C programmers has solved ages ago and
| doesn't release open source projects for. I've also written my
| 300 line for pool and allocator together and always laughed at
| taskflow, rayon, etc.. Even NUMA is easy with arena allocation.
| When casey muratori of the handmade network said the same thing
| I remember agreeing and he got made fun of for it.
|
| BTW the for-case can simply be supported by setting a
| pool/global boolean and using that to decide how to wait for a
| new task (during the paralle for the boolean will be true,
| otherwise do sleeps with mutexes in the worst case for energy
| saving)
| ashvardanian wrote:
| I totally agree -- most C/C++ developers with 10+ years of
| experience have built similar thread pools or allocators in
| their own codebases. I'd include myself and a few of my
| former colleagues on that list.
|
| That said, closed-source solutions for local use aren't quite
| the same as an open-source project with wider validation.
| With more third-party usage, reviews, and edge cases, you
| often discover issues you'd never hit in-house. Some of the
| most valuable improvements I've seen have come from external
| bug reports or occasional PRs from people using the code in
| very different environments.
| mgaunard wrote:
| There are many open-source and academic libraries for
| parallel programming which have performance similar or
| better than OpenMP.
| Ygg2 wrote:
| > When casey muratori of the handmade network said the same
| thing I remember agreeing and he got made fun of for it.
|
| Casey Muratori, while a great programmer in his own right,
| often disregards the use cases, which leads to apple-to-
| orange comparisons. E.g., why is this ASCII editor from 40
| years ago much faster than this Unicode (with full ZWC
| joining emoji suite) text editor?
| articulatepang wrote:
| I'd love to learn more about this. What
| resources/books/articles/code can I look at to understand
| this more? Or, if you have some time, would you mind
| expanding on it?
|
| The parts I'm specifically interested in: 1. What the 300
| line pool and allocator look like 2. What this means: "BTW
| the for-case can simply be supported by setting a pool/global
| boolean and using that to decide how to wait for a new task
| (during the paralle for the boolean will be true, otherwise
| do sleeps with mutexes in the worst case for energy saving)"
|
| Thank you!
| quackzar wrote:
| Any comparison with heartbeat scheduling i.e.,
| https://github.com/judofyr/spice or the rust port
| https://github.com/dragostis/chili ?
| ashvardanian wrote:
| As a rule of thumb, I find Zig projects often come across as
| higher quality than many C, C++, or Rust counterparts -- it's a
| surprisingly strong community. That said, I don't write much in
| Zig myself, haven't explored the projects you linked in detail,
| and wasn't even considering Zig alternatives as I needed
| something for my C++ and Rust projects.
|
| From a first glance, they seem to be tackling a slightly
| different problem: focusing on efficient handling of nested
| parallelism. Fork Union doesn't support nested parallelism at
| all.
| johnisgood wrote:
| I think Ada is great with its builtin constructs for
| concurrency. It helps you avoid data races, too. You can
| formally verify your code as well if you so wish. Ada may be
| too "serious" for people. :D
| felixguendling wrote:
| Would you recommend this as a "thread pool" / coroutine scheduler
| replacement for an application web server?
| ashvardanian wrote:
| Yes, I was planning a similar experiment with UCall
| (https://github.com/unum-cloud/ucall), leveraging the NUMA
| functionality introduced in v2 of Fork Union. I don't currently
| have the right hardware to test it properly, but it would be
| very interesting to measure how pinning behaves on machines
| with multiple NUMA nodes, NICs, and a balanced PCIe topology.
| nextaccountic wrote:
| Tokio actually has some similarities with Rayon. Tokio is used
| in most Rust web servers, like Axum and Actix-web
| ashvardanian wrote:
| That's true -- though in my benchmarks Tokio came out as one
| of the slower parallelism-enabling projects. The article
| still included a comparison: $
| PARALLEL_REDUCTIONS_LENGTH=1536 cargo +nightly bench --
| --output-format bencher test fork_union ...
| bench: 5,150 ns/iter (+/- 402) test rayon ... bench:
| 47,251 ns/iter (+/- 3,985) test smol ... bench:
| 54,931 ns/iter (+/- 10) test tokio ... bench:
| 240,707 ns/iter (+/- 921)
|
| ... but I now avoid comparing to Tokio since it doesn't seem
| fair -- fork-join style parallel processing isn't really its
| primary use case.
| jcelerier wrote:
| We've used fork_union in spatgris (https://github.com/GRIS-
| UdeM/SpatGRIS) recently and got a nice speedup! That said there
| were some trouble with the busy wait eating 100% of CPU
| ashvardanian wrote:
| Wow, I didn't realize someone had integrated it into their
| project before I even tried it in mine -- thanks for the trust
| and for sharing!
|
| I completely agree that tuning is needed for better CPU sleep
| scheduling. I'm hoping to look into it this October, ideally on
| some Alder Lake CPUs with mixed Performance/Efficiency cores
| and NUMA enabled.
| jcelerier wrote:
| Haha, when I saw the previous post I thought "this is exactly
| what I need" - our problem maps 1:1 to what fork_union
| provides and has a low-latency requirement (real-time audio
| dsp)
| _flux wrote:
| Where does the improved performance come from? The project does
| actually outline different factors, but I wonder which of them
| are the biggest ones, or are they all equally important?
|
| And how big a player is the busyloop-locking. Yes, the code is
| telling CPU do it energy-efficiently, but it's not going to beat
| the OS if the loop is waiting for more job that's not coming for
| a while.. Is it doing this with every core?
|
| One factor could be that when a subprocess dies, then it doesn't
| need to release any memory, as the OS deals with it in one* go,
| versus a thread teardown where you need to be neat. Though I
| suppose this would not be a lot of work.
| ashvardanian wrote:
| Compared to Rayon or Taskflow, the biggest initial win is
| cutting out heap allocations for all the promise/result objects
| -- those act like mutexes once the allocator gets hammered by
| many threads.
|
| Hard to rank the rest without a proper breakdown. If I ever
| tried, I'd probably end up writing a paper -- and I'd rather
| write code :)
| forrestthewoods wrote:
| Interesting project. More so than I expected!
|
| I wish it was benchmarked against some "real" work and not
| summing integers though. I find such nano-benches incredibly
| unreliable and unhelpful.
___________________________________________________________________
(page generated 2025-09-28 23:00 UTC)