[HN Gopher] Disruptor-rs: better latency and throughput than cro...
___________________________________________________________________
Disruptor-rs: better latency and throughput than crossbeam
Author : nicholassm83
Score : 130 points
Date : 2024-07-13 13:47 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| bluejekyll wrote:
| This is really cool to see. Is anyone potentially working on an
| integration with this an Tokio to bring these performance
| benefits to the async ecosystem? Or maybe I should ask first,
| would it make sense to look at this as a foundational library for
| the multi-thread async frameworks in Rust?
| nXqd wrote:
| Tokio focuses on being high throughput as default, since they
| mostly use yield_now backoff strategy. It should work with most
| application.
|
| For latency sensitive application, it tends to have different
| purpose which mainly trade off CPU and RAM usage for higher low
| latency ( first ) and throughput later.
| nicholassm83 wrote:
| I agree, the disruptor is more about low latency. And the
| cost is very high: a 100% utilized core. This is a great
| trade-off if you can make money by being faster such as in
| e-trading.
| lordnacho wrote:
| Suppose I have trading system built on Tokio. How would I
| go about using this instead? What parts need replacing?
|
| Actually looking at the code a bit, it seems like you could
| replace the select statements with the various handlers,
| and hook up some threads to them. It would indeed cook your
| CPU but that's ok for certain use cases.
| nicholassm83 wrote:
| I would love to give you a good answer but I've been
| working on low latency trading systems for a decade so I
| have never used async/actors/fibers/etc. I would think it
| implies a rewrite as async is fundamentally baked into
| your code if you use Tokio.
| lordnacho wrote:
| Depends on what "fundamental" means. If we're talking
| about how stuff is scheduled, then yes of course you're
| right. Either we suspend stuff and take a hit on when to
| continue, or we hot-loop and latency is minimized at the
| cost of cooking a CPU.
|
| But there's a bunch of stuff that isn't that part of the
| trading system, though. All the code that deals with the
| format of the incoming exchange might still be useful
| somehow. All the internal messages as well might just
| have the same format. The logic of putting events on some
| sort of queue for some other worker (task/thread) to do
| seems pretty similar to me. You are just handling the
| messages immediately rather than waking up a thread for
| it, and that seems to be the tradeoff.
| _3u10 wrote:
| These libs are more about hot paths / cache coherency and
| allowing single CPU processing (no cache coherency issues
| / lock contention) than anything else. That is where the
| performance comes from, referred to as "mechanical
| sympathy" in the original LMAX paper.
|
| Originally computers were expensive, and lots of users
| wanted to share a system, so a lot of OS thought went
| into this, LMAX flips the script on this, computers are
| cheap, and you want the computer doing one thing as fast
| as possible, which isn't a good fit for modern OS's that
| have been designed around the exact opposite idea. This
| is also why bare metal is many times faster than VMs in
| practice, because you aren't sharing someone else's
| computer with a bunch of other programs polluting the
| cache.
| lordnacho wrote:
| Yeah, I agree. But the ideas of mechanical sympathy carry
| over into more than one kind of design. You can still be
| thinking about caches and branch prediction while writing
| things in async. It's just the awareness of it that
| allows you to make the tradeoffs you care about.
| _3u10 wrote:
| High throughput networking does the same thing, it polls
| the network adapter rather than waiting for interrupts.
|
| The cost is not high, it's much less expensive to have a
| CPU operating more efficiently than not processing anything
| because its syncing caches / context switching to handle an
| interrupt.
|
| These libraries are for busy systems, not systems waiting
| 30 minutes for the next request to come in.
|
| Basically, in an under utilized system most of the time you
| poll there is nothing wasting CPU for the poll, in an high
| throughput system when you poll there is almost ALWAYS data
| ready to be read, so interrupts are less efficient when
| utilization is high.
| nine_k wrote:
| Running half the cores of an industrial Xeon or Zen under
| 100% load implies very serious cooling. I suspect that
| running them all at 100% load for hours is just
| infeasible without e.g. water cooling.
| wmf wrote:
| Nah, it will just clock down. Server CPUs are designed to
| support all cores at 100% utilization indefinitely.
|
| Of course you can get different numbers if you invent a
| nonstandard definition of utilization.
| nine_k wrote:
| Of course server CPUs can run all cores at 100%
| indefinitely, as long as the cooling can handle it.
|
| With 300W to 400W TDP (Xeon Sapphire 9200) and two CPUs
| per typical 2U case, cooling is a real challenge, hence
| my mention of water cooling.
| wmf wrote:
| I disagree. Air cooling 1 KW per U is a commodity now.
| It's nothing special. (Whether your data center can
| handle it is another topic.)
| kprotty wrote:
| Tokio's focus is on low _tail_ -latencies for networking
| applications (as mentioned). But it doesn't employs yield_now
| for waiting on a concurrent condition to occur, even as a
| backoff strategy, as that fundamentally kills tail-latency
| under the average OS scheduler.
| pca006132 wrote:
| Probably doesn't make sense. Busy wait is fast when you can
| dedicate a core to the task, but this means that you cannot
| have many tasks in parallel with a small set of physical cores.
| When you oversubscribe, performance will quickly degrade.
|
| Tokio and other libraries such as pthread allows thread to wait
| for something and wake up that particular thread when the event
| occurs. This is what allows scheduler to schedule many tasks to
| a very small set of cores without running useless instructions
| checking for status.
|
| For foundational library, I think you want things that are
| composable, and low latency stuff are not that composable IMO.
|
| Not saying that they are bad, but low latency is something that
| requires global effort in your system, and using such library
| without being aware of these limitations will likely cause more
| harm than good.
| andrepd wrote:
| How would you? These are completely at odds: async is about
| suspending tasks that are waiting for something so that you can
| do other stuff in the meantime, and low-latency is about
| spinning a core at 100% to start working as fast as possible
| when the stuff you're waiting for arrives. You can't do both x)
| _3u10 wrote:
| No, not really, this is for synchronous processing, the events
| get overwritten so by the time you async handler fires you're
| processing an item that has mutated.
|
| What you're looking for is io_uring on Linux or IOCP on
| Windows, I don't think osx has something similar, maybe kqueue.
| karmakaze wrote:
| I played around with the original (Java) LMAX disruptor which was
| an interesting and different way to achieve latency/throughput.
| Didn't find a whitepaper--here's some references[0] which
| includes a Martin Fowler post[1].
|
| [0] https://github.com/LMAX-Exchange/disruptor/wiki/Blogs-And-
| Ar...
|
| [1] https://martinfowler.com/articles/lmax.html
| temporarely wrote:
| here you go:
|
| _Disruptor_ , Thompson, Farley, et al 2011
|
| https://lmax-exchange.github.io/disruptor/files/Disruptor-1....
| alchemist1e9 wrote:
| Is there anything specific to Rust that this library does which
| modern C++ can't match in performance? I'd be very interested to
| understand if there is.
| slashdev wrote:
| No, there shouldn't be.
|
| Rust is not magic and you can compile both with llvm (clang++).
|
| If you specify that the pointers don't alias, and don't use any
| language sugar that adds overhead on either side, the
| performance will be very similar.
| nicholassm83 wrote:
| I agree.
|
| The Rust implementation even needs to use a few unsafe blocks
| (to work with UnsafeCells internally) but is mostly safe
| code. Other than that you can achieve the same in C++. But I
| think the real benefit is that you can write the rest of your
| code in safe Rust.
| alchemist1e9 wrote:
| Unless the rest of your code is already in C++ and you're
| interested in this new better disrupter implementation,
| that's probably a common situation for people interested in
| this topic. Any recommendations for those in that
| situation? perhaps existing C++ implementations already
| match this idk.
| bluejekyll wrote:
| While you're not explicitly saying this, C++ in Rust's
| terms, is all unsafe. In a multi-threading context like
| this, that's even more important.
| nicholassm83 wrote:
| I'm trying to be polite. :-) And there is a lot of great
| C++ code and developers out there - especially in the
| e-trading/HFT space.
| pornel wrote:
| For Rust users there's a significant difference:
|
| * it's a Cargo package, which is trivial to add to a project.
| Pure-Rust projects are easier to build cross-platform.
|
| * It exports a safe Rust interface. It has configurable levels
| of thread safety, which are protected from misuse at compile
| time.
|
| The point isn't that C++ can match performance, but that you
| don't have to use C++, and still get the performance, plus
| other niceties.
|
| This is "is there anything specific to C++ that assembly can't
| match in performance?" one step removed.
| alchemist1e9 wrote:
| I had expected that's true. You just never know if perhaps
| Rust compilers have some more advanced/modern tricks that can
| only be accessed easily by writing in Rust without writing
| assembly directly.
| pornel wrote:
| There is a trick in truly exclusive references (marked
| noalias in LLVM). C++ doesn't even have the lesser form of
| C restrict pointers. However, a truly performance focused C
| or C++ library would tweak the code to get the desired
| optimizations one way or another.
|
| A more nebulous Rust perf thing is ability rely on the
| compiler to check lifetimes and immutability/exclusivity of
| pointers. This allows using fine-grained multithreading,
| even with 3rd party code, without the worry it's going to
| cause heisenbugs. It allows library APIs to work with
| temporary complex references that would be footguns
| otherwise (e.g. prefer string_view instead of string. Don't
| copy inputs defensively, because it's known they can't be
| mutated or freed even by a broken caller).
| BrokrnAlgorithm wrote:
| Is there also a decent c++ implementation of the disruptor out
| there?
| LtdJorge wrote:
| So nice, that I was just reading about the disruptor, since I had
| an idea of using ring buffers with atomic operations to back Rust
| channels with lower latency for intra-thread communication
| without locks, and now I see this. Gonna take a read!
___________________________________________________________________
(page generated 2024-07-13 23:00 UTC)