hngopher.com

       [HN Gopher] Disruptor-rs: better latency and throughput than cro...
       ___________________________________________________________________
        
       Disruptor-rs: better latency and throughput than crossbeam
        
       Author : nicholassm83
       Score  : 130 points
       Date   : 2024-07-13 13:47 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | bluejekyll wrote:
       | This is really cool to see. Is anyone potentially working on an
       | integration with this an Tokio to bring these performance
       | benefits to the async ecosystem? Or maybe I should ask first,
       | would it make sense to look at this as a foundational library for
       | the multi-thread async frameworks in Rust?
        
         | nXqd wrote:
         | Tokio focuses on being high throughput as default, since they
         | mostly use yield_now backoff strategy. It should work with most
         | application.
         | 
         | For latency sensitive application, it tends to have different
         | purpose which mainly trade off CPU and RAM usage for higher low
         | latency ( first ) and throughput later.
        
           | nicholassm83 wrote:
           | I agree, the disruptor is more about low latency. And the
           | cost is very high: a 100% utilized core. This is a great
           | trade-off if you can make money by being faster such as in
           | e-trading.
        
             | lordnacho wrote:
             | Suppose I have trading system built on Tokio. How would I
             | go about using this instead? What parts need replacing?
             | 
             | Actually looking at the code a bit, it seems like you could
             | replace the select statements with the various handlers,
             | and hook up some threads to them. It would indeed cook your
             | CPU but that's ok for certain use cases.
        
               | nicholassm83 wrote:
               | I would love to give you a good answer but I've been
               | working on low latency trading systems for a decade so I
               | have never used async/actors/fibers/etc. I would think it
               | implies a rewrite as async is fundamentally baked into
               | your code if you use Tokio.
        
               | lordnacho wrote:
               | Depends on what "fundamental" means. If we're talking
               | about how stuff is scheduled, then yes of course you're
               | right. Either we suspend stuff and take a hit on when to
               | continue, or we hot-loop and latency is minimized at the
               | cost of cooking a CPU.
               | 
               | But there's a bunch of stuff that isn't that part of the
               | trading system, though. All the code that deals with the
               | format of the incoming exchange might still be useful
               | somehow. All the internal messages as well might just
               | have the same format. The logic of putting events on some
               | sort of queue for some other worker (task/thread) to do
               | seems pretty similar to me. You are just handling the
               | messages immediately rather than waking up a thread for
               | it, and that seems to be the tradeoff.
        
               | _3u10 wrote:
               | These libs are more about hot paths / cache coherency and
               | allowing single CPU processing (no cache coherency issues
               | / lock contention) than anything else. That is where the
               | performance comes from, referred to as "mechanical
               | sympathy" in the original LMAX paper.
               | 
               | Originally computers were expensive, and lots of users
               | wanted to share a system, so a lot of OS thought went
               | into this, LMAX flips the script on this, computers are
               | cheap, and you want the computer doing one thing as fast
               | as possible, which isn't a good fit for modern OS's that
               | have been designed around the exact opposite idea. This
               | is also why bare metal is many times faster than VMs in
               | practice, because you aren't sharing someone else's
               | computer with a bunch of other programs polluting the
               | cache.
        
               | lordnacho wrote:
               | Yeah, I agree. But the ideas of mechanical sympathy carry
               | over into more than one kind of design. You can still be
               | thinking about caches and branch prediction while writing
               | things in async. It's just the awareness of it that
               | allows you to make the tradeoffs you care about.
        
             | _3u10 wrote:
             | High throughput networking does the same thing, it polls
             | the network adapter rather than waiting for interrupts.
             | 
             | The cost is not high, it's much less expensive to have a
             | CPU operating more efficiently than not processing anything
             | because its syncing caches / context switching to handle an
             | interrupt.
             | 
             | These libraries are for busy systems, not systems waiting
             | 30 minutes for the next request to come in.
             | 
             | Basically, in an under utilized system most of the time you
             | poll there is nothing wasting CPU for the poll, in an high
             | throughput system when you poll there is almost ALWAYS data
             | ready to be read, so interrupts are less efficient when
             | utilization is high.
        
               | nine_k wrote:
               | Running half the cores of an industrial Xeon or Zen under
               | 100% load implies very serious cooling. I suspect that
               | running them all at 100% load for hours is just
               | infeasible without e.g. water cooling.
        
               | wmf wrote:
               | Nah, it will just clock down. Server CPUs are designed to
               | support all cores at 100% utilization indefinitely.
               | 
               | Of course you can get different numbers if you invent a
               | nonstandard definition of utilization.
        
               | nine_k wrote:
               | Of course server CPUs can run all cores at 100%
               | indefinitely, as long as the cooling can handle it.
               | 
               | With 300W to 400W TDP (Xeon Sapphire 9200) and two CPUs
               | per typical 2U case, cooling is a real challenge, hence
               | my mention of water cooling.
        
               | wmf wrote:
               | I disagree. Air cooling 1 KW per U is a commodity now.
               | It's nothing special. (Whether your data center can
               | handle it is another topic.)
        
           | kprotty wrote:
           | Tokio's focus is on low _tail_ -latencies for networking
           | applications (as mentioned). But it doesn't employs yield_now
           | for waiting on a concurrent condition to occur, even as a
           | backoff strategy, as that fundamentally kills tail-latency
           | under the average OS scheduler.
        
         | pca006132 wrote:
         | Probably doesn't make sense. Busy wait is fast when you can
         | dedicate a core to the task, but this means that you cannot
         | have many tasks in parallel with a small set of physical cores.
         | When you oversubscribe, performance will quickly degrade.
         | 
         | Tokio and other libraries such as pthread allows thread to wait
         | for something and wake up that particular thread when the event
         | occurs. This is what allows scheduler to schedule many tasks to
         | a very small set of cores without running useless instructions
         | checking for status.
         | 
         | For foundational library, I think you want things that are
         | composable, and low latency stuff are not that composable IMO.
         | 
         | Not saying that they are bad, but low latency is something that
         | requires global effort in your system, and using such library
         | without being aware of these limitations will likely cause more
         | harm than good.
        
         | andrepd wrote:
         | How would you? These are completely at odds: async is about
         | suspending tasks that are waiting for something so that you can
         | do other stuff in the meantime, and low-latency is about
         | spinning a core at 100% to start working as fast as possible
         | when the stuff you're waiting for arrives. You can't do both x)
        
         | _3u10 wrote:
         | No, not really, this is for synchronous processing, the events
         | get overwritten so by the time you async handler fires you're
         | processing an item that has mutated.
         | 
         | What you're looking for is io_uring on Linux or IOCP on
         | Windows, I don't think osx has something similar, maybe kqueue.
        
       | karmakaze wrote:
       | I played around with the original (Java) LMAX disruptor which was
       | an interesting and different way to achieve latency/throughput.
       | Didn't find a whitepaper--here's some references[0] which
       | includes a Martin Fowler post[1].
       | 
       | [0] https://github.com/LMAX-Exchange/disruptor/wiki/Blogs-And-
       | Ar...
       | 
       | [1] https://martinfowler.com/articles/lmax.html
        
         | temporarely wrote:
         | here you go:
         | 
         |  _Disruptor_ , Thompson, Farley, et al 2011
         | 
         | https://lmax-exchange.github.io/disruptor/files/Disruptor-1....
        
       | alchemist1e9 wrote:
       | Is there anything specific to Rust that this library does which
       | modern C++ can't match in performance? I'd be very interested to
       | understand if there is.
        
         | slashdev wrote:
         | No, there shouldn't be.
         | 
         | Rust is not magic and you can compile both with llvm (clang++).
         | 
         | If you specify that the pointers don't alias, and don't use any
         | language sugar that adds overhead on either side, the
         | performance will be very similar.
        
           | nicholassm83 wrote:
           | I agree.
           | 
           | The Rust implementation even needs to use a few unsafe blocks
           | (to work with UnsafeCells internally) but is mostly safe
           | code. Other than that you can achieve the same in C++. But I
           | think the real benefit is that you can write the rest of your
           | code in safe Rust.
        
             | alchemist1e9 wrote:
             | Unless the rest of your code is already in C++ and you're
             | interested in this new better disrupter implementation,
             | that's probably a common situation for people interested in
             | this topic. Any recommendations for those in that
             | situation? perhaps existing C++ implementations already
             | match this idk.
        
             | bluejekyll wrote:
             | While you're not explicitly saying this, C++ in Rust's
             | terms, is all unsafe. In a multi-threading context like
             | this, that's even more important.
        
               | nicholassm83 wrote:
               | I'm trying to be polite. :-) And there is a lot of great
               | C++ code and developers out there - especially in the
               | e-trading/HFT space.
        
         | pornel wrote:
         | For Rust users there's a significant difference:
         | 
         | * it's a Cargo package, which is trivial to add to a project.
         | Pure-Rust projects are easier to build cross-platform.
         | 
         | * It exports a safe Rust interface. It has configurable levels
         | of thread safety, which are protected from misuse at compile
         | time.
         | 
         | The point isn't that C++ can match performance, but that you
         | don't have to use C++, and still get the performance, plus
         | other niceties.
         | 
         | This is "is there anything specific to C++ that assembly can't
         | match in performance?" one step removed.
        
           | alchemist1e9 wrote:
           | I had expected that's true. You just never know if perhaps
           | Rust compilers have some more advanced/modern tricks that can
           | only be accessed easily by writing in Rust without writing
           | assembly directly.
        
             | pornel wrote:
             | There is a trick in truly exclusive references (marked
             | noalias in LLVM). C++ doesn't even have the lesser form of
             | C restrict pointers. However, a truly performance focused C
             | or C++ library would tweak the code to get the desired
             | optimizations one way or another.
             | 
             | A more nebulous Rust perf thing is ability rely on the
             | compiler to check lifetimes and immutability/exclusivity of
             | pointers. This allows using fine-grained multithreading,
             | even with 3rd party code, without the worry it's going to
             | cause heisenbugs. It allows library APIs to work with
             | temporary complex references that would be footguns
             | otherwise (e.g. prefer string_view instead of string. Don't
             | copy inputs defensively, because it's known they can't be
             | mutated or freed even by a broken caller).
        
       | BrokrnAlgorithm wrote:
       | Is there also a decent c++ implementation of the disruptor out
       | there?
        
       | LtdJorge wrote:
       | So nice, that I was just reading about the disruptor, since I had
       | an idea of using ring buffers with atomic operations to back Rust
       | channels with lower latency for intra-thread communication
       | without locks, and now I see this. Gonna take a read!
        
       ___________________________________________________________________
       (page generated 2024-07-13 23:00 UTC)