[HN Gopher] Axboe Achieves 8M IOPS Per-Core with Newest Linux Op...
       ___________________________________________________________________
        
       Axboe Achieves 8M IOPS Per-Core with Newest Linux Optimization
       Patches
        
       Author : marcodiego
       Score  : 98 points
       Date   : 2021-10-17 04:05 UTC (1 days ago)
        
 (HTM) web link (www.phoronix.com)
 (TXT) w3m dump (www.phoronix.com)
        
       | Aissen wrote:
       | That's 125 ns per IO, on a single CPU core, with two devices.
       | Mind boggling.
        
         | megous wrote:
         | IO is 512B sectors here? Or 4KiB?
         | 
         | EDIT: 512, looking at the screenshot..
        
         | anarazel wrote:
         | It's worth noting that there's some batching involved, so it's
         | not 125ns for processing one io, then again 125ns for the next.
         | Both combining some of the work, and superscalar execution is
         | necessary to reach this high numbers...
        
           | benlwalker wrote:
           | To expand on the batching details, I haven't seen exactly
           | what he's doing here, but historically the numbers quoted are
           | from benchmarks that submit something like 128 total queue
           | depth in alternating batches of 64. The disks hit max
           | performance with a much lower queue depth than 64. So it's
           | very unrealistic perfect batching that you'll never see in a
           | real application. The workload is only a useful tool for
           | optimizing the code, which is exactly what he's using it for.
        
             | anarazel wrote:
             | I think you need batching for the nvme doorbell alone at
             | anything close to these kinds of rates...
             | 
             | For some workloads it's not that hard to have that deep
             | queues. What's harder is to know when to use them and when
             | not. There's really not enough information available to
             | make any of this self-tuning.
        
             | wtallis wrote:
             | The Intel Optane drives he's testing with hit max
             | performance at pretty low queue depths, but flash-based
             | SSDs with similar throughput will require those high queue
             | depths.
        
         | guerrilla wrote:
         | Faster than DRAM from the 90's.
        
           | terafo wrote:
           | Modern SSDs generally have more bandwidth than DDR2.
        
           | formerly_proven wrote:
           | The access time has actually not changed a lot, it's just
           | that transferring the smallest read a contemporary processor
           | would issue - one cache line (~64 bytes) - at somewhere
           | between 500-1000 MB/s takes around 100 ns already. That
           | transfer-related latency has been reduced drastically.
        
             | guerrilla wrote:
             | Didn't /RAS timing drop from like 120ns to 10ns too? That's
             | what I was thinking about. Then DDR, and that bus size is
             | also 16 times wider.
        
         | devit wrote:
         | At around 4 GHz and 2 IPC, it's about 1000 CPU instructions per
         | I/O.
         | 
         | Assuming you have an hardware DMA ring buffer for I/O that is
         | directly mapped in userspace, the only thing that is really
         | needed is to write the operation type, size, disk position and
         | memory position to the ring buffer, update the buffer position
         | and check for flush, doable in around 8 CISC instructions (plus
         | the slowpath), so around 100x inefficient.
         | 
         | Without the direct mapped ring buffer and with a filesystem,
         | you need a kernel to translate from uring to the hardware ring
         | buffer, and here it still seems around 10x inefficient as
         | around 100 instructions should be enough to do the translation
         | (assuming pages already mapped in the IOMMU, that you have the
         | file block map in cache, and that the whole system is
         | architected to maximize the efficiency of this operation).
        
       | reacharavindh wrote:
       | Does anyone know whether these optimisations in block layer and
       | iouring benefit major IO hitters like PostgreSQL, ZFS, NFS writes
       | etc? Will it take several years and monumental effort before they
       | adopt iouring first?
        
         | wtallis wrote:
         | io_uring is an improved interface between userspace and the
         | kernel, so it doesn't provide any direct benefits to in-kernel
         | filesystem operations, and the performance benefits io_uring
         | does provide should be largely filesystem-agnostic. That said,
         | some of these recent optimizations may be low enough in the
         | kernel's io stack to also benefit io originating within the
         | kernel itself.
         | 
         | Userspace applications that already have some support for
         | asynchronous disk IO (either through the old libaio APIs or as
         | a cleanly-abstracted thread pool) should be able to switch to
         | using io_uring as their backend without too much trouble, and
         | reap the benefits of async that actually works reliably (if
         | switching from libaio) and with vastly lower overhead (if
         | switching from a thread pool). Databases like PostgreSQL were
         | some of the few applications that attempted to deal with the
         | limitations of libaio, but I'm not sure how close they are to
         | having a production-quality io_uring backend.
        
           | sumtechguy wrote:
           | From an earlier article
           | https://www.phoronix.com/scan.php?page=news_item&px=Linux-
           | Ap...
           | 
           | "His patches pushing the greater performance have been
           | changes to the block code, NVMe, multi-queue blk-mq, and
           | IO_uring." https://git.kernel.dk/cgit/linux-
           | block/log/?h=perf-wip
           | 
           | So it looks like he is playing with a good portion of the IO
           | block stack with a very recent concentration on io_uring. So
           | maybe some of it?...
        
       | terafo wrote:
       | He just hit 9M
       | https://twitter.com/axboe/status/1450188650852065291
        
       | junon wrote:
       | I know this will be down voted but I've learned to be a bit
       | skeptical of Axboe's monumental claims. There have been some
       | unfair benchmarks posted in the past surrounding io_uring that
       | were called out by people on the liburing repositories. Take
       | these with a grain of salt - his results are notoriously very
       | difficult to reproduce, even on identical hardware.
       | 
       | Nevertheless, io_uring is certainly the better design and I'm
       | happy to see progress still being made.
        
         | marcodiego wrote:
         | Not disagreeing with you, but even if benchmarks are unfair,
         | these reports are still a good illustration of progress. Unless
         | there are some drawbacks to these optimizations, there should
         | be no negative side-effects.
         | 
         | AFAIK as I know, these changes shouldn't negatively impact
         | tasks that are not making any use of these features. So, even
         | if improvements are not directly proportional to what is being
         | reported, they're still real.
        
           | junon wrote:
           | > Unless there are some drawbacks to these optimizations,
           | there should be no negative side-effects.
           | 
           | Historically the benchmarks have been crafted in a way to
           | make them seem faster than their real-world and well-formed
           | opponents (e.g. the io_uring vs epoll benchmark). One issue
           | that was pointed out was that the io_uring benchmark eschewed
           | proper error handling in order to avoid some branches,
           | whereas the epoll benchmark properly error checked. This
           | reduced the benchmark times considerably, though no
           | correction was ever published after the fact.
           | 
           | This is why in the Github issue I mentioned in another
           | comment there are some people understandably a bit annoyed. I
           | don't choose to jump to conclusions about Axboe's intent, and
           | my original comment wasn't meant to do so.
        
         | Datagenerator wrote:
         | Nitpicking aside he has twenty years of Linux kernel
         | development experience and does great things for everyones
         | benefit. Congratulations Jens Axboe!
        
           | junon wrote:
           | This wasn't nitpicking. It was claiming a nontrivial speedup
           | over epoll for identical test cases. What was the point of
           | your comment?
        
             | yuffaduffa wrote:
             | What was the point of yours? "Take benchmarks with a grain
             | of salt" is like saying refrigerating food is important or,
             | more appropriately, that it's important to verify
             | surprising claims. (You made it a personal observation for
             | some reason, but your whole point is still about as
             | insightful as both of those.)
             | 
             | The person you're going after here probably felt compelled
             | to counter the needless personal nature of your remarks.
             | It's difficult to experimentally verify relativity but we
             | don't criticize Einstein as a result.
        
               | junon wrote:
               | Comparing very verifiable, applied science to very
               | theoretical science is a strawman. Science doesn't care
               | about feelings or tenure, so while his experience is
               | _relevant_ , it does not _excuse_ the provably inflated
               | performance figures that have been boasted historically.
        
         | pengaru wrote:
         | > There have been some unfair benchmarks posted in the past
         | surrounding io_uring that were called out by people on the
         | liburing repositories.
         | 
         | Could you provide any links to these discussions?
        
           | junon wrote:
           | Sure. https://github.com/axboe/liburing/issues/189
           | 
           | Original claims were in the 90% and above performance
           | increase over epoll. Then issues were found, and the figure
           | was adjusted to 60% over epoll. Then more issues were found,
           | and now real-world performance tests are showing minimal
           | speedups if any.
           | 
           | Unfortunately the sibling commentors don't see "computer
           | science" as a science but instead as a "feel good hobby", it
           | seems. My point wasn't to hurt feelings, it was to provide a
           | word of caution with these sorts of groundbreaking claims
           | with respect specifically to the io_uring efforts, as they
           | have been disingenuous quite a few times historically.
           | 
           | I don't doubt Jens does fantastic work. I don't doubt that
           | he's seen these speedups in very specific cases. But people
           | are celebrating this as a win where they'd be skeptical of
           | e.g. "breakthrough" treatments of cancer (footnote: in mice).
           | It's the same thing.
        
             | [deleted]
        
             | servytor wrote:
             | But if you read the thread, you realize he is testing on a
             | 2011 laptop, and the older architecture may be causing the
             | issues. He reported a 3x speedup running the benchmark on a
             | Raspberry Pi 4 in the same thread[0]
             | 
             | Someone claimed that you don't get to see the benefit of
             | io_uring in a hypervisor[1], but they did not provide
             | benchmark results.
             | 
             | [0]: https://github.com/axboe/liburing/issues/189#issuecomm
             | ent-94...
             | 
             | [1]: https://github.com/axboe/liburing/issues/189#issuecomm
             | ent-73...
        
       | jpgvm wrote:
       | I remember when he first hit 1M when he was at Fusion-IO. Man how
       | far things have come since those days.
        
       | tomc1985 wrote:
       | I keep seeing Axboe as a typo'd version of Adobe and it's really
       | bugging me
       | 
       | edit- It's someone's name, I thought it was a company or product
        
         | noir_lord wrote:
         | It is a wlel kwnon pohnnomeen taht you can sawp all the ltteers
         | in a wrod and as lnog as the fsrit and lsat is ccrreot it wlil
         | sitll be cibemreplohnse.
         | 
         | https://www.mrc-cbu.cam.ac.uk/people/matt.davis/cmabridge/
        
           | tomc1985 wrote:
           | It's wild that I was able to read that in my head at full
           | speed, with the correct word pronunciation (except the last
           | word... wonder what the upper limit on that is anyway?)
        
         | cteiosanu wrote:
         | Glad I'm not the only one.
        
       | egberts1 wrote:
       | I look forward to a faster `cp` command.
        
         | the8472 wrote:
         | io_uring certainly could help there, but (conditional on using
         | SSDs) even switching to something parallel like xcp or fcp
         | would speed things up compared to single-threaded gnu cp.
        
         | kerneltrap wrote:
         | The fastest `cp` is actually doing no data copies at all
         | (relying on copy-on-write), on filesystems with reflink
         | support. Incidentally, coreutils v9.0 cp switched to doing
         | reflinks by default [1], so there's already a faster cp.
         | 
         | [1]
         | https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=2...
        
       ___________________________________________________________________
       (page generated 2021-10-18 23:01 UTC)