[HN Gopher] Axboe Achieves 8M IOPS Per-Core with Newest Linux Op...
___________________________________________________________________
Axboe Achieves 8M IOPS Per-Core with Newest Linux Optimization
Patches
Author : marcodiego
Score : 98 points
Date : 2021-10-17 04:05 UTC (1 days ago)
(HTM) web link (www.phoronix.com)
(TXT) w3m dump (www.phoronix.com)
| Aissen wrote:
| That's 125 ns per IO, on a single CPU core, with two devices.
| Mind boggling.
| megous wrote:
| IO is 512B sectors here? Or 4KiB?
|
| EDIT: 512, looking at the screenshot..
| anarazel wrote:
| It's worth noting that there's some batching involved, so it's
| not 125ns for processing one io, then again 125ns for the next.
| Both combining some of the work, and superscalar execution is
| necessary to reach this high numbers...
| benlwalker wrote:
| To expand on the batching details, I haven't seen exactly
| what he's doing here, but historically the numbers quoted are
| from benchmarks that submit something like 128 total queue
| depth in alternating batches of 64. The disks hit max
| performance with a much lower queue depth than 64. So it's
| very unrealistic perfect batching that you'll never see in a
| real application. The workload is only a useful tool for
| optimizing the code, which is exactly what he's using it for.
| anarazel wrote:
| I think you need batching for the nvme doorbell alone at
| anything close to these kinds of rates...
|
| For some workloads it's not that hard to have that deep
| queues. What's harder is to know when to use them and when
| not. There's really not enough information available to
| make any of this self-tuning.
| wtallis wrote:
| The Intel Optane drives he's testing with hit max
| performance at pretty low queue depths, but flash-based
| SSDs with similar throughput will require those high queue
| depths.
| guerrilla wrote:
| Faster than DRAM from the 90's.
| terafo wrote:
| Modern SSDs generally have more bandwidth than DDR2.
| formerly_proven wrote:
| The access time has actually not changed a lot, it's just
| that transferring the smallest read a contemporary processor
| would issue - one cache line (~64 bytes) - at somewhere
| between 500-1000 MB/s takes around 100 ns already. That
| transfer-related latency has been reduced drastically.
| guerrilla wrote:
| Didn't /RAS timing drop from like 120ns to 10ns too? That's
| what I was thinking about. Then DDR, and that bus size is
| also 16 times wider.
| devit wrote:
| At around 4 GHz and 2 IPC, it's about 1000 CPU instructions per
| I/O.
|
| Assuming you have an hardware DMA ring buffer for I/O that is
| directly mapped in userspace, the only thing that is really
| needed is to write the operation type, size, disk position and
| memory position to the ring buffer, update the buffer position
| and check for flush, doable in around 8 CISC instructions (plus
| the slowpath), so around 100x inefficient.
|
| Without the direct mapped ring buffer and with a filesystem,
| you need a kernel to translate from uring to the hardware ring
| buffer, and here it still seems around 10x inefficient as
| around 100 instructions should be enough to do the translation
| (assuming pages already mapped in the IOMMU, that you have the
| file block map in cache, and that the whole system is
| architected to maximize the efficiency of this operation).
| reacharavindh wrote:
| Does anyone know whether these optimisations in block layer and
| iouring benefit major IO hitters like PostgreSQL, ZFS, NFS writes
| etc? Will it take several years and monumental effort before they
| adopt iouring first?
| wtallis wrote:
| io_uring is an improved interface between userspace and the
| kernel, so it doesn't provide any direct benefits to in-kernel
| filesystem operations, and the performance benefits io_uring
| does provide should be largely filesystem-agnostic. That said,
| some of these recent optimizations may be low enough in the
| kernel's io stack to also benefit io originating within the
| kernel itself.
|
| Userspace applications that already have some support for
| asynchronous disk IO (either through the old libaio APIs or as
| a cleanly-abstracted thread pool) should be able to switch to
| using io_uring as their backend without too much trouble, and
| reap the benefits of async that actually works reliably (if
| switching from libaio) and with vastly lower overhead (if
| switching from a thread pool). Databases like PostgreSQL were
| some of the few applications that attempted to deal with the
| limitations of libaio, but I'm not sure how close they are to
| having a production-quality io_uring backend.
| sumtechguy wrote:
| From an earlier article
| https://www.phoronix.com/scan.php?page=news_item&px=Linux-
| Ap...
|
| "His patches pushing the greater performance have been
| changes to the block code, NVMe, multi-queue blk-mq, and
| IO_uring." https://git.kernel.dk/cgit/linux-
| block/log/?h=perf-wip
|
| So it looks like he is playing with a good portion of the IO
| block stack with a very recent concentration on io_uring. So
| maybe some of it?...
| terafo wrote:
| He just hit 9M
| https://twitter.com/axboe/status/1450188650852065291
| junon wrote:
| I know this will be down voted but I've learned to be a bit
| skeptical of Axboe's monumental claims. There have been some
| unfair benchmarks posted in the past surrounding io_uring that
| were called out by people on the liburing repositories. Take
| these with a grain of salt - his results are notoriously very
| difficult to reproduce, even on identical hardware.
|
| Nevertheless, io_uring is certainly the better design and I'm
| happy to see progress still being made.
| marcodiego wrote:
| Not disagreeing with you, but even if benchmarks are unfair,
| these reports are still a good illustration of progress. Unless
| there are some drawbacks to these optimizations, there should
| be no negative side-effects.
|
| AFAIK as I know, these changes shouldn't negatively impact
| tasks that are not making any use of these features. So, even
| if improvements are not directly proportional to what is being
| reported, they're still real.
| junon wrote:
| > Unless there are some drawbacks to these optimizations,
| there should be no negative side-effects.
|
| Historically the benchmarks have been crafted in a way to
| make them seem faster than their real-world and well-formed
| opponents (e.g. the io_uring vs epoll benchmark). One issue
| that was pointed out was that the io_uring benchmark eschewed
| proper error handling in order to avoid some branches,
| whereas the epoll benchmark properly error checked. This
| reduced the benchmark times considerably, though no
| correction was ever published after the fact.
|
| This is why in the Github issue I mentioned in another
| comment there are some people understandably a bit annoyed. I
| don't choose to jump to conclusions about Axboe's intent, and
| my original comment wasn't meant to do so.
| Datagenerator wrote:
| Nitpicking aside he has twenty years of Linux kernel
| development experience and does great things for everyones
| benefit. Congratulations Jens Axboe!
| junon wrote:
| This wasn't nitpicking. It was claiming a nontrivial speedup
| over epoll for identical test cases. What was the point of
| your comment?
| yuffaduffa wrote:
| What was the point of yours? "Take benchmarks with a grain
| of salt" is like saying refrigerating food is important or,
| more appropriately, that it's important to verify
| surprising claims. (You made it a personal observation for
| some reason, but your whole point is still about as
| insightful as both of those.)
|
| The person you're going after here probably felt compelled
| to counter the needless personal nature of your remarks.
| It's difficult to experimentally verify relativity but we
| don't criticize Einstein as a result.
| junon wrote:
| Comparing very verifiable, applied science to very
| theoretical science is a strawman. Science doesn't care
| about feelings or tenure, so while his experience is
| _relevant_ , it does not _excuse_ the provably inflated
| performance figures that have been boasted historically.
| pengaru wrote:
| > There have been some unfair benchmarks posted in the past
| surrounding io_uring that were called out by people on the
| liburing repositories.
|
| Could you provide any links to these discussions?
| junon wrote:
| Sure. https://github.com/axboe/liburing/issues/189
|
| Original claims were in the 90% and above performance
| increase over epoll. Then issues were found, and the figure
| was adjusted to 60% over epoll. Then more issues were found,
| and now real-world performance tests are showing minimal
| speedups if any.
|
| Unfortunately the sibling commentors don't see "computer
| science" as a science but instead as a "feel good hobby", it
| seems. My point wasn't to hurt feelings, it was to provide a
| word of caution with these sorts of groundbreaking claims
| with respect specifically to the io_uring efforts, as they
| have been disingenuous quite a few times historically.
|
| I don't doubt Jens does fantastic work. I don't doubt that
| he's seen these speedups in very specific cases. But people
| are celebrating this as a win where they'd be skeptical of
| e.g. "breakthrough" treatments of cancer (footnote: in mice).
| It's the same thing.
| [deleted]
| servytor wrote:
| But if you read the thread, you realize he is testing on a
| 2011 laptop, and the older architecture may be causing the
| issues. He reported a 3x speedup running the benchmark on a
| Raspberry Pi 4 in the same thread[0]
|
| Someone claimed that you don't get to see the benefit of
| io_uring in a hypervisor[1], but they did not provide
| benchmark results.
|
| [0]: https://github.com/axboe/liburing/issues/189#issuecomm
| ent-94...
|
| [1]: https://github.com/axboe/liburing/issues/189#issuecomm
| ent-73...
| jpgvm wrote:
| I remember when he first hit 1M when he was at Fusion-IO. Man how
| far things have come since those days.
| tomc1985 wrote:
| I keep seeing Axboe as a typo'd version of Adobe and it's really
| bugging me
|
| edit- It's someone's name, I thought it was a company or product
| noir_lord wrote:
| It is a wlel kwnon pohnnomeen taht you can sawp all the ltteers
| in a wrod and as lnog as the fsrit and lsat is ccrreot it wlil
| sitll be cibemreplohnse.
|
| https://www.mrc-cbu.cam.ac.uk/people/matt.davis/cmabridge/
| tomc1985 wrote:
| It's wild that I was able to read that in my head at full
| speed, with the correct word pronunciation (except the last
| word... wonder what the upper limit on that is anyway?)
| cteiosanu wrote:
| Glad I'm not the only one.
| egberts1 wrote:
| I look forward to a faster `cp` command.
| the8472 wrote:
| io_uring certainly could help there, but (conditional on using
| SSDs) even switching to something parallel like xcp or fcp
| would speed things up compared to single-threaded gnu cp.
| kerneltrap wrote:
| The fastest `cp` is actually doing no data copies at all
| (relying on copy-on-write), on filesystems with reflink
| support. Incidentally, coreutils v9.0 cp switched to doing
| reflinks by default [1], so there's already a faster cp.
|
| [1]
| https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=2...
___________________________________________________________________
(page generated 2021-10-18 23:01 UTC)