[HN Gopher] Achieving 11M IOPS and 66 GB/S IO on a Single Thread...
___________________________________________________________________
Achieving 11M IOPS and 66 GB/S IO on a Single ThreadRipper
Workstation
Author : tanelpoder
Score : 231 points
Date : 2021-01-29 12:45 UTC (10 hours ago)
(HTM) web link (tanelpoder.com)
(TXT) w3m dump (tanelpoder.com)
| secondcoming wrote:
| > For final tests, I even disabled the frequent gettimeofday
| system calls that are used for I/O latency measurement
|
| I was knocking up some profiling code and measured the
| performance of gettimeofday as a proof-of-concept test.
|
| The performance difference between running the test on my
| personal desktop Linux VM versus running it on a cloud instance
| Linux VM was quite interesting (cloud was worse)
|
| I think I read somewhere that cloud instances cannot use the VDSO
| code path because your app may be moved to a different machine.
| My recollection of the reason is somewhat cloudy.
| ashkankiani wrote:
| When I bought a bunch of NVME drives, I was disappointed with how
| slow the maximum speed I could achieve with them was given my
| knowledge and available time at the time. Thanks for making this
| post to give me more points of insight into the problem.
|
| I'm on the same page with your thesis that "hardware is fast and
| clusters are usually overkill," and disk I/O was a piece that I
| hadn't really figured out yet despite making great strides in the
| software engineering side of things. I'm trying to make a startup
| this year and disk I/O will actually be a huge factor in how far
| I can scale without bursting costs for my application. Good
| stuff!
| whalesalad wrote:
| This post is fantastic. I wish there was more workstation porn
| like this for those of us who are not into the RGB light show
| ripjaw hacksaw aorus elite novelty stuff that gamers are so into.
| Benchmarks in the community are almost universally focused on
| gaming performance and FPS.
|
| I want to build an epic rig that will last a long time with
| professional grade hardware (with ECC memory for instance) and
| would love to get a lot of the bleeding-edge stuff without
| compromising on durability. Where do these people hang out
| online?
| piinbinary wrote:
| The level1techs forums seems to have a lot of people with
| similar interests
| greggyb wrote:
| STH: https://www.youtube.com/user/ServeTheHomeVideo
| https://www.servethehome.com/
|
| GamersNexus (despite the name, they include a good amount of
| non-gaming benchmarks, and they have great content on cases and
| cooling): https://www.youtube.com/user/GamersNexus
| https://www.gamersnexus.net/
|
| Level1Techs (mentioned in another reply):
| https://www.youtube.com/c/Level1Techs
| https://www.level1techs.com/
|
| r/homelab (and all the subreddits listed in its sidebar):
| https://www.reddit.com/r/homelab/
|
| Even LinusTechTips has some decent content for server hardware,
| though they stay fairly superficial. And the forum definitely
| has people who can help out: https://linustechtips.com/
|
| And the thing is, depending on what metric you judge
| performance by, the enthusiast hardware may very well
| outperform the server hardware. For something that is sensitive
| to memory, e.g., you can get much faster RAM in enthusiast SKUs
| (https://www.crucial.com/memory/ddr4/BLM2K8G51C19U4B) than
| you'll find in server hardware. Similarly, the HEDT SKUs out-
| clock the server SKUs for both Intel and AMD.
|
| I have a Threadripper system that outperforms most servers I
| work with on a daily basis, because most of my workloads,
| despite being multi-threaded, are sensitive to clockspeed.
| 1996 wrote:
| Indeed, serious people now use gamer computer parts because
| it's just faster!
| greggyb wrote:
| It's not "just faster".
|
| No one's using "gamer NICs" for high speed networking. Top
| of the line "gaming" networking is 802.11ax or 10GbE.
| 2x200Gb/s NICs are available now.
|
| Gaming parts are strictly single socket - software that can
| take advantage of >64 cores will need server hardware -
| either one of the giant Ampere ARM CPUs or a 2+ socket
| system.
|
| If something must run in RAM and needs TB of RAM, well then
| it's not even a question of faster or slower. The
| capability only exists on server platforms.
|
| _Some_ workloads will benefit from the performance
| characteristics of consumer hardware.
| vmception wrote:
| > RGB light show ripjaw hacksaw aorus elite novelty stuff
|
| haha yeah I bought a whole computer from someone and was
| wondering why the RAM looked like rupies from Zelda
|
| apparently that is common now
|
| but at least I'm not cosplaying as a karate day trader for my
| Wall Street Journal expose'
| philsnow wrote:
| I'm with you on this, I just built a (much more modest than the
| article's) workstation/homelab machine a few months ago, to
| replace my previous one which was going on 10 years old and
| showing its age.
|
| There's some folks in /r/homelab who are into this kind of
| thing, and I used their advice a fair bit in my build. While it
| is kind of mixed (there's a lot of people who build pi clusters
| as their homelab), there's still plenty of people who buy
| decommissioned "enterprise" hardware and make monstrous-for-
| home-use things.
| deagle50 wrote:
| Happy to help if you want feedback. Servethehome forums are
| also a great resource of info and used hardware, probably the
| best community for your needs.
| arminiusreturns wrote:
| Check out HardForum. Lots of very knowledgable people on there
| helped me mature my hardware level knowledge. Back when I was
| building 4 cpu, 64 core opteron systems. Also decent banter.
| tanelpoder wrote:
| Thanks! In case you're interested in building a ThreadRipper
| Pro WX-based system like mine, then AMD apparently starts
| selling the CPUs independently from March 2021 onwards:
|
| https://www.anandtech.com/show/16396/the-amd-wrx80-chipset-a...
|
| Previously you could only get this CPU when buying the Lenovo
| ThinkStation P620 machine. I'm pretty happy with Lenovo
| Thinkstations though (I bought a P920 with dual Xeons 2.5 years
| ago)
| ksec wrote:
| And just in time article
|
| https://www.anandtech.com/show/16462/hands-on-with-the-
| asus-...
|
| I guess I should submit this on HN as well.
|
| Edit: I was getting too ahead of myself I thought these are
| for TR Pro with Zen 3. Turns out they are not out yet.
| zhdc1 wrote:
| Look at purchasing used enterprise hardware. You can buy a
| reliable x9 or X10 generation supermicro server (rack or tower)
| for around a couple of hundred.
| ashkankiani wrote:
| I've been planning to do this, but enterprise hardware seems
| like it requires a completely different set of knowledge on
| how to purchase it and maintain it, and especially as a
| consumer.
|
| It's not quite as trivial of a barrier to entry as consumer
| desktops, but I suppose that's the point. Still, it would be
| nice if there was a guide that could help me make good
| decisions to start.
| jqcoffey wrote:
| Also, purpose built data center chassis are designed for
| high airflow and are thus really quite loud.
| modoc wrote:
| Very true. I have a single rack mount server in my HVAC
| room, and it's still so loud I had to glue soundproofing
| foam on the nearby walls:)
| benlwalker wrote:
| Plug for a post I wrote a few years ago demonstrating nearly the
| same result but using only a single CPU core:
| https://spdk.io/news/2019/05/06/nvme/
|
| This is using SPDK to eliminate all of the overhead the author
| identified. The hardware is far more capable than most people
| expect, if the software would just get out of the way.
| tanelpoder wrote:
| Yes I had seen that one (even more impressive!)
|
| When I have more time again, I'll run fio with the SPDK plugin
| on my kit too. And would be interested in seeing what happens
| when doing 512B random I/Os?
| benlwalker wrote:
| The system that was tested there was PCIe bandwidth
| constrained because this was a few years ago. With your
| system, it'll get a bigger number - probably 14 or 15 million
| 4KiB IO per second per core.
|
| But while SPDK does have an fio plug-in, unfortunately you
| won't see numbers like that with fio. There's way too much
| overhead in the tool itself. We can't get beyond 3 to 4
| million with that. We rolled our own benchmarking tool in
| SPDK so we can actually measure the software we produce.
|
| Since the core is CPU bound, 512B IO are going to net the
| same IO per second as 4k. The software overhead in SPDK is
| fixed per IO, regardless of size. You can also run more
| threads with SPDK than just one - it has no locks or cross
| thread communication so it scales linearly with additional
| threads. You can push systems to 80-100M IO per second if you
| have disks and bandwidth that can handle it.
| StillBored wrote:
| Yah this has been going on for a while. Before SPDK it was
| with custom kernel bypasses and fast inifiband/FC arrays. I
| was involved with a similar project in the early 2000's.
| Where at the time the bottleneck was the shared xeon bus,
| and then it moved to the PCIe bus with opterons/nehalem+.
| In our case we ended up spending a lot of time tuning the
| application to avoid cross socket communication as well
| since that could become a big deal (of course after careful
| card placement).
|
| But SPDK has a problem you don't have with bypasses and
| uio_ring, in that it needs the IOMMU enabled, and that can
| itself become a bottleneck. There are also issues for some
| applications that want to use interrupts rather than poll
| everything.
|
| Whats really nice about uio_ring is that it sort of
| standardizes a large part of what people were doing with
| bypasses.
| tanelpoder wrote:
| Yeah, that's what I wondered - I'm ok with using multiple
| cores, would I get even more IOPS when doing smaller I/Os.
| Is the benchmark suite you used part of the SPDK toolkit
| (and easy enough to run?)
| benlwalker wrote:
| Whether you get more IOPs with smaller I/Os depends on a
| number of things. Most drives these days are natively
| 4KiB blocks and are emulating 512B sectors for backward
| compatibility. This emulation means that 512B writes are
| often quite slow - probably slower than writing 4KiB
| (with 4KiB alignment). But 512B reads are typically very
| fast. On Optane drives this may not be true because the
| media works entirely differently - those may be able to
| do native 512B writes. Talk to the device vendor to get
| the real answer.
|
| For at least reads, if you don't hit a CPU limit you'll
| get 8x more IOPS with 512B than you will with 4KiB with
| SPDK. It's more or less perfect scaling. There's some
| additional hardware overheads in the MMU and PCIe
| subsystems with 512B because you're sending more messages
| for the same bandwidth, but my experience has been that
| it is mostly negligible.
|
| The benchmark builds to build/examples/perf and you can
| just run it with -h to get the help output. Random 4KiB
| reads at 32 QD to all available NVMe devices (all devices
| unbound from the kernel and rebound to vfio-pci) for 60
| seconds would be something like:
|
| perf -q 32 -o 4096 -w randread -t 60
|
| You can specify only test specific devices with the -r
| parameter (by BUS:DEVICE:FUNCTION essentially). The tool
| can also benchmark kernel devices. Using -R will turn on
| io_uring (otherwise it uses libaio), and you simply list
| the block devices on the command line after the base
| options like this:
|
| perf -q 32 -o 4096 -w randread -t 60 -R /dev/nvme0n1
|
| You can get ahold of help from the SPDK community at
| https://spdk.io/community. There will be lots of people
| willing to help.
|
| Excellent post by the way. I really enjoyed it.
| tanelpoder wrote:
| Thanks! Will add this to TODO list too.
| rektide wrote:
| Nice follow up @ttanelpoder to "RAM is the new disk" (2015)[1]
| which we talked about not even two weeks ago!
|
| I was quite surprised to hear in that thread that AMD's
| infiniband was so oversubscribed. There's 256GBps of pcie on a 1P
| butit seems like this 66GBps is all the fabric can do. A little
| under a 4:1 oversubscription!
|
| [1] https://news.ycombinator.com/item?id=25863093
| electricshampo1 wrote:
| 66GBps is from each of 10 drives doing ~6.6 GBps; don't think
| the infinity fabric is the limiter here
| [deleted]
| muro wrote:
| This article was great, thanks for sharing!
|
| Anyone has advice on optimizing a windows 10 system? I have a
| haswell workstation (E5-1680 v3) that I find reasonably fast and
| works very well under Linux. In windows, I get lost. I tried to
| run the userbenchark suite which told me I'm below median for
| most of my components. Is there any good advice how to improve
| that? Which tools give good insight into what the machine is
| doing under windows? I'd like first to try to optimize what I
| have, before upgrading to the new shiny :).
| RobLach wrote:
| Excellent article. Worth a read even if you're not maxing IO.
| wiradikusuma wrote:
| I've been thinking about this. Would traditional co-location
| (e.g. 2x 2U from DELL) in a local data center be cheaper if e.g.
| you're serving local (country-wise) market?
| derefr wrote:
| Depends on how long you need the server, and the ownership
| model you've chosen to pursue for it.
|
| If you _purchase_ a server and stick it in a co-lo somewhere,
| and your business plans to exist for 10+ years -- well, is that
| server still going to be powering your business 10 years from
| now? Or will you have moved its workloads to something newer?
| If so, you 'll probably want to decommission and sell the
| server at some point. The time required to deal with that might
| not be worth the labor costs of your highly-paid engineers.
| Which means you might not actually end up re-capturing the
| depreciated value of the server, but instead will just let it
| rot on the shelf, or dispose of it as e-waste.
|
| Hardware _leasing_ is a lot simpler. Whether you lease servers
| from an OEM like Dell, there 's a quick, well-known path to
| getting the EOLed hardware shipped back to Dell and the
| depreciated value paid back out to you.
|
| And, of course, hardware _renting_ is simpler still. Renting
| the hardware of the co-lo (i.e. "bare-metal unmanaged server"
| hosting plans) means never having to worry about the CapEx of
| the hardware in the first place. You just walk away at the end
| of your term. But, of course, that's when you start paying
| premiums on top of the hardware.
|
| Renting VMs, then, is like renting hardware on a micro-scale;
| you never have to think about what you're running on, as --
| presuming your workload isn't welded to particular machine
| features like GPUs or local SSDs -- you'll tend to
| automatically get migrated to newer hypervisor hardware
| generations as they become available.
|
| When you work it out in terms of "ten years of ops-staff labor
| costs of dealing with generational migrations and sell-offs"
| vs. "ten years of premiums charged by hosting rentiers", the
| pricing is surprisingly comparable. (In fact, this is basically
| the math hosting providers use to figure out what they _can_
| charge without scaring away their large enterprise customers,
| who are fully capable of taking a better deal if there is one.)
| rodgerd wrote:
| > If you purchase a server and stick it in a co-lo somewhere,
| and your business plans to exist for 10+ years -- well, is
| that server still going to be powering your business 10 years
| from now? Or will you have moved its workloads to something
| newer?
|
| Which, if you have even the remotest fiscal competence,
| you'll have funded by using the depreciation of the book
| value of the asset after 3 years.
| 37ef_ced3 wrote:
| Somebody please tell me how many ResNet50 inferences you can do
| per second on one of these chips
|
| Here is the standalone AVX-512 ResNet50 code (C99 .h and .c
| files):
|
| https://nn-512.com/browse/ResNet50
|
| Oops, AMD doesn't support AVX-512 yet. Even Zen 3? Incredible
| wyldfire wrote:
| Whoa, this code looks interesting. Must've been emitted by
| something higher-level? Something like PyTorch/TF/MLIR/TVM/Glow
| maybe?
|
| If that is the case, then maybe it could be emitted again while
| masking the instruction sets Ryzen doesn't support yet.
| tanelpoder wrote:
| You mean on the CPU, right? This CPU doesn't support AVX-512:
| $ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/
| /\n/g' | egrep "avx|sse|popcnt" | sort | uniq avx
| avx2 misalignsse popcnt sse sse2
| sse4_1 sse4_2 sse4a ssse3
|
| What compile/build options should I use?
| 37ef_ced3 wrote:
| No AVX-512, forget it then
| xxpor wrote:
| They don't have avx512 instructions.
| qaq wrote:
| Now honestly say for how long two boxes like this behind a load
| balancer would be more than enough for your startup.
| pbalcer wrote:
| What I find interesting about the performance of this type of
| hardware is how it affects the software we are using for storage.
| The article talked about how the Linux kernel just can't keep up,
| but what about databases or kv stores. Are the trade-offs those
| types of solutions make still valid for this type of hardware?
|
| RocksDB, and LSM algorithms in general, seem to be designed with
| the assumption that random block I/O is slow. It appears that,
| for modern hardware, that assumption no longer holds, and the
| software only slows things down [0].
|
| [0] -
| https://github.com/BLepers/KVell/blob/master/sosp19-final40....
| ddorian43 wrote:
| Disappointed there was no lmdb comparison in there.
| tyingq wrote:
| A paper on making LSM more SSD friendly:
| https://users.cs.duke.edu/~rvt/ICDE_2017_CameraReady_427.pdf
| pbalcer wrote:
| Thanks for sharing this article - I found it very insightful.
| I've seen similar ideas being floated around before, and they
| often seem to focus on what software can be added on top of
| an already fairly complex solution (while LSM can appear to
| be conceptually simple, its implementations are anything
| but).
|
| To me, what the original article shows is an opportunity to
| remove - not add.
| jeffbee wrote:
| If you think about it from the perspective of the authors of
| large-scale databases, linear access is still a lot cheaper
| than random access in a datacenter filesystem.
| AtlasBarfed wrote:
| scylladb had a blogpost once about how surprisingly small
| amounts of cpu time are available to process packets at the
| modern highest speed networks like 40gbit and the like.
|
| I can't find it now. I think they were trying to say that
| cassandra can't keep up because of the JVM overhead and you
| need to be close to metal for extreme performance.
|
| This is similar. Huge amounts of flooding I/O from modern PCIx
| SSDs really closes the traditional gap between CPU and "disk".
|
| The biggest limiter in cloud right now is the EBS/SAN. Sure you
| can use local storage in AWS if you don't mind it disappearing,
| but while gp3 is an improvement, it pales to stuff like this.
|
| Also, this is fascinating:
|
| "Take the write speeds with a grain of salt, as TLC & QLC cards
| have slower multi-bit writes into the main NAND area, but may
| have some DIMM memory for buffering writes and/or a "TurboWrite
| buffer" (as Samsung calls it) that uses part of the SSDs NAND
| as faster SLC storage. It's done by issuing single-bit "SLC-
| like" writes into TLC area. So, once you've filled up the "SLC"
| TurboWrite buffer at 5000 MB/s, you'll be bottlenecked by the
| TLC "main area" at 2000 MB/s (on the 1 TB disks)."
|
| I didn't know controllers could swap between TLC/QLC and SLC.
| tanelpoder wrote:
| I learned the last bit from here (Samsung Solid State Drive
| TurboWrite Technology pdf):
|
| https://images-eu.ssl-images-
| amazon.com/images/I/914ckzwNMpS...
| StillBored wrote:
| Yes a number of articles about these newer TLC drives talk
| about it. The end result is that an empty drive is going to
| benchmark considerably different from one 99% full of
| uncompressable files.
|
| for example:
|
| https://www.tomshardware.com/uk/reviews/intel-
| ssd-660p-qlc-n...
| 1996 wrote:
| > I didn't know controllers could swap between TLC/QLC and
| SLC.
|
| I wish I could control the % of SLC. Even dividing a QLC
| space by 16 makes it cheaper than buying a similarly sized
| SLC
| 1MachineElf wrote:
| Reminds me of the Solid-State Drive checkbox that VirtualBox
| has for any VM disks. Checking it will make sure that the VM
| hardware emulation doesn't wait for the filesystem journal to
| be written, which would normally be advisable with spinning
| disks.
| digikata wrote:
| Not only the assumptions at the application layer, but
| potentially the filesystem too.
| [deleted]
| bob1029 wrote:
| I have personally found that making even the most primitive
| efforts at single-writer principle and batching IO in your
| software can make many orders of magnitude difference.
|
| Saturating an NVMe drive with a single x86 thread is trivial if
| you change how you play the game. Using async/await and
| yielding to the OS is not going to cut it anymore. Latency with
| these drives is measured in microseconds. You are better off
| doing microbatches of writes (10-1000 uS wide) and pushing
| these to disk with a single thread that monitors a queue in a
| busy wait loop (sort of like LMAX Disruptor but even more
| aggressive).
|
| Thinking about high core count parts, sacrificing an entire
| thread to busy waiting so you can write your transactions to
| disk very quickly is not a terrible prospect anymore. This same
| ideology is also really useful for ultra-precise execution of
| future timed actions. Approaches in managed lanaguages like
| Task.Delay or even Thread.Sleep are insanely inaccurate by
| comparison. The humble while(true) loop is certainly not energy
| efficient, but it is very responsive and predictable as long as
| you dont ever yield. What's one core when you have 63 more to
| go around?
| pbalcer wrote:
| The authors of the article I linked to earlier came to the
| same conclusions. And so did the SPDK folks. And the kernel
| community (or axboe :)) when coming up with io_uring. I'm
| just hoping that we will see software catching up.
| mikepurvis wrote:
| Isn't the use or non-use of async/await a bit orthogonal to
| the rest of this?
|
| I'm not an expert in this area, but wouldn't it be just as
| lightweight to have your async workers pushing onto a queue,
| and then have your async writer only wake up when the queue
| is at a certain level to create the batched write? Either
| way, you won't be paying the OS context switching costs
| associated with blocking a write thread, which I think is
| most of what you're trying to get out of here.
| pbalcer wrote:
| Right, I agree. I'd go even further and say that
| async/await is a great fit for a modern _asynchronous_ I /O
| stack (not read()/write()). Especially with io_uring using
| polled I/O (the worker thread is in the kernel, all the
| async runtime has to do is check for completion
| periodically), or with SPDK if you spin up your own I/O
| worker thread(s) like @benlwalker explained elsewhere in
| the thread.
| tyingq wrote:
| I wonder if "huge pages" would make a difference, since some of
| the bottlenecks seemed to be lock contention on memory pages.
| tanelpoder wrote:
| Linux pagecache doesn't use hugepages, but definitely when
| doing direct I/O into application buffers, would make sense to
| use hugepages for that. I plan to run tests on various database
| engines next - and many of them support using hugepages (for
| shared memory areas at least).
| guerby wrote:
| In the networking world (DPDK) huge pages and static pinning
| everything is a huge deal as you have very few cpu cycles per
| network packet.
| tanelpoder wrote:
| Yep - and there's SPDK for direct NVMe storage access
| without going through the Linux block layer:
| https://spdk.io
|
| (it's in my TODO list too)
| tyingq wrote:
| Thanks! Apparently, they did add it for tmpfs, and discussed
| it for ext4. https://lwn.net/Articles/718102/
| tanelpoder wrote:
| Good point - something to test, once I get to the
| filesystem benchmarks!
| tyingq wrote:
| I'm somewhat curious what happens to the long standing 4P/4U
| servers from companies like Dell and HP. The Ryzen/EPYC has
| really made going past 2P/2U a more rare need.
| thinkingkong wrote:
| You might be able to buy a smaller server but the rack density
| doesnt necessarily change. You still have to worry about
| cooling and power so lots of DCs would have 1/4 or 1/2 racks.
| tyingq wrote:
| Sure. I wasn't really thinking of density, just the
| interesting start of the "death" of 4 socket servers. Being
| an old-timer, it's interesting to me because "typical
| database server" has been synonymous with 4P/4U for a long,
| long time.
| vinay_ys wrote:
| I haven't seen a 4 socket machine in a long time.
| wtallis wrote:
| I think at this point the only reasons to go beyond 2U are to
| make room for either 3.5" hard drives, or GPUs.
| rektide wrote:
| Would love to see some very dense blade style ryzen
| offerings. The 4 2P nodes in 2U is great. Good way to share
| some power supies, fan, chassis, ideally multi-home nic too.
|
| Turn those sleds into blades though, put em on their side, &
| go even denser. It should be a way to save costs, but density
| alas is a huge upsell, even though it should be a way to
| scale costs down.
| tanelpoder wrote:
| Indeed, 128 EPYC cores in 2 sockets (with total 16 memory
| channels) will give a lot of power. I guess it's worth
| mentioning that the 64-core chips have much lower clock rate
| than 16/32 core ones though. And with some expensive software
| that's licensed by CPU core (Oracle), you'd want faster cores,
| but possibly pay a higher NUMA price when going with a single 4
| or 8 sockets machine for your "sacred monolith".
| StillBored wrote:
| There always seems to be buyers for more exotic high end
| hardware. That market has been shrinking and expanding, well
| since the first computer, as mainstream machines become more
| capable and people discover more uses for large coherent
| machines.
|
| But users of 16 socket machines, will just step down to 4
| socket epyc machines with 512 cores (or whatever). And someone
| else will realize that moving their "web scale" cluster from 5k
| machines, down to a single machine with 16 sockets results in
| lower latency and less cost. (or whatever).
| anarazel wrote:
| Have you checked if using the fio options (--iodepth_batch_*) to
| batch submissions helps? Fio doesn't do that by default, and I
| found that that can be a significant benefit.
|
| Particularly submitting multiple up requests can amortize the
| cost of setting the nvme doorbell (the expensive part as far as I
| understand it) across multiple requests.
| tanelpoder wrote:
| I tested various fio options, but didn't notice this one - I'll
| check it out! It might explain why I still kept seeing lots of
| interrupts raised even though I had enabled the I/O completion
| polling instead, with io_uring's --hipri option.
|
| edit: I ran a quick test with various IO batch sizes and it
| didn't make a difference - I guess because thanks to using
| io_uring, my bottleneck is not in IO submission, but deeper in
| the block IO stack...
| wtallis wrote:
| I think on recent kernels, using the hipri option doesn't get
| you interrupt-free polled IO unless you've configured the
| nvme driver to allocate some queues specifically for polled
| IO. Since these Samsung drives support 128 queues and you're
| only using a 16C/32T processor, you have more than enough for
| each drive to have one poll queue and one regular IO queue
| allocated to each (virtual) CPU core.
| tanelpoder wrote:
| That would explain it. Do you recommend any docs/links I
| should read about allocating queues for polled IO?
| anarazel wrote:
| It's terribly documented :(. You need to set the
| nvme.poll_queues to the number of queues you want, before
| the disks are attached. I.e. either at boot, or you need
| to set the parameter and then cause the NVMe to be
| rescanned (you can do that in sysfs, but I can't
| immediately recall the steps with high confidence).
| anarazel wrote:
| Ah, yes, shell history ftw. Of course you should ensure
| no filesystem is mounted or such:
| root@awork3:~# echo 4 >
| /sys/module/nvme/parameters/poll_queues
| root@awork3:~# echo 1 >
| /sys/block/nvme1n1/device/reset_controller
| root@awork3:~# dmesg -c [749717.253101] nvme
| nvme1: 12/0/4 default/read/poll queues
| root@awork3:~# echo 8 >
| /sys/module/nvme/parameters/poll_queues
| root@awork3:~# dmesg -c root@awork3:~# echo 1 >
| /sys/block/nvme1n1/device/reset_controller
| root@awork3:~# dmesg -c [749736.513102] nvme
| nvme1: 8/0/8 default/read/poll queues
| tanelpoder wrote:
| Thanks for the pointers, I'll bookmark this and will try
| it out someday.
| anarazel wrote:
| > I tested various fio options, but didn't notice this one -
| I'll check it out! It might explain why I still kept seeing
| lots of interrupts raised even though I had enabled the I/O
| completion polling instead, with io_uring's --hipri option.
|
| I think that should be independent.
|
| > edit: I ran a quick test with various IO batch sizes and it
| didn't make a difference - I guess because thanks to using
| io_uring, my bottleneck is not in IO submission, but deeper
| in the block IO stack...
|
| It probably won't get you drastically higher speeds in an
| isolated test - but it should help reduce CPU overhead. E.g.
| on one of my SSDs fio --ioengine io_uring --rw randread
| --filesize 50GB --invalidate=0 --name=test --direct=1 --bs=4k
| --numjobs=1 --registerfiles --fixedbufs --gtod_reduce=1
| --iodepth 48 uses about 25% more CPU than when I add
| --iodepth_batch_submit=0 --iodepth_batch_complete_max=0. But
| the resulting iops are nearly the same as long as there are
| enough cycles available.
|
| This is via filesystem, so ymmv, but the mechanism should be
| mostly independent.
| tanelpoder wrote:
| Author here: This article was intended to explain some modern
| hardware bottlenecks (and non-bottlenecks), but unexpectedly
| ended up covering a bunch of Linux kernel I/O stack issues as
| well :-) AMA
| jeffbee wrote:
| Great article, I learned! Can you tell me if you looked into
| aspects of the NVMe device itself, such as whether it supports
| 4K logical blocks instead of 512B? Use `nvme id-ns` to read out
| the supported logical block formats.
| tanelpoder wrote:
| Doesn't seem to support 4k out of the box? Some drives - like
| Intel Optane SSDs allow changing this in firmware (and
| reformatting) with a manufacturer's utility...
| $ lsblk -t /dev/nvme0n1 NAME ALIGNMENT MIN-IO OPT-IO
| PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME nvme0n1
| 0 512 0 512 512 0 none 1023 128 0B
| $ sudo nvme id-ns -H /dev/nvme0n1 | grep Size LBA
| Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes -
| Relative Performance: 0 Best (in use)
| jeffbee wrote:
| Thanks for checking. SSD review sites never mention this
| important detail. For some reason the Samsung datacenter
| SSDs support 4K LBA format, and they are very similar to
| the retail SSDs which don't seem to. I have the a retail
| 970 Evo that only provides 512.
| wtallis wrote:
| I just checked my logs, and none of Samsung's consumer
| NVMe drives have ever supported sector sizes other than
| 512B. They seem to view this feature as part of their
| product segmentation strategy.
|
| Some consumer SSD vendors do enable 4kB LBA support. I've
| seen it supported on consumer drives from WD, SK hynix
| and a variety of brands using Phison or SMI SSD
| controllers (including Kingston, Seagate, Corsair,
| Sabrent). But I haven't systematically checked to see
| which brands consistently support it.
| 1996 wrote:
| Is it genuine 512?
|
| As in, what ashift value do you use with zfs?
| wtallis wrote:
| Regardless of what sector size you configure the SSD to
| expose, the drive's flash translation layer still manages
| logical to physical mappings at a 4kB granularity, the
| underlying media page size is usually on the order of
| 16kB, and the erase block size is several MB. So what
| ashift value you want to use depends very much on what
| kind of tradeoffs you're okay with in terms of different
| aspects of performance and write endurance/write
| amplification. But for most flash-based SSDs, there's no
| reason to set ashift to anything less than 12
| (corresponding to 4kB blocks).
| guerby wrote:
| Here is an article about nvme-cli tool :
|
| https://nvmexpress.org/open-source-nvme-management-
| utility-n...
|
| On Samsung SSD 970 EVO 1TB it seems only 512 bytes LBA are
| supported: # nvme id-ns /dev/nvme0n1 -n 1
| -H|grep "^LBA Format" LBA Format 0 : Metadata Size:
| 0 bytes - Data Size: 512 bytes - Relative Performance: 0
| Best (in use)
| rafaelturk wrote:
| Thanks for well written article, makes me think about
| inefficiencies in our over-hyped cloud environment.
| tanelpoder wrote:
| Oh yes - and incorrectly configured on-premises systems too!
| sitkack wrote:
| Could you explain some of your thought processes and
| methodologies when approaching problems like this?
|
| What is your mental model like? How much experimentation do you
| do verses reading kernel code? How do you know what questions
| to start asking?
|
| *edit, btw I understand that a response to these questions
| could be an entire book, you get the question-space.
| tanelpoder wrote:
| Good question. I don't ever read kernel code as a starting
| point, only if some profiling or tracing tool points me
| towards an interesting function or codepath. And interesting
| usually is something that takes most CPU in perf output or
| some function call with an unusually high latency in ftrace,
| bcc/bpftrace script output. Or just a stack trace in a core-
| or crashdump.
|
| As far as mindset goes - I try to apply the developer mindset
| to system performance. In other words, I don't use much of
| what I call the "old school sysadmin mindset", from a time
| where better tooling was not available. I don't use
| systemwide utilization or various get/hit ratios for doing
| "metric voodoo" of Unix wizards.
|
| The developer mindset dictates that everything you run is an
| application. JVM is an application. Kernel is an application.
| Postgres, Oracle are applications. All applications execute
| one or more threads that run on CPU or do not run on CPU.
| There are only two categories of reasons why a thread does
| not run on CPU (is sleeping): The OS put the thread to sleep
| (involuntary blocking) or the thread voluntarily wanted to go
| to go to sleep (for example, it realized it can't get some
| application level lock).
|
| And you drill down from there. Your OS/system is just a bunch
| of threads running on CPU, sleeping and sometimes
| communicating with each other. You can _directly_ measure all
| of these things easily nowadays with profilers, no need for
| metric voodoo.
|
| I have written my own tools to complement things like perf,
| ftrace and BPF stuff - as a consultant I regularly see 10+
| year old Linux versions, etc - and I find sampling thread
| states from /proc file system is a really good (and flexible)
| starting point for system performance analysis and even some
| drilldown - all this without having to install new software
| or upgrading to latest kernels. Some of the tools I showed in
| my article too:
|
| https://tanelpoder.com/psnapper & https://0x.tools
|
| In the end of my post I mentioned that I'll do a webinar
| "hacking session" next Thursday, I'll show more how I work
| there :-)
| vinay_ys wrote:
| Very cool rig and benchmark. Kudos. Request: add network io
| load to your benchmarking load while nvme io load is running.
| tanelpoder wrote:
| Thanks, will do in a future article! I could share the disks
| out via NFS or iSCSI or something and hammer them from a
| remote machine...
| PragmaticPulp wrote:
| This is a great article. Thanks for writing it up and sharing.
| guerby wrote:
| 71 GB/s is 568 Gbit/s so you'll need about 3 dual 100 Gbit/s
| cards to pump data out at the rate you can read it from the
| NVMe drives.
|
| And ethernet (unless LAN jumbo frames) is about 1.5kByte per
| frame (not 4kB).
|
| One such PC should be able to do 100k simultaneous 5 Mbps HD
| streams.
|
| Testing this would be fun :)
| zamadatix wrote:
| Mellanox has a 2x200 Gbps NIC these days. Haven't gotten to
| play with it yet though.
| tanelpoder wrote:
| Which NICs would you recommend for me to buy for testing at
| least 1x100 Gbps (ideally 200 Gbps?) networking between
| this machine (PCIe 4.0) and an Intel Xeon one that I have
| with PCIe 3.0. Don't want to spend much money, so the cards
| don't need to be too enterprisey, just fast.
|
| And - do such cards even allow direct "cross" connection
| without a switch in between?
| drewg123 wrote:
| All 100G is enterprisy.
|
| For a cheap solution, I'd get a pair of used Mellanox
| ConnectX4 or Chelsio T6, and a QSFP28 direct attach
| copper cable.
| zamadatix wrote:
| +1 on what the sibling comment said.
|
| As for directly connecting them absolutely, works great.
| Id recommend a cheap DAC off fs.com to connect them in
| that case.
| drewg123 wrote:
| At Netflix, I'm playing with an EPYC 7502P with 16 NVME and
| dual 2x100 Mellanox ConnectX6-DX NICs. With hardware kTLS
| offload, we're able to serve about 350Gb/s of real customer
| traffic. This goes down to about 240Gb/s when using software
| kTLS, due to memory bandwidth limits.
|
| This is all FreeBSD, and is the evolution of the work
| described in my talk at the last EuroBSDCon in 2019:
| https://papers.freebsd.org/2019/eurobsdcon/gallatin-
| numa_opt...
| ksec wrote:
| >we're able to serve about 350Gb/s of real customer
| traffic.
|
| I still remember the post about breaking 100Gbps barrier,
| that was may be in 2016 or 17 ? And wasn't that long ago it
| was 200Gbps and if I remember correct it was hitting memory
| bandwidth barrier as well.
|
| And now 350Gbps?!
|
| So what's next? Wait for DDR5? Or moving to some memory
| controller black magic like POWER10?
| drewg123 wrote:
| Yes, before hardware inline kTLS offload, we were limited
| to 200Gb/s or so with Naples. With Rome, its a bit
| higher. But hardware inline kTLS with the Mellanox CX6-DX
| eliminates memory bandwidth as a bottleneck.
|
| The current bottleneck is IO related, and its unclear
| what the issue is. We're working with the hardware
| vendors to try to figure it out. We should be getting
| about 390Gb/s
| ksec wrote:
| Oh wow! Cant wait to hear more about it.
| tanelpoder wrote:
| I should (finally) receive my RTX 3090 card today (PCIe 4.0
| too!), I guess here goes my weekend (and the following
| weekends over a couple of years)!
| tarasglek wrote:
| You should look at cpu usage. There is a good chance all your
| interrupts are hitting cpu-0. you can run hwloc to see what
| chiplet the pci cards are on and handle interrupts on those
| cores.
| jeffbee wrote:
| Why would that happen with the linux nvme stack that puts a
| completion queue on each CPU?
| wtallis wrote:
| I think that in addition to allocating a queue per CPU, you
| need to be able to allocate a MSI(-X) vector per CPU. That
| shouldn't be a problem for the Samsung 980 PRO, since it
| supports 128 queues and 130 interrupt vectors.
| tanelpoder wrote:
| Thanks for the "hwloc" tip. I hadn't thought about that.
|
| I was thinking of doing something like that. Weirdly I got
| sustained throughput differences when I killed & restarted
| fio. So, if I got 11M IOPS, it stayed at that level until I
| killed fio & restarted. If I got 10.8M next, it stayed like
| it until I killed & restarted it.
|
| This makes me think that I'm hitting some PCIe/memory
| bottleneck, dependent on process placement (which process
| happens to need to move data across infinity fabric due to
| accessing data through a "remote" PCIe root complex or
| something like that). But then I realized that Zen 2 has a
| central IO hub again, so there shouldn't be a "far edge of
| I/O" like on current gen Intel CPUs (?)
|
| But there's definitely some workload placement and
| I/O-memory-interrupt affinity that I've wanted to look into.
| I could even enable the NUMA-like-mode from BIOS, but again
| with Zen 2, the memory access goes through the central
| infinity-fabric chip too, I understand, so not sure if
| there's any value in trying to achieve memory locality for
| individual chiplets on this platform (?)
| wtallis wrote:
| The PCIe is all on a single IO die, but internally it is
| organized into quadrants that can produce some NUMA
| effects. So it is probably worth trying out the motherboard
| firmware settings to expose your CPU as multiple NUMA
| nodes, and using the FIO options to allocate memory only on
| the local node, and restricting execution to the right
| cores.
| tanelpoder wrote:
| Yep, I enabled the "numa-like-awareness" in BIOS and ran
| a few quick tests to see whether the NUMA-aware
| scheduler/NUMA balancing would do the right thing and
| migrate processes closer to their memory over time, but
| didn't notice any benefit. But yep I haven't manually
| locked down the execution and memory placement yet. This
| placement may well explain why I saw some ~5% throughput
| fluctuations _only if killing & restarting fio_ and not
| while the same test was running.
| syoc wrote:
| I have done some tests on AMD servers and I the Linux
| scheduler does a pretty good job. I do however get
| noticeable (a couple percent) better performance by
| forcing the process to run on the correct numa node.
|
| Make sure you get as many numa domains as possible in
| your BIOS settings.
|
| I recommend using numactl with the cpu-exclusive and mem-
| exclusive flags. I have noticed a slight perfomance drop
| when RAM cache fills beyond the sticks local to the cpus
| doing work.
|
| One last comment is that you mentioned interrupts being
| "stiped" among CPUs. I would recommend pinning the
| interrupts from one disk to one numa-local CPU and using
| numactl to run fio for that disk on the same CPU. An
| additional experiment is to, if you have enough cores,
| pin interrupts to CPUs local to disk, but use other cores
| on the same numa node for fio. That has been my most
| successful setup so far.
| ksec wrote:
| I just love this article. Especially when the norm is always
| about scaling out instead of scaling up. We can have 128 Core
| CPU, 2TB Memory, PCI-E 4.0 SSD, ( and soon PCI-E 5.0 ). We
| could even fit a _Petabyte_ in 1U for SSD Storage.
|
| I remember WhatsApp used to operate its _500M_ user with only a
| dozen of large FreeBSD boxes. ( Only to be taken apart by
| Facebook )
|
| So Thank you for raising awareness. Hopefully the pendulum is
| swinging back to conceptually simple design.
|
| >I also have a 380 GB Intel Optane 905P SSD for low latency
| writes
|
| I would love to see that. Although I am waiting for someone to
| do a review on the Optane SSD P5800X [1]. Random 4K IOPS up to
| 1.5M with lower than 6 _us_ Latency.
|
| [1] https://www.servethehome.com/new-intel-
| optane-p5800x-100-dwp...
| texasbigdata wrote:
| Second on Optane.
| phkahler wrote:
| >> I remember WhatsApp used to operate its 500M user with
| only a dozen of large FreeBSD boxes.
|
| With 1TB of RAM you can have 256 bytes for every person on
| earth live in memory. With SSD either as virtual memory or
| keeping an index in RAM, you can do meaningful work in real
| time, probably as fast as the network will allow.
| rektide wrote:
| Intel killing off prosumer optane 2 weeks ago[1] made me so
| so so sad.
|
| The new P5800X should be sick.
|
| [1] https://news.ycombinator.com/item?id=25805779
| KaiserPro wrote:
| Excellent write up.
|
| I used to work for a VFX company in 2008. At that point we used
| lustre to get high throughput file storage.
|
| From memory we had something like 20 racks of server/disks to
| get a 3-6 gigabyte/s (sustained) throughput on a 300tb
| filesystem.
|
| It is hilarious to think that a 2u box can now theoretically
| saturate 2x100gig nics.
| qaq wrote:
| Would be cool to see pgbench score for this setup
| namero999 wrote:
| You should be farming Chia on that thing [0]
|
| Amazing, congrats!
|
| [0] https://github.com/Chia-Network/chia-blockchain/wiki/FAQ
| jayonsoftware1 wrote:
| https://www.asus.com/us/Motherboard-Accessories/HYPER-M-2-X1...
| vs https://highpoint-tech.com/USA_new/nvme_raid_controllers.htm .
| One card is about x10 expensive, but looks like performance is
| same. Am I missing some thing.
| tanelpoder wrote:
| The ASUS one doesn't have its own RAID controller nor PCIe
| switch onboard. It relies on the motherboard-provided PCIe
| bifurcation and if using hardware RAID, it'd use AMD's built-in
| RAID solution (but I'll use software RAID via Linux dm/md). The
| HighPoint SSD7500 seems to have a proprietary RAID controller
| built in to it and some management/monitoring features too
| (it's the "somewhat enterprisey" version)
| wtallis wrote:
| The HighPoint card doesn't have a hardware RAID controller,
| just a PCIe switch and an option ROM providing boot support
| for their software RAID.
|
| PCIe switch chips were affordable in the PCIe 2.0 era when
| multi-GPU gaming setups were popular, but Broadcom decided to
| price them out of the consumer market for PCIe 3 and later.
| tanelpoder wrote:
| Ok, thanks, good to know. I misunderstood from their
| website.
| rektide wrote:
| pcie switches getting expensive is so the suck.
| qaq wrote:
| Now price this in terms of AWS and marvel at the markup
| speedgoose wrote:
| I'm afraid Jeff Bezos himself couldn't afford such IOs on AWS.
| nwmcsween wrote:
| So Linus was wrong on his rant to Dave about the page cache being
| detremental on fast devices
| ogrisel wrote:
| As a nitpicking person, I really like to read a post that does
| not confuse GB/s for GiB/s :)
|
| https://en.wikipedia.org/wiki/Byte#Multiple-byte_units
| ogrisel wrote:
| Actually now I realize that the title and the intro paragraph
| are contradicting each other...
| tanelpoder wrote:
| Yeah, I used the formally incorrect GB in the title when I
| tried to make it look as simple as possible... GiB just
| didn't look as nice in the "marketing copy" :-)
|
| I may have missed using the right unit in some other sections
| too. At least I hope that I've conveyed that there's a
| difference!
___________________________________________________________________
(page generated 2021-01-29 23:00 UTC)