[HN Gopher] Achieving 11M IOPS and 66 GB/S IO on a Single Thread...
___________________________________________________________________
Achieving 11M IOPS and 66 GB/S IO on a Single ThreadRipper
Workstation
Author : tanelpoder
Score : 429 points
Date : 2021-01-29 12:45 UTC (1 days ago)
(HTM) web link (tanelpoder.com)
(TXT) w3m dump (tanelpoder.com)
| secondcoming wrote:
| > For final tests, I even disabled the frequent gettimeofday
| system calls that are used for I/O latency measurement
|
| I was knocking up some profiling code and measured the
| performance of gettimeofday as a proof-of-concept test.
|
| The performance difference between running the test on my
| personal desktop Linux VM versus running it on a cloud instance
| Linux VM was quite interesting (cloud was worse)
|
| I think I read somewhere that cloud instances cannot use the VDSO
| code path because your app may be moved to a different machine.
| My recollection of the reason is somewhat cloudy.
| ashkankiani wrote:
| When I bought a bunch of NVME drives, I was disappointed with how
| slow the maximum speed I could achieve with them was given my
| knowledge and available time at the time. Thanks for making this
| post to give me more points of insight into the problem.
|
| I'm on the same page with your thesis that "hardware is fast and
| clusters are usually overkill," and disk I/O was a piece that I
| hadn't really figured out yet despite making great strides in the
| software engineering side of things. I'm trying to make a startup
| this year and disk I/O will actually be a huge factor in how far
| I can scale without bursting costs for my application. Good
| stuff!
| chx wrote:
| A terabyte of RAM on your desktop.
|
| It's been less than a quarter century ago, 1997, when Microsoft
| and Compaq launched the TerraServer which was a wordplay on
| terabyte -- it stored a terabyte of data and it was a Big Deal.
| Today's that not storage, that's main RAM, unencumbered by NUMA.
| whalesalad wrote:
| This post is fantastic. I wish there was more workstation porn
| like this for those of us who are not into the RGB light show
| ripjaw hacksaw aorus elite novelty stuff that gamers are so into.
| Benchmarks in the community are almost universally focused on
| gaming performance and FPS.
|
| I want to build an epic rig that will last a long time with
| professional grade hardware (with ECC memory for instance) and
| would love to get a lot of the bleeding-edge stuff without
| compromising on durability. Where do these people hang out
| online?
| piinbinary wrote:
| The level1techs forums seems to have a lot of people with
| similar interests
| [deleted]
| greggyb wrote:
| STH: https://www.youtube.com/user/ServeTheHomeVideo
| https://www.servethehome.com/
|
| GamersNexus (despite the name, they include a good amount of
| non-gaming benchmarks, and they have great content on cases and
| cooling): https://www.youtube.com/user/GamersNexus
| https://www.gamersnexus.net/
|
| Level1Techs (mentioned in another reply):
| https://www.youtube.com/c/Level1Techs
| https://www.level1techs.com/
|
| r/homelab (and all the subreddits listed in its sidebar):
| https://www.reddit.com/r/homelab/
|
| Even LinusTechTips has some decent content for server hardware,
| though they stay fairly superficial. And the forum definitely
| has people who can help out: https://linustechtips.com/
|
| And the thing is, depending on what metric you judge
| performance by, the enthusiast hardware may very well
| outperform the server hardware. For something that is sensitive
| to memory, e.g., you can get much faster RAM in enthusiast SKUs
| (https://www.crucial.com/memory/ddr4/BLM2K8G51C19U4B) than
| you'll find in server hardware. Similarly, the HEDT SKUs out-
| clock the server SKUs for both Intel and AMD.
|
| I have a Threadripper system that outperforms most servers I
| work with on a daily basis, because most of my workloads,
| despite being multi-threaded, are sensitive to clockspeed.
| 1996 wrote:
| Indeed, serious people now use gamer computer parts because
| it's just faster!
| greggyb wrote:
| It's not "just faster".
|
| No one's using "gamer NICs" for high speed networking. Top
| of the line "gaming" networking is 802.11ax or 10GbE.
| 2x200Gb/s NICs are available now.
|
| Gaming parts are strictly single socket - software that can
| take advantage of >64 cores will need server hardware -
| either one of the giant Ampere ARM CPUs or a 2+ socket
| system.
|
| If something must run in RAM and needs TB of RAM, well then
| it's not even a question of faster or slower. The
| capability only exists on server platforms.
|
| _Some_ workloads will benefit from the performance
| characteristics of consumer hardware.
| mgerdts wrote:
| The desktop used in TFA supports 1 TB of RAM.
|
| https://www.lenovo.com/us/en/thinkstation-p620
| greggyb wrote:
| Workstations and desktops are distinct market segments.
| The machine in the article uses a workstation platform.
| And the workstation processors available in that Lenovo
| machine clock slower than something like a 5950X
| mainstream processor. The RDIMMS you need to get to 1TB
| in the machine run much slower than the UDIMMS I linked
| above.
| vmception wrote:
| > RGB light show ripjaw hacksaw aorus elite novelty stuff
|
| haha yeah I bought a whole computer from someone and was
| wondering why the RAM looked like rupies from Zelda
|
| apparently that is common now
|
| but at least I'm not cosplaying as a karate day trader for my
| Wall Street Journal expose'
| ecf wrote:
| Because flashy RGB is the default mode used for marketing
| purposes.
|
| I'm not trying to be snarky here but you can always just turn
| off the lights or set it to be a solid color of your
| preference.
| philsnow wrote:
| I'm with you on this, I just built a (much more modest than the
| article's) workstation/homelab machine a few months ago, to
| replace my previous one which was going on 10 years old and
| showing its age.
|
| There's some folks in /r/homelab who are into this kind of
| thing, and I used their advice a fair bit in my build. While it
| is kind of mixed (there's a lot of people who build pi clusters
| as their homelab), there's still plenty of people who buy
| decommissioned "enterprise" hardware and make monstrous-for-
| home-use things.
| COGlory wrote:
| Check out Wendell Wilson, of Level1Techs on YouTube
| (https://www.youtube.com/channel/UCOWcZ6Wicl-1N34H0zZe38w or
| https://www.youtube.com/user/teksyndicate), and
| https://forum.level1techs.com
| deagle50 wrote:
| Happy to help if you want feedback. Servethehome forums are
| also a great resource of info and used hardware, probably the
| best community for your needs.
| gigatexal wrote:
| +1 to ServeTheHome, the forums have some of the nicest and
| smartest people I've ever met online.
|
| Hard forum is cool too
| arminiusreturns wrote:
| Check out HardForum. Lots of very knowledgable people on there
| helped me mature my hardware level knowledge. Back when I was
| building 4 cpu, 64 core opteron systems. Also decent banter.
| tomc1985 wrote:
| Though I have to wonder.... would these be good gaming systems?
| Are there any scenarios where the perks (stupid numbers of
| cores, 8-channel memory, 128 PCI-E lanes, etc) would help?
| sp332 wrote:
| Not many games are written to scale out that far. I remember
| Ashes of the Singularity was used to showcase Ryzen CPUs
| though.
| eldelshell wrote:
| Not really. Gaming is bound to latency and rendering, not
| scale nor bandwidth. Memory and IO usage is pretty constant
| while the game is running.
| tanelpoder wrote:
| Thanks! In case you're interested in building a ThreadRipper
| Pro WX-based system like mine, then AMD apparently starts
| selling the CPUs independently from March 2021 onwards:
|
| https://www.anandtech.com/show/16396/the-amd-wrx80-chipset-a...
|
| Previously you could only get this CPU when buying the Lenovo
| ThinkStation P620 machine. I'm pretty happy with Lenovo
| Thinkstations though (I bought a P920 with dual Xeons 2.5 years
| ago)
| ksec wrote:
| And just in time article
|
| https://www.anandtech.com/show/16462/hands-on-with-the-
| asus-...
|
| I guess I should submit this on HN as well.
|
| Edit: I was getting too ahead of myself I thought these are
| for TR Pro with Zen 3. Turns out they are not out yet.
| walrus01 wrote:
| My only quibble with that board is that I worry about how
| easily replaced the fan on the chipset is. In my experience
| that exact type of fan will inevitably fail in a moderately
| dusty environment... And it doesn't look like anything you
| could screw on/off from the common industry standard sizes
| of 40mm or 60mm 12VDC fans that come in various
| thicknesses.
| opencl wrote:
| Supermicro's WRX80 motherboard looks like it has an
| easily replaceable chipset fan, not sure about the
| Gigabyte one.
|
| https://www.anandtech.com/show/16396/the-amd-
| wrx80-chipset-a...
| zhdc1 wrote:
| Look at purchasing used enterprise hardware. You can buy a
| reliable x9 or X10 generation supermicro server (rack or tower)
| for around a couple of hundred.
| igorstellar wrote:
| Downside of buying enterprise for home use is noise - their
| turbofan coolers are insanely loud while consumer grade 120mm
| (Noctua et al) coolers are most quiet.
| jordanbeiber wrote:
| It's mostly about casing though - density is important with
| enterprise stuff, and noise level is almost irrelevant
| hence small chassis with small, loud, fans.
|
| I've got a 3.5" x16 bay gooxi chassis that I've put a
| supermicro mb + xeon in.
|
| Something like this:
|
| https://www.xcase.co.uk/collections/3u-rackmount-
| cases/produ...
|
| I got this specific nas chassis because it got a fan wall
| with 3x120mm fans, not because I need bays.
|
| With a few rather cool SSD's for storage and quiet noctua
| fans it is barely a whisper.
|
| Also - vertical rack mounting behind a closet door! I can
| have a massive chassis that basically takes no place at
| all. Can't belive I didn't figure that one out earlier...
| leptons wrote:
| Noise isn't the only downside - the power they consume can
| cost $$$. These things aren't typically the most energy
| efficient machines.
| girvo wrote:
| Would swapping them out for Noctuas be difficult?
| cfn wrote:
| Mostly yes because server chassis are very compact and
| sometimes use proprietary connectors and fans. Still,
| many people have done that with good results, have a look
| in YouTube to know which server models are best suited
| for that kind of customization.
| alvern wrote:
| I've not been successful trying this with HPE servers.
| Most server fans (Foxconn/Delta) run 2.8 amp or higher.
| Not aware of any "silent" gaming grade fans that use more
| than 0.38 amps. That's not even considering the CFM.
| weehoo wrote:
| Why would current be relevant here? Shouldn't operating
| voltage be the only thing that matters?
| sokoloff wrote:
| Amps * Volts is power. Power is a proxy (a moderately
| good one) for air movement (a mix of volume/mass at a
| specific [back-]pressure).
|
| It's not likely that a silent 2W fan will move a similar
| amount of air as the stock 14W fans. The enterprise gear
| from HPE is pretty well engineered; I'm skeptical that
| they over-designed the fans by a 7x factor.
|
| Operating voltage tells you "this fan won't burn up when
| you plug it in". It doesn't tell you "will keep the
| components cool".
| smartbit wrote:
| Another downside is powerconsumption at rest. Supermicro
| board with 2x Xeon use 80watt at minimum. Add a 10Gbit
| switch and a few more peripherals and you're looking at an
| _additional_ $EUR80 /month electricity bill. _Year after
| year_ , that is $EUR10.000 in 10years.
|
| Of course that is nothing compared to what you'd pay at
| Google/Azure/AWS for the AMD machine of this news item :-)
|
| 12V _only_ PSUs like OEMs use or ATX12VO in combination
| with a motherboard without IPMI, similar to the German
| Fujitsu motherboards, have significant lower power
| consumption at rest. Somewhere around 8-10Watt without HDD.
| Much better for home use IHMO.
| [deleted]
| devonkim wrote:
| In the US, electricity rates are typically much cheaper
| than the EU. My rate is roughly .08 EUR/kWh, for example,
| and I don't get any subsidies to convert to solar, so I
| have no way to make it pay off for myself within 15 years
| (beyond the time most people expect to stay in a home
| here), while other states in the US subsidize so heavily
| or rates for electricity are so high most people have
| solar panels at least (see: Hawaii with among the highest
| costs for electricity in the US).
|
| Regardless of electricity cost, all that electricity
| usage winds up with a lot of heat in a dwelling. To help
| offset the energy consumption in the future I plan to use
| a hybrid water heater that can act as a heat pump and
| dehumidifier and capture the excess heat as a way to
| reduce energy consumption for hot water.
| ashkankiani wrote:
| I've been planning to do this, but enterprise hardware seems
| like it requires a completely different set of knowledge on
| how to purchase it and maintain it, and especially as a
| consumer.
|
| It's not quite as trivial of a barrier to entry as consumer
| desktops, but I suppose that's the point. Still, it would be
| nice if there was a guide that could help me make good
| decisions to start.
| bombcar wrote:
| It's actually much EASIER in my experience- enterprise gear
| is made to be easily reparable and most if not all parts
| can be swapped without tools.
|
| Loud though - most of them run pretty quiet if not doing
| anything.
| jqcoffey wrote:
| Also, purpose built data center chassis are designed for
| high airflow and are thus really quite loud.
| modoc wrote:
| Very true. I have a single rack mount server in my HVAC
| room, and it's still so loud I had to glue soundproofing
| foam on the nearby walls:)
| gogopuppygogo wrote:
| Most people who get into home labs spend some time on
| research and throw some money at gaining an education.
|
| Compute is so cheap second hand.
| [deleted]
| benlwalker wrote:
| Plug for a post I wrote a few years ago demonstrating nearly the
| same result but using only a single CPU core:
| https://spdk.io/news/2019/05/06/nvme/
|
| This is using SPDK to eliminate all of the overhead the author
| identified. The hardware is far more capable than most people
| expect, if the software would just get out of the way.
| tanelpoder wrote:
| Yes I had seen that one (even more impressive!)
|
| When I have more time again, I'll run fio with the SPDK plugin
| on my kit too. And would be interested in seeing what happens
| when doing 512B random I/Os?
| benlwalker wrote:
| The system that was tested there was PCIe bandwidth
| constrained because this was a few years ago. With your
| system, it'll get a bigger number - probably 14 or 15 million
| 4KiB IO per second per core.
|
| But while SPDK does have an fio plug-in, unfortunately you
| won't see numbers like that with fio. There's way too much
| overhead in the tool itself. We can't get beyond 3 to 4
| million with that. We rolled our own benchmarking tool in
| SPDK so we can actually measure the software we produce.
|
| Since the core is CPU bound, 512B IO are going to net the
| same IO per second as 4k. The software overhead in SPDK is
| fixed per IO, regardless of size. You can also run more
| threads with SPDK than just one - it has no locks or cross
| thread communication so it scales linearly with additional
| threads. You can push systems to 80-100M IO per second if you
| have disks and bandwidth that can handle it.
| StillBored wrote:
| Yah this has been going on for a while. Before SPDK it was
| with custom kernel bypasses and fast inifiband/FC arrays. I
| was involved with a similar project in the early 2000's.
| Where at the time the bottleneck was the shared xeon bus,
| and then it moved to the PCIe bus with opterons/nehalem+.
| In our case we ended up spending a lot of time tuning the
| application to avoid cross socket communication as well
| since that could become a big deal (of course after careful
| card placement).
|
| But SPDK has a problem you don't have with bypasses and
| uio_ring, in that it needs the IOMMU enabled, and that can
| itself become a bottleneck. There are also issues for some
| applications that want to use interrupts rather than poll
| everything.
|
| Whats really nice about uio_ring is that it sort of
| standardizes a large part of what people were doing with
| bypasses.
| tanelpoder wrote:
| Yeah, that's what I wondered - I'm ok with using multiple
| cores, would I get even more IOPS when doing smaller I/Os.
| Is the benchmark suite you used part of the SPDK toolkit
| (and easy enough to run?)
| benlwalker wrote:
| Whether you get more IOPs with smaller I/Os depends on a
| number of things. Most drives these days are natively
| 4KiB blocks and are emulating 512B sectors for backward
| compatibility. This emulation means that 512B writes are
| often quite slow - probably slower than writing 4KiB
| (with 4KiB alignment). But 512B reads are typically very
| fast. On Optane drives this may not be true because the
| media works entirely differently - those may be able to
| do native 512B writes. Talk to the device vendor to get
| the real answer.
|
| For at least reads, if you don't hit a CPU limit you'll
| get 8x more IOPS with 512B than you will with 4KiB with
| SPDK. It's more or less perfect scaling. There's some
| additional hardware overheads in the MMU and PCIe
| subsystems with 512B because you're sending more messages
| for the same bandwidth, but my experience has been that
| it is mostly negligible.
|
| The benchmark builds to build/examples/perf and you can
| just run it with -h to get the help output. Random 4KiB
| reads at 32 QD to all available NVMe devices (all devices
| unbound from the kernel and rebound to vfio-pci) for 60
| seconds would be something like:
|
| perf -q 32 -o 4096 -w randread -t 60
|
| You can specify only test specific devices with the -r
| parameter (by BUS:DEVICE:FUNCTION essentially). The tool
| can also benchmark kernel devices. Using -R will turn on
| io_uring (otherwise it uses libaio), and you simply list
| the block devices on the command line after the base
| options like this:
|
| perf -q 32 -o 4096 -w randread -t 60 -R /dev/nvme0n1
|
| You can get ahold of help from the SPDK community at
| https://spdk.io/community. There will be lots of people
| willing to help.
|
| Excellent post by the way. I really enjoyed it.
| tanelpoder wrote:
| Thanks! Will add this to TODO list too.
| rektide wrote:
| Nice follow up @ttanelpoder to "RAM is the new disk" (2015)[1]
| which we talked about not even two weeks ago!
|
| I was quite surprised to hear in that thread that AMD's
| infiniband was so oversubscribed. There's 256GBps of pcie on a 1P
| butit seems like this 66GBps is all the fabric can do. A little
| under a 4:1 oversubscription!
|
| [1] https://news.ycombinator.com/item?id=25863093
| electricshampo1 wrote:
| 66GBps is from each of 10 drives doing ~6.6 GBps; don't think
| the infinity fabric is the limiter here
| rektide wrote:
| I'd been going off this link[1] from the previous "Ram is the
| new diskx thread, but I think last time I read it I'd only
| counted one Infinity Fabric Inter-Socket on the 1P diagram
| (which provides the PCIe). On review, willing to bet, yeah,
| the pcie lanes aren't all sharing the one IFIS. The diagram
| is to give an idea, not the actual configuration.
|
| [1] https://en.wikichip.org/wiki/amd/infinity_fabric#Scalable
| _Da...
| [deleted]
| muro wrote:
| This article was great, thanks for sharing!
|
| Anyone has advice on optimizing a windows 10 system? I have a
| haswell workstation (E5-1680 v3) that I find reasonably fast and
| works very well under Linux. In windows, I get lost. I tried to
| run the userbenchark suite which told me I'm below median for
| most of my components. Is there any good advice how to improve
| that? Which tools give good insight into what the machine is
| doing under windows? I'd like first to try to optimize what I
| have, before upgrading to the new shiny :).
| RobLach wrote:
| Excellent article. Worth a read even if you're not maxing IO.
| tutfbhuf wrote:
| This is a very synthetical fio benchmark, I would like to see how
| actual applications like a postgres databases would perform on
| such a tuned machine.
| tanelpoder wrote:
| Yep, some "real" workload tests are coming next (using
| filesystems). I wanted to start from low level basics and later
| build on top of that.
| wiradikusuma wrote:
| I've been thinking about this. Would traditional co-location
| (e.g. 2x 2U from DELL) in a local data center be cheaper if e.g.
| you're serving local (country-wise) market?
| derefr wrote:
| Depends on how long you need the server, and the ownership
| model you've chosen to pursue for it.
|
| If you _purchase_ a server and stick it in a co-lo somewhere,
| and your business plans to exist for 10+ years -- well, is that
| server still going to be powering your business 10 years from
| now? Or will you have moved its workloads to something newer?
| If so, you 'll probably want to decommission and sell the
| server at some point. The time required to deal with that might
| not be worth the labor costs of your highly-paid engineers.
| Which means you might not actually end up re-capturing the
| depreciated value of the server, but instead will just let it
| rot on the shelf, or dispose of it as e-waste.
|
| Hardware _leasing_ is a lot simpler. Whether you lease servers
| from an OEM like Dell, there 's a quick, well-known path to
| getting the EOLed hardware shipped back to Dell and the
| depreciated value paid back out to you.
|
| And, of course, hardware _renting_ is simpler still. Renting
| the hardware of the co-lo (i.e. "bare-metal unmanaged server"
| hosting plans) means never having to worry about the CapEx of
| the hardware in the first place. You just walk away at the end
| of your term. But, of course, that's when you start paying
| premiums on top of the hardware.
|
| Renting VMs, then, is like renting hardware on a micro-scale;
| you never have to think about what you're running on, as --
| presuming your workload isn't welded to particular machine
| features like GPUs or local SSDs -- you'll tend to
| automatically get migrated to newer hypervisor hardware
| generations as they become available.
|
| When you work it out in terms of "ten years of ops-staff labor
| costs of dealing with generational migrations and sell-offs"
| vs. "ten years of premiums charged by hosting rentiers", the
| pricing is surprisingly comparable. (In fact, this is basically
| the math hosting providers use to figure out what they _can_
| charge without scaring away their large enterprise customers,
| who are fully capable of taking a better deal if there is one.)
| rodgerd wrote:
| > If you purchase a server and stick it in a co-lo somewhere,
| and your business plans to exist for 10+ years -- well, is
| that server still going to be powering your business 10 years
| from now? Or will you have moved its workloads to something
| newer?
|
| Which, if you have even the remotest fiscal competence,
| you'll have funded by using the depreciation of the book
| value of the asset after 3 years.
| 37ef_ced3 wrote:
| Somebody please tell me how many ResNet50 inferences you can do
| per second on one of these chips
|
| Here is the standalone AVX-512 ResNet50 code (C99 .h and .c
| files):
|
| https://nn-512.com/browse/ResNet50
|
| Oops, AMD doesn't support AVX-512 yet. Even Zen 3? Incredible
| wyldfire wrote:
| Whoa, this code looks interesting. Must've been emitted by
| something higher-level? Something like PyTorch/TF/MLIR/TVM/Glow
| maybe?
|
| If that is the case, then maybe it could be emitted again while
| masking the instruction sets Ryzen doesn't support yet.
| tanelpoder wrote:
| You mean on the CPU, right? This CPU doesn't support AVX-512:
| $ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/
| /\n/g' | egrep "avx|sse|popcnt" | sort | uniq avx
| avx2 misalignsse popcnt sse sse2
| sse4_1 sse4_2 sse4a ssse3
|
| What compile/build options should I use?
| 37ef_ced3 wrote:
| No AVX-512, forget it then
| xxpor wrote:
| They don't have avx512 instructions.
| qaq wrote:
| Now honestly say for how long two boxes like this behind a load
| balancer would be more than enough for your startup.
| pbalcer wrote:
| What I find interesting about the performance of this type of
| hardware is how it affects the software we are using for storage.
| The article talked about how the Linux kernel just can't keep up,
| but what about databases or kv stores. Are the trade-offs those
| types of solutions make still valid for this type of hardware?
|
| RocksDB, and LSM algorithms in general, seem to be designed with
| the assumption that random block I/O is slow. It appears that,
| for modern hardware, that assumption no longer holds, and the
| software only slows things down [0].
|
| [0] -
| https://github.com/BLepers/KVell/blob/master/sosp19-final40....
| ddorian43 wrote:
| Disappointed there was no lmdb comparison in there.
| tyingq wrote:
| A paper on making LSM more SSD friendly:
| https://users.cs.duke.edu/~rvt/ICDE_2017_CameraReady_427.pdf
| pbalcer wrote:
| Thanks for sharing this article - I found it very insightful.
| I've seen similar ideas being floated around before, and they
| often seem to focus on what software can be added on top of
| an already fairly complex solution (while LSM can appear to
| be conceptually simple, its implementations are anything
| but).
|
| To me, what the original article shows is an opportunity to
| remove - not add.
| jeffbee wrote:
| If you think about it from the perspective of the authors of
| large-scale databases, linear access is still a lot cheaper
| than random access in a datacenter filesystem.
| AtlasBarfed wrote:
| scylladb had a blogpost once about how surprisingly small
| amounts of cpu time are available to process packets at the
| modern highest speed networks like 40gbit and the like.
|
| I can't find it now. I think they were trying to say that
| cassandra can't keep up because of the JVM overhead and you
| need to be close to metal for extreme performance.
|
| This is similar. Huge amounts of flooding I/O from modern PCIx
| SSDs really closes the traditional gap between CPU and "disk".
|
| The biggest limiter in cloud right now is the EBS/SAN. Sure you
| can use local storage in AWS if you don't mind it disappearing,
| but while gp3 is an improvement, it pales to stuff like this.
|
| Also, this is fascinating:
|
| "Take the write speeds with a grain of salt, as TLC & QLC cards
| have slower multi-bit writes into the main NAND area, but may
| have some DIMM memory for buffering writes and/or a "TurboWrite
| buffer" (as Samsung calls it) that uses part of the SSDs NAND
| as faster SLC storage. It's done by issuing single-bit "SLC-
| like" writes into TLC area. So, once you've filled up the "SLC"
| TurboWrite buffer at 5000 MB/s, you'll be bottlenecked by the
| TLC "main area" at 2000 MB/s (on the 1 TB disks)."
|
| I didn't know controllers could swap between TLC/QLC and SLC.
| tanelpoder wrote:
| I learned the last bit from here (Samsung Solid State Drive
| TurboWrite Technology pdf):
|
| https://images-eu.ssl-images-
| amazon.com/images/I/914ckzwNMpS...
| PeterCorless wrote:
| Hi! From ScyllaDB here. There are a few things that help us
| really get the most out of hardware and network IO.
|
| 1. Async everywhere - We use AIO and io_uring to make sure
| that your inter-core communications are non-blocking.
|
| 2. Shard-per-core - It also helps if specific data is pinned
| to a specific CPU, so we partition on a per-core basis.
| Avoids cross-CPU traffic and, again, less blocking.
|
| 3. Schedulers - Yes, we have our own IO scheduler and CPU
| scheduler. We try to get every cycle out of a CPU. Java is
| very "slushy" and though you can tune a JVM it is never going
| to be as "tight" performance-wise.
|
| 4. Direct-attached NVMe > networked-attached block storage. I
| mean... yeah.
|
| We're making Scylla even faster now, so you might want to
| check out our blogs on Project Circe:
|
| * Introducing Project Circe:
| https://www.scylladb.com/2021/01/12/making-scylla-a-
| monstrou...
|
| * Project Circe January Update:
| https://www.scylladb.com/2021/01/28/project-circe-january-
| up...
|
| The latter has more on our new scheduler 2.0 design.
| StillBored wrote:
| Yes a number of articles about these newer TLC drives talk
| about it. The end result is that an empty drive is going to
| benchmark considerably different from one 99% full of
| uncompressable files.
|
| for example:
|
| https://www.tomshardware.com/uk/reviews/intel-
| ssd-660p-qlc-n...
| 1996 wrote:
| > I didn't know controllers could swap between TLC/QLC and
| SLC.
|
| I wish I could control the % of SLC. Even dividing a QLC
| space by 16 makes it cheaper than buying a similarly sized
| SLC
| 1MachineElf wrote:
| Reminds me of the Solid-State Drive checkbox that VirtualBox
| has for any VM disks. Checking it will make sure that the VM
| hardware emulation doesn't wait for the filesystem journal to
| be written, which would normally be advisable with spinning
| disks.
| digikata wrote:
| Not only the assumptions at the application layer, but
| potentially the filesystem too.
| [deleted]
| bob1029 wrote:
| I have personally found that making even the most primitive
| efforts at single-writer principle and batching IO in your
| software can make many orders of magnitude difference.
|
| Saturating an NVMe drive with a single x86 thread is trivial if
| you change how you play the game. Using async/await and
| yielding to the OS is not going to cut it anymore. Latency with
| these drives is measured in microseconds. You are better off
| doing microbatches of writes (10-1000 uS wide) and pushing
| these to disk with a single thread that monitors a queue in a
| busy wait loop (sort of like LMAX Disruptor but even more
| aggressive).
|
| Thinking about high core count parts, sacrificing an entire
| thread to busy waiting so you can write your transactions to
| disk very quickly is not a terrible prospect anymore. This same
| ideology is also really useful for ultra-precise execution of
| future timed actions. Approaches in managed lanaguages like
| Task.Delay or even Thread.Sleep are insanely inaccurate by
| comparison. The humble while(true) loop is certainly not energy
| efficient, but it is very responsive and predictable as long as
| you dont ever yield. What's one core when you have 63 more to
| go around?
| MrFoof wrote:
| >Latency with these drives is measured in microseconds.
|
| For context and to put numbers around this, the average read
| latency of the fastest, latest generation PCI 4.0 x4 U.2
| enterprise drives is 82-86us, and the average write latency
| is 11-16us.
| pbalcer wrote:
| The authors of the article I linked to earlier came to the
| same conclusions. And so did the SPDK folks. And the kernel
| community (or axboe :)) when coming up with io_uring. I'm
| just hoping that we will see software catching up.
| mikepurvis wrote:
| Isn't the use or non-use of async/await a bit orthogonal to
| the rest of this?
|
| I'm not an expert in this area, but wouldn't it be just as
| lightweight to have your async workers pushing onto a queue,
| and then have your async writer only wake up when the queue
| is at a certain level to create the batched write? Either
| way, you won't be paying the OS context switching costs
| associated with blocking a write thread, which I think is
| most of what you're trying to get out of here.
| pbalcer wrote:
| Right, I agree. I'd go even further and say that
| async/await is a great fit for a modern _asynchronous_ I /O
| stack (not read()/write()). Especially with io_uring using
| polled I/O (the worker thread is in the kernel, all the
| async runtime has to do is check for completion
| periodically), or with SPDK if you spin up your own I/O
| worker thread(s) like @benlwalker explained elsewhere in
| the thread.
| throwawaygimp wrote:
| Very interesting. I'm currently desiging and building a
| system which has a separate MCU just for timing accurate
| stuff rather than having the burdon of realtime kernel stuff,
| but I never considered just dedicating a core. Then I could
| also use that specifically to handle some IO queues too
| perhaps, so it could do double duty and not necessarily be
| wasteful. Thanks... now I need to go figure out why I either
| didn't consider that - or perhaps I did and discarded it for
| some reason beyond me right now. Hmm... thought provoking
| post of the day for me
| tyingq wrote:
| I wonder if "huge pages" would make a difference, since some of
| the bottlenecks seemed to be lock contention on memory pages.
| tanelpoder wrote:
| Linux pagecache doesn't use hugepages, but definitely when
| doing direct I/O into application buffers, would make sense to
| use hugepages for that. I plan to run tests on various database
| engines next - and many of them support using hugepages (for
| shared memory areas at least).
| guerby wrote:
| In the networking world (DPDK) huge pages and static pinning
| everything is a huge deal as you have very few cpu cycles per
| network packet.
| tanelpoder wrote:
| Yep - and there's SPDK for direct NVMe storage access
| without going through the Linux block layer:
| https://spdk.io
|
| (it's in my TODO list too)
| tyingq wrote:
| Thanks! Apparently, they did add it for tmpfs, and discussed
| it for ext4. https://lwn.net/Articles/718102/
| tanelpoder wrote:
| Good point - something to test, once I get to the
| filesystem benchmarks!
| tyingq wrote:
| I'm somewhat curious what happens to the long standing 4P/4U
| servers from companies like Dell and HP. The Ryzen/EPYC has
| really made going past 2P/2U a more rare need.
| toast0 wrote:
| At least when I was actively looking at hardware (2011-2018), 4
| socket Xeon was available off the shelf, but at quite the
| premium over 2 socket Xeon. If your load scaled horizontally,
| it still made sense to get a 2P Xeon over 2x 1P Xeon, but 2x 2P
| Xeon was way more cost efficient than a 4P Xeon. 8P or 16P
| seemed to exist, but maybe only in catalogs.
|
| I'm not really in the market anymore, but Epyc looks like 1P is
| going to solve a lot of needs, and 2P will be available at a
| reasonable premium, but 4P will probably be out of reach.
| thinkingkong wrote:
| You might be able to buy a smaller server but the rack density
| doesnt necessarily change. You still have to worry about
| cooling and power so lots of DCs would have 1/4 or 1/2 racks.
| tyingq wrote:
| Sure. I wasn't really thinking of density, just the
| interesting start of the "death" of 4 socket servers. Being
| an old-timer, it's interesting to me because "typical
| database server" has been synonymous with 4P/4U for a long,
| long time.
| vinay_ys wrote:
| I haven't seen a 4 socket machine in a long time.
| wtallis wrote:
| I think at this point the only reasons to go beyond 2U are to
| make room for either 3.5" hard drives, or GPUs.
| rektide wrote:
| Would love to see some very dense blade style ryzen
| offerings. The 4 2P nodes in 2U is great. Good way to share
| some power supies, fan, chassis, ideally multi-home nic too.
|
| Turn those sleds into blades though, put em on their side, &
| go even denser. It should be a way to save costs, but density
| alas is a huge upsell, even though it should be a way to
| scale costs down.
| tanelpoder wrote:
| Indeed, 128 EPYC cores in 2 sockets (with total 16 memory
| channels) will give a lot of power. I guess it's worth
| mentioning that the 64-core chips have much lower clock rate
| than 16/32 core ones though. And with some expensive software
| that's licensed by CPU core (Oracle), you'd want faster cores,
| but possibly pay a higher NUMA price when going with a single 4
| or 8 sockets machine for your "sacred monolith".
| StillBored wrote:
| There always seems to be buyers for more exotic high end
| hardware. That market has been shrinking and expanding, well
| since the first computer, as mainstream machines become more
| capable and people discover more uses for large coherent
| machines.
|
| But users of 16 socket machines, will just step down to 4
| socket epyc machines with 512 cores (or whatever). And someone
| else will realize that moving their "web scale" cluster from 5k
| machines, down to a single machine with 16 sockets results in
| lower latency and less cost. (or whatever).
| maerF0x0 wrote:
| > Shouldn't I be building a 50-node cluster in the cloud "for
| scalability"? This is exactly the point of my experiment - do you
| really want to have all the complexity of clusters or performance
| implications of remote storage if you can run your I/O heavy
| workload on just one server with local NVMe storage?
|
| Anyone have a story to share about their company doing just this?
| "Scale out" has basically been the only acceptable answer across
| most of my career. Not to mention High Availability.
| tanelpoder wrote:
| You can get high availability without a "distibuted system",
| just an active/passive failover cluster may be enough for some
| requirements. Even failover (sometimes seamless) on a VMWare
| cluster can help with planned maintenance scenarios without
| downtime, etc.
|
| Another way of achieving HA together with satisfying disaster
| recovery requirements is replication (either app level or
| database log replication, etc). So, no distributed system is
| necessary unless, you have _legit_ scaling requirements.
|
| If you work on ERP-like databases for traditional Fortune
| 500-like companies, few people run such "sacred monolith"
| applications on modern distributed NoSQL databases, it's all
| Oracle, MSSQL or some Postgres nowadays. Data warehouses used
| to be all Oracle, Teradata too - although these DBs support
| some cluster scale-out, they're still "sacred monoliths" from a
| different era (they are still doing - what they were designed
| for - very well). Now of course Snowflake, BigQuery, etc are
| taking over the DW/analytics world for new greenfield projects,
| existing systems usually stay as they are due to lock-in &
| extremely high cost of rewriting decades of existing reports
| and apps.
| anarazel wrote:
| Have you checked if using the fio options (--iodepth_batch_*) to
| batch submissions helps? Fio doesn't do that by default, and I
| found that that can be a significant benefit.
|
| Particularly submitting multiple up requests can amortize the
| cost of setting the nvme doorbell (the expensive part as far as I
| understand it) across multiple requests.
| tanelpoder wrote:
| I tested various fio options, but didn't notice this one - I'll
| check it out! It might explain why I still kept seeing lots of
| interrupts raised even though I had enabled the I/O completion
| polling instead, with io_uring's --hipri option.
|
| edit: I ran a quick test with various IO batch sizes and it
| didn't make a difference - I guess because thanks to using
| io_uring, my bottleneck is not in IO submission, but deeper in
| the block IO stack...
| wtallis wrote:
| I think on recent kernels, using the hipri option doesn't get
| you interrupt-free polled IO unless you've configured the
| nvme driver to allocate some queues specifically for polled
| IO. Since these Samsung drives support 128 queues and you're
| only using a 16C/32T processor, you have more than enough for
| each drive to have one poll queue and one regular IO queue
| allocated to each (virtual) CPU core.
| tanelpoder wrote:
| That would explain it. Do you recommend any docs/links I
| should read about allocating queues for polled IO?
| anarazel wrote:
| It's terribly documented :(. You need to set the
| nvme.poll_queues to the number of queues you want, before
| the disks are attached. I.e. either at boot, or you need
| to set the parameter and then cause the NVMe to be
| rescanned (you can do that in sysfs, but I can't
| immediately recall the steps with high confidence).
| anarazel wrote:
| Ah, yes, shell history ftw. Of course you should ensure
| no filesystem is mounted or such:
| root@awork3:~# echo 4 >
| /sys/module/nvme/parameters/poll_queues
| root@awork3:~# echo 1 >
| /sys/block/nvme1n1/device/reset_controller
| root@awork3:~# dmesg -c [749717.253101] nvme
| nvme1: 12/0/4 default/read/poll queues
| root@awork3:~# echo 8 >
| /sys/module/nvme/parameters/poll_queues
| root@awork3:~# dmesg -c root@awork3:~# echo 1 >
| /sys/block/nvme1n1/device/reset_controller
| root@awork3:~# dmesg -c [749736.513102] nvme
| nvme1: 8/0/8 default/read/poll queues
| tanelpoder wrote:
| Thanks for the pointers, I'll bookmark this and will try
| it out someday.
| anarazel wrote:
| > I tested various fio options, but didn't notice this one -
| I'll check it out! It might explain why I still kept seeing
| lots of interrupts raised even though I had enabled the I/O
| completion polling instead, with io_uring's --hipri option.
|
| I think that should be independent.
|
| > edit: I ran a quick test with various IO batch sizes and it
| didn't make a difference - I guess because thanks to using
| io_uring, my bottleneck is not in IO submission, but deeper
| in the block IO stack...
|
| It probably won't get you drastically higher speeds in an
| isolated test - but it should help reduce CPU overhead. E.g.
| on one of my SSDs fio --ioengine io_uring --rw randread
| --filesize 50GB --invalidate=0 --name=test --direct=1 --bs=4k
| --numjobs=1 --registerfiles --fixedbufs --gtod_reduce=1
| --iodepth 48 uses about 25% more CPU than when I add
| --iodepth_batch_submit=0 --iodepth_batch_complete_max=0. But
| the resulting iops are nearly the same as long as there are
| enough cycles available.
|
| This is via filesystem, so ymmv, but the mechanism should be
| mostly independent.
| tanelpoder wrote:
| Author here: This article was intended to explain some modern
| hardware bottlenecks (and non-bottlenecks), but unexpectedly
| ended up covering a bunch of Linux kernel I/O stack issues as
| well :-) AMA
| jeffbee wrote:
| Great article, I learned! Can you tell me if you looked into
| aspects of the NVMe device itself, such as whether it supports
| 4K logical blocks instead of 512B? Use `nvme id-ns` to read out
| the supported logical block formats.
| tanelpoder wrote:
| Doesn't seem to support 4k out of the box? Some drives - like
| Intel Optane SSDs allow changing this in firmware (and
| reformatting) with a manufacturer's utility...
| $ lsblk -t /dev/nvme0n1 NAME ALIGNMENT MIN-IO OPT-IO
| PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME nvme0n1
| 0 512 0 512 512 0 none 1023 128 0B
| $ sudo nvme id-ns -H /dev/nvme0n1 | grep Size LBA
| Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes -
| Relative Performance: 0 Best (in use)
| jeffbee wrote:
| Thanks for checking. SSD review sites never mention this
| important detail. For some reason the Samsung datacenter
| SSDs support 4K LBA format, and they are very similar to
| the retail SSDs which don't seem to. I have the a retail
| 970 Evo that only provides 512.
| wtallis wrote:
| I just checked my logs, and none of Samsung's consumer
| NVMe drives have ever supported sector sizes other than
| 512B. They seem to view this feature as part of their
| product segmentation strategy.
|
| Some consumer SSD vendors do enable 4kB LBA support. I've
| seen it supported on consumer drives from WD, SK hynix
| and a variety of brands using Phison or SMI SSD
| controllers (including Kingston, Seagate, Corsair,
| Sabrent). But I haven't systematically checked to see
| which brands consistently support it.
| floatboth wrote:
| At least _early_ WD Black models don 't really seem to
| have 4K LBA support. The format option is listed, but it
| refuses to actually run the command to reformat the drive
| to the new "sector" size.
| wtallis wrote:
| Put your system to sleep and wake it back up. (I use
| `rtcwake -m mem -s 10`). Power-cycling the drive like
| this resets whatever security lock the motherboard
| firmware enables on the drive during the boot process,
| allowing the drive to accept admin commands like NVMe
| format and ATA secure erase that would otherwise be
| rejected. Works on both the WD Black SN700 and SN750
| models, doesn't seem to be necessary on the very first
| (Marvell-based) WD Black or the latest SN850.
| floatboth wrote:
| I'm pretty sure this is the very first one though --
| WDS250G2X0C, firmware 101110WD.
| wtallis wrote:
| I think that's the second-gen WD Black, but the first one
| that had their in-house SSD controller rather than a
| third-party controller. The marketing and packaging
| didn't prominently use a more specific model number to
| distinguish it from the previous WD Black, but on the
| drive's label it does say "PC SN700". Also, the first-gen
| WD Black was 256GB and 512GB capacities, while the later
| generations are 250/500/1000/2000GB. Firmware version
| strings for the first-gen WD Black were stuff like
| "B35200WD", while the SN700/720/730/750 family have
| versions like "102000WD" and "111110WD". So I would
| definitely expect your drive to require the sleep-wake
| cycle before it'll let you reformat to 4k sectors.
| jeffbee wrote:
| You seem to have a lot of info on this topic. Do you run
| a blog or some other way you disseminate this stuff?
| 1996 wrote:
| Is it genuine 512?
|
| As in, what ashift value do you use with zfs?
| wtallis wrote:
| Regardless of what sector size you configure the SSD to
| expose, the drive's flash translation layer still manages
| logical to physical mappings at a 4kB granularity, the
| underlying media page size is usually on the order of
| 16kB, and the erase block size is several MB. So what
| ashift value you want to use depends very much on what
| kind of tradeoffs you're okay with in terms of different
| aspects of performance and write endurance/write
| amplification. But for most flash-based SSDs, there's no
| reason to set ashift to anything less than 12
| (corresponding to 4kB blocks).
| 1996 wrote:
| > for most flash-based SSDs, there's no reason to set
| ashift to anything less than 12 (corresponding to 4kB
| blocks).
|
| matching the page size?
|
| > the underlying media page size is usually on the order
| of 16kB
|
| I'd say that's a good reason to set ashift=14 as
| 2^14=16kb
| wtallis wrote:
| There _are_ downsides to forcing the OS /FS to always use
| larger block sizes for IO. You might simply be moving
| some write amplification out of the SSD and into the
| filesystem, while losing some performance in the process.
| Which is why it really depends on your workload, and to
| some extent on the specific SSD in question. I'm not
| convinced that ashift=14 is a sensible one size fits all
| recommendation, even if we're talking only about recent-
| model consumer-grade NAND SSDs.
| mgerdts wrote:
| FWIW, WD SN850 has similar performance and supports 512
| and 4k sectors.
| guerby wrote:
| Here is an article about nvme-cli tool :
|
| https://nvmexpress.org/open-source-nvme-management-
| utility-n...
|
| On Samsung SSD 970 EVO 1TB it seems only 512 bytes LBA are
| supported: # nvme id-ns /dev/nvme0n1 -n 1
| -H|grep "^LBA Format" LBA Format 0 : Metadata Size:
| 0 bytes - Data Size: 512 bytes - Relative Performance: 0
| Best (in use)
| rafaelturk wrote:
| Thanks for well written article, makes me think about
| inefficiencies in our over-hyped cloud environment.
| tanelpoder wrote:
| Oh yes - and incorrectly configured on-premises systems too!
| sitkack wrote:
| Could you explain some of your thought processes and
| methodologies when approaching problems like this?
|
| What is your mental model like? How much experimentation do you
| do verses reading kernel code? How do you know what questions
| to start asking?
|
| *edit, btw I understand that a response to these questions
| could be an entire book, you get the question-space.
| tanelpoder wrote:
| Good question. I don't ever read kernel code as a starting
| point, only if some profiling or tracing tool points me
| towards an interesting function or codepath. And interesting
| usually is something that takes most CPU in perf output or
| some function call with an unusually high latency in ftrace,
| bcc/bpftrace script output. Or just a stack trace in a core-
| or crashdump.
|
| As far as mindset goes - I try to apply the developer mindset
| to system performance. In other words, I don't use much of
| what I call the "old school sysadmin mindset", from a time
| where better tooling was not available. I don't use
| systemwide utilization or various get/hit ratios for doing
| "metric voodoo" of Unix wizards.
|
| The developer mindset dictates that everything you run is an
| application. JVM is an application. Kernel is an application.
| Postgres, Oracle are applications. All applications execute
| one or more threads that run on CPU or do not run on CPU.
| There are only two categories of reasons why a thread does
| not run on CPU (is sleeping): The OS put the thread to sleep
| (involuntary blocking) or the thread voluntarily wanted to go
| to go to sleep (for example, it realized it can't get some
| application level lock).
|
| And you drill down from there. Your OS/system is just a bunch
| of threads running on CPU, sleeping and sometimes
| communicating with each other. You can _directly_ measure all
| of these things easily nowadays with profilers, no need for
| metric voodoo.
|
| I have written my own tools to complement things like perf,
| ftrace and BPF stuff - as a consultant I regularly see 10+
| year old Linux versions, etc - and I find sampling thread
| states from /proc file system is a really good (and flexible)
| starting point for system performance analysis and even some
| drilldown - all this without having to install new software
| or upgrading to latest kernels. Some of the tools I showed in
| my article too:
|
| https://tanelpoder.com/psnapper & https://0x.tools
|
| In the end of my post I mentioned that I'll do a webinar
| "hacking session" next Thursday, I'll show more how I work
| there :-)
| vinay_ys wrote:
| Very cool rig and benchmark. Kudos. Request: add network io
| load to your benchmarking load while nvme io load is running.
| tanelpoder wrote:
| Thanks, will do in a future article! I could share the disks
| out via NFS or iSCSI or something and hammer them from a
| remote machine...
| PragmaticPulp wrote:
| This is a great article. Thanks for writing it up and sharing.
| guerby wrote:
| 71 GB/s is 568 Gbit/s so you'll need about 3 dual 100 Gbit/s
| cards to pump data out at the rate you can read it from the
| NVMe drives.
|
| And ethernet (unless LAN jumbo frames) is about 1.5kByte per
| frame (not 4kB).
|
| One such PC should be able to do 100k simultaneous 5 Mbps HD
| streams.
|
| Testing this would be fun :)
| zamadatix wrote:
| Mellanox has a 2x200 Gbps NIC these days. Haven't gotten to
| play with it yet though.
| tanelpoder wrote:
| Which NICs would you recommend for me to buy for testing at
| least 1x100 Gbps (ideally 200 Gbps?) networking between
| this machine (PCIe 4.0) and an Intel Xeon one that I have
| with PCIe 3.0. Don't want to spend much money, so the cards
| don't need to be too enterprisey, just fast.
|
| And - do such cards even allow direct "cross" connection
| without a switch in between?
| namibj wrote:
| If you care about price, check out (used, ofc) infiniband
| cards.
|
| They all seem to offer/suggest daisy-chain connectivity
| at least for those with two ports per card as one
| potential topology.
| drewg123 wrote:
| All 100G is enterprisy.
|
| For a cheap solution, I'd get a pair of used Mellanox
| ConnectX4 or Chelsio T6, and a QSFP28 direct attach
| copper cable.
| zamadatix wrote:
| +1 on what the sibling comment said.
|
| As for directly connecting them absolutely, works great.
| Id recommend a cheap DAC off fs.com to connect them in
| that case.
| drewg123 wrote:
| At Netflix, I'm playing with an EPYC 7502P with 16 NVME and
| dual 2x100 Mellanox ConnectX6-DX NICs. With hardware kTLS
| offload, we're able to serve about 350Gb/s of real customer
| traffic. This goes down to about 240Gb/s when using software
| kTLS, due to memory bandwidth limits.
|
| This is all FreeBSD, and is the evolution of the work
| described in my talk at the last EuroBSDCon in 2019:
| https://papers.freebsd.org/2019/eurobsdcon/gallatin-
| numa_opt...
| ksec wrote:
| >we're able to serve about 350Gb/s of real customer
| traffic.
|
| I still remember the post about breaking 100Gbps barrier,
| that was may be in 2016 or 17 ? And wasn't that long ago it
| was 200Gbps and if I remember correct it was hitting memory
| bandwidth barrier as well.
|
| And now 350Gbps?!
|
| So what's next? Wait for DDR5? Or moving to some memory
| controller black magic like POWER10?
| drewg123 wrote:
| Yes, before hardware inline kTLS offload, we were limited
| to 200Gb/s or so with Naples. With Rome, its a bit
| higher. But hardware inline kTLS with the Mellanox CX6-DX
| eliminates memory bandwidth as a bottleneck.
|
| The current bottleneck is IO related, and its unclear
| what the issue is. We're working with the hardware
| vendors to try to figure it out. We should be getting
| about 390Gb/s
| ksec wrote:
| Oh wow! Cant wait to hear more about it.
| jiggawatts wrote:
| > But hardware inline kTLS with the Mellanox CX6-DX
| eliminates memory bandwidth as a bottleneck.
|
| For a while now I had operated under the assumption that
| CPU-based crypto with AES-GCM was faster than most
| hardware offload cards. What makes the Mellanox NIC
| perform better?
|
| I.e.: Why does memory bandwidth matter to TLS? Aren't you
| encrypting data "on the fly", while it is still resident
| in the CPU caches?
|
| > We're working with the hardware vendors to try to
| figure it out. We should be getting about 390Gb/s
|
| Something I explained to a colleague recently is that a
| modern CPU gains or loses more computer power from a 1deg
| C temperature difference in the room's air than my first
| four computers had combined.
|
| You're basically complaining that you're unable to get a
| mere 10% of the expected throughput. But put in absolute
| terms, that's 40 Gbps, which is about 10x more than what
| a typical server in 2020 can put out on the network.
| (Just because you have 10 Gbps NICs doesn't mean you can
| get 10 Gbps! Try iperf3 and you'll be shocked that you're
| lucky if you can crack 5 Gbps in practice)
| floatboth wrote:
| > Try iperf3 and you'll be shocked that you're lucky if
| you can crack 5 Gbps in practice
|
| Easy line rate if you crank the MTU all the way to 9000
| :D
|
| > modern CPU gains or loses more computer power from a
| 1deg C temperature difference in the room's air
|
| If you're using the boost algorithm rather than a static
| overclock, _and_ when that boost is thermally limited
| rather than current limited. With a good cooler it 's not
| too hard to always have thermal headroom.
| jiggawatts wrote:
| > Easy line rate if you crank the MTU all the way to 9000
| :D
|
| In my experience jumbo frames provide _at best_ an
| improvement of about 20% in rare cases, such as ping-pong
| UDP protocols such as TFTP or Citrix PVS streaming.
| magila wrote:
| > I.e.: Why does memory bandwidth matter to TLS? Aren't
| you encrypting data "on the fly", while it is still
| resident in the CPU caches?
|
| I assume NF's software pipeline is zero copy, so if TLS
| is done in the NIC data only gets read from memory once
| when it is DMA'd to the NIC. With software TLS you need
| to read the data from memory (assuming it's not already
| in cache, which given the size of data NF deals with is
| unlikely), encrypt it, then write it back out to main
| memory so it can be DMA'd to the NIC. I know Intel has
| some fancy tech that can DMA directly to/from the CPU's
| cache, but I don't think AMD has that capability (yet).
| toast0 wrote:
| > For a while now I had operated under the assumption
| that CPU-based crypto with AES-GCM was faster than most
| hardware offload cards. What makes the Mellanox NIC
| perform better?
|
| > I.e.: Why does memory bandwidth matter to TLS? Aren't
| you encrypting data "on the fly", while it is still
| resident in the CPU caches?
|
| It may depend on what you're sending. Netflix's use case
| is generally sending files. If you're doing software
| encryption you would load the plain text file into memory
| (via the filesystem/unified buffer cache), then write the
| (session specific) encrypted text into separate memory,
| then tell give that memory to the NIC to send out.
|
| If the NIC can do the encryption, you would load the
| plain text into memory, then tell the NIC to read from
| that memory to encrypt and send out. That saves at least
| a write pass, and probably a read pass. (256 MB of L3
| cache on latest EPYC is a lot, but it's not enough to
| expect cached reads from the filesystem to hit L3 that
| often, IMHO)
|
| If my guestimate is right, a cold file would go from
| hitting memory 4 times to hitting it twice. And a file in
| disk cache would go from 3 times to once; the CPU doesn't
| need to touch the memory if it's in the disk cache.
|
| Not that this is a totally different case from encrypting
| dynamic data that's necessarily touched by the CPU.
|
| > You're basically complaining that you're unable to get
| a mere 10% of the expected throughput. But put in
| absolute terms, that's 40 Gbps, which is about 10x more
| than what a typical server in 2020 can put out on the
| network. (Just because you have 10 Gbps NICs doesn't mean
| you can get 10 Gbps! Try iperf3 and you'll be shocked
| that you're lucky if you can crack 5 Gbps in practice)
|
| I had no problem serving 10 Gbps of files on a dual Xeon
| E5-2690 (v1; a 2012 CPU), although that CPU isn't great
| at AES, so I think it only did 8 Gbps or so with TLS; the
| next round of servers for that role had 2x 10G and 2690
| v3 or v4 (2014 or 2016; but I can't remember when we got
| them) and thanks to better AES instructions, they were
| able to do 20 G (and a lot more handshakes/sec too). If
| your 2020 servers aren't as good as my circa 2012 servers
| were, you might need to work on your stack. OTOH, bulk
| file serving for many clients can be different than a
| single connection iperf.
| drewg123 wrote:
| > If my guestimate is right, a cold file would go from
| hitting memory 4 times to hitting it twice. And a file in
| disk cache would go from 3 times to once; the CPU doesn't
| need to touch the memory if it's in the disk cache.
|
| You're spot on. I have a slide that I like to show NIC
| vendors when they question why TLS offload is important.
| See pages 21 and 22 of: https://people.freebsd.org/~galla
| tin/talks/euro2019-ktls.pdf
| tanelpoder wrote:
| I should (finally) receive my RTX 3090 card today (PCIe 4.0
| too!), I guess here goes my weekend (and the following
| weekends over a couple of years)!
| tarasglek wrote:
| You should look at cpu usage. There is a good chance all your
| interrupts are hitting cpu-0. you can run hwloc to see what
| chiplet the pci cards are on and handle interrupts on those
| cores.
| jeffbee wrote:
| Why would that happen with the linux nvme stack that puts a
| completion queue on each CPU?
| wtallis wrote:
| I think that in addition to allocating a queue per CPU, you
| need to be able to allocate a MSI(-X) vector per CPU. That
| shouldn't be a problem for the Samsung 980 PRO, since it
| supports 128 queues and 130 interrupt vectors.
| tanelpoder wrote:
| Thanks for the "hwloc" tip. I hadn't thought about that.
|
| I was thinking of doing something like that. Weirdly I got
| sustained throughput differences when I killed & restarted
| fio. So, if I got 11M IOPS, it stayed at that level until I
| killed fio & restarted. If I got 10.8M next, it stayed like
| it until I killed & restarted it.
|
| This makes me think that I'm hitting some PCIe/memory
| bottleneck, dependent on process placement (which process
| happens to need to move data across infinity fabric due to
| accessing data through a "remote" PCIe root complex or
| something like that). But then I realized that Zen 2 has a
| central IO hub again, so there shouldn't be a "far edge of
| I/O" like on current gen Intel CPUs (?)
|
| But there's definitely some workload placement and
| I/O-memory-interrupt affinity that I've wanted to look into.
| I could even enable the NUMA-like-mode from BIOS, but again
| with Zen 2, the memory access goes through the central
| infinity-fabric chip too, I understand, so not sure if
| there's any value in trying to achieve memory locality for
| individual chiplets on this platform (?)
| wtallis wrote:
| The PCIe is all on a single IO die, but internally it is
| organized into quadrants that can produce some NUMA
| effects. So it is probably worth trying out the motherboard
| firmware settings to expose your CPU as multiple NUMA
| nodes, and using the FIO options to allocate memory only on
| the local node, and restricting execution to the right
| cores.
| tanelpoder wrote:
| Yep, I enabled the "numa-like-awareness" in BIOS and ran
| a few quick tests to see whether the NUMA-aware
| scheduler/NUMA balancing would do the right thing and
| migrate processes closer to their memory over time, but
| didn't notice any benefit. But yep I haven't manually
| locked down the execution and memory placement yet. This
| placement may well explain why I saw some ~5% throughput
| fluctuations _only if killing & restarting fio_ and not
| while the same test was running.
| syoc wrote:
| I have done some tests on AMD servers and I the Linux
| scheduler does a pretty good job. I do however get
| noticeable (a couple percent) better performance by
| forcing the process to run on the correct numa node.
|
| Make sure you get as many numa domains as possible in
| your BIOS settings.
|
| I recommend using numactl with the cpu-exclusive and mem-
| exclusive flags. I have noticed a slight perfomance drop
| when RAM cache fills beyond the sticks local to the cpus
| doing work.
|
| One last comment is that you mentioned interrupts being
| "stiped" among CPUs. I would recommend pinning the
| interrupts from one disk to one numa-local CPU and using
| numactl to run fio for that disk on the same CPU. An
| additional experiment is to, if you have enough cores,
| pin interrupts to CPUs local to disk, but use other cores
| on the same numa node for fio. That has been my most
| successful setup so far.
| tarasglek wrote:
| So there are 2 parts to cpu affinity. a) cpu assigned to
| ssd for handling interrupts and b) cpu assigned to fio.
| numactl is your friend for experimenting with with changing
| fio affinity.
|
| https://access.redhat.com/documentation/en-
| us/red_hat_enterp... tells you how to tweak irq handlers.
|
| You usually want to change both. pinning each fio process +
| each interrupt handler to specific cpus will reach highest
| perf.
|
| You can even use isolcpus param to linux kernel to reduce
| jitter from things you don't care about to minimize
| latency.(wont do much for bandwidth)
| mgerdts wrote:
| I have the same box, but with the 32 core CPU and fewer
| NVMe drives. I've not poked at all the PCIe slots yet, but
| all that I've looked at are in NUMA node 1. This includes
| the on board M.2 slots. It is in NPS=4 mode.
| tanelpoder wrote:
| Mine goes only up to 2 NUMA nodes (as shown in numactl
| --hardware), despite setting NPS4 in BIOS. I guess it's
| because I have only 2 x 8-core chiplets enabled (?)
| perryizgr8 wrote:
| It would be interesting to know what you intend to use this rig
| for, if that is not some secret :)
| tanelpoder wrote:
| Valid question!
|
| 1) Learning & researching capabilities of modern HW
|
| 2) Running RDBMS stress tests (until breaking point), Oracle,
| Postgres+TimescaleDB, MySQL, probably ScyllaDB soon too
|
| 3) Why? As a performance troubleshooter consultant+trainer, I
| regularly have to reproduce complex problems that show up
| only under high concurrency & load - stuff that you can't
| just reproduce in a VM in a laptop.
|
| 4) Fun - seeing if the "next gen" hardware's promised
| performance is actually possible!
|
| FYI I have some videos from my past complex problem
| troubleshooting adventures, mostly Oracle stuff so far and
| some Linux performance troubleshooting:
|
| https://tanelpoder.com/videos/
| nicioan wrote:
| Excellent article, thank you! I really like the analysis and
| profiling part of the evaluation. I also have some experience
| in I/O performance in linux -- we measured 30GiB/s in a pcie
| Gen3 box (shameless plug[0]).
|
| I have one question / comment: did you use multiple jobs for
| the BW (large IO) experiments? If yes, then did you set
| randrepeat to 0? I'm asking this because fio by default uses
| the same sequence of offsets for each job, in which case there
| might be data re-used across jobs. I had verified that with
| blktrace a few years back, but it might have changed recently.
|
| [0]https://www.usenix.org/conference/fast19/presentation/kourti
| ...
|
| edit: fixed typo
| tanelpoder wrote:
| Looks interesting! I wonder whether there'd be interesting
| new database applications on NVMe when doing as small as 512
| byte I/Os (with more efficient "IO engine" than Linux bio,
| that has too high CPU overhead with such small requests).
|
| I mean, currently OLTP RDBMS engines tend to use 4k, 8k (and
| some) 16k block size and when doing completely random I/O
| (or, say traversing an index on customer_id that now needs to
| read random occasional customer orders across years of
| history). So you may end up reading 1000 x 8 kB blocks just
| to read 1000 x 100B order records "randomly" scattered across
| the table from inserts done over the years.
|
| Optane persistent memory can do small, cache line sized I/O I
| understand, but that's a different topic. When being able to
| do random 512B I/O on "commodity" NVMe SSDs efficiently, this
| would open some interesting opportunities for retrieving
| records that are scattered "randomly" across the disks.
|
| edit: to answer your question, I used 10 separate fio
| commands with numjobs=3 or 4 for each and randrepeat was set
| to default.
| ksec wrote:
| I just love this article. Especially when the norm is always
| about scaling out instead of scaling up. We can have 128 Core
| CPU, 2TB Memory, PCI-E 4.0 SSD, ( and soon PCI-E 5.0 ). We
| could even fit a _Petabyte_ in 1U for SSD Storage.
|
| I remember WhatsApp used to operate its _500M_ user with only a
| dozen of large FreeBSD boxes. ( Only to be taken apart by
| Facebook )
|
| So Thank you for raising awareness. Hopefully the pendulum is
| swinging back to conceptually simple design.
|
| >I also have a 380 GB Intel Optane 905P SSD for low latency
| writes
|
| I would love to see that. Although I am waiting for someone to
| do a review on the Optane SSD P5800X [1]. Random 4K IOPS up to
| 1.5M with lower than 6 _us_ Latency.
|
| [1] https://www.servethehome.com/new-intel-
| optane-p5800x-100-dwp...
| texasbigdata wrote:
| Second on Optane.
| phkahler wrote:
| >> I remember WhatsApp used to operate its 500M user with
| only a dozen of large FreeBSD boxes.
|
| With 1TB of RAM you can have 256 bytes for every person on
| earth live in memory. With SSD either as virtual memory or
| keeping an index in RAM, you can do meaningful work in real
| time, probably as fast as the network will allow.
| swader999 wrote:
| Faster then they can type!
| zie wrote:
| My math doesn't compute with yours:
|
| depending on how you define a TB(memory tends to favour the
| latter definition, but YMMV):
|
| 1,000,000,000,000 / 7.8billion = 128.21 bytes per human.
|
| 1,099,511,627,776 / 7.8billion = 140.96 bytes per human.
|
| population source via Wikipedia.
| rektide wrote:
| Intel killing off prosumer optane 2 weeks ago[1] made me so
| so so sad.
|
| The new P5800X should be sick.
|
| [1] https://news.ycombinator.com/item?id=25805779
| maerF0x0 wrote:
| When I first moved to the bay area, the company that hired me
| asked me what kind of computer I wanted and gave me a budget
| (like $3000 or something)... I spent a few days crafting a
| parts list so I could build an awesome workstation. Once I sent
| it over they were like "Uh, we just meant which macbook do you
| want?" and kind of gave me some shade about it. They joked, so
| how are you going to do meetings or on call?
|
| I rolled with it, but really wondered if they knew I could get
| 2x the hardware and have a computer at home and at work for
| less money than the MBP ... Most of the people didnt seem to
| understand that laptop CPUs are not the same as
| desktop/workstation ones, especially when they hit thermal down
| throttling.
| noir_lord wrote:
| Last but one job boss offered me an iMac Pro, I asked if I
| could just have the equivalent money for hardware and he said
| sure.
|
| Which is how I ended up with an absolute _monster_ of a work
| machine, these days I WFH and while work issued me a Macbook
| Pro it sits on the shelf behind me.
|
| Fedora on a (still fast) Ryzen/2080 and 2x4K 27" screens vs a
| Macbook Pro is a hilarious no brainer for me.
|
| Upgrading soon but can't decide whether I _need_ the 5950X or
| merely _want_ it - realistically except for gaming I 'm
| nowhere near tapping out this machine (and it's still awesome
| for that an VR which is why the step-son is about to get a in
| his words "sick" PC).
| walrus01 wrote:
| I mean it would have been a totally valid answer to say that
| you intended to use a $600 laptop as effectively a thin
| client, and spend $2400 on a powerful workstation PC to drive
| remotely.
| antongribok wrote:
| Great article!
|
| Any chance you could post somewhere the output of:
| lstopo --of ascii
|
| Or similar?
| tanelpoder wrote:
| I can do it tomorrow, please drop me an email (email listed
| in my blog)
| KaiserPro wrote:
| Excellent write up.
|
| I used to work for a VFX company in 2008. At that point we used
| lustre to get high throughput file storage.
|
| From memory we had something like 20 racks of server/disks to
| get a 3-6 gigabyte/s (sustained) throughput on a 300tb
| filesystem.
|
| It is hilarious to think that a 2u box can now theoretically
| saturate 2x100gig nics.
| drmadera wrote:
| Great article. Did you consider doing Optane tests? I built a
| 3990x WS with all-optanes and I get blazing fast access times,
| but 3gb/s top speeds. It might be interesting to look at them for
| these tests, specially in time-sensitive scenarios.
| tanelpoder wrote:
| I have 2 Optane 905P M.2 cards and I intend to run some
| database engine tests, putting their transaction logs (and
| possibly temporary spill areas for sorts, hashes) on Optane.
|
| When I think about Optane, I think about optimizing for low
| latency where it's needed and not that much about bandwidth of
| large ops.
| jacquesm wrote:
| Lovely article, zero fluff, tons of good content and modest to
| boot. Thank you for this write-up, I'll pass it around to some
| people who feel that the need for competent system administration
| skills has passed.
| qaq wrote:
| Would be cool to see pgbench score for this setup
| namero999 wrote:
| You should be farming Chia on that thing [0]
|
| Amazing, congrats!
|
| [0] https://github.com/Chia-Network/chia-blockchain/wiki/FAQ
| jayonsoftware1 wrote:
| https://www.asus.com/us/Motherboard-Accessories/HYPER-M-2-X1...
| vs https://highpoint-tech.com/USA_new/nvme_raid_controllers.htm .
| One card is about x10 expensive, but looks like performance is
| same. Am I missing some thing.
| tanelpoder wrote:
| The ASUS one doesn't have its own RAID controller nor PCIe
| switch onboard. It relies on the motherboard-provided PCIe
| bifurcation and if using hardware RAID, it'd use AMD's built-in
| RAID solution (but I'll use software RAID via Linux dm/md). The
| HighPoint SSD7500 seems to have a proprietary RAID controller
| built in to it and some management/monitoring features too
| (it's the "somewhat enterprisey" version)
| wtallis wrote:
| The HighPoint card doesn't have a hardware RAID controller,
| just a PCIe switch and an option ROM providing boot support
| for their software RAID.
|
| PCIe switch chips were affordable in the PCIe 2.0 era when
| multi-GPU gaming setups were popular, but Broadcom decided to
| price them out of the consumer market for PCIe 3 and later.
| tanelpoder wrote:
| Ok, thanks, good to know. I misunderstood from their
| website.
| rektide wrote:
| pcie switches getting expensive is so the suck.
| MrFoof wrote:
| U.2 form factor drives (also NVMe protocol) can achieve higher
| IOPS (particularly writes) still over M.2 form factor (especially
| M.2 2280), with higher durability, but you'll need your own
| controllers which are sparse on the market for the moment.
| Throughput (MB/sec, not IOPS) will be about the same, but the U.2
| drives can do it for longer.
|
| U.2 means more NAND to parallelize over, more spare area (and
| higher overall durability), potentially larger DRAM caches, and a
| far larger area to dissipate heat. Plus it has all the fancy
| bleeding-edge features you aren't going to see on consumer-grade
| drives.
|
| -- -----
|
| The big issue with U.2 for "end user" applications like
| workstations is you can't get drivers from Samsung for things
| like the PM1733 or PM9A3 (which blow the doors off the 980 Pro,
| especially for writes and $/GB, plus other neat features like
| Fail-In-Place) unless you're an SI, in which you also co-
| developed the firmware. The same goes for SanDisk, KIOXIA and
| other makers of enterprise SSDs.
|
| The kicker is enterprise U.2 drives are about the same $/GB as
| SATA drives, but being NVMe PCIe 4.0 x4. blow the doors off about
| everything. There's also the EDSFF, NF1 and now E.1L form
| factors, but U.2 is very prevalent. Enterprise SSDs are
| attractive as that's where the huge volume is (hence the low
| $/GB), but end-user support is really limited. You can use
| "generic drivers", but you won't see anywhere near the peak
| performance of the drives.
|
| The good news is both Micron and Intel have great support for
| end-users, where you can get optimized drivers and updated
| firmware. Intel has the D7-P5510 probably hitting VARs and some
| retail sellers (maybe NewEgg) within about 60 days. Similar
| throughput to the Samsung drives, far more write IOPS (especially
| sustained), lower latencies, FAR more durability (with a big
| warranty), far more capacity, and not too bad a price (looking
| like ~$800USD for 3.84TB with ~7.2PB of warrantied writes over 5
| years).
|
| -- -----
|
| My plan once Genesis Peak (Threadripper 5XXX) hits is four 3.84TB
| Intel D7-P5510s in RAID10, connected to a HighPoint SSD7580 PCIe
| 4.0 x16 controller. Figure ~$4,000 for a storage setup of ~7.3TB
| usable space after formatting, 26GB/sec peak writes, ~8GB/sec
| peak reads, with 2.8M 4K read iops, 700K 4K write iops, and
| ~14.3PB of warrantied write durability.
| floatboth wrote:
| How would a model-specific driver for something that speaks
| NVMe even work? Is it for Linux? Is it open? Is it just
| modifications to the stock Linux NVMe driver that take some
| drive specifics into account? Or is it some stupid proprietary
| NVMe stack?
| tutfbhuf wrote:
| This article focuses on IOPS and throughput, but what is also
| important for many applications is I/O latency, which can be
| measured with ioping (apt-get install ioping). Unfortunately,
| even 10x PCIe 4.0 NVMe do not provide any better latency than a
| single NVMe drive. If you are constrained by disk latency then
| 11M IOPS won't gain you much.
| cheeze wrote:
| Does this come up in practice? What kind of use cases suffer
| from disk latency?
|
| This stuff is all fascinating to me. I have a zfs NAS but I
| feel like I've barely scratched the surface of SSDs
| tutfbhuf wrote:
| > Does this come up in practice? What kind of use cases
| suffer from disk latency?
|
| One popular example is HFT.
|
| And from my experience on a desktop PC it is better to
| disable swap and have the OOM killer do his work, instead of
| swapping to disk, which makes my system noticeable laggy,
| even with a fast NVMe.
| sitkack wrote:
| Anything with transaction SLOs in the microsecond or
| millisecond range. Adtech, fintech, fraud detection, call
| records, shopping carts.
|
| Two big players in this space are Aerospike and ScyllaDB.
| qaq wrote:
| Now price this in terms of AWS and marvel at the markup
| speedgoose wrote:
| I'm afraid Jeff Bezos himself couldn't afford such IOs on AWS.
| nwmcsween wrote:
| So Linus was wrong on his rant to Dave about the page cache being
| detremental on fast devices
| svacko wrote:
| I wonder, is increasing temperature of the M.2 NVMe disks
| affecting the measured performance? Or is P620 cooling system
| efficient enough to keep temp of the number of disks low?
|
| Anyway, thanks for the inspirative post!
| tanelpoder wrote:
| Both quad SSDs adapters had a fan on it and the built in M.2
| ones had a heatsink, right in front of one large chassis fan &
| air intake. I didn't measure the SSD temperatures, but the I/O
| rate didn't drop over time. I was bottlenecked by CPU when
| doing small I/O tests, I monitored the current MHz from
| /proc/cpuinfo to make sure that the CPU speeds didn't drop
| lower than their nominal 3.9 GHz (and they didn't).
|
| Btw, even the DIMMs have dedicated fans and enclosure (one per
| 4 DIMMs) on the P620.
| ogrisel wrote:
| As a nitpicking person, I really like to read a post that does
| not confuse GB/s for GiB/s :)
|
| https://en.wikipedia.org/wiki/Byte#Multiple-byte_units
| ogrisel wrote:
| Actually now I realize that the title and the intro paragraph
| are contradicting each other...
| tanelpoder wrote:
| Yeah, I used the formally incorrect GB in the title when I
| tried to make it look as simple as possible... GiB just
| didn't look as nice in the "marketing copy" :-)
|
| I may have missed using the right unit in some other sections
| too. At least I hope that I've conveyed that there's a
| difference!
___________________________________________________________________
(page generated 2021-01-30 23:02 UTC)