[HN Gopher] Achieving 11M IOPS and 66 GB/S IO on a Single Thread...
       ___________________________________________________________________
        
       Achieving 11M IOPS and 66 GB/S IO on a Single ThreadRipper
       Workstation
        
       Author : tanelpoder
       Score  : 429 points
       Date   : 2021-01-29 12:45 UTC (1 days ago)
        
 (HTM) web link (tanelpoder.com)
 (TXT) w3m dump (tanelpoder.com)
        
       | secondcoming wrote:
       | > For final tests, I even disabled the frequent gettimeofday
       | system calls that are used for I/O latency measurement
       | 
       | I was knocking up some profiling code and measured the
       | performance of gettimeofday as a proof-of-concept test.
       | 
       | The performance difference between running the test on my
       | personal desktop Linux VM versus running it on a cloud instance
       | Linux VM was quite interesting (cloud was worse)
       | 
       | I think I read somewhere that cloud instances cannot use the VDSO
       | code path because your app may be moved to a different machine.
       | My recollection of the reason is somewhat cloudy.
        
       | ashkankiani wrote:
       | When I bought a bunch of NVME drives, I was disappointed with how
       | slow the maximum speed I could achieve with them was given my
       | knowledge and available time at the time. Thanks for making this
       | post to give me more points of insight into the problem.
       | 
       | I'm on the same page with your thesis that "hardware is fast and
       | clusters are usually overkill," and disk I/O was a piece that I
       | hadn't really figured out yet despite making great strides in the
       | software engineering side of things. I'm trying to make a startup
       | this year and disk I/O will actually be a huge factor in how far
       | I can scale without bursting costs for my application. Good
       | stuff!
        
       | chx wrote:
       | A terabyte of RAM on your desktop.
       | 
       | It's been less than a quarter century ago, 1997, when Microsoft
       | and Compaq launched the TerraServer which was a wordplay on
       | terabyte -- it stored a terabyte of data and it was a Big Deal.
       | Today's that not storage, that's main RAM, unencumbered by NUMA.
        
       | whalesalad wrote:
       | This post is fantastic. I wish there was more workstation porn
       | like this for those of us who are not into the RGB light show
       | ripjaw hacksaw aorus elite novelty stuff that gamers are so into.
       | Benchmarks in the community are almost universally focused on
       | gaming performance and FPS.
       | 
       | I want to build an epic rig that will last a long time with
       | professional grade hardware (with ECC memory for instance) and
       | would love to get a lot of the bleeding-edge stuff without
       | compromising on durability. Where do these people hang out
       | online?
        
         | piinbinary wrote:
         | The level1techs forums seems to have a lot of people with
         | similar interests
        
         | [deleted]
        
         | greggyb wrote:
         | STH: https://www.youtube.com/user/ServeTheHomeVideo
         | https://www.servethehome.com/
         | 
         | GamersNexus (despite the name, they include a good amount of
         | non-gaming benchmarks, and they have great content on cases and
         | cooling): https://www.youtube.com/user/GamersNexus
         | https://www.gamersnexus.net/
         | 
         | Level1Techs (mentioned in another reply):
         | https://www.youtube.com/c/Level1Techs
         | https://www.level1techs.com/
         | 
         | r/homelab (and all the subreddits listed in its sidebar):
         | https://www.reddit.com/r/homelab/
         | 
         | Even LinusTechTips has some decent content for server hardware,
         | though they stay fairly superficial. And the forum definitely
         | has people who can help out: https://linustechtips.com/
         | 
         | And the thing is, depending on what metric you judge
         | performance by, the enthusiast hardware may very well
         | outperform the server hardware. For something that is sensitive
         | to memory, e.g., you can get much faster RAM in enthusiast SKUs
         | (https://www.crucial.com/memory/ddr4/BLM2K8G51C19U4B) than
         | you'll find in server hardware. Similarly, the HEDT SKUs out-
         | clock the server SKUs for both Intel and AMD.
         | 
         | I have a Threadripper system that outperforms most servers I
         | work with on a daily basis, because most of my workloads,
         | despite being multi-threaded, are sensitive to clockspeed.
        
           | 1996 wrote:
           | Indeed, serious people now use gamer computer parts because
           | it's just faster!
        
             | greggyb wrote:
             | It's not "just faster".
             | 
             | No one's using "gamer NICs" for high speed networking. Top
             | of the line "gaming" networking is 802.11ax or 10GbE.
             | 2x200Gb/s NICs are available now.
             | 
             | Gaming parts are strictly single socket - software that can
             | take advantage of >64 cores will need server hardware -
             | either one of the giant Ampere ARM CPUs or a 2+ socket
             | system.
             | 
             | If something must run in RAM and needs TB of RAM, well then
             | it's not even a question of faster or slower. The
             | capability only exists on server platforms.
             | 
             |  _Some_ workloads will benefit from the performance
             | characteristics of consumer hardware.
        
               | mgerdts wrote:
               | The desktop used in TFA supports 1 TB of RAM.
               | 
               | https://www.lenovo.com/us/en/thinkstation-p620
        
               | greggyb wrote:
               | Workstations and desktops are distinct market segments.
               | The machine in the article uses a workstation platform.
               | And the workstation processors available in that Lenovo
               | machine clock slower than something like a 5950X
               | mainstream processor. The RDIMMS you need to get to 1TB
               | in the machine run much slower than the UDIMMS I linked
               | above.
        
         | vmception wrote:
         | > RGB light show ripjaw hacksaw aorus elite novelty stuff
         | 
         | haha yeah I bought a whole computer from someone and was
         | wondering why the RAM looked like rupies from Zelda
         | 
         | apparently that is common now
         | 
         | but at least I'm not cosplaying as a karate day trader for my
         | Wall Street Journal expose'
        
           | ecf wrote:
           | Because flashy RGB is the default mode used for marketing
           | purposes.
           | 
           | I'm not trying to be snarky here but you can always just turn
           | off the lights or set it to be a solid color of your
           | preference.
        
         | philsnow wrote:
         | I'm with you on this, I just built a (much more modest than the
         | article's) workstation/homelab machine a few months ago, to
         | replace my previous one which was going on 10 years old and
         | showing its age.
         | 
         | There's some folks in /r/homelab who are into this kind of
         | thing, and I used their advice a fair bit in my build. While it
         | is kind of mixed (there's a lot of people who build pi clusters
         | as their homelab), there's still plenty of people who buy
         | decommissioned "enterprise" hardware and make monstrous-for-
         | home-use things.
        
         | COGlory wrote:
         | Check out Wendell Wilson, of Level1Techs on YouTube
         | (https://www.youtube.com/channel/UCOWcZ6Wicl-1N34H0zZe38w or
         | https://www.youtube.com/user/teksyndicate), and
         | https://forum.level1techs.com
        
         | deagle50 wrote:
         | Happy to help if you want feedback. Servethehome forums are
         | also a great resource of info and used hardware, probably the
         | best community for your needs.
        
           | gigatexal wrote:
           | +1 to ServeTheHome, the forums have some of the nicest and
           | smartest people I've ever met online.
           | 
           | Hard forum is cool too
        
         | arminiusreturns wrote:
         | Check out HardForum. Lots of very knowledgable people on there
         | helped me mature my hardware level knowledge. Back when I was
         | building 4 cpu, 64 core opteron systems. Also decent banter.
        
         | tomc1985 wrote:
         | Though I have to wonder.... would these be good gaming systems?
         | Are there any scenarios where the perks (stupid numbers of
         | cores, 8-channel memory, 128 PCI-E lanes, etc) would help?
        
           | sp332 wrote:
           | Not many games are written to scale out that far. I remember
           | Ashes of the Singularity was used to showcase Ryzen CPUs
           | though.
        
           | eldelshell wrote:
           | Not really. Gaming is bound to latency and rendering, not
           | scale nor bandwidth. Memory and IO usage is pretty constant
           | while the game is running.
        
         | tanelpoder wrote:
         | Thanks! In case you're interested in building a ThreadRipper
         | Pro WX-based system like mine, then AMD apparently starts
         | selling the CPUs independently from March 2021 onwards:
         | 
         | https://www.anandtech.com/show/16396/the-amd-wrx80-chipset-a...
         | 
         | Previously you could only get this CPU when buying the Lenovo
         | ThinkStation P620 machine. I'm pretty happy with Lenovo
         | Thinkstations though (I bought a P920 with dual Xeons 2.5 years
         | ago)
        
           | ksec wrote:
           | And just in time article
           | 
           | https://www.anandtech.com/show/16462/hands-on-with-the-
           | asus-...
           | 
           | I guess I should submit this on HN as well.
           | 
           | Edit: I was getting too ahead of myself I thought these are
           | for TR Pro with Zen 3. Turns out they are not out yet.
        
             | walrus01 wrote:
             | My only quibble with that board is that I worry about how
             | easily replaced the fan on the chipset is. In my experience
             | that exact type of fan will inevitably fail in a moderately
             | dusty environment... And it doesn't look like anything you
             | could screw on/off from the common industry standard sizes
             | of 40mm or 60mm 12VDC fans that come in various
             | thicknesses.
        
               | opencl wrote:
               | Supermicro's WRX80 motherboard looks like it has an
               | easily replaceable chipset fan, not sure about the
               | Gigabyte one.
               | 
               | https://www.anandtech.com/show/16396/the-amd-
               | wrx80-chipset-a...
        
         | zhdc1 wrote:
         | Look at purchasing used enterprise hardware. You can buy a
         | reliable x9 or X10 generation supermicro server (rack or tower)
         | for around a couple of hundred.
        
           | igorstellar wrote:
           | Downside of buying enterprise for home use is noise - their
           | turbofan coolers are insanely loud while consumer grade 120mm
           | (Noctua et al) coolers are most quiet.
        
             | jordanbeiber wrote:
             | It's mostly about casing though - density is important with
             | enterprise stuff, and noise level is almost irrelevant
             | hence small chassis with small, loud, fans.
             | 
             | I've got a 3.5" x16 bay gooxi chassis that I've put a
             | supermicro mb + xeon in.
             | 
             | Something like this:
             | 
             | https://www.xcase.co.uk/collections/3u-rackmount-
             | cases/produ...
             | 
             | I got this specific nas chassis because it got a fan wall
             | with 3x120mm fans, not because I need bays.
             | 
             | With a few rather cool SSD's for storage and quiet noctua
             | fans it is barely a whisper.
             | 
             | Also - vertical rack mounting behind a closet door! I can
             | have a massive chassis that basically takes no place at
             | all. Can't belive I didn't figure that one out earlier...
        
             | leptons wrote:
             | Noise isn't the only downside - the power they consume can
             | cost $$$. These things aren't typically the most energy
             | efficient machines.
        
             | girvo wrote:
             | Would swapping them out for Noctuas be difficult?
        
               | cfn wrote:
               | Mostly yes because server chassis are very compact and
               | sometimes use proprietary connectors and fans. Still,
               | many people have done that with good results, have a look
               | in YouTube to know which server models are best suited
               | for that kind of customization.
        
               | alvern wrote:
               | I've not been successful trying this with HPE servers.
               | Most server fans (Foxconn/Delta) run 2.8 amp or higher.
               | Not aware of any "silent" gaming grade fans that use more
               | than 0.38 amps. That's not even considering the CFM.
        
               | weehoo wrote:
               | Why would current be relevant here? Shouldn't operating
               | voltage be the only thing that matters?
        
               | sokoloff wrote:
               | Amps * Volts is power. Power is a proxy (a moderately
               | good one) for air movement (a mix of volume/mass at a
               | specific [back-]pressure).
               | 
               | It's not likely that a silent 2W fan will move a similar
               | amount of air as the stock 14W fans. The enterprise gear
               | from HPE is pretty well engineered; I'm skeptical that
               | they over-designed the fans by a 7x factor.
               | 
               | Operating voltage tells you "this fan won't burn up when
               | you plug it in". It doesn't tell you "will keep the
               | components cool".
        
             | smartbit wrote:
             | Another downside is powerconsumption at rest. Supermicro
             | board with 2x Xeon use 80watt at minimum. Add a 10Gbit
             | switch and a few more peripherals and you're looking at an
             | _additional_ $EUR80 /month electricity bill. _Year after
             | year_ , that is $EUR10.000 in 10years.
             | 
             | Of course that is nothing compared to what you'd pay at
             | Google/Azure/AWS for the AMD machine of this news item :-)
             | 
             | 12V _only_ PSUs like OEMs use or ATX12VO in combination
             | with a motherboard without IPMI, similar to the German
             | Fujitsu motherboards, have significant lower power
             | consumption at rest. Somewhere around 8-10Watt without HDD.
             | Much better for home use IHMO.
        
               | [deleted]
        
               | devonkim wrote:
               | In the US, electricity rates are typically much cheaper
               | than the EU. My rate is roughly .08 EUR/kWh, for example,
               | and I don't get any subsidies to convert to solar, so I
               | have no way to make it pay off for myself within 15 years
               | (beyond the time most people expect to stay in a home
               | here), while other states in the US subsidize so heavily
               | or rates for electricity are so high most people have
               | solar panels at least (see: Hawaii with among the highest
               | costs for electricity in the US).
               | 
               | Regardless of electricity cost, all that electricity
               | usage winds up with a lot of heat in a dwelling. To help
               | offset the energy consumption in the future I plan to use
               | a hybrid water heater that can act as a heat pump and
               | dehumidifier and capture the excess heat as a way to
               | reduce energy consumption for hot water.
        
           | ashkankiani wrote:
           | I've been planning to do this, but enterprise hardware seems
           | like it requires a completely different set of knowledge on
           | how to purchase it and maintain it, and especially as a
           | consumer.
           | 
           | It's not quite as trivial of a barrier to entry as consumer
           | desktops, but I suppose that's the point. Still, it would be
           | nice if there was a guide that could help me make good
           | decisions to start.
        
             | bombcar wrote:
             | It's actually much EASIER in my experience- enterprise gear
             | is made to be easily reparable and most if not all parts
             | can be swapped without tools.
             | 
             | Loud though - most of them run pretty quiet if not doing
             | anything.
        
             | jqcoffey wrote:
             | Also, purpose built data center chassis are designed for
             | high airflow and are thus really quite loud.
        
               | modoc wrote:
               | Very true. I have a single rack mount server in my HVAC
               | room, and it's still so loud I had to glue soundproofing
               | foam on the nearby walls:)
        
             | gogopuppygogo wrote:
             | Most people who get into home labs spend some time on
             | research and throw some money at gaining an education.
             | 
             | Compute is so cheap second hand.
        
           | [deleted]
        
       | benlwalker wrote:
       | Plug for a post I wrote a few years ago demonstrating nearly the
       | same result but using only a single CPU core:
       | https://spdk.io/news/2019/05/06/nvme/
       | 
       | This is using SPDK to eliminate all of the overhead the author
       | identified. The hardware is far more capable than most people
       | expect, if the software would just get out of the way.
        
         | tanelpoder wrote:
         | Yes I had seen that one (even more impressive!)
         | 
         | When I have more time again, I'll run fio with the SPDK plugin
         | on my kit too. And would be interested in seeing what happens
         | when doing 512B random I/Os?
        
           | benlwalker wrote:
           | The system that was tested there was PCIe bandwidth
           | constrained because this was a few years ago. With your
           | system, it'll get a bigger number - probably 14 or 15 million
           | 4KiB IO per second per core.
           | 
           | But while SPDK does have an fio plug-in, unfortunately you
           | won't see numbers like that with fio. There's way too much
           | overhead in the tool itself. We can't get beyond 3 to 4
           | million with that. We rolled our own benchmarking tool in
           | SPDK so we can actually measure the software we produce.
           | 
           | Since the core is CPU bound, 512B IO are going to net the
           | same IO per second as 4k. The software overhead in SPDK is
           | fixed per IO, regardless of size. You can also run more
           | threads with SPDK than just one - it has no locks or cross
           | thread communication so it scales linearly with additional
           | threads. You can push systems to 80-100M IO per second if you
           | have disks and bandwidth that can handle it.
        
             | StillBored wrote:
             | Yah this has been going on for a while. Before SPDK it was
             | with custom kernel bypasses and fast inifiband/FC arrays. I
             | was involved with a similar project in the early 2000's.
             | Where at the time the bottleneck was the shared xeon bus,
             | and then it moved to the PCIe bus with opterons/nehalem+.
             | In our case we ended up spending a lot of time tuning the
             | application to avoid cross socket communication as well
             | since that could become a big deal (of course after careful
             | card placement).
             | 
             | But SPDK has a problem you don't have with bypasses and
             | uio_ring, in that it needs the IOMMU enabled, and that can
             | itself become a bottleneck. There are also issues for some
             | applications that want to use interrupts rather than poll
             | everything.
             | 
             | Whats really nice about uio_ring is that it sort of
             | standardizes a large part of what people were doing with
             | bypasses.
        
             | tanelpoder wrote:
             | Yeah, that's what I wondered - I'm ok with using multiple
             | cores, would I get even more IOPS when doing smaller I/Os.
             | Is the benchmark suite you used part of the SPDK toolkit
             | (and easy enough to run?)
        
               | benlwalker wrote:
               | Whether you get more IOPs with smaller I/Os depends on a
               | number of things. Most drives these days are natively
               | 4KiB blocks and are emulating 512B sectors for backward
               | compatibility. This emulation means that 512B writes are
               | often quite slow - probably slower than writing 4KiB
               | (with 4KiB alignment). But 512B reads are typically very
               | fast. On Optane drives this may not be true because the
               | media works entirely differently - those may be able to
               | do native 512B writes. Talk to the device vendor to get
               | the real answer.
               | 
               | For at least reads, if you don't hit a CPU limit you'll
               | get 8x more IOPS with 512B than you will with 4KiB with
               | SPDK. It's more or less perfect scaling. There's some
               | additional hardware overheads in the MMU and PCIe
               | subsystems with 512B because you're sending more messages
               | for the same bandwidth, but my experience has been that
               | it is mostly negligible.
               | 
               | The benchmark builds to build/examples/perf and you can
               | just run it with -h to get the help output. Random 4KiB
               | reads at 32 QD to all available NVMe devices (all devices
               | unbound from the kernel and rebound to vfio-pci) for 60
               | seconds would be something like:
               | 
               | perf -q 32 -o 4096 -w randread -t 60
               | 
               | You can specify only test specific devices with the -r
               | parameter (by BUS:DEVICE:FUNCTION essentially). The tool
               | can also benchmark kernel devices. Using -R will turn on
               | io_uring (otherwise it uses libaio), and you simply list
               | the block devices on the command line after the base
               | options like this:
               | 
               | perf -q 32 -o 4096 -w randread -t 60 -R /dev/nvme0n1
               | 
               | You can get ahold of help from the SPDK community at
               | https://spdk.io/community. There will be lots of people
               | willing to help.
               | 
               | Excellent post by the way. I really enjoyed it.
        
               | tanelpoder wrote:
               | Thanks! Will add this to TODO list too.
        
       | rektide wrote:
       | Nice follow up @ttanelpoder to "RAM is the new disk" (2015)[1]
       | which we talked about not even two weeks ago!
       | 
       | I was quite surprised to hear in that thread that AMD's
       | infiniband was so oversubscribed. There's 256GBps of pcie on a 1P
       | butit seems like this 66GBps is all the fabric can do. A little
       | under a 4:1 oversubscription!
       | 
       | [1] https://news.ycombinator.com/item?id=25863093
        
         | electricshampo1 wrote:
         | 66GBps is from each of 10 drives doing ~6.6 GBps; don't think
         | the infinity fabric is the limiter here
        
           | rektide wrote:
           | I'd been going off this link[1] from the previous "Ram is the
           | new diskx thread, but I think last time I read it I'd only
           | counted one Infinity Fabric Inter-Socket on the 1P diagram
           | (which provides the PCIe). On review, willing to bet, yeah,
           | the pcie lanes aren't all sharing the one IFIS. The diagram
           | is to give an idea, not the actual configuration.
           | 
           | [1] https://en.wikichip.org/wiki/amd/infinity_fabric#Scalable
           | _Da...
        
       | [deleted]
        
       | muro wrote:
       | This article was great, thanks for sharing!
       | 
       | Anyone has advice on optimizing a windows 10 system? I have a
       | haswell workstation (E5-1680 v3) that I find reasonably fast and
       | works very well under Linux. In windows, I get lost. I tried to
       | run the userbenchark suite which told me I'm below median for
       | most of my components. Is there any good advice how to improve
       | that? Which tools give good insight into what the machine is
       | doing under windows? I'd like first to try to optimize what I
       | have, before upgrading to the new shiny :).
        
       | RobLach wrote:
       | Excellent article. Worth a read even if you're not maxing IO.
        
       | tutfbhuf wrote:
       | This is a very synthetical fio benchmark, I would like to see how
       | actual applications like a postgres databases would perform on
       | such a tuned machine.
        
         | tanelpoder wrote:
         | Yep, some "real" workload tests are coming next (using
         | filesystems). I wanted to start from low level basics and later
         | build on top of that.
        
       | wiradikusuma wrote:
       | I've been thinking about this. Would traditional co-location
       | (e.g. 2x 2U from DELL) in a local data center be cheaper if e.g.
       | you're serving local (country-wise) market?
        
         | derefr wrote:
         | Depends on how long you need the server, and the ownership
         | model you've chosen to pursue for it.
         | 
         | If you _purchase_ a server and stick it in a co-lo somewhere,
         | and your business plans to exist for 10+ years -- well, is that
         | server still going to be powering your business 10 years from
         | now? Or will you have moved its workloads to something newer?
         | If so, you 'll probably want to decommission and sell the
         | server at some point. The time required to deal with that might
         | not be worth the labor costs of your highly-paid engineers.
         | Which means you might not actually end up re-capturing the
         | depreciated value of the server, but instead will just let it
         | rot on the shelf, or dispose of it as e-waste.
         | 
         | Hardware _leasing_ is a lot simpler. Whether you lease servers
         | from an OEM like Dell, there 's a quick, well-known path to
         | getting the EOLed hardware shipped back to Dell and the
         | depreciated value paid back out to you.
         | 
         | And, of course, hardware _renting_ is simpler still. Renting
         | the hardware of the co-lo (i.e.  "bare-metal unmanaged server"
         | hosting plans) means never having to worry about the CapEx of
         | the hardware in the first place. You just walk away at the end
         | of your term. But, of course, that's when you start paying
         | premiums on top of the hardware.
         | 
         | Renting VMs, then, is like renting hardware on a micro-scale;
         | you never have to think about what you're running on, as --
         | presuming your workload isn't welded to particular machine
         | features like GPUs or local SSDs -- you'll tend to
         | automatically get migrated to newer hypervisor hardware
         | generations as they become available.
         | 
         | When you work it out in terms of "ten years of ops-staff labor
         | costs of dealing with generational migrations and sell-offs"
         | vs. "ten years of premiums charged by hosting rentiers", the
         | pricing is surprisingly comparable. (In fact, this is basically
         | the math hosting providers use to figure out what they _can_
         | charge without scaring away their large enterprise customers,
         | who are fully capable of taking a better deal if there is one.)
        
           | rodgerd wrote:
           | > If you purchase a server and stick it in a co-lo somewhere,
           | and your business plans to exist for 10+ years -- well, is
           | that server still going to be powering your business 10 years
           | from now? Or will you have moved its workloads to something
           | newer?
           | 
           | Which, if you have even the remotest fiscal competence,
           | you'll have funded by using the depreciation of the book
           | value of the asset after 3 years.
        
       | 37ef_ced3 wrote:
       | Somebody please tell me how many ResNet50 inferences you can do
       | per second on one of these chips
       | 
       | Here is the standalone AVX-512 ResNet50 code (C99 .h and .c
       | files):
       | 
       | https://nn-512.com/browse/ResNet50
       | 
       | Oops, AMD doesn't support AVX-512 yet. Even Zen 3? Incredible
        
         | wyldfire wrote:
         | Whoa, this code looks interesting. Must've been emitted by
         | something higher-level? Something like PyTorch/TF/MLIR/TVM/Glow
         | maybe?
         | 
         | If that is the case, then maybe it could be emitted again while
         | masking the instruction sets Ryzen doesn't support yet.
        
         | tanelpoder wrote:
         | You mean on the CPU, right? This CPU doesn't support AVX-512:
         | $ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/
         | /\n/g' | egrep "avx|sse|popcnt" | sort | uniq       avx
         | avx2       misalignsse       popcnt       sse       sse2
         | sse4_1       sse4_2       sse4a       ssse3
         | 
         | What compile/build options should I use?
        
           | 37ef_ced3 wrote:
           | No AVX-512, forget it then
        
         | xxpor wrote:
         | They don't have avx512 instructions.
        
       | qaq wrote:
       | Now honestly say for how long two boxes like this behind a load
       | balancer would be more than enough for your startup.
        
       | pbalcer wrote:
       | What I find interesting about the performance of this type of
       | hardware is how it affects the software we are using for storage.
       | The article talked about how the Linux kernel just can't keep up,
       | but what about databases or kv stores. Are the trade-offs those
       | types of solutions make still valid for this type of hardware?
       | 
       | RocksDB, and LSM algorithms in general, seem to be designed with
       | the assumption that random block I/O is slow. It appears that,
       | for modern hardware, that assumption no longer holds, and the
       | software only slows things down [0].
       | 
       | [0] -
       | https://github.com/BLepers/KVell/blob/master/sosp19-final40....
        
         | ddorian43 wrote:
         | Disappointed there was no lmdb comparison in there.
        
         | tyingq wrote:
         | A paper on making LSM more SSD friendly:
         | https://users.cs.duke.edu/~rvt/ICDE_2017_CameraReady_427.pdf
        
           | pbalcer wrote:
           | Thanks for sharing this article - I found it very insightful.
           | I've seen similar ideas being floated around before, and they
           | often seem to focus on what software can be added on top of
           | an already fairly complex solution (while LSM can appear to
           | be conceptually simple, its implementations are anything
           | but).
           | 
           | To me, what the original article shows is an opportunity to
           | remove - not add.
        
         | jeffbee wrote:
         | If you think about it from the perspective of the authors of
         | large-scale databases, linear access is still a lot cheaper
         | than random access in a datacenter filesystem.
        
         | AtlasBarfed wrote:
         | scylladb had a blogpost once about how surprisingly small
         | amounts of cpu time are available to process packets at the
         | modern highest speed networks like 40gbit and the like.
         | 
         | I can't find it now. I think they were trying to say that
         | cassandra can't keep up because of the JVM overhead and you
         | need to be close to metal for extreme performance.
         | 
         | This is similar. Huge amounts of flooding I/O from modern PCIx
         | SSDs really closes the traditional gap between CPU and "disk".
         | 
         | The biggest limiter in cloud right now is the EBS/SAN. Sure you
         | can use local storage in AWS if you don't mind it disappearing,
         | but while gp3 is an improvement, it pales to stuff like this.
         | 
         | Also, this is fascinating:
         | 
         | "Take the write speeds with a grain of salt, as TLC & QLC cards
         | have slower multi-bit writes into the main NAND area, but may
         | have some DIMM memory for buffering writes and/or a "TurboWrite
         | buffer" (as Samsung calls it) that uses part of the SSDs NAND
         | as faster SLC storage. It's done by issuing single-bit "SLC-
         | like" writes into TLC area. So, once you've filled up the "SLC"
         | TurboWrite buffer at 5000 MB/s, you'll be bottlenecked by the
         | TLC "main area" at 2000 MB/s (on the 1 TB disks)."
         | 
         | I didn't know controllers could swap between TLC/QLC and SLC.
        
           | tanelpoder wrote:
           | I learned the last bit from here (Samsung Solid State Drive
           | TurboWrite Technology pdf):
           | 
           | https://images-eu.ssl-images-
           | amazon.com/images/I/914ckzwNMpS...
        
           | PeterCorless wrote:
           | Hi! From ScyllaDB here. There are a few things that help us
           | really get the most out of hardware and network IO.
           | 
           | 1. Async everywhere - We use AIO and io_uring to make sure
           | that your inter-core communications are non-blocking.
           | 
           | 2. Shard-per-core - It also helps if specific data is pinned
           | to a specific CPU, so we partition on a per-core basis.
           | Avoids cross-CPU traffic and, again, less blocking.
           | 
           | 3. Schedulers - Yes, we have our own IO scheduler and CPU
           | scheduler. We try to get every cycle out of a CPU. Java is
           | very "slushy" and though you can tune a JVM it is never going
           | to be as "tight" performance-wise.
           | 
           | 4. Direct-attached NVMe > networked-attached block storage. I
           | mean... yeah.
           | 
           | We're making Scylla even faster now, so you might want to
           | check out our blogs on Project Circe:
           | 
           | * Introducing Project Circe:
           | https://www.scylladb.com/2021/01/12/making-scylla-a-
           | monstrou...
           | 
           | * Project Circe January Update:
           | https://www.scylladb.com/2021/01/28/project-circe-january-
           | up...
           | 
           | The latter has more on our new scheduler 2.0 design.
        
           | StillBored wrote:
           | Yes a number of articles about these newer TLC drives talk
           | about it. The end result is that an empty drive is going to
           | benchmark considerably different from one 99% full of
           | uncompressable files.
           | 
           | for example:
           | 
           | https://www.tomshardware.com/uk/reviews/intel-
           | ssd-660p-qlc-n...
        
           | 1996 wrote:
           | > I didn't know controllers could swap between TLC/QLC and
           | SLC.
           | 
           | I wish I could control the % of SLC. Even dividing a QLC
           | space by 16 makes it cheaper than buying a similarly sized
           | SLC
        
         | 1MachineElf wrote:
         | Reminds me of the Solid-State Drive checkbox that VirtualBox
         | has for any VM disks. Checking it will make sure that the VM
         | hardware emulation doesn't wait for the filesystem journal to
         | be written, which would normally be advisable with spinning
         | disks.
        
         | digikata wrote:
         | Not only the assumptions at the application layer, but
         | potentially the filesystem too.
        
         | [deleted]
        
         | bob1029 wrote:
         | I have personally found that making even the most primitive
         | efforts at single-writer principle and batching IO in your
         | software can make many orders of magnitude difference.
         | 
         | Saturating an NVMe drive with a single x86 thread is trivial if
         | you change how you play the game. Using async/await and
         | yielding to the OS is not going to cut it anymore. Latency with
         | these drives is measured in microseconds. You are better off
         | doing microbatches of writes (10-1000 uS wide) and pushing
         | these to disk with a single thread that monitors a queue in a
         | busy wait loop (sort of like LMAX Disruptor but even more
         | aggressive).
         | 
         | Thinking about high core count parts, sacrificing an entire
         | thread to busy waiting so you can write your transactions to
         | disk very quickly is not a terrible prospect anymore. This same
         | ideology is also really useful for ultra-precise execution of
         | future timed actions. Approaches in managed lanaguages like
         | Task.Delay or even Thread.Sleep are insanely inaccurate by
         | comparison. The humble while(true) loop is certainly not energy
         | efficient, but it is very responsive and predictable as long as
         | you dont ever yield. What's one core when you have 63 more to
         | go around?
        
           | MrFoof wrote:
           | >Latency with these drives is measured in microseconds.
           | 
           | For context and to put numbers around this, the average read
           | latency of the fastest, latest generation PCI 4.0 x4 U.2
           | enterprise drives is 82-86us, and the average write latency
           | is 11-16us.
        
           | pbalcer wrote:
           | The authors of the article I linked to earlier came to the
           | same conclusions. And so did the SPDK folks. And the kernel
           | community (or axboe :)) when coming up with io_uring. I'm
           | just hoping that we will see software catching up.
        
           | mikepurvis wrote:
           | Isn't the use or non-use of async/await a bit orthogonal to
           | the rest of this?
           | 
           | I'm not an expert in this area, but wouldn't it be just as
           | lightweight to have your async workers pushing onto a queue,
           | and then have your async writer only wake up when the queue
           | is at a certain level to create the batched write? Either
           | way, you won't be paying the OS context switching costs
           | associated with blocking a write thread, which I think is
           | most of what you're trying to get out of here.
        
             | pbalcer wrote:
             | Right, I agree. I'd go even further and say that
             | async/await is a great fit for a modern _asynchronous_ I /O
             | stack (not read()/write()). Especially with io_uring using
             | polled I/O (the worker thread is in the kernel, all the
             | async runtime has to do is check for completion
             | periodically), or with SPDK if you spin up your own I/O
             | worker thread(s) like @benlwalker explained elsewhere in
             | the thread.
        
           | throwawaygimp wrote:
           | Very interesting. I'm currently desiging and building a
           | system which has a separate MCU just for timing accurate
           | stuff rather than having the burdon of realtime kernel stuff,
           | but I never considered just dedicating a core. Then I could
           | also use that specifically to handle some IO queues too
           | perhaps, so it could do double duty and not necessarily be
           | wasteful. Thanks... now I need to go figure out why I either
           | didn't consider that - or perhaps I did and discarded it for
           | some reason beyond me right now. Hmm... thought provoking
           | post of the day for me
        
       | tyingq wrote:
       | I wonder if "huge pages" would make a difference, since some of
       | the bottlenecks seemed to be lock contention on memory pages.
        
         | tanelpoder wrote:
         | Linux pagecache doesn't use hugepages, but definitely when
         | doing direct I/O into application buffers, would make sense to
         | use hugepages for that. I plan to run tests on various database
         | engines next - and many of them support using hugepages (for
         | shared memory areas at least).
        
           | guerby wrote:
           | In the networking world (DPDK) huge pages and static pinning
           | everything is a huge deal as you have very few cpu cycles per
           | network packet.
        
             | tanelpoder wrote:
             | Yep - and there's SPDK for direct NVMe storage access
             | without going through the Linux block layer:
             | https://spdk.io
             | 
             | (it's in my TODO list too)
        
           | tyingq wrote:
           | Thanks! Apparently, they did add it for tmpfs, and discussed
           | it for ext4. https://lwn.net/Articles/718102/
        
             | tanelpoder wrote:
             | Good point - something to test, once I get to the
             | filesystem benchmarks!
        
       | tyingq wrote:
       | I'm somewhat curious what happens to the long standing 4P/4U
       | servers from companies like Dell and HP. The Ryzen/EPYC has
       | really made going past 2P/2U a more rare need.
        
         | toast0 wrote:
         | At least when I was actively looking at hardware (2011-2018), 4
         | socket Xeon was available off the shelf, but at quite the
         | premium over 2 socket Xeon. If your load scaled horizontally,
         | it still made sense to get a 2P Xeon over 2x 1P Xeon, but 2x 2P
         | Xeon was way more cost efficient than a 4P Xeon. 8P or 16P
         | seemed to exist, but maybe only in catalogs.
         | 
         | I'm not really in the market anymore, but Epyc looks like 1P is
         | going to solve a lot of needs, and 2P will be available at a
         | reasonable premium, but 4P will probably be out of reach.
        
         | thinkingkong wrote:
         | You might be able to buy a smaller server but the rack density
         | doesnt necessarily change. You still have to worry about
         | cooling and power so lots of DCs would have 1/4 or 1/2 racks.
        
           | tyingq wrote:
           | Sure. I wasn't really thinking of density, just the
           | interesting start of the "death" of 4 socket servers. Being
           | an old-timer, it's interesting to me because "typical
           | database server" has been synonymous with 4P/4U for a long,
           | long time.
        
             | vinay_ys wrote:
             | I haven't seen a 4 socket machine in a long time.
        
         | wtallis wrote:
         | I think at this point the only reasons to go beyond 2U are to
         | make room for either 3.5" hard drives, or GPUs.
        
           | rektide wrote:
           | Would love to see some very dense blade style ryzen
           | offerings. The 4 2P nodes in 2U is great. Good way to share
           | some power supies, fan, chassis, ideally multi-home nic too.
           | 
           | Turn those sleds into blades though, put em on their side, &
           | go even denser. It should be a way to save costs, but density
           | alas is a huge upsell, even though it should be a way to
           | scale costs down.
        
         | tanelpoder wrote:
         | Indeed, 128 EPYC cores in 2 sockets (with total 16 memory
         | channels) will give a lot of power. I guess it's worth
         | mentioning that the 64-core chips have much lower clock rate
         | than 16/32 core ones though. And with some expensive software
         | that's licensed by CPU core (Oracle), you'd want faster cores,
         | but possibly pay a higher NUMA price when going with a single 4
         | or 8 sockets machine for your "sacred monolith".
        
         | StillBored wrote:
         | There always seems to be buyers for more exotic high end
         | hardware. That market has been shrinking and expanding, well
         | since the first computer, as mainstream machines become more
         | capable and people discover more uses for large coherent
         | machines.
         | 
         | But users of 16 socket machines, will just step down to 4
         | socket epyc machines with 512 cores (or whatever). And someone
         | else will realize that moving their "web scale" cluster from 5k
         | machines, down to a single machine with 16 sockets results in
         | lower latency and less cost. (or whatever).
        
       | maerF0x0 wrote:
       | > Shouldn't I be building a 50-node cluster in the cloud "for
       | scalability"? This is exactly the point of my experiment - do you
       | really want to have all the complexity of clusters or performance
       | implications of remote storage if you can run your I/O heavy
       | workload on just one server with local NVMe storage?
       | 
       | Anyone have a story to share about their company doing just this?
       | "Scale out" has basically been the only acceptable answer across
       | most of my career. Not to mention High Availability.
        
         | tanelpoder wrote:
         | You can get high availability without a "distibuted system",
         | just an active/passive failover cluster may be enough for some
         | requirements. Even failover (sometimes seamless) on a VMWare
         | cluster can help with planned maintenance scenarios without
         | downtime, etc.
         | 
         | Another way of achieving HA together with satisfying disaster
         | recovery requirements is replication (either app level or
         | database log replication, etc). So, no distributed system is
         | necessary unless, you have _legit_ scaling requirements.
         | 
         | If you work on ERP-like databases for traditional Fortune
         | 500-like companies, few people run such "sacred monolith"
         | applications on modern distributed NoSQL databases, it's all
         | Oracle, MSSQL or some Postgres nowadays. Data warehouses used
         | to be all Oracle, Teradata too - although these DBs support
         | some cluster scale-out, they're still "sacred monoliths" from a
         | different era (they are still doing - what they were designed
         | for - very well). Now of course Snowflake, BigQuery, etc are
         | taking over the DW/analytics world for new greenfield projects,
         | existing systems usually stay as they are due to lock-in &
         | extremely high cost of rewriting decades of existing reports
         | and apps.
        
       | anarazel wrote:
       | Have you checked if using the fio options (--iodepth_batch_*) to
       | batch submissions helps? Fio doesn't do that by default, and I
       | found that that can be a significant benefit.
       | 
       | Particularly submitting multiple up requests can amortize the
       | cost of setting the nvme doorbell (the expensive part as far as I
       | understand it) across multiple requests.
        
         | tanelpoder wrote:
         | I tested various fio options, but didn't notice this one - I'll
         | check it out! It might explain why I still kept seeing lots of
         | interrupts raised even though I had enabled the I/O completion
         | polling instead, with io_uring's --hipri option.
         | 
         | edit: I ran a quick test with various IO batch sizes and it
         | didn't make a difference - I guess because thanks to using
         | io_uring, my bottleneck is not in IO submission, but deeper in
         | the block IO stack...
        
           | wtallis wrote:
           | I think on recent kernels, using the hipri option doesn't get
           | you interrupt-free polled IO unless you've configured the
           | nvme driver to allocate some queues specifically for polled
           | IO. Since these Samsung drives support 128 queues and you're
           | only using a 16C/32T processor, you have more than enough for
           | each drive to have one poll queue and one regular IO queue
           | allocated to each (virtual) CPU core.
        
             | tanelpoder wrote:
             | That would explain it. Do you recommend any docs/links I
             | should read about allocating queues for polled IO?
        
               | anarazel wrote:
               | It's terribly documented :(. You need to set the
               | nvme.poll_queues to the number of queues you want, before
               | the disks are attached. I.e. either at boot, or you need
               | to set the parameter and then cause the NVMe to be
               | rescanned (you can do that in sysfs, but I can't
               | immediately recall the steps with high confidence).
        
               | anarazel wrote:
               | Ah, yes, shell history ftw. Of course you should ensure
               | no filesystem is mounted or such:
               | root@awork3:~# echo 4 >
               | /sys/module/nvme/parameters/poll_queues
               | root@awork3:~# echo 1 >
               | /sys/block/nvme1n1/device/reset_controller
               | root@awork3:~# dmesg -c         [749717.253101] nvme
               | nvme1: 12/0/4 default/read/poll queues
               | root@awork3:~# echo 8 >
               | /sys/module/nvme/parameters/poll_queues
               | root@awork3:~# dmesg -c         root@awork3:~# echo 1 >
               | /sys/block/nvme1n1/device/reset_controller
               | root@awork3:~# dmesg -c         [749736.513102] nvme
               | nvme1: 8/0/8 default/read/poll queues
        
               | tanelpoder wrote:
               | Thanks for the pointers, I'll bookmark this and will try
               | it out someday.
        
           | anarazel wrote:
           | > I tested various fio options, but didn't notice this one -
           | I'll check it out! It might explain why I still kept seeing
           | lots of interrupts raised even though I had enabled the I/O
           | completion polling instead, with io_uring's --hipri option.
           | 
           | I think that should be independent.
           | 
           | > edit: I ran a quick test with various IO batch sizes and it
           | didn't make a difference - I guess because thanks to using
           | io_uring, my bottleneck is not in IO submission, but deeper
           | in the block IO stack...
           | 
           | It probably won't get you drastically higher speeds in an
           | isolated test - but it should help reduce CPU overhead. E.g.
           | on one of my SSDs fio --ioengine io_uring --rw randread
           | --filesize 50GB --invalidate=0 --name=test --direct=1 --bs=4k
           | --numjobs=1 --registerfiles --fixedbufs --gtod_reduce=1
           | --iodepth 48 uses about 25% more CPU than when I add
           | --iodepth_batch_submit=0 --iodepth_batch_complete_max=0. But
           | the resulting iops are nearly the same as long as there are
           | enough cycles available.
           | 
           | This is via filesystem, so ymmv, but the mechanism should be
           | mostly independent.
        
       | tanelpoder wrote:
       | Author here: This article was intended to explain some modern
       | hardware bottlenecks (and non-bottlenecks), but unexpectedly
       | ended up covering a bunch of Linux kernel I/O stack issues as
       | well :-) AMA
        
         | jeffbee wrote:
         | Great article, I learned! Can you tell me if you looked into
         | aspects of the NVMe device itself, such as whether it supports
         | 4K logical blocks instead of 512B? Use `nvme id-ns` to read out
         | the supported logical block formats.
        
           | tanelpoder wrote:
           | Doesn't seem to support 4k out of the box? Some drives - like
           | Intel Optane SSDs allow changing this in firmware (and
           | reformatting) with a manufacturer's utility...
           | $ lsblk -t /dev/nvme0n1       NAME    ALIGNMENT MIN-IO OPT-IO
           | PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME       nvme0n1
           | 0    512      0     512     512    0 none     1023 128    0B
           | $ sudo nvme id-ns -H /dev/nvme0n1 | grep Size       LBA
           | Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes -
           | Relative Performance: 0 Best (in use)
        
             | jeffbee wrote:
             | Thanks for checking. SSD review sites never mention this
             | important detail. For some reason the Samsung datacenter
             | SSDs support 4K LBA format, and they are very similar to
             | the retail SSDs which don't seem to. I have the a retail
             | 970 Evo that only provides 512.
        
               | wtallis wrote:
               | I just checked my logs, and none of Samsung's consumer
               | NVMe drives have ever supported sector sizes other than
               | 512B. They seem to view this feature as part of their
               | product segmentation strategy.
               | 
               | Some consumer SSD vendors do enable 4kB LBA support. I've
               | seen it supported on consumer drives from WD, SK hynix
               | and a variety of brands using Phison or SMI SSD
               | controllers (including Kingston, Seagate, Corsair,
               | Sabrent). But I haven't systematically checked to see
               | which brands consistently support it.
        
               | floatboth wrote:
               | At least _early_ WD Black models don 't really seem to
               | have 4K LBA support. The format option is listed, but it
               | refuses to actually run the command to reformat the drive
               | to the new "sector" size.
        
               | wtallis wrote:
               | Put your system to sleep and wake it back up. (I use
               | `rtcwake -m mem -s 10`). Power-cycling the drive like
               | this resets whatever security lock the motherboard
               | firmware enables on the drive during the boot process,
               | allowing the drive to accept admin commands like NVMe
               | format and ATA secure erase that would otherwise be
               | rejected. Works on both the WD Black SN700 and SN750
               | models, doesn't seem to be necessary on the very first
               | (Marvell-based) WD Black or the latest SN850.
        
               | floatboth wrote:
               | I'm pretty sure this is the very first one though --
               | WDS250G2X0C, firmware 101110WD.
        
               | wtallis wrote:
               | I think that's the second-gen WD Black, but the first one
               | that had their in-house SSD controller rather than a
               | third-party controller. The marketing and packaging
               | didn't prominently use a more specific model number to
               | distinguish it from the previous WD Black, but on the
               | drive's label it does say "PC SN700". Also, the first-gen
               | WD Black was 256GB and 512GB capacities, while the later
               | generations are 250/500/1000/2000GB. Firmware version
               | strings for the first-gen WD Black were stuff like
               | "B35200WD", while the SN700/720/730/750 family have
               | versions like "102000WD" and "111110WD". So I would
               | definitely expect your drive to require the sleep-wake
               | cycle before it'll let you reformat to 4k sectors.
        
               | jeffbee wrote:
               | You seem to have a lot of info on this topic. Do you run
               | a blog or some other way you disseminate this stuff?
        
               | 1996 wrote:
               | Is it genuine 512?
               | 
               | As in, what ashift value do you use with zfs?
        
               | wtallis wrote:
               | Regardless of what sector size you configure the SSD to
               | expose, the drive's flash translation layer still manages
               | logical to physical mappings at a 4kB granularity, the
               | underlying media page size is usually on the order of
               | 16kB, and the erase block size is several MB. So what
               | ashift value you want to use depends very much on what
               | kind of tradeoffs you're okay with in terms of different
               | aspects of performance and write endurance/write
               | amplification. But for most flash-based SSDs, there's no
               | reason to set ashift to anything less than 12
               | (corresponding to 4kB blocks).
        
               | 1996 wrote:
               | > for most flash-based SSDs, there's no reason to set
               | ashift to anything less than 12 (corresponding to 4kB
               | blocks).
               | 
               | matching the page size?
               | 
               | > the underlying media page size is usually on the order
               | of 16kB
               | 
               | I'd say that's a good reason to set ashift=14 as
               | 2^14=16kb
        
               | wtallis wrote:
               | There _are_ downsides to forcing the OS /FS to always use
               | larger block sizes for IO. You might simply be moving
               | some write amplification out of the SSD and into the
               | filesystem, while losing some performance in the process.
               | Which is why it really depends on your workload, and to
               | some extent on the specific SSD in question. I'm not
               | convinced that ashift=14 is a sensible one size fits all
               | recommendation, even if we're talking only about recent-
               | model consumer-grade NAND SSDs.
        
               | mgerdts wrote:
               | FWIW, WD SN850 has similar performance and supports 512
               | and 4k sectors.
        
             | guerby wrote:
             | Here is an article about nvme-cli tool :
             | 
             | https://nvmexpress.org/open-source-nvme-management-
             | utility-n...
             | 
             | On Samsung SSD 970 EVO 1TB it seems only 512 bytes LBA are
             | supported:                  # nvme id-ns /dev/nvme0n1 -n 1
             | -H|grep "^LBA Format"        LBA Format  0 : Metadata Size:
             | 0   bytes - Data Size: 512 bytes - Relative Performance: 0
             | Best (in use)
        
         | rafaelturk wrote:
         | Thanks for well written article, makes me think about
         | inefficiencies in our over-hyped cloud environment.
        
           | tanelpoder wrote:
           | Oh yes - and incorrectly configured on-premises systems too!
        
         | sitkack wrote:
         | Could you explain some of your thought processes and
         | methodologies when approaching problems like this?
         | 
         | What is your mental model like? How much experimentation do you
         | do verses reading kernel code? How do you know what questions
         | to start asking?
         | 
         | *edit, btw I understand that a response to these questions
         | could be an entire book, you get the question-space.
        
           | tanelpoder wrote:
           | Good question. I don't ever read kernel code as a starting
           | point, only if some profiling or tracing tool points me
           | towards an interesting function or codepath. And interesting
           | usually is something that takes most CPU in perf output or
           | some function call with an unusually high latency in ftrace,
           | bcc/bpftrace script output. Or just a stack trace in a core-
           | or crashdump.
           | 
           | As far as mindset goes - I try to apply the developer mindset
           | to system performance. In other words, I don't use much of
           | what I call the "old school sysadmin mindset", from a time
           | where better tooling was not available. I don't use
           | systemwide utilization or various get/hit ratios for doing
           | "metric voodoo" of Unix wizards.
           | 
           | The developer mindset dictates that everything you run is an
           | application. JVM is an application. Kernel is an application.
           | Postgres, Oracle are applications. All applications execute
           | one or more threads that run on CPU or do not run on CPU.
           | There are only two categories of reasons why a thread does
           | not run on CPU (is sleeping): The OS put the thread to sleep
           | (involuntary blocking) or the thread voluntarily wanted to go
           | to go to sleep (for example, it realized it can't get some
           | application level lock).
           | 
           | And you drill down from there. Your OS/system is just a bunch
           | of threads running on CPU, sleeping and sometimes
           | communicating with each other. You can _directly_ measure all
           | of these things easily nowadays with profilers, no need for
           | metric voodoo.
           | 
           | I have written my own tools to complement things like perf,
           | ftrace and BPF stuff - as a consultant I regularly see 10+
           | year old Linux versions, etc - and I find sampling thread
           | states from /proc file system is a really good (and flexible)
           | starting point for system performance analysis and even some
           | drilldown - all this without having to install new software
           | or upgrading to latest kernels. Some of the tools I showed in
           | my article too:
           | 
           | https://tanelpoder.com/psnapper & https://0x.tools
           | 
           | In the end of my post I mentioned that I'll do a webinar
           | "hacking session" next Thursday, I'll show more how I work
           | there :-)
        
         | vinay_ys wrote:
         | Very cool rig and benchmark. Kudos. Request: add network io
         | load to your benchmarking load while nvme io load is running.
        
           | tanelpoder wrote:
           | Thanks, will do in a future article! I could share the disks
           | out via NFS or iSCSI or something and hammer them from a
           | remote machine...
        
         | PragmaticPulp wrote:
         | This is a great article. Thanks for writing it up and sharing.
        
         | guerby wrote:
         | 71 GB/s is 568 Gbit/s so you'll need about 3 dual 100 Gbit/s
         | cards to pump data out at the rate you can read it from the
         | NVMe drives.
         | 
         | And ethernet (unless LAN jumbo frames) is about 1.5kByte per
         | frame (not 4kB).
         | 
         | One such PC should be able to do 100k simultaneous 5 Mbps HD
         | streams.
         | 
         | Testing this would be fun :)
        
           | zamadatix wrote:
           | Mellanox has a 2x200 Gbps NIC these days. Haven't gotten to
           | play with it yet though.
        
             | tanelpoder wrote:
             | Which NICs would you recommend for me to buy for testing at
             | least 1x100 Gbps (ideally 200 Gbps?) networking between
             | this machine (PCIe 4.0) and an Intel Xeon one that I have
             | with PCIe 3.0. Don't want to spend much money, so the cards
             | don't need to be too enterprisey, just fast.
             | 
             | And - do such cards even allow direct "cross" connection
             | without a switch in between?
        
               | namibj wrote:
               | If you care about price, check out (used, ofc) infiniband
               | cards.
               | 
               | They all seem to offer/suggest daisy-chain connectivity
               | at least for those with two ports per card as one
               | potential topology.
        
               | drewg123 wrote:
               | All 100G is enterprisy.
               | 
               | For a cheap solution, I'd get a pair of used Mellanox
               | ConnectX4 or Chelsio T6, and a QSFP28 direct attach
               | copper cable.
        
               | zamadatix wrote:
               | +1 on what the sibling comment said.
               | 
               | As for directly connecting them absolutely, works great.
               | Id recommend a cheap DAC off fs.com to connect them in
               | that case.
        
           | drewg123 wrote:
           | At Netflix, I'm playing with an EPYC 7502P with 16 NVME and
           | dual 2x100 Mellanox ConnectX6-DX NICs. With hardware kTLS
           | offload, we're able to serve about 350Gb/s of real customer
           | traffic. This goes down to about 240Gb/s when using software
           | kTLS, due to memory bandwidth limits.
           | 
           | This is all FreeBSD, and is the evolution of the work
           | described in my talk at the last EuroBSDCon in 2019:
           | https://papers.freebsd.org/2019/eurobsdcon/gallatin-
           | numa_opt...
        
             | ksec wrote:
             | >we're able to serve about 350Gb/s of real customer
             | traffic.
             | 
             | I still remember the post about breaking 100Gbps barrier,
             | that was may be in 2016 or 17 ? And wasn't that long ago it
             | was 200Gbps and if I remember correct it was hitting memory
             | bandwidth barrier as well.
             | 
             | And now 350Gbps?!
             | 
             | So what's next? Wait for DDR5? Or moving to some memory
             | controller black magic like POWER10?
        
               | drewg123 wrote:
               | Yes, before hardware inline kTLS offload, we were limited
               | to 200Gb/s or so with Naples. With Rome, its a bit
               | higher. But hardware inline kTLS with the Mellanox CX6-DX
               | eliminates memory bandwidth as a bottleneck.
               | 
               | The current bottleneck is IO related, and its unclear
               | what the issue is. We're working with the hardware
               | vendors to try to figure it out. We should be getting
               | about 390Gb/s
        
               | ksec wrote:
               | Oh wow! Cant wait to hear more about it.
        
               | jiggawatts wrote:
               | > But hardware inline kTLS with the Mellanox CX6-DX
               | eliminates memory bandwidth as a bottleneck.
               | 
               | For a while now I had operated under the assumption that
               | CPU-based crypto with AES-GCM was faster than most
               | hardware offload cards. What makes the Mellanox NIC
               | perform better?
               | 
               | I.e.: Why does memory bandwidth matter to TLS? Aren't you
               | encrypting data "on the fly", while it is still resident
               | in the CPU caches?
               | 
               | > We're working with the hardware vendors to try to
               | figure it out. We should be getting about 390Gb/s
               | 
               | Something I explained to a colleague recently is that a
               | modern CPU gains or loses more computer power from a 1deg
               | C temperature difference in the room's air than my first
               | four computers had combined.
               | 
               | You're basically complaining that you're unable to get a
               | mere 10% of the expected throughput. But put in absolute
               | terms, that's 40 Gbps, which is about 10x more than what
               | a typical server in 2020 can put out on the network.
               | (Just because you have 10 Gbps NICs doesn't mean you can
               | get 10 Gbps! Try iperf3 and you'll be shocked that you're
               | lucky if you can crack 5 Gbps in practice)
        
               | floatboth wrote:
               | > Try iperf3 and you'll be shocked that you're lucky if
               | you can crack 5 Gbps in practice
               | 
               | Easy line rate if you crank the MTU all the way to 9000
               | :D
               | 
               | > modern CPU gains or loses more computer power from a
               | 1deg C temperature difference in the room's air
               | 
               | If you're using the boost algorithm rather than a static
               | overclock, _and_ when that boost is thermally limited
               | rather than current limited. With a good cooler it 's not
               | too hard to always have thermal headroom.
        
               | jiggawatts wrote:
               | > Easy line rate if you crank the MTU all the way to 9000
               | :D
               | 
               | In my experience jumbo frames provide _at best_ an
               | improvement of about 20% in rare cases, such as ping-pong
               | UDP protocols such as TFTP or Citrix PVS streaming.
        
               | magila wrote:
               | > I.e.: Why does memory bandwidth matter to TLS? Aren't
               | you encrypting data "on the fly", while it is still
               | resident in the CPU caches?
               | 
               | I assume NF's software pipeline is zero copy, so if TLS
               | is done in the NIC data only gets read from memory once
               | when it is DMA'd to the NIC. With software TLS you need
               | to read the data from memory (assuming it's not already
               | in cache, which given the size of data NF deals with is
               | unlikely), encrypt it, then write it back out to main
               | memory so it can be DMA'd to the NIC. I know Intel has
               | some fancy tech that can DMA directly to/from the CPU's
               | cache, but I don't think AMD has that capability (yet).
        
               | toast0 wrote:
               | > For a while now I had operated under the assumption
               | that CPU-based crypto with AES-GCM was faster than most
               | hardware offload cards. What makes the Mellanox NIC
               | perform better?
               | 
               | > I.e.: Why does memory bandwidth matter to TLS? Aren't
               | you encrypting data "on the fly", while it is still
               | resident in the CPU caches?
               | 
               | It may depend on what you're sending. Netflix's use case
               | is generally sending files. If you're doing software
               | encryption you would load the plain text file into memory
               | (via the filesystem/unified buffer cache), then write the
               | (session specific) encrypted text into separate memory,
               | then tell give that memory to the NIC to send out.
               | 
               | If the NIC can do the encryption, you would load the
               | plain text into memory, then tell the NIC to read from
               | that memory to encrypt and send out. That saves at least
               | a write pass, and probably a read pass. (256 MB of L3
               | cache on latest EPYC is a lot, but it's not enough to
               | expect cached reads from the filesystem to hit L3 that
               | often, IMHO)
               | 
               | If my guestimate is right, a cold file would go from
               | hitting memory 4 times to hitting it twice. And a file in
               | disk cache would go from 3 times to once; the CPU doesn't
               | need to touch the memory if it's in the disk cache.
               | 
               | Not that this is a totally different case from encrypting
               | dynamic data that's necessarily touched by the CPU.
               | 
               | > You're basically complaining that you're unable to get
               | a mere 10% of the expected throughput. But put in
               | absolute terms, that's 40 Gbps, which is about 10x more
               | than what a typical server in 2020 can put out on the
               | network. (Just because you have 10 Gbps NICs doesn't mean
               | you can get 10 Gbps! Try iperf3 and you'll be shocked
               | that you're lucky if you can crack 5 Gbps in practice)
               | 
               | I had no problem serving 10 Gbps of files on a dual Xeon
               | E5-2690 (v1; a 2012 CPU), although that CPU isn't great
               | at AES, so I think it only did 8 Gbps or so with TLS; the
               | next round of servers for that role had 2x 10G and 2690
               | v3 or v4 (2014 or 2016; but I can't remember when we got
               | them) and thanks to better AES instructions, they were
               | able to do 20 G (and a lot more handshakes/sec too). If
               | your 2020 servers aren't as good as my circa 2012 servers
               | were, you might need to work on your stack. OTOH, bulk
               | file serving for many clients can be different than a
               | single connection iperf.
        
               | drewg123 wrote:
               | > If my guestimate is right, a cold file would go from
               | hitting memory 4 times to hitting it twice. And a file in
               | disk cache would go from 3 times to once; the CPU doesn't
               | need to touch the memory if it's in the disk cache.
               | 
               | You're spot on. I have a slide that I like to show NIC
               | vendors when they question why TLS offload is important.
               | See pages 21 and 22 of: https://people.freebsd.org/~galla
               | tin/talks/euro2019-ktls.pdf
        
           | tanelpoder wrote:
           | I should (finally) receive my RTX 3090 card today (PCIe 4.0
           | too!), I guess here goes my weekend (and the following
           | weekends over a couple of years)!
        
         | tarasglek wrote:
         | You should look at cpu usage. There is a good chance all your
         | interrupts are hitting cpu-0. you can run hwloc to see what
         | chiplet the pci cards are on and handle interrupts on those
         | cores.
        
           | jeffbee wrote:
           | Why would that happen with the linux nvme stack that puts a
           | completion queue on each CPU?
        
             | wtallis wrote:
             | I think that in addition to allocating a queue per CPU, you
             | need to be able to allocate a MSI(-X) vector per CPU. That
             | shouldn't be a problem for the Samsung 980 PRO, since it
             | supports 128 queues and 130 interrupt vectors.
        
           | tanelpoder wrote:
           | Thanks for the "hwloc" tip. I hadn't thought about that.
           | 
           | I was thinking of doing something like that. Weirdly I got
           | sustained throughput differences when I killed & restarted
           | fio. So, if I got 11M IOPS, it stayed at that level until I
           | killed fio & restarted. If I got 10.8M next, it stayed like
           | it until I killed & restarted it.
           | 
           | This makes me think that I'm hitting some PCIe/memory
           | bottleneck, dependent on process placement (which process
           | happens to need to move data across infinity fabric due to
           | accessing data through a "remote" PCIe root complex or
           | something like that). But then I realized that Zen 2 has a
           | central IO hub again, so there shouldn't be a "far edge of
           | I/O" like on current gen Intel CPUs (?)
           | 
           | But there's definitely some workload placement and
           | I/O-memory-interrupt affinity that I've wanted to look into.
           | I could even enable the NUMA-like-mode from BIOS, but again
           | with Zen 2, the memory access goes through the central
           | infinity-fabric chip too, I understand, so not sure if
           | there's any value in trying to achieve memory locality for
           | individual chiplets on this platform (?)
        
             | wtallis wrote:
             | The PCIe is all on a single IO die, but internally it is
             | organized into quadrants that can produce some NUMA
             | effects. So it is probably worth trying out the motherboard
             | firmware settings to expose your CPU as multiple NUMA
             | nodes, and using the FIO options to allocate memory only on
             | the local node, and restricting execution to the right
             | cores.
        
               | tanelpoder wrote:
               | Yep, I enabled the "numa-like-awareness" in BIOS and ran
               | a few quick tests to see whether the NUMA-aware
               | scheduler/NUMA balancing would do the right thing and
               | migrate processes closer to their memory over time, but
               | didn't notice any benefit. But yep I haven't manually
               | locked down the execution and memory placement yet. This
               | placement may well explain why I saw some ~5% throughput
               | fluctuations _only if killing & restarting fio_ and not
               | while the same test was running.
        
               | syoc wrote:
               | I have done some tests on AMD servers and I the Linux
               | scheduler does a pretty good job. I do however get
               | noticeable (a couple percent) better performance by
               | forcing the process to run on the correct numa node.
               | 
               | Make sure you get as many numa domains as possible in
               | your BIOS settings.
               | 
               | I recommend using numactl with the cpu-exclusive and mem-
               | exclusive flags. I have noticed a slight perfomance drop
               | when RAM cache fills beyond the sticks local to the cpus
               | doing work.
               | 
               | One last comment is that you mentioned interrupts being
               | "stiped" among CPUs. I would recommend pinning the
               | interrupts from one disk to one numa-local CPU and using
               | numactl to run fio for that disk on the same CPU. An
               | additional experiment is to, if you have enough cores,
               | pin interrupts to CPUs local to disk, but use other cores
               | on the same numa node for fio. That has been my most
               | successful setup so far.
        
             | tarasglek wrote:
             | So there are 2 parts to cpu affinity. a) cpu assigned to
             | ssd for handling interrupts and b) cpu assigned to fio.
             | numactl is your friend for experimenting with with changing
             | fio affinity.
             | 
             | https://access.redhat.com/documentation/en-
             | us/red_hat_enterp... tells you how to tweak irq handlers.
             | 
             | You usually want to change both. pinning each fio process +
             | each interrupt handler to specific cpus will reach highest
             | perf.
             | 
             | You can even use isolcpus param to linux kernel to reduce
             | jitter from things you don't care about to minimize
             | latency.(wont do much for bandwidth)
        
             | mgerdts wrote:
             | I have the same box, but with the 32 core CPU and fewer
             | NVMe drives. I've not poked at all the PCIe slots yet, but
             | all that I've looked at are in NUMA node 1. This includes
             | the on board M.2 slots. It is in NPS=4 mode.
        
               | tanelpoder wrote:
               | Mine goes only up to 2 NUMA nodes (as shown in numactl
               | --hardware), despite setting NPS4 in BIOS. I guess it's
               | because I have only 2 x 8-core chiplets enabled (?)
        
         | perryizgr8 wrote:
         | It would be interesting to know what you intend to use this rig
         | for, if that is not some secret :)
        
           | tanelpoder wrote:
           | Valid question!
           | 
           | 1) Learning & researching capabilities of modern HW
           | 
           | 2) Running RDBMS stress tests (until breaking point), Oracle,
           | Postgres+TimescaleDB, MySQL, probably ScyllaDB soon too
           | 
           | 3) Why? As a performance troubleshooter consultant+trainer, I
           | regularly have to reproduce complex problems that show up
           | only under high concurrency & load - stuff that you can't
           | just reproduce in a VM in a laptop.
           | 
           | 4) Fun - seeing if the "next gen" hardware's promised
           | performance is actually possible!
           | 
           | FYI I have some videos from my past complex problem
           | troubleshooting adventures, mostly Oracle stuff so far and
           | some Linux performance troubleshooting:
           | 
           | https://tanelpoder.com/videos/
        
         | nicioan wrote:
         | Excellent article, thank you! I really like the analysis and
         | profiling part of the evaluation. I also have some experience
         | in I/O performance in linux -- we measured 30GiB/s in a pcie
         | Gen3 box (shameless plug[0]).
         | 
         | I have one question / comment: did you use multiple jobs for
         | the BW (large IO) experiments? If yes, then did you set
         | randrepeat to 0? I'm asking this because fio by default uses
         | the same sequence of offsets for each job, in which case there
         | might be data re-used across jobs. I had verified that with
         | blktrace a few years back, but it might have changed recently.
         | 
         | [0]https://www.usenix.org/conference/fast19/presentation/kourti
         | ...
         | 
         | edit: fixed typo
        
           | tanelpoder wrote:
           | Looks interesting! I wonder whether there'd be interesting
           | new database applications on NVMe when doing as small as 512
           | byte I/Os (with more efficient "IO engine" than Linux bio,
           | that has too high CPU overhead with such small requests).
           | 
           | I mean, currently OLTP RDBMS engines tend to use 4k, 8k (and
           | some) 16k block size and when doing completely random I/O
           | (or, say traversing an index on customer_id that now needs to
           | read random occasional customer orders across years of
           | history). So you may end up reading 1000 x 8 kB blocks just
           | to read 1000 x 100B order records "randomly" scattered across
           | the table from inserts done over the years.
           | 
           | Optane persistent memory can do small, cache line sized I/O I
           | understand, but that's a different topic. When being able to
           | do random 512B I/O on "commodity" NVMe SSDs efficiently, this
           | would open some interesting opportunities for retrieving
           | records that are scattered "randomly" across the disks.
           | 
           | edit: to answer your question, I used 10 separate fio
           | commands with numjobs=3 or 4 for each and randrepeat was set
           | to default.
        
         | ksec wrote:
         | I just love this article. Especially when the norm is always
         | about scaling out instead of scaling up. We can have 128 Core
         | CPU, 2TB Memory, PCI-E 4.0 SSD, ( and soon PCI-E 5.0 ). We
         | could even fit a _Petabyte_ in 1U for SSD Storage.
         | 
         | I remember WhatsApp used to operate its _500M_ user with only a
         | dozen of large FreeBSD boxes. ( Only to be taken apart by
         | Facebook )
         | 
         | So Thank you for raising awareness. Hopefully the pendulum is
         | swinging back to conceptually simple design.
         | 
         | >I also have a 380 GB Intel Optane 905P SSD for low latency
         | writes
         | 
         | I would love to see that. Although I am waiting for someone to
         | do a review on the Optane SSD P5800X [1]. Random 4K IOPS up to
         | 1.5M with lower than 6 _us_ Latency.
         | 
         | [1] https://www.servethehome.com/new-intel-
         | optane-p5800x-100-dwp...
        
           | texasbigdata wrote:
           | Second on Optane.
        
           | phkahler wrote:
           | >> I remember WhatsApp used to operate its 500M user with
           | only a dozen of large FreeBSD boxes.
           | 
           | With 1TB of RAM you can have 256 bytes for every person on
           | earth live in memory. With SSD either as virtual memory or
           | keeping an index in RAM, you can do meaningful work in real
           | time, probably as fast as the network will allow.
        
             | swader999 wrote:
             | Faster then they can type!
        
             | zie wrote:
             | My math doesn't compute with yours:
             | 
             | depending on how you define a TB(memory tends to favour the
             | latter definition, but YMMV):
             | 
             | 1,000,000,000,000 / 7.8billion = 128.21 bytes per human.
             | 
             | 1,099,511,627,776 / 7.8billion = 140.96 bytes per human.
             | 
             | population source via Wikipedia.
        
           | rektide wrote:
           | Intel killing off prosumer optane 2 weeks ago[1] made me so
           | so so sad.
           | 
           | The new P5800X should be sick.
           | 
           | [1] https://news.ycombinator.com/item?id=25805779
        
         | maerF0x0 wrote:
         | When I first moved to the bay area, the company that hired me
         | asked me what kind of computer I wanted and gave me a budget
         | (like $3000 or something)... I spent a few days crafting a
         | parts list so I could build an awesome workstation. Once I sent
         | it over they were like "Uh, we just meant which macbook do you
         | want?" and kind of gave me some shade about it. They joked, so
         | how are you going to do meetings or on call?
         | 
         | I rolled with it, but really wondered if they knew I could get
         | 2x the hardware and have a computer at home and at work for
         | less money than the MBP ... Most of the people didnt seem to
         | understand that laptop CPUs are not the same as
         | desktop/workstation ones, especially when they hit thermal down
         | throttling.
        
           | noir_lord wrote:
           | Last but one job boss offered me an iMac Pro, I asked if I
           | could just have the equivalent money for hardware and he said
           | sure.
           | 
           | Which is how I ended up with an absolute _monster_ of a work
           | machine, these days I WFH and while work issued me a Macbook
           | Pro it sits on the shelf behind me.
           | 
           | Fedora on a (still fast) Ryzen/2080 and 2x4K 27" screens vs a
           | Macbook Pro is a hilarious no brainer for me.
           | 
           | Upgrading soon but can't decide whether I _need_ the 5950X or
           | merely _want_ it - realistically except for gaming I 'm
           | nowhere near tapping out this machine (and it's still awesome
           | for that an VR which is why the step-son is about to get a in
           | his words "sick" PC).
        
           | walrus01 wrote:
           | I mean it would have been a totally valid answer to say that
           | you intended to use a $600 laptop as effectively a thin
           | client, and spend $2400 on a powerful workstation PC to drive
           | remotely.
        
         | antongribok wrote:
         | Great article!
         | 
         | Any chance you could post somewhere the output of:
         | lstopo --of ascii
         | 
         | Or similar?
        
           | tanelpoder wrote:
           | I can do it tomorrow, please drop me an email (email listed
           | in my blog)
        
         | KaiserPro wrote:
         | Excellent write up.
         | 
         | I used to work for a VFX company in 2008. At that point we used
         | lustre to get high throughput file storage.
         | 
         | From memory we had something like 20 racks of server/disks to
         | get a 3-6 gigabyte/s (sustained) throughput on a 300tb
         | filesystem.
         | 
         | It is hilarious to think that a 2u box can now theoretically
         | saturate 2x100gig nics.
        
       | drmadera wrote:
       | Great article. Did you consider doing Optane tests? I built a
       | 3990x WS with all-optanes and I get blazing fast access times,
       | but 3gb/s top speeds. It might be interesting to look at them for
       | these tests, specially in time-sensitive scenarios.
        
         | tanelpoder wrote:
         | I have 2 Optane 905P M.2 cards and I intend to run some
         | database engine tests, putting their transaction logs (and
         | possibly temporary spill areas for sorts, hashes) on Optane.
         | 
         | When I think about Optane, I think about optimizing for low
         | latency where it's needed and not that much about bandwidth of
         | large ops.
        
       | jacquesm wrote:
       | Lovely article, zero fluff, tons of good content and modest to
       | boot. Thank you for this write-up, I'll pass it around to some
       | people who feel that the need for competent system administration
       | skills has passed.
        
       | qaq wrote:
       | Would be cool to see pgbench score for this setup
        
       | namero999 wrote:
       | You should be farming Chia on that thing [0]
       | 
       | Amazing, congrats!
       | 
       | [0] https://github.com/Chia-Network/chia-blockchain/wiki/FAQ
        
       | jayonsoftware1 wrote:
       | https://www.asus.com/us/Motherboard-Accessories/HYPER-M-2-X1...
       | vs https://highpoint-tech.com/USA_new/nvme_raid_controllers.htm .
       | One card is about x10 expensive, but looks like performance is
       | same. Am I missing some thing.
        
         | tanelpoder wrote:
         | The ASUS one doesn't have its own RAID controller nor PCIe
         | switch onboard. It relies on the motherboard-provided PCIe
         | bifurcation and if using hardware RAID, it'd use AMD's built-in
         | RAID solution (but I'll use software RAID via Linux dm/md). The
         | HighPoint SSD7500 seems to have a proprietary RAID controller
         | built in to it and some management/monitoring features too
         | (it's the "somewhat enterprisey" version)
        
           | wtallis wrote:
           | The HighPoint card doesn't have a hardware RAID controller,
           | just a PCIe switch and an option ROM providing boot support
           | for their software RAID.
           | 
           | PCIe switch chips were affordable in the PCIe 2.0 era when
           | multi-GPU gaming setups were popular, but Broadcom decided to
           | price them out of the consumer market for PCIe 3 and later.
        
             | tanelpoder wrote:
             | Ok, thanks, good to know. I misunderstood from their
             | website.
        
             | rektide wrote:
             | pcie switches getting expensive is so the suck.
        
       | MrFoof wrote:
       | U.2 form factor drives (also NVMe protocol) can achieve higher
       | IOPS (particularly writes) still over M.2 form factor (especially
       | M.2 2280), with higher durability, but you'll need your own
       | controllers which are sparse on the market for the moment.
       | Throughput (MB/sec, not IOPS) will be about the same, but the U.2
       | drives can do it for longer.
       | 
       | U.2 means more NAND to parallelize over, more spare area (and
       | higher overall durability), potentially larger DRAM caches, and a
       | far larger area to dissipate heat. Plus it has all the fancy
       | bleeding-edge features you aren't going to see on consumer-grade
       | drives.
       | 
       | -- -----
       | 
       | The big issue with U.2 for "end user" applications like
       | workstations is you can't get drivers from Samsung for things
       | like the PM1733 or PM9A3 (which blow the doors off the 980 Pro,
       | especially for writes and $/GB, plus other neat features like
       | Fail-In-Place) unless you're an SI, in which you also co-
       | developed the firmware. The same goes for SanDisk, KIOXIA and
       | other makers of enterprise SSDs.
       | 
       | The kicker is enterprise U.2 drives are about the same $/GB as
       | SATA drives, but being NVMe PCIe 4.0 x4. blow the doors off about
       | everything. There's also the EDSFF, NF1 and now E.1L form
       | factors, but U.2 is very prevalent. Enterprise SSDs are
       | attractive as that's where the huge volume is (hence the low
       | $/GB), but end-user support is really limited. You can use
       | "generic drivers", but you won't see anywhere near the peak
       | performance of the drives.
       | 
       | The good news is both Micron and Intel have great support for
       | end-users, where you can get optimized drivers and updated
       | firmware. Intel has the D7-P5510 probably hitting VARs and some
       | retail sellers (maybe NewEgg) within about 60 days. Similar
       | throughput to the Samsung drives, far more write IOPS (especially
       | sustained), lower latencies, FAR more durability (with a big
       | warranty), far more capacity, and not too bad a price (looking
       | like ~$800USD for 3.84TB with ~7.2PB of warrantied writes over 5
       | years).
       | 
       | -- -----
       | 
       | My plan once Genesis Peak (Threadripper 5XXX) hits is four 3.84TB
       | Intel D7-P5510s in RAID10, connected to a HighPoint SSD7580 PCIe
       | 4.0 x16 controller. Figure ~$4,000 for a storage setup of ~7.3TB
       | usable space after formatting, 26GB/sec peak writes, ~8GB/sec
       | peak reads, with 2.8M 4K read iops, 700K 4K write iops, and
       | ~14.3PB of warrantied write durability.
        
         | floatboth wrote:
         | How would a model-specific driver for something that speaks
         | NVMe even work? Is it for Linux? Is it open? Is it just
         | modifications to the stock Linux NVMe driver that take some
         | drive specifics into account? Or is it some stupid proprietary
         | NVMe stack?
        
       | tutfbhuf wrote:
       | This article focuses on IOPS and throughput, but what is also
       | important for many applications is I/O latency, which can be
       | measured with ioping (apt-get install ioping). Unfortunately,
       | even 10x PCIe 4.0 NVMe do not provide any better latency than a
       | single NVMe drive. If you are constrained by disk latency then
       | 11M IOPS won't gain you much.
        
         | cheeze wrote:
         | Does this come up in practice? What kind of use cases suffer
         | from disk latency?
         | 
         | This stuff is all fascinating to me. I have a zfs NAS but I
         | feel like I've barely scratched the surface of SSDs
        
           | tutfbhuf wrote:
           | > Does this come up in practice? What kind of use cases
           | suffer from disk latency?
           | 
           | One popular example is HFT.
           | 
           | And from my experience on a desktop PC it is better to
           | disable swap and have the OOM killer do his work, instead of
           | swapping to disk, which makes my system noticeable laggy,
           | even with a fast NVMe.
        
             | sitkack wrote:
             | Anything with transaction SLOs in the microsecond or
             | millisecond range. Adtech, fintech, fraud detection, call
             | records, shopping carts.
             | 
             | Two big players in this space are Aerospike and ScyllaDB.
        
       | qaq wrote:
       | Now price this in terms of AWS and marvel at the markup
        
         | speedgoose wrote:
         | I'm afraid Jeff Bezos himself couldn't afford such IOs on AWS.
        
       | nwmcsween wrote:
       | So Linus was wrong on his rant to Dave about the page cache being
       | detremental on fast devices
        
       | svacko wrote:
       | I wonder, is increasing temperature of the M.2 NVMe disks
       | affecting the measured performance? Or is P620 cooling system
       | efficient enough to keep temp of the number of disks low?
       | 
       | Anyway, thanks for the inspirative post!
        
         | tanelpoder wrote:
         | Both quad SSDs adapters had a fan on it and the built in M.2
         | ones had a heatsink, right in front of one large chassis fan &
         | air intake. I didn't measure the SSD temperatures, but the I/O
         | rate didn't drop over time. I was bottlenecked by CPU when
         | doing small I/O tests, I monitored the current MHz from
         | /proc/cpuinfo to make sure that the CPU speeds didn't drop
         | lower than their nominal 3.9 GHz (and they didn't).
         | 
         | Btw, even the DIMMs have dedicated fans and enclosure (one per
         | 4 DIMMs) on the P620.
        
       | ogrisel wrote:
       | As a nitpicking person, I really like to read a post that does
       | not confuse GB/s for GiB/s :)
       | 
       | https://en.wikipedia.org/wiki/Byte#Multiple-byte_units
        
         | ogrisel wrote:
         | Actually now I realize that the title and the intro paragraph
         | are contradicting each other...
        
           | tanelpoder wrote:
           | Yeah, I used the formally incorrect GB in the title when I
           | tried to make it look as simple as possible... GiB just
           | didn't look as nice in the "marketing copy" :-)
           | 
           | I may have missed using the right unit in some other sections
           | too. At least I hope that I've conveyed that there's a
           | difference!
        
       ___________________________________________________________________
       (page generated 2021-01-30 23:02 UTC)