[HN Gopher] Extreme HTTP Performance Tuning
___________________________________________________________________
Extreme HTTP Performance Tuning
Author : talawahtech
Score : 880 points
Date : 2021-05-20 20:01 UTC (1 days ago)
(HTM) web link (talawah.io)
(TXT) w3m dump (talawah.io)
| HugoDaniel wrote:
| "Many of these specific optimizations won't really benefit you
| unless you are already serving more than 50k req/s to begin
| with."
| jeffbee wrote:
| Very nice round-up of techniques. I'd throw out a few that might
| or might not be worth trying: 1) I always disable C-states deeper
| than C1E. Waking from C6 takes upwards of 100 microseconds, way
| too much for a latency-sensitive service, and it doesn't save
| _you_ any money when you are running on EC2; 2) Try receive flow
| steering for a possible boost above and beyond what you get from
| RSS.
|
| Would also be interesting to discuss the impacts of turning off
| the xmit queue discipline. fq is designed to reduce frame drops
| at the switch level. Transmitting as fast as possible can cause
| frame drops which will totally erase all your other tuning work.
| talawahtech wrote:
| Thanks!
|
| > I always disable C-states deeper than C1E
|
| AWS doesn't let you mess with c-states for instances smaller
| than a c5.9xlarge[1]. I did actually test it out on a 9xlarge
| just for kicks, but it didn't make a difference. Once this test
| starts, all CPUs are 99+% Busy for the duration of the test. I
| think it would factor in more if there were lots of CPUs, and
| some were idle during the test.
|
| > Try receive flow steering for a possible boost
|
| I think the stuff I do in the "perfect locality" section[2]
| (particularly SO_ATTACH_REUSEPORT_CBPF) achieves what receive
| flow steering would be trying to do, but more efficiently.
|
| > Would also be interesting to discuss the impacts of turning
| off the xmit queue discipline
|
| Yea, noqueue would definitely be a no-go on a constrained
| network, but when running the (t)wrk benchmark in the cluster
| placement group I didn't see any evidence of packet drops or
| retransmits. Drop only happened with the iperf test.
|
| 1.
| https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo...
|
| 2. https://talawah.io/blog/extreme-http-performance-tuning-
| one-...
| duskwuff wrote:
| Does C-state tuning even do anything on EC2? My intuition says
| it probably doesn't pass through to the underlying hardware --
| once the VM exits, it's up to the host OS what power state the
| CPU goes into.
| jeffbee wrote:
| It definitely works and you can measure the effect. There's
| official documentation on what it does and how to tune it:
|
| https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo.
| ..
| duskwuff wrote:
| Okay, so it looks as though it only applies to certain
| large instance types -- presumably ones which are large
| enough that it makes sense for the host to statically
| allocate CPU cores (or even sockets) to a guest.
| Interesting.
| xtacy wrote:
| I suspect that the web server's CPU usage will be pretty high
| (almost 100%), so C-state tuning may not matter as much?
|
| EDIT: also, RSS happens on the NIC. RFS happens in the kernel,
| so it might not be as effective. For a uniform request workload
| like the one in the article, statically binding flows to a NIC
| queue should be sufficient. :)
| fierro wrote:
| How can you be sure the estimated max server capability is not
| actually just a limitation in the _client_ , i.e, the client
| maxes out at _sending_ 224k requests / second.
|
| I see that this is clearly not the case here, but in general how
| can one be sure?
| mh- wrote:
| You parallelize the load from multiple clients (running on
| separate hardware). There are some open source projects that
| facilitate this sort of workload (and the subsequent
| aggregation of results/stats.)
| trashcan wrote:
| https://locust.io/ is a good example
| 0xEFF wrote:
| Use N clients. Increase N until you're sure.
| alinspired wrote:
| What was an MTU in the test, how increasing it affects the
| results ?
|
| Reminds me how complicated it was to generate 40Gbit/sec of http
| traffic (with default MTU) to test F5 Bigip appliances, luckily
| TCL irules had `HTTP::retry`
| talawahtech wrote:
| The MTU is 9001 within the VPC, but the packets are less than
| 250 bytes so the MTU doesn't really come into play.
|
| This test is more about packets/s than bytes/s.
| volta83 wrote:
| I'm missing one thing from the article, that is commonly missing
| from performance-related articles.
|
| When you talk about playing whack-a-mole with the optimizations,
| this is what you are missing:
|
| > What's the best the hardware can do?
|
| You don't say in the article. The article only says that you
| start at 250k req/s, and ends at 1.2 req/s.
|
| Is that good? Is your optimization work done? Can you open a beer
| and celebrate?
|
| The article doesn't say.
|
| If the best the hardware can technically do is 1.3M req/s, then
| you probably can call it a day.
|
| But if the best the hardware can do is technically 100M req/s,
| then you just went from very very bad (0.25% of hardware peak) to
| just very bad (1.2% of hardware peak).
|
| Knowing how many reqs per second should the hardware be able to
| do is the only way to put things in perspective here.
| notacoward wrote:
| "What the hardware can do" is not the only useful definition of
| "done" and doesn't even mean anything for an arbitrarily chosen
| workload of sufficient complexity for user-level software to be
| involved. You can open a beer and celebrate when your
| performance lets you handle X load for less than Y operational
| cost. Or Z% less than current cost. Masturbating over "wire
| speed" when the wire has never been the limit for any prior
| implementation is kind of pointless.
|
| P.S. I've worked in both network and storage realms where
| getting close to wire speed really was a meaningful goal.
| Probably for longer than some interlocutors' entire careers.
| But this is not that case.
| slver wrote:
| TCP is not typically a hardware feature so how'd you know
| exactly?
|
| Maybe you wanna write a dedicated OS for it? Interesting
| project but I can't blame them for not doing it.
| stingraycharles wrote:
| Offloading TCP to hardware is, in fact, something that is
| very common, especially once you get into the 10gbit
| connections area. I would be surprised if AWS didn't do this.
| slver wrote:
| It's available, is it very common, I can't claim.
|
| Googling stuff like "Amazon AWS hardware TCP TOE" doesn't
| reveal anything. So we can't assume that either.
| jiggawatts wrote:
| Typically with public cloud vendors you get SR-IOV
| networking above a certain VM size, but you may have to
| jump through hoops to enable it.
|
| I'm not sure about AWS, but in Azure it is called
| "Accelerated Networking" and it is available in most
| recent VM sizes that have 4 CPUs or more.
|
| It enables direct hardware connectivity and all offload
| options. In my testing it reduces latency dramatically,
| with typical applications seeing a 5x faster small
| transactions. Similarly, you can get "wire speed" for
| single TCP streams without any special coding.
| volta83 wrote:
| Your network links supports certain throughput and latencies
| depending on the packet sizes. Your vendor should tell you
| what these are, and provide you with benchmarks to reproduce
| their claims (OSU reproduces these for, e.g., MPI).
|
| The network card also has hardware limits in the BW that it
| can handle, its latency. It is connected with the CPU via
| PCI-e usually, which has also latency and bandwidths, etc.
|
| All this go to the CPU, which has latencies and BW from the
| different caches and DRAM, etc.
|
| So you should be able to model what's the theoretical maximum
| of request that the network can handle, and then the network
| interface, the PCI-e bus, etc. up to DRAM.
|
| The amount that they can handle differs, so the bottleneck is
| going to be the slowest part of the chain.
|
| For example, as an extremely simplified example, say you have
| a 100 GB/s network, connected to a network adapter that can
| handle 200GB/s, connected with PCI-e 3 to the CPU at 12GB/S,
| which is connected with DRAM at 200GB/s.
|
| If each request has to receive or send 1 GB, then you can at
| most handle 12 req/s because that's all what your PCI-e bus
| can support.
|
| If you are then delivering 1 reqs/s then either your "model"
| is wrong, or your app is poorly implemented.
|
| If you are then delivering 11 req/s, then either your "model"
| is wrong, or your app is well implemented.
|
| But if you are far away from your model, e.g., at 1 reqs/s,
| you can still validate your model, e.g., by using two PCI-e
| bus, which you then expect to be 2x as fast. Maybe your data
| about your PCI-e bw is incorrect, or you are not
| understanding something about how the packets get transfer,
| but the model guides you through the hardware bottlenecks.
|
| The blog post lacks a "model", and focus on "what the
| software does" without ever putting it into the context of
| "what the hardware can do".
|
| That is enough to allow you to compare whether software A is
| faster than software B, but if you are the fastest, it
| doesn't tell you how far can you go.
| slver wrote:
| Handling request response isn't just about packet count. I
| might as well claim it's all just electric current and
| short some wires for max throughput /s
| gpderetta wrote:
| The computation given by parent allows you to compute
| upper bounds and order of magnitude estimates. He is
| correct that you need these values to guide your
| optimizations.
| volta83 wrote:
| Sure, its more complex than that, and an accurate model
| would be more complex as well.
|
| But hey, doing science[0] is hard, better not be
| scientific instead /s
|
| [1] science as in the scientific method:
| model->hypothesis->test , improve model->iterate. In
| contrast to the "shoot gun", or like the blog author
| called it, "whack-a-mole" method: try many things, be
| grateful if one sticks, no ragrets. /s
| slver wrote:
| Doing science is great, but first we need to make sure
| we're not comparing apples and oranges.
|
| OP has defined the problem as speeding up an HTTP server
| (libreactor based) on Linux. So that's a context we
| assume as a base, questions like "what can the hardware
| do without libreactor and without Linux" are not posed
| here.
| volta83 wrote:
| If your problem is "speeding up X", one of, if not the
| first question you should ask is: "how fast can X be"?
|
| If you don't know, find out, because maybe X is already
| as fast as it can be, and there is nothing to speed up.
|
| Sure, the OP just looks around and sees that others are
| faster, and they want to be as fast as they are.
|
| That's one way to go. But if all others are only 1% as
| fast as _they should be_, then...
|
| - either you have fundamentally misunderstood the problem
| and the answer to "how fast can X be?" (maybe its not as
| fast as you thought for reasons worth learning)
|
| - what everyone else is doing is not the right way to
| make X as fast as X can be
|
| The value in having a model of your problem is not the
| model, but rather what you can learn from it.
|
| You can optimize "what an application does", but if what
| it does is the wrong thing to do, that's not going to get
| you close to what the performance of that application
| should be.
| jiggawatts wrote:
| I had this literal debate with a "network engineer" that
| was trying to convince me that 14 Mbps coming out of a
| Windows box with dual 10 Gbps NICs was expected. You
| know... because "Windows is slow"!
|
| I aim for 9 Gbps per NIC, but I still see people settling
| for 3 Gbps total as if that's "normal".
| hansel_der wrote:
| > but I still see people settling for 3 Gbps total as if
| that's "normal".
|
| y'know - it might be enough
| talawahtech wrote:
| The answer to that question is not quite as straight-forward as
| you might think. In many ways, this experiment/post is about
| _figuring out_ the answer to the question of "what is the best
| the hardware can do".
|
| I originally started running these tests using the c5.xlarge
| (not c5n.xlarge) instance type, which is capable of a maximum
| 1M packets per second. That is an artificial limit set by AWS
| at the network hardware level. Now mind you, it is not an
| arbitrary limit, I am sure they used several factors to decide
| what limits make the most sense based on the instance size,
| customer use cases, and overall network health. If I had to
| hazard a guess I would say that 99% of AWS customers don't even
| begin to approach that limit, and those that do are probably
| doing high speed routing and/or using UDP.
|
| Virtually no-one would have been hitting 1M req/s with 4 vCPUs
| doing synchronous HTTP request/response over TCP. Those that
| did would have been using a kernel bypass solution like DPDK.
| So this blog post is actually about trying to _find_ "the
| limit", which is in quotes because it is qualified with
| multiple conditions: (1) TCP (2) request/response (3) Standard
| kernel TCP/IP stack.
|
| While working on the post, I actively tried to find a network
| performance testing tool that would let me determine the upper
| limit for this TCP request/response use case. I looked at
| netperf, sockperf and uperf (iPerf doesn't do req/resp). For
| the TCP request/response case they were *all slower* than
| wrk+libreactor. So it was up to me to _find_ the limit.
|
| When I realized that I might hit the 1M req/s limit I switched
| to the c5n.xlarge whose hardware limit is 1.8M pps. Again, this
| is just a limit set by AWS.
|
| Future tests using a Graviton2 instance + io_uring +
| recompiling the kernel using profile-guided optimizations might
| allow us to push past the 1.8M pps limit. Future instances from
| AWS may just raise the pps limit again...
|
| Either way, it should be fun to find out.
| injinj wrote:
| Great work, thanks!
|
| I'm curious whether disabling the slow kernel network features
| competes with an tcp bypass stack. I did my own wrk benchmark
| [0], but I did not try to optimize the kernel stack beyond
| pinning CPUs and busypoll, because the bypass was about 6 times
| as fast. I assumed that there is no way the kernel stack could
| compete with that. This article shows that I may be wrong. I will
| definitely check out SO_ATTACH_REUSEPORT_CBPF in the future.
|
| [0] https://github.com/raitechnology/raids/#using-wrk-httpd-
| load...
| talawahtech wrote:
| That is an area I am curious about as well, especially if you
| throw io_uring into the mix. I think most kernel bypass
| solutions get some of their gains by just forcing you to use
| the same strategies covered in the perfect locality section. It
| doesn't all just come from the "bypass" part.
|
| Even if isn't quite as fast as DPDK and co, it might be close
| enough for _some_ people to start opting to stick with the
| tried and true kernel stack instead of the more exotic
| alternatives.
| injinj wrote:
| My gut feeling with io_uring is that it wouldn't help as much
| with messaging applications with 100 byte request/reply
| patterns. It would be better in a with a pipelined situation,
| through a load balancing front end. I would love to be proven
| wrong, though.
| talawahtech wrote:
| 1.2M req/s means 2.4M (send/recv) syscalls per second. I
| definitely think io_uring will make a difference. Just not
| sure if it will be 5% or 25%.
| alufers wrote:
| That is one hell of a comprehensive article. I wonder how much
| impact would such extreme optimizations on a real-world
| application, which for example does DB queries.
|
| This experiment feels similar to people who buy old cars and
| remove everything from the inside except the engine, which they
| tune up so that the car runs faster :).
| talawahtech wrote:
| This comprehensive level of extreme tuning is not going to be
| _directly_ useful to most people; but there are a few things in
| there like SO_ATTACH_REUSEPORT_CBPF that I hope to see more
| servers and frameworks adopt. Similarly I think it is good to
| be aware of the adaptive interrupt capabilities of AWS
| instances, and the impacts of speculative execution
| mitigations, even if you stick to the defaults.
|
| More importantly it is about the idea of using tools like
| Flamegraphs (or other profiling tools) to identify and
| eliminate _your_ bottlenecks. It is also just fun to experiment
| and share the results (and the CloudFormation template). Plus
| it establishes a high water mark for what is possible, which
| also makes it useful for future experiments. At some point I
| would like to do a modified version of this that includes DB
| queries.
| longhairedhippy wrote:
| Wow, I haven't seen SO_ATTACH_REUSEPORT_CBPF before, I didn't
| even know it existed. That is a pretty ingenious and powerful
| primitive to cut down on cross-NUMA chatter. I always like it
| when folks push things to the extreme, it really shows
| exactly what is going on under the hood.
| BiteCode_dev wrote:
| What does SO_ATTACH_REUSEPORT_CBPF and how does one uses it?
| bboreham wrote:
| That is covered in the article.
| mkoubaa wrote:
| Speaking of which I wonder if anyone did this to the Linux
| kernel for a variant that's tuned only for http
| astrange wrote:
| He's cheating by assuming all http responses fit in one TCP
| packet, but you could use FreeBSD which is already tuned like
| this and has optimizations like ACCEPT_FILTER_HTTP not
| mentioned in this article.
| 101008 wrote:
| Yes, my experience (not much) is that what makes YouTube or
| Google or any of those products really impressive is the speed.
|
| YouTube or Google Search suggestion is good, and I think it
| could be replicable with that amount of data. What is insane is
| the speed. I can't think how they do it. I am doing something
| similar for the company I work on and it takes seconds (and the
| amount of data isn't that much), so I can't wrap my head around
| it.
|
| The point is that doing only speed is not _that_ complicated,
| and doing some algorithms alone is not _that_ complicated. What
| is really hard is to do both.
| jiggawatts wrote:
| Latency. Latency. Latency!
|
| It's hard to measure, so nobody does.
|
| Throughput is easy to measure, so everybody does.
|
| Latency is hard to buy, so few people try.
|
| Throughput is easy to buy, so everybody does.
|
| Latency is what matters to every user.
|
| Throughput matters only to a few people.
|
| Turn on SR-IOV. Disable ACPI C-states. Stop tunnelling
| internal traffic through virtual firewalls. Use binary
| protocols instead of JSON over HTTPS.
|
| I've seen just those alone improve end-user experience
| tenfold.
| ecnahc515 wrote:
| A lot of this is just spending more money and resources to
| make it possible to optimize for speed.
|
| With sufficient caching with and a lot of parallelism makes
| this possible. That costs money though. Caching means storing
| data twice. Parallelism means more servers (since you'll
| probably be aiming to saturate the network bandwidth for each
| host).
|
| Pre-aggregating data is another part of the strategy, as that
| avoids using CPU cycles in the fast-path, but it means
| storing even more copies of the data!
|
| My personal anecdotal experience with this is with SQL on
| object storage. Query engines that use object storage can
| still perform well with the above techniques, even though
| querying large amounts of data from object is slow. You can
| bypass the slowness of object storage if you pre-cache the
| data somewhere else that's closer/faster for recent data. You
| can have materialized views/tables for rollups of data over
| longer periods of time, which reduces the data needed to be
| fetched and cached. It also requires less CPU due to working
| with a smaller amount of pre-calculated data.
|
| Apply this to every layer, every system, etc, and you can get
| good performance even with tons of data. It's why doing
| machine-learning in real- is way harder than pre-computing
| models. Streaming platforms make this all much easier as you
| can constantly be pre-computing as much as you can, and pre-
| filling caches, etc.
|
| Of course, having engineers work on 1% performance
| improvements in the OS kernel, or memory allocators, etc will
| add up and help a lot too.
| QuercusMax wrote:
| One interesting thing to note is that there are lots of
| internal tools (CLI, web UI, etc.) that are REALLY slow.
| Things that are heavily used in the fast-path for
| development (e.g. code search, code review, test results)
| are generally pretty quick, but if there's a random system
| that has a UI, it's probably going to be very slow -
| because there's no budget for speeding them up, and the
| only people it annoys are engineers from other teams.
| simcop2387 wrote:
| I've had them take seconds for suggestions before when doing
| more esoteric searches. I think there's an inordinate amount
| of cached suggestions and they have an incredible way to look
| them up efficiently.
| mtoddsmith wrote:
| At a previous job they tracked down some slow https performance
| in a game server to OpenSSL lib allocating/reallocation new
| buffers for each zip'd request. Patching that gave a huge
| performance increase and saved them from buying some fancy $500k
| hardware to offload the https processing.
| hinkley wrote:
| I can still remember the days when /dev/random slowed down SSL
| session handshakes.
| HugoDaniel wrote:
| "Disabling these mitigations gives us a performance boost of
| around 28%. "
|
| This can't be serious. Can someone flag this article? Highly
| inappropriate.
| zdw wrote:
| I wonder what the results would be if all the optimizations were
| applied except for the security-related mitigations, which were
| left enabled.
| fabioyy wrote:
| did you try DPDK?
| ameyv wrote:
| Hi Marc,
|
| Fantastic work! Keep it up.
| Bellamy wrote:
| I have done some performance optimization but this article has
| 30% stuff I have never heard of. Great work and thanks!
| pornel wrote:
| Interesting that most of the gains are from better
| utilization/configuration of Linux, not from code optimizations.
| The userland code was, and remained a tiny fraction of time
| spent.
| MichaelMoser123 wrote:
| When is it advisable to turn off spectre/meltdown mittigations in
| practice? My guess is that if you are on a server and not running
| any user supplied code then you are on the safe side; on
| condition that you could exclude buffer overuns by running
| managed code/java or by using Rust.
| toast0 wrote:
| So the unspoken part of your question is when is it useful to
| turn off mitigations. The answer to that is when your
| application makes a lot of syscalls / when syscalls are a
| bottleneck beyond the actual work of the syscalls.
|
| This case, where it's all connection handling and serving a
| small static piece of data is a clear example; there's almost
| no userland work to be done before it goes to another syscall
| so any additional cost for the user/kernel barrier is going to
| hurt.
|
| Then the question becomes who can run code on your server; also
| condidering maybe there's a remote code execution vulnerability
| in your code, or library code you use. Is there a meaningful
| barrier that spectre/meltdown mitigations would help enforce?
| Or would getting RCE get control over everything of substance
| anyway?
| MichaelMoser123 wrote:
| if you have an event driven system then end up with very
| frequent system calls.
| anarazel wrote:
| Partially that can be amortized with io_uring... At the
| cost of some complexity, of course.
| 3gg wrote:
| Very educational and well-written, thank you.
| strawberrysauce wrote:
| Your website is super snappy. I see that it has a perfect
| lighthouse score too. Can you explain the stack you used and how
| you set it up?
| [deleted]
| talawahtech wrote:
| It is a statically generated site created with vitepress[1] and
| hosted on Cloudflare Pages[2]. The only dynamic functionality
| is the contact form which sends a JSON request to a Cloudflare
| Worker[3], which in turn dispatches the message to me via
| SNS[4].
|
| It is modeled off of the code used to generate Vue blog[5], but
| I made a ton of little modifications, including some changes
| directly to vitepress.
|
| Keep in mind that vitepress is very much an early work in
| progress and the blog functionality is just kinda tacked on,
| the default use case is documentation. It also definitely has
| bugs and is under heavy development so wouldn't recommend it
| quite yet unless you are actually interested in getting your
| handa dirty with Vue 3. I am glad I used it because it gave me
| an excuse to start learning Vue, but unless you are just using
| the default theme to create a documentation site, it will
| require some work.
|
| 1. https://vitepress.vuejs.org/
|
| 2. https://pages.cloudflare.com/
|
| 3. https://workers.cloudflare.com/
|
| 4. https://aws.amazon.com/sns/
|
| 3. https://github.com/vuejs/blog
| strawberrysauce wrote:
| Thanks :). Found one flaw in your already crazy optimized
| vitpress site - the images aren't cached :P
| ricktdotorg wrote:
| cf-cache-status: HIT
| remram wrote:
| On the other hand you could probably make the table of
| content be always visible when the screen size allows it.
| Clicking on the burger in the site menu to get a page-
| specific sidebar is a bit counter-intuitive.
| throwdbaaway wrote:
| > EC2 X-factor?
|
| > Even after taking all the steps above, I still regularly saw a
| 5-10% variance in performance across two seemingly identical EC2
| server instances
|
| > To work around this variance, I tried to use the same instance
| consistently across all benchmark runs. If I had to redo a test,
| I painstakingly stopped/started my server instance until I got an
| instance that matched the established performance of previous
| runs.
|
| We notice similar performance variance when running benchmark on
| GCP and Azure. In the worst case, there can be a 20% variance on
| GCP. On Azure, the variance between identical instances is not as
| bad, perhaps about 10%, but there is an extra 5% variance between
| normal hours and off-peak hours, which further complicates
| things.
|
| It can be very frustrating to stop/start hundreds of times for
| hours to get back an instance with the same performance
| characteristic. For now, I use a simple bash for-loop that checks
| the "CPU MHz" value from lscpu output, and that seems to be
| reliable enough.
| Matumio wrote:
| On AWS you can rent ".metal" instances which are probably more
| stable for benchmarking. I tried this once for fun on a1.metal
| because I wanted access to all hardware performance counters.
| For that it worked. My computation was also running slightly
| faster (something around 5% IIRC). But of course you'll have to
| pay for all its cores and memory while you use it.
| throwdbaaway wrote:
| Yeah, that's exactly what the GCP engineer recommends, and
| likely why the final benchmark in the article was done using
| a c5n.9xlarge.
|
| Still, there is no guarantee that after stopping the instance
| on Friday evening, you would get back the same physical host
| on Monday morning. So, while using dedicated hardware does
| avoid the noisy neighbor problem, the "silicon lottery"
| problem remains. And so far, the data that I gathered
| indicates that the latter is the more likely cause, i.e. a
| "fast" virtual machine would remain fast indefinitely, while
| a "slow" virtual machine would remain slow indefinitely,
| despite both relying on a bunch of shared resources.
| jiggawatts wrote:
| Why would you expect two different virtual machines to have
| identical performance?
|
| I would expect that _just_ the cache usage characteristics of
| "neighbouring" workloads alone would account for at least a 10%
| variance! Not to mention system bus usage, page table entry
| churn, etc, etc...
|
| If you need more than 5% accuracy for a benchmark, you
| absolutely have to use dedicated hosts. Even then, just the
| _temperature of the room_ would have an effect if you leave
| Turbo Boost enabled! Not to mention the "silicon lottery" that
| all overclockers are familiar with...
|
| This feels like those engineering classes where we had to
| calculate stresses in every truss of a bridge to seven figures,
| and then multiply by ten for safety.
| throwdbaaway wrote:
| I didn't expect identical performance, but a 10~20% variance
| is just too big. For example, if
| https://www.cockroachlabs.com/guides/2021-cloud-report/ got a
| "slow" GCP virtual machine but a "fast" azure virtual
| machine, the final result could totally flip.
|
| The more problematic scenario, as mentioned in the article,
| is when you need to do some sort of performance tuning that
| can take weeks/months to complete. On the cloud, you either
| have to keep the virtual machine running all the time (and
| hope that a live migration doesn't happen behind the scene to
| move it to a different physical host), and do the painful
| stop/start until you get back the "right" virtual machine
| before proceeding to do the actual work.
|
| We discovered this variance a couple of months ago. And this
| article from talawah.io is actually the first time I have
| seen anyone else mentioning about it. It still remains a
| mystery, because we too can't figure out what contributes to
| the variance using tools like stress-ng, but the variance is
| real when looking at MySQL commits/s metric.
|
| > If you need more than 5% accuracy for a benchmark, you
| absolutely have to use dedicated hosts.
|
| After this ordeal, I am arriving at that conclusion as well.
| Just the perfect excuse to build a couple of ryzen boxes.
| jiggawatts wrote:
| This is a bit like someone being mystified that their
| arrival time at a destination across the city is not
| repeatable to within plus-minus a minute.
|
| There are traffic lights on the way! Other cars! Weather!
| Etc...
|
| I've heard that Google's internal servers (not GCP!) use
| special features of the Intel Xeon processors to logically
| partition the CPU caches. This enables non-prod workloads
| to coexist with prod workloads with a minimal risk of cache
| trashing of the prod workload. IBM mainframes go further,
| splitting at the hardware level, with dedicated expansion
| slots and the like.
|
| You can't reasonably expect 4-core _virtual_ machines to
| behave identically to within 5% on a shared platform! That
| tiny little VM is probably shoulder-to-shoulder with 6 or 7
| other tenants on a 28 or 32 core processor. The host itself
| is likely dual-socket, and some other VMs sizes may be
| present, so up to 60 other VMs running on the same host.
| All sharing memory, network, disk, etc...
|
| The original article was also a network test. Shared
| fabrics aren't going to return 100% consistent results
| either. For that, you'd need a crossover cable.
| throwdbaaway wrote:
| Well, I'll be the first one to admit that I was naive to
| expect <5% variance prior to this experience. But I guess
| you are going to far by framing this as a common wisdom?
|
| In the HN discussion about cockroachdb cloud report 2021
| (https://news.ycombinator.com/item?id=25811532), there
| was only 1 comment thread that talks about "cloud
| weather".
|
| In https://engineering.mongodb.com/post/reducing-
| variability-in..., high profile engineers still claimed
| that it is perfectly fine to use cloud for performance
| testing, and "EC2 instances are neither good nor bad".
|
| Of course, both the cockroachdb and mongodb cases could
| be related, as any performance variance at the instance
| level could be masked when the instances form a cluster,
| and the workload can be served by any node within the
| cluster.
| jiggawatts wrote:
| You do have a point. I also have seen many benchmarks use
| cloud instances without any disclaimers, and it always
| made me raise an eyebrow quizzically.
|
| Any such benchmark I do is averaged over a few instances
| in several availability zones. I also benchmark
| specifically in the local region that I will be deploying
| production to. They're not all the same!
|
| Where the cloud is useful for benchmarking is that it's
| possible to spin up a wide range of "scenarios" at low
| cost. Want to run a series of tests ranging from 1 to 100
| cores in a single box? You can! That's very useful for
| many kinds of multi-threaded development.
| truth_seeker wrote:
| Very impressive analysis. Thanks for sharing.
| miohtama wrote:
| How much head room there would be if one were to use Unikernel
| and skip the application space altogether?
| cakoose wrote:
| This was great!
|
| Reminds me a lot of this classic CS paper: Improving IPC by
| Kernel Design, by Jochen Liedke (1993)
|
| https://www.cse.unsw.edu.au/~cs9242/19/papers/Liedtke_93.pdf
| 0xbadcafebee wrote:
| Very well written, bravo. TOC and reference links makes it even
| better.
| baybal2 wrote:
| Take a note, no quick cheat like DPDK was used.
|
| This shows you can make a regular Linux program using Linux
| network stack to approach something handcoded with DPDK.
| Adiqq wrote:
| Anyone can recommend similar articles/blogs that focus on
| optimization of networking/computing in Linux/cloud environments?
| This kind of articles are very informative, because they refer to
| advanced mechanisms that I either haven't heard about or newer
| saw in practical use.
| the8472 wrote:
| Since it's CPU-bound and spends a lot of time in the kernel would
| compiling the kernel for the specific CPU used make sense? Or are
| the CPU cycles wasted on things the compiler can't optimize?
| talawahtech wrote:
| Recompiling the kernel using profile guided optimizations[1] is
| yet another thing on the (never-ending) to-do list.
|
| 1. https://lwn.net/Articles/830300/
| ta988 wrote:
| Could you make a profile of just a bunch of functions on a
| running system?
| drenvuk wrote:
| I'm of two minds with regards to this: This is cool but unless
| you have no authentication, data to fetch remotely or on disk
| this is really just telling you what the ceiling is for
| everything you could possibly run.
|
| As for this article, there are so many knobs that you tweaked to
| get this to run faster it's incredibly informative. Thank you for
| sharing.
| joshka wrote:
| > this is really just telling you what the ceiling is
|
| That's a useful piece of info to know when performance tuning a
| real world app with auth / data / etc.
| londons_explore wrote:
| Some of these things could be fixed upstream and everyone see
| real perf gains...
|
| For example, having dhclient (a very popular dhcp client) leave
| open an AF_PACKET socket causing a 3% slowdown in incoming packet
| processing for all network packets seems... suboptimal!
|
| Surely it can be patched to not cause a systemwide 3% slowdown
| (or at least to only do it very briefly while actively refreshing
| the DHCP lease)?
| talawahtech wrote:
| I would also love to see that dhclient issue resolved upstream,
| or at least a cleaner way to work around it. But we should also
| be mindful that for most workloads the impact is probably way,
| way less.
|
| Some of these things really only show up when you push things
| to their extremes, so it probably just wasn't on the
| developer's radar before.
| lttlrck wrote:
| I believe systemd-networkd has its own implementation of DHCP
| and therefore doesn't use dhclient. But I wonder if it's
| behavior is any better in this respect.
|
| This has piqued my interest.
| mercora wrote:
| systemd-networkd keeps open that kind of socket for LLDP
| but apparently not for the DHCP client code. wpa_supplicant
| also keeps open this type of socket on my local system. and
| the dhcpd daemons on my routers have some of those too for
| each interface...
|
| i wonder if the slow path here could be avoided by using
| separate network namespaces in a way these sockets don't
| even get to see the packets...
| lttlrck wrote:
| Interesting.
|
| Looks like LLDP can be switched off in the network
| config.
|
| https://systemd.network/systemd.network.html
| zokier wrote:
| Specifically on EC2 I don't think you actually need to keep
| dhcp client running anyways, afaik EC2 instance ips are static
| so you can just keep using the one you got on boot.
| linlin1991 wrote:
| false null
| SaveTheRbtz wrote:
| The analysis itself is quite impressive: a very systematic top-
| down approach. We need more people doing stuff like this!
|
| But! Be careful applying tunables from the article "as-is"[1]:
| some of them would destroy TCP performance:
| net.ipv4.tcp_sack=0 net.ipv4.tcp_dsack=0
| net.ipv4.tcp_timestamps=0 net.ipv4.tcp_moderate_rcvbuf=0
| net.ipv4.tcp_congestion_control=reno
| net.core.default_qdisc=noqueue
|
| Not to mention that `gro off` that will bump CPU usage by ~10-20%
| on most real world workload, Security Team would be really
| against turning off mitigations, and usage of `-march=native`
| will cause a lot of core dumps in heterogenous production
| environments.
|
| [1] This is usually the case with single purpose micro-
| benchmarks: most of the tunables have side effects that may not
| be captured by a single workflow. Always verify how the "tunings"
| you found on the internet behave in _your_ environment.
| habibur wrote:
| That can be done with HTTP. But right now it's all HTTPS
| specially when you are serving APIs over the Internet.
|
| And once I switch to HTTPS I see a dramatic drop in throughput
| like x10.
|
| A http 15k req/sec drops down to 400 req/sec once I start serving
| it over HTTPS.
|
| I see no solution to it as everything has to https now.
| astrange wrote:
| HTTPS especially TLS1.3 is not slow. x86 has had AES
| acceleration since 2010.
|
| It might need different tuning or you might be negotiating a
| slow cipher.
| ComputerGuru wrote:
| The SSL handshake (which affects TTFB) isn't AES.
| astrange wrote:
| Right, but TLS1.3 improves that especially with 0RTT.
| Before that you had things like session resumption for
| repeat clients, or if your server was overloaded you could
| use an external HTTPS proxy.
| micropoet wrote:
| Impressive stuff
| jart wrote:
| > Disabling [spectre] mitigations gives us a performance boost of
| around 28%
|
| Every couple months these last several years there always seems
| to be some bug where the fix only costs us 3% performance. Since
| those tiny performance hits add up over time, security is sort of
| like inflation in the compute economy. What I want to know is how
| high can we make that 28% go? The author could likely build a
| custom kernel that turns off stuff like pie, aslr, retpoline,
| etc. which would likely yield another 10%. Can anyone think of
| anything else?
| ronsor wrote:
| Most of these mitigations are worse than useless in an
| environment not executing untrusted code. Simply put, if you
| have a dedicated server and you aren't running user code, you
| don't need them.
| vlz wrote:
| But of course other exploits (e.g. in your webapp) might lead
| to "running user code" where you didn't expect it and then
| the mitigations could prevent privilege escalation, couldn't
| they?
| vladvasiliu wrote:
| But if you have a dedicated server for your web app, if
| there's some kind of exploit in it allowing for random code
| to be run, said code already has access to everything it
| needs, right?
|
| The interesting data will probably be whatever secrets the
| app handles, say database credentials, so the attacker is
| off to the races. They probably don't care about having
| root in particular.
| ex_amazon_sde wrote:
| > if there's some kind of exploit in it allowing for
| random code to be run, said code already has access to
| everything it needs
|
| On the same host there could be SSL certificates,
| credentials in a local MTA, credentials used to run
| backups and so on.
|
| Or the application itself could be made of multiple
| components where the vulnerable one is sandboxed.
| vladvasiliu wrote:
| All those points are true - though I'd argue this is
| stretching the "one app per VM" thing -, but I guess this
| is just the usual case of understanding your situation
| and realizing there's no one size fits all.
|
| My take on this question is rather that there shouldn't
| be any dogma around this, such as disabling mitigations
| should not be considered absolutely, 100% harmful and
| never, ever, ever disabled.
|
| In the context of the OP, where the application is
| running on AWS, backups, email, etc are all likely to be
| handled either externally (say EBS snapshots) in which
| case there's no issue, or via "trusting the machine", so
| getting credentials via the instance role which every
| process on the VM can do, so no need for privilege
| escalation.
|
| So I guess if you trust EC2 or Task roles or similar (not
| familiar with EKS) to access sensitive data and only run
| a "single" application, there's likely little to no
| reason to use the mitigations.
|
| But, yeah, if you're running an application with multiple
| components, each in their own processes and don't use
| instance roles for sensitive access, maybe leave them on.
| Also, maybe, this means you're not running a single app
| per vm?
| ex_amazon_sde wrote:
| Why "app"? These are services.
|
| > there shouldn't be any dogma around this
|
| Like everything in security, it's about tradeoffs.
|
| > Also, maybe, this means you're not running a single app
| per vm?
|
| This is an argument for unikernels.
|
| Instead, on 99.9% of your services you want to run
| multiple independent processes, especially in a
| datacenter environment: your service, web server, sshd,
| logging forwarder, monitoring daemon, dhcp client, NTP
| client, backup service.
|
| Often some additional "bigcorp" services like HIDS,
| credential provider, asset management, power management,
| deployment tools.
| vladvasiliu wrote:
| > Why "app"? These are services.
|
| Yes, but I was using my initial post's parent's
| terminology. But I agree, in my mind, the subject was one
| single "service", as in process (or a process hierarchy,
| like say with gunicorn for python deployments).
|
| > This is an argument for unikernels.
|
| It is. And I'm also very interested in the developments
| around Firecraker and similar technologies. If we'd be
| able to have the kind of isolation AWS promises between
| ec2 instances on a single physical machine, while at the
| same time being able to launch a process in an isolated
| container as easy as with docker right now, I'd consider
| that really great. And all the other "infrastructure"
| services you talk about could just live their lives in
| their dedicated containers.
|
| Not sure how all this would compare, performance-wise,
| with just enabling the mitigations.
| astrange wrote:
| PIE and ASLR are free on x86-64, unless someone has a bad ABI I
| don't know of. Spectre mitigations are also free or not needed
| on new enough hardware.
|
| Many security changes also help you find memory corruption
| bugs, which is good for developer productivity.
| seoaeu wrote:
| The puzzling thing was that spectre V2 mitigations were cited
| as the main culprit. They were responsible by themselves for a
| 15-20% slowdown, which is about an order of magnitude worse
| than in my experience. I wonder if the system had IBRS enabled
| instead of using retpolines at the mitigation strategy?
| imhoguy wrote:
| I am not full deep in SecOps these days and would gladly hear
| opinion of some expert:
|
| Can disabling these mitigations bring any risks assuming the
| server is sending static content to the Internet over port
| 80/443 and it is practically stateless with read-only file
| system?
| syoc wrote:
| I am not an expert but you shall have my take either way. The
| most important question here is "Am I executing arbitrary
| untrusted code?". HTTP servers will parse the incoming
| requests so they are executed to some extent. But I would not
| worry about it unless there is some backend application doing
| more involved processing with the data. repl.it should not
| disable mitigations.
| jiggawatts wrote:
| Does anyone know of a quick & easy PowerShell script I can run
| on Windows servers to disable Spectre mitigations?
|
| The last time I looked I found a lot of waffle but no simple
| way I can just turn that stuff off...
| bigredhdl wrote:
| I really like the "Optimizations That Didn't Work" section. This
| type of information should be shared more often.
| Thaxll wrote:
| There was a similar article from Dropbox years ago:
| https://dropbox.tech/infrastructure/optimizing-web-servers-f...
| still very relevant
| 120bits wrote:
| Very well written.
|
| - I have nodejs server for the APIs and its running on m5.xlarge
| instance. I haven't done much research on what instance type
| should I go for. I looked up and it seems like
| c5n.xlarge(mentioned in the article) is meant compute optimized.
| That cost difference isn't much between m5.xlarge and c5n.xlarge.
| So, I'm assuming that switching to c5 instance would be better,
| right?
|
| - Does having ngnix handle the request is better option here? And
| setup reverse proxy for NodeJS? I'm thinking of taking small
| steps on scaling an existing framework.
| talawahtech wrote:
| Thanks!
|
| The c5 instance type is about 10-15% faster than the m5, but
| the m5 has twice as much memory. So if memory is not a concern
| then switching to c5 is both a little cheaper and a little
| faster.
|
| You shouldn't need the c5n, the regular c5 should be fine for
| most use cases, and it is cheaper.
|
| Nginx in front of nodejs sounds like a solid starting point,
| but I can't claim to have a ton of experience with that combo.
| danielheath wrote:
| For high level languages like node, the graviton2 instances
| offer vastly cheaper cpu time (as in, 40%). That's the m6g /
| c6g series.
|
| As in all things, check the results on your own workload!
| [deleted]
| nodesocket wrote:
| m5 has more memory, if you application is memory bound stick
| with that instance type.
|
| I'd recommend just using a standard AWS application load
| balancer in front of your Node.js app. Terminate SSL at the ALB
| as well using certificate manager (free). Will run you around
| $18 a month more.
| secondcoming wrote:
| Fantastic article. Disabling spectre mitigations on all my team's
| GCE instances is something I'm going to check out.
|
| Regarding core pinning, the usual advice is to pin to the CPU
| socket physically closest to the NIC. Is there any point doing
| this on cloud instances? Your actual cores could be anywhere. So
| just isolate one and hope for the best?
| brobinson wrote:
| There are a bunch more mitigations that can be disabled than he
| disables in the article. I usually refer to https://make-linux-
| fast-again.com/
| atatatat wrote:
| Make Linux Even More Insecure Again
| mappu wrote:
| In this list, mitigations=off implies all the others.
| halz wrote:
| Pinning to the physically closest core is a bit misleading.
| Take a look at output from something like `lstopo`
| [https://www.open-mpi.org/projects/hwloc/], where you can
| filter pids across the NUMA topology and trace which components
| are routed into which nodes. Pin the network based workloads
| into the corresponding NUMA node and isolate processes from
| hitting the IRQ that drives the NIC.
| ricktdotorg wrote:
| wow, i had wondered about pinning in the cloud. this is a
| fantastic tip - thank you!
| nhoughto wrote:
| I'd love to have the time (and ability!) to do this level of
| digging. Amazing write up to, very well presented.
| ArtWomb wrote:
| Wow. Such impressive bpftrace skill! Keeping this article under
| my pillow ;)
|
| Wonder where the next optimization path leads? Using huge memory
| pages. io_uring, which was briefly mentioned. Or kernel bypass,
| which is supported on c5n instances as of late...
| ta988 wrote:
| Kernel bypass?
| brendangregg wrote:
| Great work, thanks for sharing! Systems performance at its best.
| Nice to see the use of the custom palette.map (I forget to do
| that myself and I often end up hacking in highlights in the Perl
| code.)
|
| BTW, those disconnected kernel stacks can probably be reconnected
| with the user stacks by switching out the libc for one with frame
| pointers; e.g., the new libc6-prof package.
| talawahtech wrote:
| Thank you for sharing all your amazing tools and resources
| brendangregg! I wouldn't have been able to do most of these
| optimizations without FlameGraph and bpftrace.
|
| I actually did the same thing and hacked up the perl code to
| generate the my custom palette.map
|
| Thanks for the tip re: the disconnected kernel stacks. They
| actually kinda started to grow on me for this experiment,
| especially since most of the work was on the kernel side.
| anarazel wrote:
| Is libc6-prof just glibc recompiled with -fno-omit-frame-
| pointer? I did that a couple times and found that while that
| fixes a few system calls, it doesn't fix all of them. I think
| the main issue was several syscalls being called from asm,
| which wasn't unsurprisingly isn't affected by -fno-omit-frame-
| pointer.
| sigg3 wrote:
| I'm digging the website layout.What's the CSS framework he's
| using? I'm on mobile, and can't see the source.
| bbeausej wrote:
| Thank you for the amazing article and detailed insights. Great
| writing style and approaches.
|
| How long did you spend researching this subject to produce such
| an in depth report?
| talawahtech wrote:
| Hard to say exactly. I have been working on this in my spare
| time, but pretty consistently since covid-19 started. A lot of
| this was new to me, so it wasn't all as straight-forward as it
| seems in the blog.
|
| As a ballpark I would say I invested hundreds of hours in this
| experiment. Lots of sidetracks and dead ends along the way, but
| also an amazing learning experience.
| diroussel wrote:
| Did you consider wrk2?
|
| https://github.com/giltene/wrk2
|
| Maybe you duplicated some of these fixes?
| ikoveshnik wrote:
| I really like that wrk2 allows to configure fixed framerate,
| latency measurement works much better in this case. But wrk2
| itself has bugs that doesn't allow it to use in more
| complicated cases, e.g. lua scripts are not working properly.
| talawahtech wrote:
| Yea, I looked it wrk2 but it was a no-go right out the gate.
| From what I recall the changes to handle coordinated omission
| use a timer that has a 1ms resolution. So basically things
| broke immediately because all requests were under 1ms.
| throwdbaaway wrote:
| If I understand correctly, coordinated omission handling only
| matters if the benchmark is done with a fixed rate RPS right?
| In this case, it looks like a closed model benchmark where a
| fixed number of client threads just go as fast as they can.
|
| edit: Oh, perhaps wrk2 still relies on the timer even when
| not specifying a fixed rate RPS.
| skyde wrote:
| so twrk doesn't handle coordinated omission or you found a
| different way to do it?
| talawahtech wrote:
| I didn't make any coordinated omission changes (I really
| didn't make many changes in general), so twrk does what wrk
| does. It attempts to correct it after the fact by looking
| for requests that took twice as long as average and doing
| some backfilling[1].
|
| I am no expert where coordinated omission is concerned, but
| my understanding is that it is most problematic in
| scenarios where your p90+ latency is high. Looking at the
| results for the 1.2M req/s test you have the following
| latencies: p50 203.00us p90
| 236.00us p99 265.00us p99.99 317.00us
| pMAX 626.00us
|
| If you were to apply wrk's coordinated omission hack to
| these result, the backfilling only starts for requests that
| took longer than p50 x 2 (roughly) = 406us, which is
| probably somewhere between p99.999 and pMAX; a very, very
| small percentage.
|
| I am not claiming that wrk's hack is "correct", just that I
| don't think coordinated omission is a major concern for
| *this specific workload/environment*
|
| 1. https://github.com/wg/wrk/blob/a211dd5a7050b1f9e8a9870b9
| 5513...
| paracyst wrote:
| I don't have anything to add to the conversation other than to
| say that this is fantastic technical writing (and content too).
| Most of the time, when similar articles like this one are posted
| to company blogs, they bore me to tears and I can't finish them,
| but this is very engaging and informative. Cheers
| talawahtech wrote:
| Thanks, that actually means a lot. It took a lot of work, not
| just on the server/code, but also the writing. I asked a lot of
| people to review it (some multiple times) and made a ton of
| changes/edits over the last couple months.
|
| Thanks again to my reviewers!
| specialist wrote:
| What is the theoretical max req/s for a 4 vCPU c5n.xlarge
| instance?
| talawahtech wrote:
| There is no published limit, but based on my tests the network
| device for the c5n.xlarge has a hard limit of 1.8M pps (which
| translates directly to req/s for small requests without
| pipelining).
|
| There is also a quota system in place, so even though that is
| the hard limit, you can only operate at those speeds for a
| short time before you start getting rate-limited.
| specialist wrote:
| Improving from 12.4% to 66.6% of theoretical max is kinda
| amazing.
|
| Presented this way may help noobs like me with capacity
| planning.
| romanitalian wrote:
| Do you compare with Japronto?
| romanitalian wrote:
| Do you see "Japronto" [https://github.com/squeaky-pl/japronto] ?
___________________________________________________________________
(page generated 2021-05-21 23:02 UTC)