[HN Gopher] Extreme HTTP Performance Tuning
       ___________________________________________________________________
        
       Extreme HTTP Performance Tuning
        
       Author : talawahtech
       Score  : 880 points
       Date   : 2021-05-20 20:01 UTC (1 days ago)
        
 (HTM) web link (talawah.io)
 (TXT) w3m dump (talawah.io)
        
       | HugoDaniel wrote:
       | "Many of these specific optimizations won't really benefit you
       | unless you are already serving more than 50k req/s to begin
       | with."
        
       | jeffbee wrote:
       | Very nice round-up of techniques. I'd throw out a few that might
       | or might not be worth trying: 1) I always disable C-states deeper
       | than C1E. Waking from C6 takes upwards of 100 microseconds, way
       | too much for a latency-sensitive service, and it doesn't save
       | _you_ any money when you are running on EC2; 2) Try receive flow
       | steering for a possible boost above and beyond what you get from
       | RSS.
       | 
       | Would also be interesting to discuss the impacts of turning off
       | the xmit queue discipline. fq is designed to reduce frame drops
       | at the switch level. Transmitting as fast as possible can cause
       | frame drops which will totally erase all your other tuning work.
        
         | talawahtech wrote:
         | Thanks!
         | 
         | > I always disable C-states deeper than C1E
         | 
         | AWS doesn't let you mess with c-states for instances smaller
         | than a c5.9xlarge[1]. I did actually test it out on a 9xlarge
         | just for kicks, but it didn't make a difference. Once this test
         | starts, all CPUs are 99+% Busy for the duration of the test. I
         | think it would factor in more if there were lots of CPUs, and
         | some were idle during the test.
         | 
         | > Try receive flow steering for a possible boost
         | 
         | I think the stuff I do in the "perfect locality" section[2]
         | (particularly SO_ATTACH_REUSEPORT_CBPF) achieves what receive
         | flow steering would be trying to do, but more efficiently.
         | 
         | > Would also be interesting to discuss the impacts of turning
         | off the xmit queue discipline
         | 
         | Yea, noqueue would definitely be a no-go on a constrained
         | network, but when running the (t)wrk benchmark in the cluster
         | placement group I didn't see any evidence of packet drops or
         | retransmits. Drop only happened with the iperf test.
         | 
         | 1.
         | https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo...
         | 
         | 2. https://talawah.io/blog/extreme-http-performance-tuning-
         | one-...
        
         | duskwuff wrote:
         | Does C-state tuning even do anything on EC2? My intuition says
         | it probably doesn't pass through to the underlying hardware --
         | once the VM exits, it's up to the host OS what power state the
         | CPU goes into.
        
           | jeffbee wrote:
           | It definitely works and you can measure the effect. There's
           | official documentation on what it does and how to tune it:
           | 
           | https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo.
           | ..
        
             | duskwuff wrote:
             | Okay, so it looks as though it only applies to certain
             | large instance types -- presumably ones which are large
             | enough that it makes sense for the host to statically
             | allocate CPU cores (or even sockets) to a guest.
             | Interesting.
        
         | xtacy wrote:
         | I suspect that the web server's CPU usage will be pretty high
         | (almost 100%), so C-state tuning may not matter as much?
         | 
         | EDIT: also, RSS happens on the NIC. RFS happens in the kernel,
         | so it might not be as effective. For a uniform request workload
         | like the one in the article, statically binding flows to a NIC
         | queue should be sufficient. :)
        
       | fierro wrote:
       | How can you be sure the estimated max server capability is not
       | actually just a limitation in the _client_ , i.e, the client
       | maxes out at _sending_ 224k requests  / second.
       | 
       | I see that this is clearly not the case here, but in general how
       | can one be sure?
        
         | mh- wrote:
         | You parallelize the load from multiple clients (running on
         | separate hardware). There are some open source projects that
         | facilitate this sort of workload (and the subsequent
         | aggregation of results/stats.)
        
           | trashcan wrote:
           | https://locust.io/ is a good example
        
         | 0xEFF wrote:
         | Use N clients. Increase N until you're sure.
        
       | alinspired wrote:
       | What was an MTU in the test, how increasing it affects the
       | results ?
       | 
       | Reminds me how complicated it was to generate 40Gbit/sec of http
       | traffic (with default MTU) to test F5 Bigip appliances, luckily
       | TCL irules had `HTTP::retry`
        
         | talawahtech wrote:
         | The MTU is 9001 within the VPC, but the packets are less than
         | 250 bytes so the MTU doesn't really come into play.
         | 
         | This test is more about packets/s than bytes/s.
        
       | volta83 wrote:
       | I'm missing one thing from the article, that is commonly missing
       | from performance-related articles.
       | 
       | When you talk about playing whack-a-mole with the optimizations,
       | this is what you are missing:
       | 
       | > What's the best the hardware can do?
       | 
       | You don't say in the article. The article only says that you
       | start at 250k req/s, and ends at 1.2 req/s.
       | 
       | Is that good? Is your optimization work done? Can you open a beer
       | and celebrate?
       | 
       | The article doesn't say.
       | 
       | If the best the hardware can technically do is 1.3M req/s, then
       | you probably can call it a day.
       | 
       | But if the best the hardware can do is technically 100M req/s,
       | then you just went from very very bad (0.25% of hardware peak) to
       | just very bad (1.2% of hardware peak).
       | 
       | Knowing how many reqs per second should the hardware be able to
       | do is the only way to put things in perspective here.
        
         | notacoward wrote:
         | "What the hardware can do" is not the only useful definition of
         | "done" and doesn't even mean anything for an arbitrarily chosen
         | workload of sufficient complexity for user-level software to be
         | involved. You can open a beer and celebrate when your
         | performance lets you handle X load for less than Y operational
         | cost. Or Z% less than current cost. Masturbating over "wire
         | speed" when the wire has never been the limit for any prior
         | implementation is kind of pointless.
         | 
         | P.S. I've worked in both network and storage realms where
         | getting close to wire speed really was a meaningful goal.
         | Probably for longer than some interlocutors' entire careers.
         | But this is not that case.
        
         | slver wrote:
         | TCP is not typically a hardware feature so how'd you know
         | exactly?
         | 
         | Maybe you wanna write a dedicated OS for it? Interesting
         | project but I can't blame them for not doing it.
        
           | stingraycharles wrote:
           | Offloading TCP to hardware is, in fact, something that is
           | very common, especially once you get into the 10gbit
           | connections area. I would be surprised if AWS didn't do this.
        
             | slver wrote:
             | It's available, is it very common, I can't claim.
             | 
             | Googling stuff like "Amazon AWS hardware TCP TOE" doesn't
             | reveal anything. So we can't assume that either.
        
               | jiggawatts wrote:
               | Typically with public cloud vendors you get SR-IOV
               | networking above a certain VM size, but you may have to
               | jump through hoops to enable it.
               | 
               | I'm not sure about AWS, but in Azure it is called
               | "Accelerated Networking" and it is available in most
               | recent VM sizes that have 4 CPUs or more.
               | 
               | It enables direct hardware connectivity and all offload
               | options. In my testing it reduces latency dramatically,
               | with typical applications seeing a 5x faster small
               | transactions. Similarly, you can get "wire speed" for
               | single TCP streams without any special coding.
        
           | volta83 wrote:
           | Your network links supports certain throughput and latencies
           | depending on the packet sizes. Your vendor should tell you
           | what these are, and provide you with benchmarks to reproduce
           | their claims (OSU reproduces these for, e.g., MPI).
           | 
           | The network card also has hardware limits in the BW that it
           | can handle, its latency. It is connected with the CPU via
           | PCI-e usually, which has also latency and bandwidths, etc.
           | 
           | All this go to the CPU, which has latencies and BW from the
           | different caches and DRAM, etc.
           | 
           | So you should be able to model what's the theoretical maximum
           | of request that the network can handle, and then the network
           | interface, the PCI-e bus, etc. up to DRAM.
           | 
           | The amount that they can handle differs, so the bottleneck is
           | going to be the slowest part of the chain.
           | 
           | For example, as an extremely simplified example, say you have
           | a 100 GB/s network, connected to a network adapter that can
           | handle 200GB/s, connected with PCI-e 3 to the CPU at 12GB/S,
           | which is connected with DRAM at 200GB/s.
           | 
           | If each request has to receive or send 1 GB, then you can at
           | most handle 12 req/s because that's all what your PCI-e bus
           | can support.
           | 
           | If you are then delivering 1 reqs/s then either your "model"
           | is wrong, or your app is poorly implemented.
           | 
           | If you are then delivering 11 req/s, then either your "model"
           | is wrong, or your app is well implemented.
           | 
           | But if you are far away from your model, e.g., at 1 reqs/s,
           | you can still validate your model, e.g., by using two PCI-e
           | bus, which you then expect to be 2x as fast. Maybe your data
           | about your PCI-e bw is incorrect, or you are not
           | understanding something about how the packets get transfer,
           | but the model guides you through the hardware bottlenecks.
           | 
           | The blog post lacks a "model", and focus on "what the
           | software does" without ever putting it into the context of
           | "what the hardware can do".
           | 
           | That is enough to allow you to compare whether software A is
           | faster than software B, but if you are the fastest, it
           | doesn't tell you how far can you go.
        
             | slver wrote:
             | Handling request response isn't just about packet count. I
             | might as well claim it's all just electric current and
             | short some wires for max throughput /s
        
               | gpderetta wrote:
               | The computation given by parent allows you to compute
               | upper bounds and order of magnitude estimates. He is
               | correct that you need these values to guide your
               | optimizations.
        
               | volta83 wrote:
               | Sure, its more complex than that, and an accurate model
               | would be more complex as well.
               | 
               | But hey, doing science[0] is hard, better not be
               | scientific instead /s
               | 
               | [1] science as in the scientific method:
               | model->hypothesis->test , improve model->iterate. In
               | contrast to the "shoot gun", or like the blog author
               | called it, "whack-a-mole" method: try many things, be
               | grateful if one sticks, no ragrets. /s
        
               | slver wrote:
               | Doing science is great, but first we need to make sure
               | we're not comparing apples and oranges.
               | 
               | OP has defined the problem as speeding up an HTTP server
               | (libreactor based) on Linux. So that's a context we
               | assume as a base, questions like "what can the hardware
               | do without libreactor and without Linux" are not posed
               | here.
        
               | volta83 wrote:
               | If your problem is "speeding up X", one of, if not the
               | first question you should ask is: "how fast can X be"?
               | 
               | If you don't know, find out, because maybe X is already
               | as fast as it can be, and there is nothing to speed up.
               | 
               | Sure, the OP just looks around and sees that others are
               | faster, and they want to be as fast as they are.
               | 
               | That's one way to go. But if all others are only 1% as
               | fast as _they should be_, then...
               | 
               | - either you have fundamentally misunderstood the problem
               | and the answer to "how fast can X be?" (maybe its not as
               | fast as you thought for reasons worth learning)
               | 
               | - what everyone else is doing is not the right way to
               | make X as fast as X can be
               | 
               | The value in having a model of your problem is not the
               | model, but rather what you can learn from it.
               | 
               | You can optimize "what an application does", but if what
               | it does is the wrong thing to do, that's not going to get
               | you close to what the performance of that application
               | should be.
        
             | jiggawatts wrote:
             | I had this literal debate with a "network engineer" that
             | was trying to convince me that 14 Mbps coming out of a
             | Windows box with dual 10 Gbps NICs was expected. You
             | know... because "Windows is slow"!
             | 
             | I aim for 9 Gbps per NIC, but I still see people settling
             | for 3 Gbps total as if that's "normal".
        
               | hansel_der wrote:
               | > but I still see people settling for 3 Gbps total as if
               | that's "normal".
               | 
               | y'know - it might be enough
        
         | talawahtech wrote:
         | The answer to that question is not quite as straight-forward as
         | you might think. In many ways, this experiment/post is about
         | _figuring out_ the answer to the question of  "what is the best
         | the hardware can do".
         | 
         | I originally started running these tests using the c5.xlarge
         | (not c5n.xlarge) instance type, which is capable of a maximum
         | 1M packets per second. That is an artificial limit set by AWS
         | at the network hardware level. Now mind you, it is not an
         | arbitrary limit, I am sure they used several factors to decide
         | what limits make the most sense based on the instance size,
         | customer use cases, and overall network health. If I had to
         | hazard a guess I would say that 99% of AWS customers don't even
         | begin to approach that limit, and those that do are probably
         | doing high speed routing and/or using UDP.
         | 
         | Virtually no-one would have been hitting 1M req/s with 4 vCPUs
         | doing synchronous HTTP request/response over TCP. Those that
         | did would have been using a kernel bypass solution like DPDK.
         | So this blog post is actually about trying to _find_ "the
         | limit", which is in quotes because it is qualified with
         | multiple conditions: (1) TCP (2) request/response (3) Standard
         | kernel TCP/IP stack.
         | 
         | While working on the post, I actively tried to find a network
         | performance testing tool that would let me determine the upper
         | limit for this TCP request/response use case. I looked at
         | netperf, sockperf and uperf (iPerf doesn't do req/resp). For
         | the TCP request/response case they were *all slower* than
         | wrk+libreactor. So it was up to me to _find_ the limit.
         | 
         | When I realized that I might hit the 1M req/s limit I switched
         | to the c5n.xlarge whose hardware limit is 1.8M pps. Again, this
         | is just a limit set by AWS.
         | 
         | Future tests using a Graviton2 instance + io_uring +
         | recompiling the kernel using profile-guided optimizations might
         | allow us to push past the 1.8M pps limit. Future instances from
         | AWS may just raise the pps limit again...
         | 
         | Either way, it should be fun to find out.
        
       | injinj wrote:
       | Great work, thanks!
       | 
       | I'm curious whether disabling the slow kernel network features
       | competes with an tcp bypass stack. I did my own wrk benchmark
       | [0], but I did not try to optimize the kernel stack beyond
       | pinning CPUs and busypoll, because the bypass was about 6 times
       | as fast. I assumed that there is no way the kernel stack could
       | compete with that. This article shows that I may be wrong. I will
       | definitely check out SO_ATTACH_REUSEPORT_CBPF in the future.
       | 
       | [0] https://github.com/raitechnology/raids/#using-wrk-httpd-
       | load...
        
         | talawahtech wrote:
         | That is an area I am curious about as well, especially if you
         | throw io_uring into the mix. I think most kernel bypass
         | solutions get some of their gains by just forcing you to use
         | the same strategies covered in the perfect locality section. It
         | doesn't all just come from the "bypass" part.
         | 
         | Even if isn't quite as fast as DPDK and co, it might be close
         | enough for _some_ people to start opting to stick with the
         | tried and true kernel stack instead of the more exotic
         | alternatives.
        
           | injinj wrote:
           | My gut feeling with io_uring is that it wouldn't help as much
           | with messaging applications with 100 byte request/reply
           | patterns. It would be better in a with a pipelined situation,
           | through a load balancing front end. I would love to be proven
           | wrong, though.
        
             | talawahtech wrote:
             | 1.2M req/s means 2.4M (send/recv) syscalls per second. I
             | definitely think io_uring will make a difference. Just not
             | sure if it will be 5% or 25%.
        
       | alufers wrote:
       | That is one hell of a comprehensive article. I wonder how much
       | impact would such extreme optimizations on a real-world
       | application, which for example does DB queries.
       | 
       | This experiment feels similar to people who buy old cars and
       | remove everything from the inside except the engine, which they
       | tune up so that the car runs faster :).
        
         | talawahtech wrote:
         | This comprehensive level of extreme tuning is not going to be
         | _directly_ useful to most people; but there are a few things in
         | there like SO_ATTACH_REUSEPORT_CBPF that I hope to see more
         | servers and frameworks adopt. Similarly I think it is good to
         | be aware of the adaptive interrupt capabilities of AWS
         | instances, and the impacts of speculative execution
         | mitigations, even if you stick to the defaults.
         | 
         | More importantly it is about the idea of using tools like
         | Flamegraphs (or other profiling tools) to identify and
         | eliminate _your_ bottlenecks. It is also just fun to experiment
         | and share the results (and the CloudFormation template). Plus
         | it establishes a high water mark for what is possible, which
         | also makes it useful for future experiments. At some point I
         | would like to do a modified version of this that includes DB
         | queries.
        
           | longhairedhippy wrote:
           | Wow, I haven't seen SO_ATTACH_REUSEPORT_CBPF before, I didn't
           | even know it existed. That is a pretty ingenious and powerful
           | primitive to cut down on cross-NUMA chatter. I always like it
           | when folks push things to the extreme, it really shows
           | exactly what is going on under the hood.
        
           | BiteCode_dev wrote:
           | What does SO_ATTACH_REUSEPORT_CBPF and how does one uses it?
        
             | bboreham wrote:
             | That is covered in the article.
        
         | mkoubaa wrote:
         | Speaking of which I wonder if anyone did this to the Linux
         | kernel for a variant that's tuned only for http
        
           | astrange wrote:
           | He's cheating by assuming all http responses fit in one TCP
           | packet, but you could use FreeBSD which is already tuned like
           | this and has optimizations like ACCEPT_FILTER_HTTP not
           | mentioned in this article.
        
         | 101008 wrote:
         | Yes, my experience (not much) is that what makes YouTube or
         | Google or any of those products really impressive is the speed.
         | 
         | YouTube or Google Search suggestion is good, and I think it
         | could be replicable with that amount of data. What is insane is
         | the speed. I can't think how they do it. I am doing something
         | similar for the company I work on and it takes seconds (and the
         | amount of data isn't that much), so I can't wrap my head around
         | it.
         | 
         | The point is that doing only speed is not _that_ complicated,
         | and doing some algorithms alone is not _that_ complicated. What
         | is really hard is to do both.
        
           | jiggawatts wrote:
           | Latency. Latency. Latency!
           | 
           | It's hard to measure, so nobody does.
           | 
           | Throughput is easy to measure, so everybody does.
           | 
           | Latency is hard to buy, so few people try.
           | 
           | Throughput is easy to buy, so everybody does.
           | 
           | Latency is what matters to every user.
           | 
           | Throughput matters only to a few people.
           | 
           | Turn on SR-IOV. Disable ACPI C-states. Stop tunnelling
           | internal traffic through virtual firewalls. Use binary
           | protocols instead of JSON over HTTPS.
           | 
           | I've seen just those alone improve end-user experience
           | tenfold.
        
           | ecnahc515 wrote:
           | A lot of this is just spending more money and resources to
           | make it possible to optimize for speed.
           | 
           | With sufficient caching with and a lot of parallelism makes
           | this possible. That costs money though. Caching means storing
           | data twice. Parallelism means more servers (since you'll
           | probably be aiming to saturate the network bandwidth for each
           | host).
           | 
           | Pre-aggregating data is another part of the strategy, as that
           | avoids using CPU cycles in the fast-path, but it means
           | storing even more copies of the data!
           | 
           | My personal anecdotal experience with this is with SQL on
           | object storage. Query engines that use object storage can
           | still perform well with the above techniques, even though
           | querying large amounts of data from object is slow. You can
           | bypass the slowness of object storage if you pre-cache the
           | data somewhere else that's closer/faster for recent data. You
           | can have materialized views/tables for rollups of data over
           | longer periods of time, which reduces the data needed to be
           | fetched and cached. It also requires less CPU due to working
           | with a smaller amount of pre-calculated data.
           | 
           | Apply this to every layer, every system, etc, and you can get
           | good performance even with tons of data. It's why doing
           | machine-learning in real- is way harder than pre-computing
           | models. Streaming platforms make this all much easier as you
           | can constantly be pre-computing as much as you can, and pre-
           | filling caches, etc.
           | 
           | Of course, having engineers work on 1% performance
           | improvements in the OS kernel, or memory allocators, etc will
           | add up and help a lot too.
        
             | QuercusMax wrote:
             | One interesting thing to note is that there are lots of
             | internal tools (CLI, web UI, etc.) that are REALLY slow.
             | Things that are heavily used in the fast-path for
             | development (e.g. code search, code review, test results)
             | are generally pretty quick, but if there's a random system
             | that has a UI, it's probably going to be very slow -
             | because there's no budget for speeding them up, and the
             | only people it annoys are engineers from other teams.
        
           | simcop2387 wrote:
           | I've had them take seconds for suggestions before when doing
           | more esoteric searches. I think there's an inordinate amount
           | of cached suggestions and they have an incredible way to look
           | them up efficiently.
        
       | mtoddsmith wrote:
       | At a previous job they tracked down some slow https performance
       | in a game server to OpenSSL lib allocating/reallocation new
       | buffers for each zip'd request. Patching that gave a huge
       | performance increase and saved them from buying some fancy $500k
       | hardware to offload the https processing.
        
         | hinkley wrote:
         | I can still remember the days when /dev/random slowed down SSL
         | session handshakes.
        
       | HugoDaniel wrote:
       | "Disabling these mitigations gives us a performance boost of
       | around 28%. "
       | 
       | This can't be serious. Can someone flag this article? Highly
       | inappropriate.
        
       | zdw wrote:
       | I wonder what the results would be if all the optimizations were
       | applied except for the security-related mitigations, which were
       | left enabled.
        
       | fabioyy wrote:
       | did you try DPDK?
        
       | ameyv wrote:
       | Hi Marc,
       | 
       | Fantastic work! Keep it up.
        
       | Bellamy wrote:
       | I have done some performance optimization but this article has
       | 30% stuff I have never heard of. Great work and thanks!
        
       | pornel wrote:
       | Interesting that most of the gains are from better
       | utilization/configuration of Linux, not from code optimizations.
       | The userland code was, and remained a tiny fraction of time
       | spent.
        
       | MichaelMoser123 wrote:
       | When is it advisable to turn off spectre/meltdown mittigations in
       | practice? My guess is that if you are on a server and not running
       | any user supplied code then you are on the safe side; on
       | condition that you could exclude buffer overuns by running
       | managed code/java or by using Rust.
        
         | toast0 wrote:
         | So the unspoken part of your question is when is it useful to
         | turn off mitigations. The answer to that is when your
         | application makes a lot of syscalls / when syscalls are a
         | bottleneck beyond the actual work of the syscalls.
         | 
         | This case, where it's all connection handling and serving a
         | small static piece of data is a clear example; there's almost
         | no userland work to be done before it goes to another syscall
         | so any additional cost for the user/kernel barrier is going to
         | hurt.
         | 
         | Then the question becomes who can run code on your server; also
         | condidering maybe there's a remote code execution vulnerability
         | in your code, or library code you use. Is there a meaningful
         | barrier that spectre/meltdown mitigations would help enforce?
         | Or would getting RCE get control over everything of substance
         | anyway?
        
           | MichaelMoser123 wrote:
           | if you have an event driven system then end up with very
           | frequent system calls.
        
             | anarazel wrote:
             | Partially that can be amortized with io_uring... At the
             | cost of some complexity, of course.
        
       | 3gg wrote:
       | Very educational and well-written, thank you.
        
       | strawberrysauce wrote:
       | Your website is super snappy. I see that it has a perfect
       | lighthouse score too. Can you explain the stack you used and how
       | you set it up?
        
         | [deleted]
        
         | talawahtech wrote:
         | It is a statically generated site created with vitepress[1] and
         | hosted on Cloudflare Pages[2]. The only dynamic functionality
         | is the contact form which sends a JSON request to a Cloudflare
         | Worker[3], which in turn dispatches the message to me via
         | SNS[4].
         | 
         | It is modeled off of the code used to generate Vue blog[5], but
         | I made a ton of little modifications, including some changes
         | directly to vitepress.
         | 
         | Keep in mind that vitepress is very much an early work in
         | progress and the blog functionality is just kinda tacked on,
         | the default use case is documentation. It also definitely has
         | bugs and is under heavy development so wouldn't recommend it
         | quite yet unless you are actually interested in getting your
         | handa dirty with Vue 3. I am glad I used it because it gave me
         | an excuse to start learning Vue, but unless you are just using
         | the default theme to create a documentation site, it will
         | require some work.
         | 
         | 1. https://vitepress.vuejs.org/
         | 
         | 2. https://pages.cloudflare.com/
         | 
         | 3. https://workers.cloudflare.com/
         | 
         | 4. https://aws.amazon.com/sns/
         | 
         | 3. https://github.com/vuejs/blog
        
           | strawberrysauce wrote:
           | Thanks :). Found one flaw in your already crazy optimized
           | vitpress site - the images aren't cached :P
        
             | ricktdotorg wrote:
             | cf-cache-status: HIT
        
           | remram wrote:
           | On the other hand you could probably make the table of
           | content be always visible when the screen size allows it.
           | Clicking on the burger in the site menu to get a page-
           | specific sidebar is a bit counter-intuitive.
        
       | throwdbaaway wrote:
       | > EC2 X-factor?
       | 
       | > Even after taking all the steps above, I still regularly saw a
       | 5-10% variance in performance across two seemingly identical EC2
       | server instances
       | 
       | > To work around this variance, I tried to use the same instance
       | consistently across all benchmark runs. If I had to redo a test,
       | I painstakingly stopped/started my server instance until I got an
       | instance that matched the established performance of previous
       | runs.
       | 
       | We notice similar performance variance when running benchmark on
       | GCP and Azure. In the worst case, there can be a 20% variance on
       | GCP. On Azure, the variance between identical instances is not as
       | bad, perhaps about 10%, but there is an extra 5% variance between
       | normal hours and off-peak hours, which further complicates
       | things.
       | 
       | It can be very frustrating to stop/start hundreds of times for
       | hours to get back an instance with the same performance
       | characteristic. For now, I use a simple bash for-loop that checks
       | the "CPU MHz" value from lscpu output, and that seems to be
       | reliable enough.
        
         | Matumio wrote:
         | On AWS you can rent ".metal" instances which are probably more
         | stable for benchmarking. I tried this once for fun on a1.metal
         | because I wanted access to all hardware performance counters.
         | For that it worked. My computation was also running slightly
         | faster (something around 5% IIRC). But of course you'll have to
         | pay for all its cores and memory while you use it.
        
           | throwdbaaway wrote:
           | Yeah, that's exactly what the GCP engineer recommends, and
           | likely why the final benchmark in the article was done using
           | a c5n.9xlarge.
           | 
           | Still, there is no guarantee that after stopping the instance
           | on Friday evening, you would get back the same physical host
           | on Monday morning. So, while using dedicated hardware does
           | avoid the noisy neighbor problem, the "silicon lottery"
           | problem remains. And so far, the data that I gathered
           | indicates that the latter is the more likely cause, i.e. a
           | "fast" virtual machine would remain fast indefinitely, while
           | a "slow" virtual machine would remain slow indefinitely,
           | despite both relying on a bunch of shared resources.
        
         | jiggawatts wrote:
         | Why would you expect two different virtual machines to have
         | identical performance?
         | 
         | I would expect that _just_ the cache usage characteristics of
         | "neighbouring" workloads alone would account for at least a 10%
         | variance! Not to mention system bus usage, page table entry
         | churn, etc, etc...
         | 
         | If you need more than 5% accuracy for a benchmark, you
         | absolutely have to use dedicated hosts. Even then, just the
         | _temperature of the room_ would have an effect if you leave
         | Turbo Boost enabled! Not to mention the  "silicon lottery" that
         | all overclockers are familiar with...
         | 
         | This feels like those engineering classes where we had to
         | calculate stresses in every truss of a bridge to seven figures,
         | and then multiply by ten for safety.
        
           | throwdbaaway wrote:
           | I didn't expect identical performance, but a 10~20% variance
           | is just too big. For example, if
           | https://www.cockroachlabs.com/guides/2021-cloud-report/ got a
           | "slow" GCP virtual machine but a "fast" azure virtual
           | machine, the final result could totally flip.
           | 
           | The more problematic scenario, as mentioned in the article,
           | is when you need to do some sort of performance tuning that
           | can take weeks/months to complete. On the cloud, you either
           | have to keep the virtual machine running all the time (and
           | hope that a live migration doesn't happen behind the scene to
           | move it to a different physical host), and do the painful
           | stop/start until you get back the "right" virtual machine
           | before proceeding to do the actual work.
           | 
           | We discovered this variance a couple of months ago. And this
           | article from talawah.io is actually the first time I have
           | seen anyone else mentioning about it. It still remains a
           | mystery, because we too can't figure out what contributes to
           | the variance using tools like stress-ng, but the variance is
           | real when looking at MySQL commits/s metric.
           | 
           | > If you need more than 5% accuracy for a benchmark, you
           | absolutely have to use dedicated hosts.
           | 
           | After this ordeal, I am arriving at that conclusion as well.
           | Just the perfect excuse to build a couple of ryzen boxes.
        
             | jiggawatts wrote:
             | This is a bit like someone being mystified that their
             | arrival time at a destination across the city is not
             | repeatable to within plus-minus a minute.
             | 
             | There are traffic lights on the way! Other cars! Weather!
             | Etc...
             | 
             | I've heard that Google's internal servers (not GCP!) use
             | special features of the Intel Xeon processors to logically
             | partition the CPU caches. This enables non-prod workloads
             | to coexist with prod workloads with a minimal risk of cache
             | trashing of the prod workload. IBM mainframes go further,
             | splitting at the hardware level, with dedicated expansion
             | slots and the like.
             | 
             | You can't reasonably expect 4-core _virtual_ machines to
             | behave identically to within 5% on a shared platform! That
             | tiny little VM is probably shoulder-to-shoulder with 6 or 7
             | other tenants on a 28 or 32 core processor. The host itself
             | is likely dual-socket, and some other VMs sizes may be
             | present, so up to 60 other VMs running on the same host.
             | All sharing memory, network, disk, etc...
             | 
             | The original article was also a network test. Shared
             | fabrics aren't going to return 100% consistent results
             | either. For that, you'd need a crossover cable.
        
               | throwdbaaway wrote:
               | Well, I'll be the first one to admit that I was naive to
               | expect <5% variance prior to this experience. But I guess
               | you are going to far by framing this as a common wisdom?
               | 
               | In the HN discussion about cockroachdb cloud report 2021
               | (https://news.ycombinator.com/item?id=25811532), there
               | was only 1 comment thread that talks about "cloud
               | weather".
               | 
               | In https://engineering.mongodb.com/post/reducing-
               | variability-in..., high profile engineers still claimed
               | that it is perfectly fine to use cloud for performance
               | testing, and "EC2 instances are neither good nor bad".
               | 
               | Of course, both the cockroachdb and mongodb cases could
               | be related, as any performance variance at the instance
               | level could be masked when the instances form a cluster,
               | and the workload can be served by any node within the
               | cluster.
        
               | jiggawatts wrote:
               | You do have a point. I also have seen many benchmarks use
               | cloud instances without any disclaimers, and it always
               | made me raise an eyebrow quizzically.
               | 
               | Any such benchmark I do is averaged over a few instances
               | in several availability zones. I also benchmark
               | specifically in the local region that I will be deploying
               | production to. They're not all the same!
               | 
               | Where the cloud is useful for benchmarking is that it's
               | possible to spin up a wide range of "scenarios" at low
               | cost. Want to run a series of tests ranging from 1 to 100
               | cores in a single box? You can! That's very useful for
               | many kinds of multi-threaded development.
        
       | truth_seeker wrote:
       | Very impressive analysis. Thanks for sharing.
        
       | miohtama wrote:
       | How much head room there would be if one were to use Unikernel
       | and skip the application space altogether?
        
       | cakoose wrote:
       | This was great!
       | 
       | Reminds me a lot of this classic CS paper: Improving IPC by
       | Kernel Design, by Jochen Liedke (1993)
       | 
       | https://www.cse.unsw.edu.au/~cs9242/19/papers/Liedtke_93.pdf
        
       | 0xbadcafebee wrote:
       | Very well written, bravo. TOC and reference links makes it even
       | better.
        
       | baybal2 wrote:
       | Take a note, no quick cheat like DPDK was used.
       | 
       | This shows you can make a regular Linux program using Linux
       | network stack to approach something handcoded with DPDK.
        
       | Adiqq wrote:
       | Anyone can recommend similar articles/blogs that focus on
       | optimization of networking/computing in Linux/cloud environments?
       | This kind of articles are very informative, because they refer to
       | advanced mechanisms that I either haven't heard about or newer
       | saw in practical use.
        
       | the8472 wrote:
       | Since it's CPU-bound and spends a lot of time in the kernel would
       | compiling the kernel for the specific CPU used make sense? Or are
       | the CPU cycles wasted on things the compiler can't optimize?
        
         | talawahtech wrote:
         | Recompiling the kernel using profile guided optimizations[1] is
         | yet another thing on the (never-ending) to-do list.
         | 
         | 1. https://lwn.net/Articles/830300/
        
           | ta988 wrote:
           | Could you make a profile of just a bunch of functions on a
           | running system?
        
       | drenvuk wrote:
       | I'm of two minds with regards to this: This is cool but unless
       | you have no authentication, data to fetch remotely or on disk
       | this is really just telling you what the ceiling is for
       | everything you could possibly run.
       | 
       | As for this article, there are so many knobs that you tweaked to
       | get this to run faster it's incredibly informative. Thank you for
       | sharing.
        
         | joshka wrote:
         | > this is really just telling you what the ceiling is
         | 
         | That's a useful piece of info to know when performance tuning a
         | real world app with auth / data / etc.
        
       | londons_explore wrote:
       | Some of these things could be fixed upstream and everyone see
       | real perf gains...
       | 
       | For example, having dhclient (a very popular dhcp client) leave
       | open an AF_PACKET socket causing a 3% slowdown in incoming packet
       | processing for all network packets seems... suboptimal!
       | 
       | Surely it can be patched to not cause a systemwide 3% slowdown
       | (or at least to only do it very briefly while actively refreshing
       | the DHCP lease)?
        
         | talawahtech wrote:
         | I would also love to see that dhclient issue resolved upstream,
         | or at least a cleaner way to work around it. But we should also
         | be mindful that for most workloads the impact is probably way,
         | way less.
         | 
         | Some of these things really only show up when you push things
         | to their extremes, so it probably just wasn't on the
         | developer's radar before.
        
           | lttlrck wrote:
           | I believe systemd-networkd has its own implementation of DHCP
           | and therefore doesn't use dhclient. But I wonder if it's
           | behavior is any better in this respect.
           | 
           | This has piqued my interest.
        
             | mercora wrote:
             | systemd-networkd keeps open that kind of socket for LLDP
             | but apparently not for the DHCP client code. wpa_supplicant
             | also keeps open this type of socket on my local system. and
             | the dhcpd daemons on my routers have some of those too for
             | each interface...
             | 
             | i wonder if the slow path here could be avoided by using
             | separate network namespaces in a way these sockets don't
             | even get to see the packets...
        
               | lttlrck wrote:
               | Interesting.
               | 
               | Looks like LLDP can be switched off in the network
               | config.
               | 
               | https://systemd.network/systemd.network.html
        
         | zokier wrote:
         | Specifically on EC2 I don't think you actually need to keep
         | dhcp client running anyways, afaik EC2 instance ips are static
         | so you can just keep using the one you got on boot.
        
       | linlin1991 wrote:
       | false null
        
       | SaveTheRbtz wrote:
       | The analysis itself is quite impressive: a very systematic top-
       | down approach. We need more people doing stuff like this!
       | 
       | But! Be careful applying tunables from the article "as-is"[1]:
       | some of them would destroy TCP performance:
       | net.ipv4.tcp_sack=0       net.ipv4.tcp_dsack=0
       | net.ipv4.tcp_timestamps=0       net.ipv4.tcp_moderate_rcvbuf=0
       | net.ipv4.tcp_congestion_control=reno
       | net.core.default_qdisc=noqueue
       | 
       | Not to mention that `gro off` that will bump CPU usage by ~10-20%
       | on most real world workload, Security Team would be really
       | against turning off mitigations, and usage of `-march=native`
       | will cause a lot of core dumps in heterogenous production
       | environments.
       | 
       | [1] This is usually the case with single purpose micro-
       | benchmarks: most of the tunables have side effects that may not
       | be captured by a single workflow. Always verify how the "tunings"
       | you found on the internet behave in _your_ environment.
        
       | habibur wrote:
       | That can be done with HTTP. But right now it's all HTTPS
       | specially when you are serving APIs over the Internet.
       | 
       | And once I switch to HTTPS I see a dramatic drop in throughput
       | like x10.
       | 
       | A http 15k req/sec drops down to 400 req/sec once I start serving
       | it over HTTPS.
       | 
       | I see no solution to it as everything has to https now.
        
         | astrange wrote:
         | HTTPS especially TLS1.3 is not slow. x86 has had AES
         | acceleration since 2010.
         | 
         | It might need different tuning or you might be negotiating a
         | slow cipher.
        
           | ComputerGuru wrote:
           | The SSL handshake (which affects TTFB) isn't AES.
        
             | astrange wrote:
             | Right, but TLS1.3 improves that especially with 0RTT.
             | Before that you had things like session resumption for
             | repeat clients, or if your server was overloaded you could
             | use an external HTTPS proxy.
        
       | micropoet wrote:
       | Impressive stuff
        
       | jart wrote:
       | > Disabling [spectre] mitigations gives us a performance boost of
       | around 28%
       | 
       | Every couple months these last several years there always seems
       | to be some bug where the fix only costs us 3% performance. Since
       | those tiny performance hits add up over time, security is sort of
       | like inflation in the compute economy. What I want to know is how
       | high can we make that 28% go? The author could likely build a
       | custom kernel that turns off stuff like pie, aslr, retpoline,
       | etc. which would likely yield another 10%. Can anyone think of
       | anything else?
        
         | ronsor wrote:
         | Most of these mitigations are worse than useless in an
         | environment not executing untrusted code. Simply put, if you
         | have a dedicated server and you aren't running user code, you
         | don't need them.
        
           | vlz wrote:
           | But of course other exploits (e.g. in your webapp) might lead
           | to "running user code" where you didn't expect it and then
           | the mitigations could prevent privilege escalation, couldn't
           | they?
        
             | vladvasiliu wrote:
             | But if you have a dedicated server for your web app, if
             | there's some kind of exploit in it allowing for random code
             | to be run, said code already has access to everything it
             | needs, right?
             | 
             | The interesting data will probably be whatever secrets the
             | app handles, say database credentials, so the attacker is
             | off to the races. They probably don't care about having
             | root in particular.
        
               | ex_amazon_sde wrote:
               | > if there's some kind of exploit in it allowing for
               | random code to be run, said code already has access to
               | everything it needs
               | 
               | On the same host there could be SSL certificates,
               | credentials in a local MTA, credentials used to run
               | backups and so on.
               | 
               | Or the application itself could be made of multiple
               | components where the vulnerable one is sandboxed.
        
               | vladvasiliu wrote:
               | All those points are true - though I'd argue this is
               | stretching the "one app per VM" thing -, but I guess this
               | is just the usual case of understanding your situation
               | and realizing there's no one size fits all.
               | 
               | My take on this question is rather that there shouldn't
               | be any dogma around this, such as disabling mitigations
               | should not be considered absolutely, 100% harmful and
               | never, ever, ever disabled.
               | 
               | In the context of the OP, where the application is
               | running on AWS, backups, email, etc are all likely to be
               | handled either externally (say EBS snapshots) in which
               | case there's no issue, or via "trusting the machine", so
               | getting credentials via the instance role which every
               | process on the VM can do, so no need for privilege
               | escalation.
               | 
               | So I guess if you trust EC2 or Task roles or similar (not
               | familiar with EKS) to access sensitive data and only run
               | a "single" application, there's likely little to no
               | reason to use the mitigations.
               | 
               | But, yeah, if you're running an application with multiple
               | components, each in their own processes and don't use
               | instance roles for sensitive access, maybe leave them on.
               | Also, maybe, this means you're not running a single app
               | per vm?
        
               | ex_amazon_sde wrote:
               | Why "app"? These are services.
               | 
               | > there shouldn't be any dogma around this
               | 
               | Like everything in security, it's about tradeoffs.
               | 
               | > Also, maybe, this means you're not running a single app
               | per vm?
               | 
               | This is an argument for unikernels.
               | 
               | Instead, on 99.9% of your services you want to run
               | multiple independent processes, especially in a
               | datacenter environment: your service, web server, sshd,
               | logging forwarder, monitoring daemon, dhcp client, NTP
               | client, backup service.
               | 
               | Often some additional "bigcorp" services like HIDS,
               | credential provider, asset management, power management,
               | deployment tools.
        
               | vladvasiliu wrote:
               | > Why "app"? These are services.
               | 
               | Yes, but I was using my initial post's parent's
               | terminology. But I agree, in my mind, the subject was one
               | single "service", as in process (or a process hierarchy,
               | like say with gunicorn for python deployments).
               | 
               | > This is an argument for unikernels.
               | 
               | It is. And I'm also very interested in the developments
               | around Firecraker and similar technologies. If we'd be
               | able to have the kind of isolation AWS promises between
               | ec2 instances on a single physical machine, while at the
               | same time being able to launch a process in an isolated
               | container as easy as with docker right now, I'd consider
               | that really great. And all the other "infrastructure"
               | services you talk about could just live their lives in
               | their dedicated containers.
               | 
               | Not sure how all this would compare, performance-wise,
               | with just enabling the mitigations.
        
         | astrange wrote:
         | PIE and ASLR are free on x86-64, unless someone has a bad ABI I
         | don't know of. Spectre mitigations are also free or not needed
         | on new enough hardware.
         | 
         | Many security changes also help you find memory corruption
         | bugs, which is good for developer productivity.
        
         | seoaeu wrote:
         | The puzzling thing was that spectre V2 mitigations were cited
         | as the main culprit. They were responsible by themselves for a
         | 15-20% slowdown, which is about an order of magnitude worse
         | than in my experience. I wonder if the system had IBRS enabled
         | instead of using retpolines at the mitigation strategy?
        
         | imhoguy wrote:
         | I am not full deep in SecOps these days and would gladly hear
         | opinion of some expert:
         | 
         | Can disabling these mitigations bring any risks assuming the
         | server is sending static content to the Internet over port
         | 80/443 and it is practically stateless with read-only file
         | system?
        
           | syoc wrote:
           | I am not an expert but you shall have my take either way. The
           | most important question here is "Am I executing arbitrary
           | untrusted code?". HTTP servers will parse the incoming
           | requests so they are executed to some extent. But I would not
           | worry about it unless there is some backend application doing
           | more involved processing with the data. repl.it should not
           | disable mitigations.
        
         | jiggawatts wrote:
         | Does anyone know of a quick & easy PowerShell script I can run
         | on Windows servers to disable Spectre mitigations?
         | 
         | The last time I looked I found a lot of waffle but no simple
         | way I can just turn that stuff off...
        
       | bigredhdl wrote:
       | I really like the "Optimizations That Didn't Work" section. This
       | type of information should be shared more often.
        
       | Thaxll wrote:
       | There was a similar article from Dropbox years ago:
       | https://dropbox.tech/infrastructure/optimizing-web-servers-f...
       | still very relevant
        
       | 120bits wrote:
       | Very well written.
       | 
       | - I have nodejs server for the APIs and its running on m5.xlarge
       | instance. I haven't done much research on what instance type
       | should I go for. I looked up and it seems like
       | c5n.xlarge(mentioned in the article) is meant compute optimized.
       | That cost difference isn't much between m5.xlarge and c5n.xlarge.
       | So, I'm assuming that switching to c5 instance would be better,
       | right?
       | 
       | - Does having ngnix handle the request is better option here? And
       | setup reverse proxy for NodeJS? I'm thinking of taking small
       | steps on scaling an existing framework.
        
         | talawahtech wrote:
         | Thanks!
         | 
         | The c5 instance type is about 10-15% faster than the m5, but
         | the m5 has twice as much memory. So if memory is not a concern
         | then switching to c5 is both a little cheaper and a little
         | faster.
         | 
         | You shouldn't need the c5n, the regular c5 should be fine for
         | most use cases, and it is cheaper.
         | 
         | Nginx in front of nodejs sounds like a solid starting point,
         | but I can't claim to have a ton of experience with that combo.
        
         | danielheath wrote:
         | For high level languages like node, the graviton2 instances
         | offer vastly cheaper cpu time (as in, 40%). That's the m6g /
         | c6g series.
         | 
         | As in all things, check the results on your own workload!
        
         | [deleted]
        
         | nodesocket wrote:
         | m5 has more memory, if you application is memory bound stick
         | with that instance type.
         | 
         | I'd recommend just using a standard AWS application load
         | balancer in front of your Node.js app. Terminate SSL at the ALB
         | as well using certificate manager (free). Will run you around
         | $18 a month more.
        
       | secondcoming wrote:
       | Fantastic article. Disabling spectre mitigations on all my team's
       | GCE instances is something I'm going to check out.
       | 
       | Regarding core pinning, the usual advice is to pin to the CPU
       | socket physically closest to the NIC. Is there any point doing
       | this on cloud instances? Your actual cores could be anywhere. So
       | just isolate one and hope for the best?
        
         | brobinson wrote:
         | There are a bunch more mitigations that can be disabled than he
         | disables in the article. I usually refer to https://make-linux-
         | fast-again.com/
        
           | atatatat wrote:
           | Make Linux Even More Insecure Again
        
           | mappu wrote:
           | In this list, mitigations=off implies all the others.
        
         | halz wrote:
         | Pinning to the physically closest core is a bit misleading.
         | Take a look at output from something like `lstopo`
         | [https://www.open-mpi.org/projects/hwloc/], where you can
         | filter pids across the NUMA topology and trace which components
         | are routed into which nodes. Pin the network based workloads
         | into the corresponding NUMA node and isolate processes from
         | hitting the IRQ that drives the NIC.
        
           | ricktdotorg wrote:
           | wow, i had wondered about pinning in the cloud. this is a
           | fantastic tip - thank you!
        
       | nhoughto wrote:
       | I'd love to have the time (and ability!) to do this level of
       | digging. Amazing write up to, very well presented.
        
       | ArtWomb wrote:
       | Wow. Such impressive bpftrace skill! Keeping this article under
       | my pillow ;)
       | 
       | Wonder where the next optimization path leads? Using huge memory
       | pages. io_uring, which was briefly mentioned. Or kernel bypass,
       | which is supported on c5n instances as of late...
        
         | ta988 wrote:
         | Kernel bypass?
        
       | brendangregg wrote:
       | Great work, thanks for sharing! Systems performance at its best.
       | Nice to see the use of the custom palette.map (I forget to do
       | that myself and I often end up hacking in highlights in the Perl
       | code.)
       | 
       | BTW, those disconnected kernel stacks can probably be reconnected
       | with the user stacks by switching out the libc for one with frame
       | pointers; e.g., the new libc6-prof package.
        
         | talawahtech wrote:
         | Thank you for sharing all your amazing tools and resources
         | brendangregg! I wouldn't have been able to do most of these
         | optimizations without FlameGraph and bpftrace.
         | 
         | I actually did the same thing and hacked up the perl code to
         | generate the my custom palette.map
         | 
         | Thanks for the tip re: the disconnected kernel stacks. They
         | actually kinda started to grow on me for this experiment,
         | especially since most of the work was on the kernel side.
        
         | anarazel wrote:
         | Is libc6-prof just glibc recompiled with -fno-omit-frame-
         | pointer? I did that a couple times and found that while that
         | fixes a few system calls, it doesn't fix all of them. I think
         | the main issue was several syscalls being called from asm,
         | which wasn't unsurprisingly isn't affected by -fno-omit-frame-
         | pointer.
        
       | sigg3 wrote:
       | I'm digging the website layout.What's the CSS framework he's
       | using? I'm on mobile, and can't see the source.
        
       | bbeausej wrote:
       | Thank you for the amazing article and detailed insights. Great
       | writing style and approaches.
       | 
       | How long did you spend researching this subject to produce such
       | an in depth report?
        
         | talawahtech wrote:
         | Hard to say exactly. I have been working on this in my spare
         | time, but pretty consistently since covid-19 started. A lot of
         | this was new to me, so it wasn't all as straight-forward as it
         | seems in the blog.
         | 
         | As a ballpark I would say I invested hundreds of hours in this
         | experiment. Lots of sidetracks and dead ends along the way, but
         | also an amazing learning experience.
        
       | diroussel wrote:
       | Did you consider wrk2?
       | 
       | https://github.com/giltene/wrk2
       | 
       | Maybe you duplicated some of these fixes?
        
         | ikoveshnik wrote:
         | I really like that wrk2 allows to configure fixed framerate,
         | latency measurement works much better in this case. But wrk2
         | itself has bugs that doesn't allow it to use in more
         | complicated cases, e.g. lua scripts are not working properly.
        
         | talawahtech wrote:
         | Yea, I looked it wrk2 but it was a no-go right out the gate.
         | From what I recall the changes to handle coordinated omission
         | use a timer that has a 1ms resolution. So basically things
         | broke immediately because all requests were under 1ms.
        
           | throwdbaaway wrote:
           | If I understand correctly, coordinated omission handling only
           | matters if the benchmark is done with a fixed rate RPS right?
           | In this case, it looks like a closed model benchmark where a
           | fixed number of client threads just go as fast as they can.
           | 
           | edit: Oh, perhaps wrk2 still relies on the timer even when
           | not specifying a fixed rate RPS.
        
           | skyde wrote:
           | so twrk doesn't handle coordinated omission or you found a
           | different way to do it?
        
             | talawahtech wrote:
             | I didn't make any coordinated omission changes (I really
             | didn't make many changes in general), so twrk does what wrk
             | does. It attempts to correct it after the fact by looking
             | for requests that took twice as long as average and doing
             | some backfilling[1].
             | 
             | I am no expert where coordinated omission is concerned, but
             | my understanding is that it is most problematic in
             | scenarios where your p90+ latency is high. Looking at the
             | results for the 1.2M req/s test you have the following
             | latencies:                 p50     203.00us       p90
             | 236.00us       p99     265.00us       p99.99  317.00us
             | pMAX    626.00us
             | 
             | If you were to apply wrk's coordinated omission hack to
             | these result, the backfilling only starts for requests that
             | took longer than p50 x 2 (roughly) = 406us, which is
             | probably somewhere between p99.999 and pMAX; a very, very
             | small percentage.
             | 
             | I am not claiming that wrk's hack is "correct", just that I
             | don't think coordinated omission is a major concern for
             | *this specific workload/environment*
             | 
             | 1. https://github.com/wg/wrk/blob/a211dd5a7050b1f9e8a9870b9
             | 5513...
        
       | paracyst wrote:
       | I don't have anything to add to the conversation other than to
       | say that this is fantastic technical writing (and content too).
       | Most of the time, when similar articles like this one are posted
       | to company blogs, they bore me to tears and I can't finish them,
       | but this is very engaging and informative. Cheers
        
         | talawahtech wrote:
         | Thanks, that actually means a lot. It took a lot of work, not
         | just on the server/code, but also the writing. I asked a lot of
         | people to review it (some multiple times) and made a ton of
         | changes/edits over the last couple months.
         | 
         | Thanks again to my reviewers!
        
       | specialist wrote:
       | What is the theoretical max req/s for a 4 vCPU c5n.xlarge
       | instance?
        
         | talawahtech wrote:
         | There is no published limit, but based on my tests the network
         | device for the c5n.xlarge has a hard limit of 1.8M pps (which
         | translates directly to req/s for small requests without
         | pipelining).
         | 
         | There is also a quota system in place, so even though that is
         | the hard limit, you can only operate at those speeds for a
         | short time before you start getting rate-limited.
        
           | specialist wrote:
           | Improving from 12.4% to 66.6% of theoretical max is kinda
           | amazing.
           | 
           | Presented this way may help noobs like me with capacity
           | planning.
        
       | romanitalian wrote:
       | Do you compare with Japronto?
        
       | romanitalian wrote:
       | Do you see "Japronto" [https://github.com/squeaky-pl/japronto] ?
        
       ___________________________________________________________________
       (page generated 2021-05-21 23:02 UTC)