https://dgroshev.com/blog/bufferbloat/ - All posts Published: 2024-01-22 Unbloating the buffers Here is a square. Picture of a square Wait, it's not just a square. Same square annotated as my computer There's more of those. Two squares, one says my computer, the other says your computer They feel lonely. Can we help them? Same as before, but with a double sided arrow saying network cable Yes! See, they are now connected and can talk to each other. Much better! Except... Oh no. As you awoke one morning from uneasy dreams you found yourself transformed in your bed into a software engineer. A calamity that I find all too familiar. Now it's on us to figure out how that network should work. With time and effort, you and I can come up with a protocol that will let bytes go through that cable. We can even make it decently quick and send pictures of exotic tropical fish to each other. It's relatively easy: both sides of the cable are under our control, so we can agree on the speed of exchange, making sure that the sending computer does not overload the receiving one. All's good, and the sun is shining [?] Except... Oh no. A network of computers in the same style as before I have more than one cable plugged into my computer. You have an ongoing amphibian emergency that can only be helped by receiving frog pictures. I don't have any, but someone in my network has them. Same network, but one link is marked red Computers are now interconnected into a large network, and all links are different. A family of magpies made a nest in Ben's network cabinet, so every time the adult magpies bring food back, the link between me and Ben slows down. Ben's a smart guy. He knows that electrons are fickle, and the connection speed varies, so he adds a buffer to his computer. If his link to me is slow, he can still receive more from Li, filling the buffer, and transmit it further as fast as he can. This way, the bandwidth is utilised as efficiently as possible, and transient spikes of traffic or slowdowns are buffered. Let's zoom in. Diagram of three computers and two links, with schematic buffers inside the computers Look closely: magpie parents are working hard, and the buffer is full. Packets trickle through, but they can only get into the buffer as fast as it is emptied on the other side. Ben's computer asks Li's to slow down. Now imagine that there is more than one type of packet. Some of the packets belong to a voice call that is happening over the same link! Same diagram, but some packets are marked yellow Voice packets need to get through quickly, but instead they wait in the queue that doesn't go anywhere. The buffer doesn't help throughput, it's not transient anymore, it just sits there. The buffer is actively harmful -- it adds a delay! The entire link from Li to me is not only at maximum throughput, it's also slow to react. The latency gets higher for no benefit. This is called bufferbloat. Bufferbloat is so common that we don't even think about it. Of course video calls drop out and glitch when you download something, duh! The internet is working hard, it's only natural that it's a bit slow to react. It doesn't have to be this way^1. How bad is bufferbloat, really? I tested my home connection with Flent^2: Flent graph before configuring CAKE Here is what happens on this graph: * Flent starts pinging the remote server (both via ICMP and UDP) * five seconds into the test, Flent starts uploading and downloading in parallel * Flent runs multiple TCP streams with different priorities in parallel. BE/BK/CS5/EF are different priorities * both bandwidth and ping are charted five times a second Some conclusions: * the download is pretty stable at 75Mbit per stream * upload is all over the place, floating from 10Mbit to 5Mbit per stream * ping raises from a few ms to almost 300ms -- 0.3s just for one packet to go there and back again! That makes video calls unusable * priorities are ignored I also tested pure, non-parallel bandwidth to the same server with iperf3. On my connection, iperf3 shows total speed of 305Mbit down/ 48Mbit up. Compared with the numbers above, total download matches Flent precisely, but upload slows down 2-4 times because of the parallel download. This is not ideal. What can I do about it? Let's look again at what happens when I'm trying to download something from Li: A diagram of three computers and two links annotated with their bandwidths. One link is much faster than the other The reason why packets bunch is that there is a bandwidth bottleneck on the way. If I could just sign up for a faster contract, the problem would go away, right? To an extent this is true: faster connection makes it harder to saturate the network. However, my provider still has a much faster network on their side, so the problem is still there, it's just less likely to occur. If only there was a way to adjust those queues dynamically, shrinking them if the network is saturated! This is precisely what Active Queue Management (AQM) does^3. CAKE is real There are many AQM implementations, but the most modern, performant, non-fiddly one is CAKE.^a very tortured backronym Better internet is one configuration option away^4: Flent graph after configuring CAKE I traded about 10% of bandwidth (263Mbit down/41Mbit up per iperf3) for: * constant average bandwidth on both upload and download * no impact of download on upload * network load has no visible impact on latency * effective traffic prioritisation I believe I can be less conservative with the bandwidth and fiddle with CAKE settings more, but I am happy with this trade as it is. Here's the entirery of VyOs^5 config changes ^6: interfaces { input ifb0 { } pppoe pppoe0 { ... redirect ifb0 } } qos { interface ifb0 { egress CAKE-WAN-IN } interface pppoe0 { egress CAKE-WAN-OUT } policy { cake CAKE-WAN-IN { bandwidth 280mbit flow-isolation { nat } rtt 30 } cake CAKE-WAN-OUT { bandwidth 45mbit flow-isolation { nat } rtt 30 } } } No more frog picture delays! Wikipedia's picture of a common frog (Rana temporaria) --------------------------------------------------------------------- ^1 Chances are, your connection is faster than 30Mbit up/down. You only need less than 10% of it for a perfectly good Zoom call. If bandwidth is split equally, it should be pretty hard for a home network to not be able to support a video call. Yet in practice, latency-sensitive traffic glitches all the time when the network is busy! People often think it's the lack of bandwidth, but it's usually the extra latency caused by bufferbloat. ^2 If you want to try this at home, I recommend bringing your own Flent test server in the closest AWS data centre instead of the ones provided by Flent. Also, install fping instead of using netperf's native pinger. ^3 I'm glossing over a lot of complexity here. For one, upload and download are very different. While my router can delay upload it doesn't really control download traffic that comes from my provider -- when a packet falls out of my provider's fibre, it's already here, the router can't push it back. It can only signal to the sender that they need to slow down. However, in practice a Good Enough AQM implementation can throttle senders effectively using ECN and packet drops, so it's alright to assume that AQM can control senders. ^4 That, and a beefy enough router to do AQM on your network speed. ^5 I really like VyOs on my home router. It's good old Debian adapted for routing and coming with a handy configuration tool. Don't be scared by the "official" pricing, it's free to use non-comercially on the rolling release or if you're happy to build it yourself with just a few commands. ^6 VyOs supports CAKE, it's just undocumented yet. The ifb0 business is described here.