hngopher.com

       [HN Gopher] It's always TCP_NODELAY
       ___________________________________________________________________
        
       It's always TCP_NODELAY
        
       Author : todsacerdoti
       Score  : 333 points
       Date   : 2024-05-09 17:54 UTC (5 hours ago)
        
 (HTM) web link (brooker.co.za)
 (TXT) w3m dump (brooker.co.za)
        
       | theamk wrote:
       | I don't by the reasoning for never needing Nagle anymore. Sure,
       | telnet isn't a thing today, but I bet there are still plenty of
       | apps which do equivalent of:                    write(fd, "Host:
       | ")          write(fd, hostname)          write(fd, "\r\n")
       | write(fd, "Content-type: ")          etc...
       | 
       | this may not be 40x overhead, but it'd still 5x or so.
        
         | otterley wrote:
         | Marc addresses that: "That's going to make some "write every
         | byte" code slower than it would otherwise be, but those
         | applications should be fixed anyway if we care about
         | efficiency."
        
         | Arnt wrote:
         | Those aren't the ones you debug, so they won't be seen by OP.
         | Those are the ones you don't need to debug because Nagle saves
         | you.
        
         | rwmj wrote:
         | The comment about telnet had me wondering what openssh does,
         | and it sets TCP_NODELAY on every connection, even for
         | interactive sessions. (Confirmed by both reading the code and
         | observing behaviour in 'strace').
        
           | c0l0 wrote:
           | _Especially_ for interactive sessions, it absolutely should!
           | :)
        
             | syncsynchalt wrote:
             | Ironic since Nagle's Algorithm (which TCP_NODELAY disables)
             | was invented for interactive sessions.
             | 
             | It's hard to imagine interactive sessions making more than
             | the tiniest of blips on a modern network.
        
         | temac wrote:
         | Fix the apps. Nobody expect magical perf if you do that when
         | writing to files, even though the OS also has its own buffers.
         | There is no reason to expect otherwise when writing to a socket
         | and actually nagle already doesn't save you from syscall
         | overhead.
        
           | toast0 wrote:
           | Nagle doesn't save the derpy side from syscall overhead, but
           | it would save the other side.
           | 
           | It's not just apps doing this stuff, it also lives in system
           | libraries. I'm still mad at the Android HTTPS library for
           | sending chunked uploads as so many tinygrams. I don't
           | remember exactly, but I think it's reasonable packetization
           | for the data chunk (if it picked a reasonable size anyway),
           | then one packet for \r\n, one for the size, and another for
           | another \r\n. There's no reason for that, but it doesn't hurt
           | the client enough that I can convince them to avoid the
           | system library so they can fix it and the server can manage
           | more throughput. Ugh. (It might be that it's just the TLS
           | packetization that was this bogus and the TCP packetization
           | was fine, it's been a while)
           | 
           | If you take a pcap for some specific issue, there's always so
           | many of these other terrible things in there. </rant>
        
           | meinersbur wrote:
           | Those are the apps are quickly written and do not care if
           | they unnecessarily congest the network. The ones that do get
           | properly maintained can set TCP_NODELAY. Seems like a
           | reasonable default to me.
        
           | ale42 wrote:
           | Apps can always misbehave, you never know what people
           | implement, and you don't always have source code to patch. I
           | don't think the role of the OS is to let the apps do whatever
           | they wish, but it should give the possibility of doing it if
           | it's needed. So I'd rather say, if you know you're properly
           | doing things and you're latency sensitive, just TCP_NODELAY
           | on all your sockets and you're fine, and nobody will blame
           | you about doing it.
        
         | grishka wrote:
         | And they really shouldn't do this. Even disregarding the
         | network aspect of it, this is still bad for performance because
         | syscalls are kinda expensive.
        
         | jrockway wrote:
         | Does this matter? Yes, there's a lot of waste. But you also
         | have a 1Gbps link. Every second that you don't use the full
         | 1Gbps is also waste, right?
        
           | tedunangst wrote:
           | This is why I always pad out the end of my html files with a
           | megabyte of &nbsp;. A half empty pipe is a half wasted pipe.
        
             | dessimus wrote:
             | Just be sure HTTP Compression is off though, or you're
             | still half-wasting the pipe.
             | 
             | Better to just dump randomized uncompressible data into
             | html comments.
        
             | arp242 wrote:
             | I am finally starting to understand some of these
             | OpenOffice/LibreOffice commit messages like
             | https://github.com/LibreOffice/core/commit/a0b6744d3d77
        
         | eatonphil wrote:
         | I imagine the write calls show up pretty easily as a bottleneck
         | in a flamegraph.
        
           | wbl wrote:
           | They don't. Maybe if you're really good you notice the higher
           | overhead but you expect to be spending time writing to the
           | network. The actual impact shows up when the bandwidth
           | consumption is way up on packet and TCP headers which won't
           | show on a flamegraph that easily.
        
         | silisili wrote:
         | We shouldn't penalize the internet at large because some
         | developers write terrible code.
        
           | littlestymaar wrote:
           | Isn't it how SMTP is working though?
        
             | leni536 wrote:
             | No?
        
         | loopdoend wrote:
         | Ah yeah I fixed this exact bug in net-http in Ruby core a
         | decade ago.
        
         | tptacek wrote:
         | The discussion here mostly seems to miss the point. The
         | argument is to _change the default_ , not to eliminate the
         | behavior altogether.
        
         | the8472 wrote:
         | shouldn't autocorking help with even without nagle?
        
         | asveikau wrote:
         | I don't think that's actually super common anymore when you
         | consider that doing asynchronous I/O, the only sane way to do
         | that is put it into a buffer rather than blocking at every
         | small write(2).
         | 
         | Then you consider that asynchronous I/O is usually necessary
         | both on server (otherwise you don't scale well) and client
         | (because blocking on network calls is terrible experience,
         | especially in today's world of frequent network changes,
         | falling out of network range, etc.)
        
         | sophacles wrote:
         | TCP_CORK handles this better than nagle tho.
        
       | mannyv wrote:
       | We used to call them "packlets."
       | 
       | His "tinygrams" is pretty good too, but that sort of implies UDP
       | (D -> datagrams)
        
         | chuckadams wrote:
         | > We used to call them "packlets."
         | 
         | setsockopt(fd, IPPROTO_TCP, TCP_MAKE_IT_GO, &go, sizeof(go));
        
       | obelos wrote:
       | Not every time. Sometimes it's DNS.
        
         | jeffrallen wrote:
         | Once every 50 years and 2 billion kilometers, it's a failing
         | memory chip. But you can usually just patch around them, so no
         | big deal.
        
         | skunkworker wrote:
         | Don't forget BGP or running out of disk space without an alert.
        
         | p_l wrote:
         | Once it was a failing line card in router zeroing last bit in
         | IPv4 addresses, resulting in ticket about "only even IPv4
         | addresses are accessible" ...
        
           | jcgrillo wrote:
           | For some reason this reminded me of the "500mi email" bug
           | [1], maybe a similar level of initial apparent absurdity?
           | 
           | [1] https://www.ibiblio.org/harris/500milemail.html
        
             | chuckadams wrote:
             | The most absurd thing to me about the 500 mile email
             | situation is that sendmail just happily started up and
             | soldiered on after being given a completely alien config
             | file. Could be read as another example of "be liberal in
             | what you accept" going awry, but sendmail's wretched config
             | format is really a volume of war stories all its own...
        
               | jcgrillo wrote:
               | Configuration changes are one of those areas where having
               | some kind of "are you sure? (y/n)" check can really pay
               | off. It wouldn't have helped in this case, because there
               | wasn't really any change management process to speak of,
               | but we haven't fully learned the lesson yet.
        
               | unconed wrote:
               | Confirmations are mostly useless unless you explicitly
               | spell out the implications of the change. They are also
               | inferior to being able to undo changes.
               | 
               | That's a lesson many don't know.
        
         | marcosdumay wrote:
         | When it fails, it's DNS. When it just stops moving, it's either
         | TCP_NODELAY or stream buffering.
         | 
         | Really complex systems (the Web) also fail because of caching.
        
         | drivers99 wrote:
         | Or SELinux
        
         | rickydroll wrote:
         | Not every time. Sometimes, the power cord is only connected at
         | one end.
        
         | sophacles wrote:
         | One time for me it was: the glass was dirty.
         | 
         | Some router near a construction site had dust settle into the
         | gap between the laser and the fiber, and it attenuated the
         | signal enough to see 40-50% packet loss.
         | 
         | We figured out where the loss was and had our NOC email the
         | relevant transit provider. A day later we got an email back
         | from the tech they dispatched with the story.
        
         | Sohcahtoa82 wrote:
         | I chuckle whenever I see this meme, because in my experience,
         | the issue is usually DHCP.
        
       | batmanthehorse wrote:
       | Does anyone know of a good way to enable TCP_NODELAY on sockets
       | when you don't have access to the source for that application? I
       | can't find any kernel settings to make it permanent, or commands
       | to change it after the fact.
       | 
       | I've been able to disable delayed acks using `quickack 1` in the
       | routing table, but it seems particularly hard to enable
       | TCP_NODELAY from outside the application.
       | 
       | I've been having exactly the problem described here lately, when
       | communicating between an application I own and a closed source
       | application it interacts with.
        
         | tedunangst wrote:
         | LD_PRELOAD.
        
           | batmanthehorse wrote:
           | Thank you, found this: https://github.com/sschroe/libnodelay
        
         | coldpie wrote:
         | Would some kind of LD_PRELOAD interception for socket(2) work?
         | Call the real function, then do setsockopt or whatever, and
         | return the modified socket.
        
           | cesarb wrote:
           | > Would some kind of LD_PRELOAD interception for socket(2)
           | work?
           | 
           | That would only work if the call goes through libc, and it's
           | not statically linked. However, it's becoming more and more
           | common to do system calls directly, bypassing libc; the Go
           | language is infamous for doing that, but there's also things
           | like the rustix crate for Rust
           | (https://crates.io/crates/rustix), which does direct system
           | calls by default.
        
             | zbowling wrote:
             | And go is wrong for doing that, at least on Linux. It
             | bypasses optimizations in the vDSO in some cases. On
             | Fuchsia, we made direct syscalls not through the vDSO
             | illegal and it was funny the hacks to go that required. The
             | system ABI of Linux really isn't the syscall interface, its
             | the system libc. That's because the C ABI (and the
             | behaviors of the triple it was compiled for) and its isms
             | for that platform are the linga franca of that system.
             | Going around that to call syscalls directly, at least for
             | the 90% of useful syscalls on the system that are wrapped
             | by libc, is asinine and creates odd bugs, makes crash
             | reporters heuristical unwinders, debuggers, etc all more
             | painful to write. It also prevents the system vendor from
             | implementing user mode optimizations that avoid mode and
             | context switches when necessary. We tried to solve these
             | issues in Fuchsia, but for Linux, Darwin, and hell, even
             | Windows, if you are making direct syscalls and it's not for
             | something really special and bespoke, you are just flat-out
             | wrong.
        
               | JoshTriplett wrote:
               | > The system ABI of Linux really isn't the syscall
               | interface, its the system libc.
               | 
               | You might have reasons to prefer to use libc; some
               | software has reason to not use libc. Those preferences
               | are in conflict, but one of them is not automatically
               | right and the other wrong in all circumstances.
               | 
               | Many UNIX systems _did_ follow the premise that you
               | _must_ use libc and the syscall interface is unstable.
               | Linux pointedly did not, and decided to have a stable
               | syscall ABI instead. This means it 's possible to have
               | multiple C libraries, as well as other libraries, which
               | have different needs or goals and interface with the
               | system differently. That's a _useful_ property of Linux.
               | 
               | There are a couple of established mechanism on Linux for
               | intercepting syscalls: ptrace, and BPF. If you want to
               | intercept all uses of a syscall, intercept the syscall.
               | If you want to intercept a particular glibc function _in
               | programs using glibc_ , or for that matter a musl
               | function in a program using musl, go ahead and use
               | LD_PRELOAD. But the Linux syscall interface is a valid
               | and stable interface to the system, and that's why
               | LD_PRELOAD is not a complete solution.
        
               | zbowling wrote:
               | It's true that Linux has a stable-ish syscall table. What
               | is funny is that this caused a whole series of Samsung
               | Android phones to reboot randomly with some apps because
               | Samsung added a syscall at the same position someone else
               | did in upstream linux and folks staticly linking their
               | own libc to avoid boionc libc were rebooting phones when
               | calling certain functions because the Samsung syscall
               | causing kernel panics when called wrong. Goes back to it
               | being a bad idea to subvert your system libc. Now, distro
               | vendors do give out multiple versions of a libc that all
               | work with your kernel. This generally works. When we had
               | to fix ABI issues this happened a few times. But I
               | wouldn't trust building our libc and assuming that libc
               | is portable to any linux machine to copy it to.
        
               | cesarb wrote:
               | > It's true that Linux has a stable-ish syscall table.
               | 
               | It's not "stable-ish", it's fully stable. Once a syscall
               | is added to the syscall table on a released version of
               | the official Linux kernel, it might later be replaced by
               | a "not implemented" stub (which always returns -ENOSYS),
               | but it will never be reused for anything else. There's
               | even reserved space on some architectures for the STREAMS
               | syscalls, which were AFAIK never on any released version
               | of the Linux kernel.
               | 
               | The exception is when creating a new architecture; for
               | instance, the syscall table for 32-bit x86 and 64-bit x86
               | has a completely different order.
        
               | withinboredom wrote:
               | I think what they meant (judging by the example you
               | ignored) is that the table changes (even if append-only)
               | and you don't know which version you actually have when
               | you statically compile your own version. Thus, your
               | syscalls might be using a newer version of the table but
               | it a) not actually be implemented, or b) implemented with
               | something bespoke.
        
               | tedunangst wrote:
               | The complication with the linux syscall interface is that
               | it turns the worse is better up to 11. Like setuid works
               | on a per thread basis, which is seriously not what you
               | want, so every program/runtime must do this fun little
               | thread stop and start and thunk dance.
        
               | JoshTriplett wrote:
               | Yeah, agreed. One of the items on my _long_ TODO list is
               | adding `setuid_process` and `setgid_process` and similar,
               | so that perhaps a decade later when new runtimes can
               | count on the presence of those syscalls, they can stop
               | duplicating that mechanism in userspace.
        
               | toast0 wrote:
               | > The system ABI of Linux really isn't the syscall
               | interface, its the system libc.
               | 
               | Which one? The Linux Kernel doesn't provide a libc. What
               | if you're a static executable?
               | 
               | Even on Operating Systems with a libc provided by the
               | kernel, it's almost always allowed to upgrade the kernel
               | without upgrading the userland (including libc); that
               | works because the interface between userland and kernel
               | is syscalls.
               | 
               | That certainly ties something that makes syscalls to a
               | narrow range of kernel versions, but it's not as if
               | dynamically linking libc means your program will be
               | compatible forever either.
        
               | jimmaswell wrote:
               | > That certainly ties something that makes syscalls to a
               | narrow range of kernel versions
               | 
               | I don't think that's right, wouldn't it be the earliest
               | kernel supporting that call and onwards? The Linux ABI
               | intentionally never breaks userland.
        
               | toast0 wrote:
               | In the case where you're running an Operating System that
               | provides a libc and is OK with removing older syscalls,
               | there's a beginning and an end to support.
               | 
               | Looking at FreeBSD under /usr/include/sys/syscall.h,
               | there's a good number of retired syscalls.
               | 
               | On Linux under /usr/include/x86_64-linux-
               | gnu/asm/unistd_32.h I see a fair number of missing
               | numbers --- not sure what those are about, but 222, 223,
               | 251, 285, and 387-392 are missing. (on Debian 12.1 with
               | linux-image-6.1.0-12-amd64 version 6.1.52-1, if it
               | matters)
        
               | assassinator42 wrote:
               | The proliferation of Docker containers seems to go
               | against that. Those really only work well since the
               | kernel has a stable syscall ABI. So much so that you see
               | Microsoft switching to a stable syscall ABI with Windows
               | 11.
        
               | sophacles wrote:
               | Linux is also weird because there are syscalls not
               | supported in most (any?) libc - things like io_uring, and
               | netlink fall into this.
        
               | gpderetta wrote:
               | Futex for a very long time was only accessible via
               | syscall.
        
               | Thaxll wrote:
               | Those are very strong words...
        
               | leni536 wrote:
               | It should be possible to use vDSO without libc, although
               | probably a lot of work.
        
               | pie_flavor wrote:
               | You seem to be saying 'it was incorrect on Fuchsia, so
               | it's incorrect on Linux'. No, it's correct on Linux, and
               | incorrect on every other platform, as each platform's
               | documentation is very clear on. Go did it incorrectly on
               | FreeBSD, but that's Go being Go; they did it in the first
               | place because it's a Linux-first system and it's correct
               | on Linux. And glibc does not have any special privilege,
               | the vdso optimizations it takes advantage of are just as
               | easily taken advantage of by the Go compiler. There's no
               | reason to bucket Linux with Windows on the subject of
               | syscalls when the Linux manpages are very clear that
               | syscalls are there to be used and exhaustively documents
               | them, while MSDN is very clear that the system interface
               | is kernel32.dll and ntdll.dll, and shuffles the syscall
               | numbers every so often so you don't get any funny ideas.
        
               | asveikau wrote:
               | Linux doesn't even have consensus on what libc to use,
               | and ABI breakage between glibc and musl is not unheard
               | of. (Probably not for syscalls but for other things.)
        
         | praptak wrote:
         | Attach debugger (ptrace), call setsockopt?
        
         | the8472 wrote:
         | opening `/proc/<pid>/fd/<fd number>` and setting the socket
         | option may work (not tested)
        
         | tuetuopay wrote:
         | you could try ebpf and hook on the socket syscall. might be
         | harder than LD_PRELOAD as suggested by other commenters though
        
         | jdadj wrote:
         | Depending on the specifics, you might be able to add socat in
         | the middle.
         | 
         | Instead of: your_app --> server
         | 
         | you'd have: your_app -> localhost_socat -> server
         | 
         | socat has command line options for setting tcp_nodelay. You'd
         | need to convince your closed source app to connect to
         | localhost, though. But if it's doing a dns lookup, you could
         | probably convince it to connect to localhost with an /etc/hosts
         | entry
         | 
         | Since your app would be talking to socat over a local socket,
         | the app's tcp_nodelay wouldn't have any effect.
        
       | mirekrusin wrote:
       | Can't it have "if payload is 1 byte (or less than X) then wait,
       | otherwise don't" condition?
        
         | chuckadams wrote:
         | Some network stacks like those in Solaris and HP/UX let you
         | tune the "Nagle limit" in just such a fashion, up to disabling
         | it entirely by setting it to 1. I'm not aware of it being
         | tunable on Linux, though you can manually control the buffering
         | using TCP_CORK. https://baus.net/on-tcp_cork/ has some nice
         | details.
        
         | fweimer wrote:
         | There is a socket option, SO_SNDLOWAT. It's not implement Linux
         | according to the manual page. The description in UNIX Network
         | Programming and TCP Illustrated conflict, too. So it's probably
         | not useful.
        
         | the8472 wrote:
         | You can buffer in userspace. Don't do small writes to the
         | socket and no bytes will be sent. Don't do two consecutive
         | small writes and nagle won't kick in.
        
         | astrange wrote:
         | FreeBSD has accept filters, which let you do something like
         | wait for a complete HTTP header (inaccurate from memory
         | summary.) Not sure about the sending side.
        
         | deathanatos wrote:
         | How is what you're describing not just Nagle's algorithm?
         | 
         | If you mean TCP_NODELAY, you should use it with TCP_CORK, which
         | prevents partial frames. TCP_CORK the socket, do your writes to
         | the kernel via send, and then once you have an application
         | level "message" ready to send out -- i.e., once you're at the
         | point where you're going to go to sleep and wait for the other
         | end to respond, unset TCP_CORK & then go back to your event
         | loop & sleep. The "uncork" at the end + nodelay sends the final
         | partial frame, if there is one.
        
       | elhosots wrote:
       | This sounds like the root of my vncviewer / server interaction
       | bugs i experience with some vnc viewer/server combo's between
       | ubuntu linux and freebsd... (tight/tiger)
        
       | evanelias wrote:
       | John Nagle has posted insightful comments about the historical
       | background for this many times, for example
       | https://news.ycombinator.com/item?id=9048947 referenced in the
       | article. He's a prolific HN commenter (#11 on the leaderboard) so
       | it can be hard to find everything, but some more comments
       | searchable via
       | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
       | or
       | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
        
         | Animats wrote:
         | The sending pattern matters. Send/Receive/Send/Receive won't
         | trigger the problem, because the request will go out
         | immediately and the reply will provide an ACK and allow another
         | request. Bulk transfers won't cause the problem, because if you
         | fill the outgoing block size, there's no delay.
         | 
         | But Send/Send/Receive will. This comes up a lot in game
         | systems, where most of the traffic is small events going one
         | way.
        
         | EvanAnderson wrote:
         | I love it when Nagle's algorithm comes up on HN. Inevitably
         | someone, not knowing "Animats" is John Nagle, responds a
         | comment from Animats with a "knowing better" tone. >smile<
         | 
         | (I also really like Animats' comments, too.)
        
           | geoelectric wrote:
           | I have to confess that when I saw this post, I quickly
           | skimmed the threads to check if someone was trying to educate
           | Animats on TCP. Think I've only seen that happen in the wild
           | once or twice, but it absolutely made my day when it did.
        
       | pclmulqdq wrote:
       | In a world where bandwidth was limited, and the packet size
       | minimum was 64 bytes plus an inter-frame gap (it still is for
       | most Ethernet networks), sending a TCP packet for literally every
       | byte wasted a huge amount of bandwidth. The same goes for sending
       | empty acks.
       | 
       | On the other hand, my general position is: it's not TCP_NODELAY,
       | it's TCP.
        
         | metadaemon wrote:
         | I'd just love a protocol that has a built in mechanism for
         | realizing the other side of the pipe disconnected for any
         | reason.
        
           | koverstreet wrote:
           | Like TCP keepalives?
        
             | mort96 wrote:
             | If the feature already technically exists in TCP, it's
             | either broken or disabled by default, which is pretty much
             | the same as not having it.
        
               | voxic11 wrote:
               | keepalives are an optional TCP feature so they are not
               | necessarily supported by all TCP implementations and
               | therefor default to off even when supported.
        
           | the8472 wrote:
           | If a socket is closed properly there'll be a FIN and the
           | other side can learn about it by polling the socket.
           | 
           | If the network connection is lost due to external
           | circumstances (say your modem crashes) then how would that
           | information propagate from the point of failure to the remote
           | end _on an idle connection_? Either you actively probe
           | (keepalives) and risk false positives or you wait until you
           | hear again from the other side, risking false negatives.
        
             | sophacles wrote:
             | It gets even worse - routing changes causing traffic to
             | blackhole would still be undetectable without a timeout
             | mechanism, since probes and responses would be lost.
        
           | toast0 wrote:
           | That's possible in circuit switched networking with various
           | types of supervision, but packet switched networking has
           | taken over because it's much less expensive to implement.
           | 
           | Attempts to add connection monitoring usually make things
           | worse --- if you need to reroute a cable, and one or both
           | ends of the cable will detect a cable disconnection and close
           | user sockets, that's not great, now you do a quick change
           | with a small period of data loss but otherwise minor
           | interruption; all of the established connections will be
           | dropped.
        
           | noselasd wrote:
           | SCTP has hearbeats to detect that.
        
           | sophacles wrote:
           | That's really really hard. For a full, guaranteed way to do
           | this we'd need circuit switching (or circuit switching
           | emulation). It's pretty expensive to do in packet networks -
           | each flow would need to be tracked by each middle box, so a
           | lot more RAM at every hop, and probably a lot more processing
           | power. If we go with circuit establishment, its also kind of
           | expensive and breaks the whole "distributed, decentralized,
           | self-healing network" property of the Internet.
           | 
           | It's possible to do better than TCP these days, bandwidth is
           | much much less constrained than it was when TCP was designed,
           | but it's still a hard problem to do detection of pipe
           | disconnected for _any_ reason other than timeouts (which we
           | already have).
        
           | pclmulqdq wrote:
           | Several of the "reliable UDP" protocols I have worked on in
           | the past have had a heartbeat mechanism that is specifically
           | for detecting this. If you haven't sent a packet down the
           | wire in 10-100 milliseconds, you will send an extra packet
           | just to say you're still there.
           | 
           | It's very useful to do this in intra-datacenter protocols.
        
           | 01HNNWZ0MV43FF wrote:
           | To re-word everyone else's comments - "Disconnected" is not
           | well-defined in any network.
        
           | jallmann wrote:
           | These types of keepalives are usually best handled at the
           | application protocol layer where you can design in more knobs
           | and respond in different ways. Otherwise you may see
           | unexpected interactions between different keepalive
           | mechanisms in different parts of the protocol stack.
        
         | niutech wrote:
         | Shouldn't QUIC (https://en.wikipedia.org/wiki/QUIC) solve the
         | TCP issues like latency?
        
       | zengid wrote:
       | Relevant Oxide and Friends podcast episode
       | https://www.youtube.com/watch?v=mqvVmYhclAg
        
         | matthavener wrote:
         | This was a great episode and the really drove home the
         | importance of visualization.
        
       | rsc wrote:
       | Not if you use a modern language that enables TCP_NODELAY by
       | default, like Go. :-)
        
         | andrewfromx wrote:
         | https://news.ycombinator.com/item?id=34179426
         | 
         | https://github.com/golang/go/issues/57530
         | 
         | huh, TIL.
        
       | ironman1478 wrote:
       | I've fixed multiple latency issues due to nagle's multiple times
       | in my career. It's the first thing I jump to. I feel like the
       | logic behind it is sound, but it just doesn't work for some
       | workloads. It should be something that an engineer needs to be
       | forced to set while creating a socket, instead of letting the OS
       | choose a default. I think that's the main issue. Not that it's a
       | good / bad option but that there is a setting that people might
       | not know about that manipulates how data is sent over the wire so
       | aggressively.
        
         | hinkley wrote:
         | What you really want is for the delay to be n microseconds, but
         | there's no good way to do that except putting your own user
         | space buffering in front of the system calls (user space works
         | better, unless you have something like io_uring amortizing
         | system call times)
        
         | Bluecobra wrote:
         | I agree, it has been fairly well known to disable Nagle's
         | Algorithm in HFT/low latency trading circles for quite some
         | time now (like > 15 years). It's one of the first things I look
         | for.
        
           | Scubabear68 wrote:
           | I was setting TCP_NODELAY at Bear Stearns for custom
           | networking code circa 1994 or so.
        
           | mcoliver wrote:
           | Same in M&E / vfx
        
         | nsguy wrote:
         | The logic is really for things like Telnet sessions. IIRC that
         | was the whole motivation.
        
         | nailer wrote:
         | You're right re: making delay explicit, but also crappy use the
         | space networking tools don't show whether no_delay is enabled
         | on sockets.
         | 
         | Last time I had to do some Linux stuff, maybe 10 years ago you
         | had to write a systemtap program. I guess it's EBNF now. But I
         | bet the userspace tools still suck.
        
         | Sebb767 wrote:
         | > It should be something that an engineer needs to be forced to
         | set while creating a socket, instead of letting the OS choose a
         | default.
         | 
         | If the intention is mostly to fix applications with bad
         | `write`-behavior, this would make setting TCP_DELAY a pretty
         | exotic option - you would need a software engineer to be both
         | smart enough to know to set this option, but not smart enough
         | to distribute their write-calls well and/or not go for writing
         | their own (probably better fitted) application-specific version
         | of Nagles.
        
       | JoshTriplett wrote:
       | I do wish that TCP_NODELAY was the default, and there was a
       | TCP_DELAY option instead. That'd be a world in which people who
       | _want_ the batch-style behavior (optimizing for throughput and
       | fewer packets at the expense of latency) could still opt into it.
        
         | mzs wrote:
         | So do I, but I with there was a new one TCP_RTTDELAY. It would
         | take a byte that would be what 128th of RTT you want to use for
         | Nagle instead of one RTT or full* buffer. 0 would be the
         | default, behaving as you and I prefer.
         | 
         | * "Given the vast amount of work a modern server can do in even
         | a few hundred microseconds, delaying sending data for even one
         | RTT isn't clearly a win."
         | 
         | I don't think that's such an issue anymore either, given that
         | the server produces so much data it fills the output buffer
         | quickly anyway, the data is then immediately sent before the
         | delay runs its course.
        
       | stonemetal12 wrote:
       | >To make a clearer case, let's turn back to the justification
       | behind Nagle's algorithm: amortizing the cost of headers and
       | avoiding that 40x overhead on single-byte packets. But does
       | anybody send single byte packets anymore?
       | 
       | That is a bit of a strawman there. While he uses single byte
       | packets as the worst case example, the issue as stated is any not
       | full packet.
        
       | somat wrote:
       | What about the opposite, disable delayed acks.
       | 
       | The problem is the pathological behavior when tinygram prevention
       | interacts with delayed acks. There is an exposed option to turn
       | off tinygram prevention(TCP_NODELAY), how would you tun off
       | delayed acks instead? Say if you wanted to benchmark all four
       | combinations and see what works best.
       | 
       | doing a little research I found:
       | 
       | linux has the TCP_QUICKACK socket option but you have to set it
       | every time you receive. there is also
       | /proc/sys/net/ipv4/tcp_delack_min and
       | /proc/sys/net/ipv4/tcp_ato_min
       | 
       | freebsd has net.inet.tcp.delayed_ack and net.inet.tcp.delacktime
        
         | Animats wrote:
         | > linux has the TCP_QUICKACK socket option but you have to set
         | it every time you receive
         | 
         | Right. What were they thinking? Why would you want it off only
         | some of the time?
        
         | batmanthehorse wrote:
         | In CentOS/RedHat you can add `quickack 1` to the end of a route
         | to tell it to disable delayed acks for that route.
        
         | mjb wrote:
         | TCP_QUICKACK does fix the worst version of the problem, but
         | doesn't fix the entire problem. Nagles algorithm will still
         | wait for up to one round-trip time before sending data (at
         | least as specified in the RFC), which is extra latency with
         | nearly no added value.
        
       | benreesman wrote:
       | Nagle and no delay are like like 90+% of the latency bugs I've
       | dealt with.
       | 
       | Two reasonable ideas that mix terribly in practice.
        
       | tempaskhn wrote:
       | Wow, never would have thought of that.
        
       | jedberg wrote:
       | This is an interesting thing that points out why abstraction
       | layers can be bad without proper message passing mechanisms.
       | 
       | This could be fixed if there was a way for the application at L7
       | to tell the TCP stack at L4 "hey, I'm an interactive shell so I
       | expect to have a lot of tiny packets, you should leave
       | TCP_NODELAY on for these packets" so that it can be off by
       | default but on for that application to reduce overhead.
       | 
       | Of course nowadays it's probably an unnecessary optimization
       | anyway, but back in '84 it would have been super handy.
        
       | landswipe wrote:
       | Use UDP ;)
        
         | gafferongames wrote:
         | Bingo
        
       | epolanski wrote:
       | I was curious whether I had to change anything in my applications
       | after reading that so did a bit of research.
       | 
       | Both Node.js and Curl use TCP_NODELAY by default from a long
       | time.
        
       | gafferongames wrote:
       | It's always TCP_NODELAY. Except when it's head of line blocking,
       | then it's not.
        
       ___________________________________________________________________
       (page generated 2024-05-09 23:00 UTC)