[HN Gopher] It's always TCP_NODELAY
       ___________________________________________________________________
        
       It's always TCP_NODELAY
        
       Author : todsacerdoti
       Score  : 813 points
       Date   : 2024-05-09 17:54 UTC (1 days ago)
        
 (HTM) web link (brooker.co.za)
 (TXT) w3m dump (brooker.co.za)
        
       | theamk wrote:
       | I don't by the reasoning for never needing Nagle anymore. Sure,
       | telnet isn't a thing today, but I bet there are still plenty of
       | apps which do equivalent of:                    write(fd, "Host:
       | ")          write(fd, hostname)          write(fd, "\r\n")
       | write(fd, "Content-type: ")          etc...
       | 
       | this may not be 40x overhead, but it'd still 5x or so.
        
         | otterley wrote:
         | Marc addresses that: "That's going to make some "write every
         | byte" code slower than it would otherwise be, but those
         | applications should be fixed anyway if we care about
         | efficiency."
        
         | Arnt wrote:
         | Those aren't the ones you debug, so they won't be seen by OP.
         | Those are the ones you don't need to debug because Nagle saves
         | you.
        
         | rwmj wrote:
         | The comment about telnet had me wondering what openssh does,
         | and it sets TCP_NODELAY on every connection, even for
         | interactive sessions. (Confirmed by both reading the code and
         | observing behaviour in 'strace').
        
           | c0l0 wrote:
           | _Especially_ for interactive sessions, it absolutely should!
           | :)
        
             | syncsynchalt wrote:
             | Ironic since Nagle's Algorithm (which TCP_NODELAY disables)
             | was invented for interactive sessions.
             | 
             | It's hard to imagine interactive sessions making more than
             | the tiniest of blips on a modern network.
        
               | eru wrote:
               | Isn't video calling an interactive session?
        
               | semi wrote:
               | I think that's more two independent byte streams. You
               | want low latency but what is transfered doesnt really
               | impact the other side, you just constantly want to push
               | the next frame
        
               | eru wrote:
               | Thanks, that makes sense!
               | 
               | It's interesting that it's very much an interactive
               | experience for the end-user. But for the logic of the
               | computer, it's not interactive at all.
               | 
               | You can make the contrast even stronger: if both video
               | streams are transmitted over UDP, you don't even need to
               | sent ACKs etc. To be truly one-directional from a
               | technical point of view.
               | 
               | Then compare that to transferring a file via TCP. For the
               | user this is as one-directional and non-interactive as it
               | gets, but the computers constantly talk back and forth.
        
               | TorKlingberg wrote:
               | Video calls indeed almost always use UDP. TCP
               | retransmission isn't really useful since by the time a
               | retransmitted packet arrives it's too old to display.
               | Worse, a single lost packet will block a TCP stream.
               | Sometimes TCP is the only way to get through a firewall,
               | but the experience is bad if there's any packet loss at
               | all.
               | 
               | VC systems do constantly send back packet loss statistics
               | and adjust the video quality to avoid saturating a link.
               | Any buffering in routers along the way will add delay, so
               | you want to keep the bitrate low enough to keep buffers
               | empty.
        
               | syncsynchalt wrote:
               | You're right! (I'm ignoring the reply thread).
               | 
               | I'm so used to a world where "interactive" was synonymous
               | with "telnet" and "person on keyboard".
        
         | temac wrote:
         | Fix the apps. Nobody expect magical perf if you do that when
         | writing to files, even though the OS also has its own buffers.
         | There is no reason to expect otherwise when writing to a socket
         | and actually nagle already doesn't save you from syscall
         | overhead.
        
           | toast0 wrote:
           | Nagle doesn't save the derpy side from syscall overhead, but
           | it would save the other side.
           | 
           | It's not just apps doing this stuff, it also lives in system
           | libraries. I'm still mad at the Android HTTPS library for
           | sending chunked uploads as so many tinygrams. I don't
           | remember exactly, but I think it's reasonable packetization
           | for the data chunk (if it picked a reasonable size anyway),
           | then one packet for \r\n, one for the size, and another for
           | another \r\n. There's no reason for that, but it doesn't hurt
           | the client enough that I can convince them to avoid the
           | system library so they can fix it and the server can manage
           | more throughput. Ugh. (It might be that it's just the TLS
           | packetization that was this bogus and the TCP packetization
           | was fine, it's been a while)
           | 
           | If you take a pcap for some specific issue, there's always so
           | many of these other terrible things in there. </rant>
        
           | meinersbur wrote:
           | Those are the apps are quickly written and do not care if
           | they unnecessarily congest the network. The ones that do get
           | properly maintained can set TCP_NODELAY. Seems like a
           | reasonable default to me.
        
           | ale42 wrote:
           | Apps can always misbehave, you never know what people
           | implement, and you don't always have source code to patch. I
           | don't think the role of the OS is to let the apps do whatever
           | they wish, but it should give the possibility of doing it if
           | it's needed. So I'd rather say, if you know you're properly
           | doing things and you're latency sensitive, just TCP_NODELAY
           | on all your sockets and you're fine, and nobody will blame
           | you about doing it.
        
           | bjourne wrote:
           | > Fix the apps. Nobody expect magical perf if you do that
           | when writing to files,
           | 
           | We write to files line-by-line or even character-by-character
           | and expect the library or OS to "magically" buffer it into
           | fast file writes. Same with memory. We expect multiple small
           | mallocs to be smartly coalesced by the platform.
        
             | eru wrote:
             | Yes, your libraries should fix that. The OS (as in the
             | kernel) should not try to do any abstraction.
             | 
             | Alas, kernels really like to offer abstractions.
        
             | _carbyau_ wrote:
             | True to a degree. But that is a singular platform wholly
             | controlled by the OS.
             | 
             | Once you put packets out into the world you're in a shared
             | space.
             | 
             | I assume every conceivable variation of argument has been
             | made both for and against Nagles at this point but it
             | essentially revolves around a shared networking resource
             | and what policy is in place for fair use.
             | 
             | Nagles fixes a particular case but interferes overall. If
             | you fix the "particular case app" the issue goes away.
        
             | PaulDavisThe1st wrote:
             | If you _expect_ a POSIX-y OS to buffer write(2) calls, you
             | 're sadly misguided. Whether or not that happens depends on
             | nature of the device file you're writing to.
             | 
             | OTOH, if you're using fwrite(3), as you likely should be
             | actual file I/O, then your expectation is entirely
             | reasonable.
             | 
             | Similarly with memory. If you expect brk(2) to handle
             | multiple small allocations "sensibly" you're going to be
             | disappointed. If you use malloc(3) then your expectation is
             | entirely reasonable.
        
               | bjourne wrote:
               | Whether buffering is part of POSIX or not is beside the
               | point. Any modern OS you'll find will buffer write calls
               | in one way or the other. Similarly with memory. Linux
               | waits until accesses page faults before reserving any
               | memory pages for you. My point is that various forms of
               | buffering is everywhere and in practice we do rely on it
               | a whole lot.
        
               | PaulDavisThe1st wrote:
               | > Any modern OS you'll find will buffer write calls in
               | one way or the other.
               | 
               | This is simply not true as a general rule. It depends on
               | the nature of the file descriptor. Yes, if the file
               | descriptor refers to the file system, it will in all
               | likelihood be buffered by the OS (not with O_DIRECT,
               | however). But on "any modern OS", file descriptors can
               | refer to things that are not files, and the buffering
               | situation there will vary from case to case.
        
               | bjourne wrote:
               | You're right, Linux does not buffer writes to file
               | descriptors for which buffering has no performance
               | benefit...
        
           | blahgeek wrote:
           | We actually have the similar behavior when writing to files:
           | contents are buffered in page cache and are written to disk
           | later in batch, unless user explicitly call "sync".
        
           | jeroenhd wrote:
           | Everybody expects magical perf if you do that when writing
           | files. We have RAM buffers and write caches for a reason,
           | even on fast SSDs. We expect it so much that macOS doesn't
           | flush to disk even when you call fsync() (files get flushed
           | to the disk's write buffer instead).
           | 
           | There's some overhead to calling write() in a loop, but it's
           | certainly not as bad as when a call to write() would actually
           | make the data traverse whatever output stream you call it on.
        
           | citrin_ru wrote:
           | I agree that such code should be fixed but having hard time
           | persuading developers to fix their code. Many of them don't
           | know what is a syscall, how making a syscall triggers sending
           | of an IP packet, how a library call translates to a syscall
           | e. t. c. Worse they don't want to know this, they write say
           | Java code (or some other high level language) and argue that
           | libraries/JDK/kernel should handle all 'low level' stuff.
           | 
           | To get optimal performance for request-response protocols
           | like HTTP one should send a full request which includes a
           | request line, all headers and a POST body using a single
           | write syscall (unless POST body is large and it make sense to
           | write it in chunks). Unfortunately not all HTTP libraries
           | work this way and a library user cannot fix this problem
           | without switching a library which is: 1. not always easy 2.
           | it is not widely known which libraries are efficient and
           | which are not. Even if you have an own HTTP library it's not
           | always trivial to fix: e. g. in Java a way to fix this
           | problem while keeping code readable and idiomatic is too wrap
           | socket into BufferedOutputStream which adds one more memory-
           | to-memory copy for all data you are sending on top of at
           | least one memory-to-memory copy you already have without a
           | buffered stream; so it's not an obvious performance win for
           | an application which already saturates memory bandwidth.
        
           | josefx wrote:
           | I would love to fix the apps, can you point me to the github
           | repo with all the code written the last 30 years so I can get
           | started?
        
         | grishka wrote:
         | And they really shouldn't do this. Even disregarding the
         | network aspect of it, this is still bad for performance because
         | syscalls are kinda expensive.
        
         | jrockway wrote:
         | Does this matter? Yes, there's a lot of waste. But you also
         | have a 1Gbps link. Every second that you don't use the full
         | 1Gbps is also waste, right?
        
           | tedunangst wrote:
           | This is why I always pad out the end of my html files with a
           | megabyte of &nbsp;. A half empty pipe is a half wasted pipe.
        
             | dessimus wrote:
             | Just be sure HTTP Compression is off though, or you're
             | still half-wasting the pipe.
             | 
             | Better to just dump randomized uncompressible data into
             | html comments.
        
             | arp242 wrote:
             | I am finally starting to understand some of these
             | OpenOffice/LibreOffice commit messages like
             | https://github.com/LibreOffice/core/commit/a0b6744d3d77
        
         | eatonphil wrote:
         | I imagine the write calls show up pretty easily as a bottleneck
         | in a flamegraph.
        
           | wbl wrote:
           | They don't. Maybe if you're really good you notice the higher
           | overhead but you expect to be spending time writing to the
           | network. The actual impact shows up when the bandwidth
           | consumption is way up on packet and TCP headers which won't
           | show on a flamegraph that easily.
        
         | silisili wrote:
         | We shouldn't penalize the internet at large because some
         | developers write terrible code.
        
           | littlestymaar wrote:
           | Isn't it how SMTP is working though?
        
             | leni536 wrote:
             | No?
        
         | loopdoend wrote:
         | Ah yeah I fixed this exact bug in net-http in Ruby core a
         | decade ago.
        
         | tptacek wrote:
         | The discussion here mostly seems to miss the point. The
         | argument is to _change the default_ , not to eliminate the
         | behavior altogether.
        
         | the8472 wrote:
         | shouldn't autocorking help with even without nagle?
        
         | asveikau wrote:
         | I don't think that's actually super common anymore when you
         | consider that doing asynchronous I/O, the only sane way to do
         | that is put it into a buffer rather than blocking at every
         | small write(2).
         | 
         | Then you consider that asynchronous I/O is usually necessary
         | both on server (otherwise you don't scale well) and client
         | (because blocking on network calls is terrible experience,
         | especially in today's world of frequent network changes,
         | falling out of network range, etc.)
        
         | sophacles wrote:
         | TCP_CORK handles this better than nagle tho.
        
         | jabl wrote:
         | Even if you do nothing 'fancy' like Nagle, corking, or
         | userspace building up the complete buffer before writing etc.,
         | at the very least the above should be using a vectored write
         | (writev() ).
        
         | Too wrote:
         | Shouldn't that go through some buffer? Unless you fflush()
         | between each write?
        
       | mannyv wrote:
       | We used to call them "packlets."
       | 
       | His "tinygrams" is pretty good too, but that sort of implies UDP
       | (D -> datagrams)
        
         | chuckadams wrote:
         | > We used to call them "packlets."
         | 
         | setsockopt(fd, IPPROTO_TCP, TCP_MAKE_IT_GO, &go, sizeof(go));
        
       | obelos wrote:
       | Not every time. Sometimes it's DNS.
        
         | jeffrallen wrote:
         | Once every 50 years and 2 billion kilometers, it's a failing
         | memory chip. But you can usually just patch around them, so no
         | big deal.
        
         | skunkworker wrote:
         | Don't forget BGP or running out of disk space without an alert.
        
         | p_l wrote:
         | Once it was a failing line card in router zeroing last bit in
         | IPv4 addresses, resulting in ticket about "only even IPv4
         | addresses are accessible" ...
        
           | jcgrillo wrote:
           | For some reason this reminded me of the "500mi email" bug
           | [1], maybe a similar level of initial apparent absurdity?
           | 
           | [1] https://www.ibiblio.org/harris/500milemail.html
        
             | chuckadams wrote:
             | The most absurd thing to me about the 500 mile email
             | situation is that sendmail just happily started up and
             | soldiered on after being given a completely alien config
             | file. Could be read as another example of "be liberal in
             | what you accept" going awry, but sendmail's wretched config
             | format is really a volume of war stories all its own...
        
               | jcgrillo wrote:
               | Configuration changes are one of those areas where having
               | some kind of "are you sure? (y/n)" check can really pay
               | off. It wouldn't have helped in this case, because there
               | wasn't really any change management process to speak of,
               | but we haven't fully learned the lesson yet.
        
               | unconed wrote:
               | Confirmations are mostly useless unless you explicitly
               | spell out the implications of the change. They are also
               | inferior to being able to undo changes.
               | 
               | That's a lesson many don't know.
        
               | lanstin wrote:
               | Your time from commit to live is proportional to your
               | rollback to a known good state. Maybe to a power of the
               | rollback time.
        
               | rincebrain wrote:
               | My favorite example of that was a while ago, "vixie-cron
               | will read a cron stanza from a core dump written to
               | /etc/cron.d" when you could convince it to write a core
               | dump there. The other crons wouldn't touch that, but
               | vixie-cron happily chomped through the core dump for "* *
               | * * * root chmod u+s /tmp/uhoh" etc.
        
             | p_l wrote:
             | I can definitely confirm our initial reaction was "WTF"
             | followed with idea that the dev team is making fun of us...
             | but we went in and run traceroutes and there it was :O
             | 
             | Was fixed in incredible coincidence manner, too - the _CTO_
             | of the network link provider was in their offices (in the
             | same building as me) and felt bored. Apparently having went
             | through all the levels from hauling cables in datacenter up
             | to CTO level, after short look at traceroutes he just
             | picked a phone, called NOC, and ordered a line card
             | replacement on the router :D
        
         | marcosdumay wrote:
         | When it fails, it's DNS. When it just stops moving, it's either
         | TCP_NODELAY or stream buffering.
         | 
         | Really complex systems (the Web) also fail because of caching.
        
         | drivers99 wrote:
         | Or SELinux
        
         | rickydroll wrote:
         | Not every time. Sometimes, the power cord is only connected at
         | one end.
        
         | sophacles wrote:
         | One time for me it was: the glass was dirty.
         | 
         | Some router near a construction site had dust settle into the
         | gap between the laser and the fiber, and it attenuated the
         | signal enough to see 40-50% packet loss.
         | 
         | We figured out where the loss was and had our NOC email the
         | relevant transit provider. A day later we got an email back
         | from the tech they dispatched with the story.
        
         | Sohcahtoa82 wrote:
         | I chuckle whenever I see this meme, because in my experience,
         | the issue is usually DHCP.
        
           | anilakar wrote:
           | But it's usually DHCP that sets the wrong DNS servers.
           | 
           | It's funny that some folks claim DNS outage is a legitimate
           | issue in systems whose both ends they control. I get it;
           | reimplementing functionality is rarely a good sign, but since
           | you already know your own addresses in the first place, you
           | should also have an internal mechanism for sharing them.
        
       | batmanthehorse wrote:
       | Does anyone know of a good way to enable TCP_NODELAY on sockets
       | when you don't have access to the source for that application? I
       | can't find any kernel settings to make it permanent, or commands
       | to change it after the fact.
       | 
       | I've been able to disable delayed acks using `quickack 1` in the
       | routing table, but it seems particularly hard to enable
       | TCP_NODELAY from outside the application.
       | 
       | I've been having exactly the problem described here lately, when
       | communicating between an application I own and a closed source
       | application it interacts with.
        
         | tedunangst wrote:
         | LD_PRELOAD.
        
           | batmanthehorse wrote:
           | Thank you, found this: https://github.com/sschroe/libnodelay
        
         | coldpie wrote:
         | Would some kind of LD_PRELOAD interception for socket(2) work?
         | Call the real function, then do setsockopt or whatever, and
         | return the modified socket.
        
           | cesarb wrote:
           | > Would some kind of LD_PRELOAD interception for socket(2)
           | work?
           | 
           | That would only work if the call goes through libc, and it's
           | not statically linked. However, it's becoming more and more
           | common to do system calls directly, bypassing libc; the Go
           | language is infamous for doing that, but there's also things
           | like the rustix crate for Rust
           | (https://crates.io/crates/rustix), which does direct system
           | calls by default.
        
             | zbowling wrote:
             | And go is wrong for doing that, at least on Linux. It
             | bypasses optimizations in the vDSO in some cases. On
             | Fuchsia, we made direct syscalls not through the vDSO
             | illegal and it was funny the hacks to go that required. The
             | system ABI of Linux really isn't the syscall interface, its
             | the system libc. That's because the C ABI (and the
             | behaviors of the triple it was compiled for) and its isms
             | for that platform are the linga franca of that system.
             | Going around that to call syscalls directly, at least for
             | the 90% of useful syscalls on the system that are wrapped
             | by libc, is asinine and creates odd bugs, makes crash
             | reporters heuristical unwinders, debuggers, etc all more
             | painful to write. It also prevents the system vendor from
             | implementing user mode optimizations that avoid mode and
             | context switches when necessary. We tried to solve these
             | issues in Fuchsia, but for Linux, Darwin, and hell, even
             | Windows, if you are making direct syscalls and it's not for
             | something really special and bespoke, you are just flat-out
             | wrong.
        
               | JoshTriplett wrote:
               | > The system ABI of Linux really isn't the syscall
               | interface, its the system libc.
               | 
               | You might have reasons to prefer to use libc; some
               | software has reason to not use libc. Those preferences
               | are in conflict, but one of them is not automatically
               | right and the other wrong in all circumstances.
               | 
               | Many UNIX systems _did_ follow the premise that you
               | _must_ use libc and the syscall interface is unstable.
               | Linux pointedly did not, and decided to have a stable
               | syscall ABI instead. This means it 's possible to have
               | multiple C libraries, as well as other libraries, which
               | have different needs or goals and interface with the
               | system differently. That's a _useful_ property of Linux.
               | 
               | There are a couple of established mechanism on Linux for
               | intercepting syscalls: ptrace, and BPF. If you want to
               | intercept all uses of a syscall, intercept the syscall.
               | If you want to intercept a particular glibc function _in
               | programs using glibc_ , or for that matter a musl
               | function in a program using musl, go ahead and use
               | LD_PRELOAD. But the Linux syscall interface is a valid
               | and stable interface to the system, and that's why
               | LD_PRELOAD is not a complete solution.
        
               | zbowling wrote:
               | It's true that Linux has a stable-ish syscall table. What
               | is funny is that this caused a whole series of Samsung
               | Android phones to reboot randomly with some apps because
               | Samsung added a syscall at the same position someone else
               | did in upstream linux and folks staticly linking their
               | own libc to avoid boionc libc were rebooting phones when
               | calling certain functions because the Samsung syscall
               | causing kernel panics when called wrong. Goes back to it
               | being a bad idea to subvert your system libc. Now, distro
               | vendors do give out multiple versions of a libc that all
               | work with your kernel. This generally works. When we had
               | to fix ABI issues this happened a few times. But I
               | wouldn't trust building our libc and assuming that libc
               | is portable to any linux machine to copy it to.
        
               | cesarb wrote:
               | > It's true that Linux has a stable-ish syscall table.
               | 
               | It's not "stable-ish", it's fully stable. Once a syscall
               | is added to the syscall table on a released version of
               | the official Linux kernel, it might later be replaced by
               | a "not implemented" stub (which always returns -ENOSYS),
               | but it will never be reused for anything else. There's
               | even reserved space on some architectures for the STREAMS
               | syscalls, which were AFAIK never on any released version
               | of the Linux kernel.
               | 
               | The exception is when creating a new architecture; for
               | instance, the syscall table for 32-bit x86 and 64-bit x86
               | has a completely different order.
        
               | withinboredom wrote:
               | I think what they meant (judging by the example you
               | ignored) is that the table changes (even if append-only)
               | and you don't know which version you actually have when
               | you statically compile your own version. Thus, your
               | syscalls might be using a newer version of the table but
               | it a) not actually be implemented, or b) implemented with
               | something bespoke.
        
               | cesarb wrote:
               | > Thus, your syscalls might be using a newer version of
               | the table but it a) not actually be implemented,
               | 
               | That's the same case as when a syscall is later removed:
               | it returns -ENOSYS. The correct way is to do the call
               | normally as if it were implemented, and if it returns
               | -ENOSYS, you know that this syscall does not exist in the
               | currently running kernel, and you should try something
               | else. That is the same no matter whether it's compiled
               | statically or dynamically; even a dynamic glibc has
               | fallback paths for some missing syscalls (glibc has a
               | minimum required kernel version, so it does not need to
               | have fallback paths for features introduced a long time
               | ago).
               | 
               | > or b) implemented with something bespoke.
               | 
               | There's nothing you can do to protect against a modified
               | kernel which does something different from the upstream
               | Linux kernel. Even going through libc doesn't help, since
               | whoever modified the Linux kernel to do something
               | unexpected could also have modified the C library to do
               | something unexpected, or libc could trip over the
               | unexpected kernel changes.
               | 
               | One example of this happening is with seccomp filters.
               | They can be used to make a syscall fail with an
               | unexpected error code, and this can confuse the C
               | library. More specifically, a seccomp filter which forces
               | the clone3 syscall to always return -EPERM breaks newer
               | libc versions which try the clone3 syscall first, and
               | then fallback to the older clone syscall if clone3
               | returned -ENOSYS (which indicates an older kernel that
               | does not have the clone3 syscall); this breaks for
               | instance running newer Linux distributions within older
               | Docker versions.
        
               | withinboredom wrote:
               | Every kernel I've ever used has been different from an
               | upstream kernel, with custom patches applied. It's
               | literally open source, anyone can do anything to it that
               | they want. If you are using libc, you'd have a reasonable
               | expectation not to need to know the details of those
               | changes. If you call the kernel directly via syscall,
               | then yeah, there is nothing you can do about someone
               | making modifications to open source software.
        
               | tedunangst wrote:
               | The complication with the linux syscall interface is that
               | it turns the worse is better up to 11. Like setuid works
               | on a per thread basis, which is seriously not what you
               | want, so every program/runtime must do this fun little
               | thread stop and start and thunk dance.
        
               | JoshTriplett wrote:
               | Yeah, agreed. One of the items on my _long_ TODO list is
               | adding `setuid_process` and `setgid_process` and similar,
               | so that perhaps a decade later when new runtimes can
               | count on the presence of those syscalls, they can stop
               | duplicating that mechanism in userspace.
        
               | toast0 wrote:
               | > The system ABI of Linux really isn't the syscall
               | interface, its the system libc.
               | 
               | Which one? The Linux Kernel doesn't provide a libc. What
               | if you're a static executable?
               | 
               | Even on Operating Systems with a libc provided by the
               | kernel, it's almost always allowed to upgrade the kernel
               | without upgrading the userland (including libc); that
               | works because the interface between userland and kernel
               | is syscalls.
               | 
               | That certainly ties something that makes syscalls to a
               | narrow range of kernel versions, but it's not as if
               | dynamically linking libc means your program will be
               | compatible forever either.
        
               | jimmaswell wrote:
               | > That certainly ties something that makes syscalls to a
               | narrow range of kernel versions
               | 
               | I don't think that's right, wouldn't it be the earliest
               | kernel supporting that call and onwards? The Linux ABI
               | intentionally never breaks userland.
        
               | toast0 wrote:
               | In the case where you're running an Operating System that
               | provides a libc and is OK with removing older syscalls,
               | there's a beginning and an end to support.
               | 
               | Looking at FreeBSD under /usr/include/sys/syscall.h,
               | there's a good number of retired syscalls.
               | 
               | On Linux under /usr/include/x86_64-linux-
               | gnu/asm/unistd_32.h I see a fair number of missing
               | numbers --- not sure what those are about, but 222, 223,
               | 251, 285, and 387-392 are missing. (on Debian 12.1 with
               | linux-image-6.1.0-12-amd64 version 6.1.52-1, if it
               | matters)
        
               | assassinator42 wrote:
               | The proliferation of Docker containers seems to go
               | against that. Those really only work well since the
               | kernel has a stable syscall ABI. So much so that you see
               | Microsoft switching to a stable syscall ABI with Windows
               | 11.
        
               | sophacles wrote:
               | Linux is also weird because there are syscalls not
               | supported in most (any?) libc - things like io_uring, and
               | netlink fall into this.
        
               | gpderetta wrote:
               | Futex for a very long time was only accessible via
               | syscall.
        
               | Thaxll wrote:
               | Those are very strong words...
        
               | leni536 wrote:
               | It should be possible to use vDSO without libc, although
               | probably a lot of work.
        
               | LegionMammal978 wrote:
               | It's not that much work; after all, every libc needs to
               | have its own implementation. The kernel maps the vDSO
               | into memory for you, and gives you the base address as an
               | entry in the auxiliary vector.
               | 
               | But using it does require some basic knowledge of the ELF
               | format on the current platform, in order to parse the
               | symbol table. (Alongside knowledge of which functions are
               | available in the first place.)
        
               | intelVISA wrote:
               | It's hard work to NOT have the damn vDSO invade your
               | address space. Only kludge part of Linux, well, apart
               | from Nagle's, dlopen, and that weird zero copy kernel
               | patch that mmap'd -each- socket recv(!) for a while.
        
               | pie_flavor wrote:
               | You seem to be saying 'it was incorrect on Fuchsia, so
               | it's incorrect on Linux'. No, it's correct on Linux, and
               | incorrect on every other platform, as each platform's
               | documentation is very clear on. Go did it incorrectly on
               | FreeBSD, but that's Go being Go; they did it in the first
               | place because it's a Linux-first system and it's correct
               | on Linux. And glibc does not have any special privilege,
               | the vdso optimizations it takes advantage of are just as
               | easily taken advantage of by the Go compiler. There's no
               | reason to bucket Linux with Windows on the subject of
               | syscalls when the Linux manpages are very clear that
               | syscalls are there to be used and exhaustively documents
               | them, while MSDN is very clear that the system interface
               | is kernel32.dll and ntdll.dll, and shuffles the syscall
               | numbers every so often so you don't get any funny ideas.
        
               | asveikau wrote:
               | Linux doesn't even have consensus on what libc to use,
               | and ABI breakage between glibc and musl is not unheard
               | of. (Probably not for syscalls but for other things.)
        
               | LegionMammal978 wrote:
               | > And go is wrong for doing that, at least on Linux. It
               | bypasses optimizations in the vDSO in some cases.
               | 
               | Go's runtime _does_ go through the vDSO for syscalls that
               | support it, though (e.g., [0]). Of course, it won 't
               | magically adapt to new functions added in later kernel
               | versions, but neither will a statically-linked libc. And
               | it's not like it's a regular occurrence for Linux to new
               | functions to the vDSO, in any case.
               | 
               | [0] https://github.com/golang/go/blob/master/src/runtime/
               | time_li...
        
         | praptak wrote:
         | Attach debugger (ptrace), call setsockopt?
        
         | the8472 wrote:
         | opening `/proc/<pid>/fd/<fd number>` and setting the socket
         | option may work (not tested)
        
         | tuetuopay wrote:
         | you could try ebpf and hook on the socket syscall. might be
         | harder than LD_PRELOAD as suggested by other commenters though
        
         | jdadj wrote:
         | Depending on the specifics, you might be able to add socat in
         | the middle.
         | 
         | Instead of: your_app --> server
         | 
         | you'd have: your_app -> localhost_socat -> server
         | 
         | socat has command line options for setting tcp_nodelay. You'd
         | need to convince your closed source app to connect to
         | localhost, though. But if it's doing a dns lookup, you could
         | probably convince it to connect to localhost with an /etc/hosts
         | entry
         | 
         | Since your app would be talking to socat over a local socket,
         | the app's tcp_nodelay wouldn't have any effect.
        
         | Too wrote:
         | Is it possible to set it as a global OS setting, inside a
         | container?
        
       | mirekrusin wrote:
       | Can't it have "if payload is 1 byte (or less than X) then wait,
       | otherwise don't" condition?
        
         | chuckadams wrote:
         | Some network stacks like those in Solaris and HP/UX let you
         | tune the "Nagle limit" in just such a fashion, up to disabling
         | it entirely by setting it to 1. I'm not aware of it being
         | tunable on Linux, though you can manually control the buffering
         | using TCP_CORK. https://baus.net/on-tcp_cork/ has some nice
         | details.
        
         | fweimer wrote:
         | There is a socket option, SO_SNDLOWAT. It's not implement Linux
         | according to the manual page. The description in UNIX Network
         | Programming and TCP Illustrated conflict, too. So it's probably
         | not useful.
        
         | the8472 wrote:
         | You can buffer in userspace. Don't do small writes to the
         | socket and no bytes will be sent. Don't do two consecutive
         | small writes and nagle won't kick in.
        
         | astrange wrote:
         | FreeBSD has accept filters, which let you do something like
         | wait for a complete HTTP header (inaccurate from memory
         | summary.) Not sure about the sending side.
        
         | deathanatos wrote:
         | How is what you're describing not just Nagle's algorithm?
         | 
         | If you mean TCP_NODELAY, you should use it with TCP_CORK, which
         | prevents partial frames. TCP_CORK the socket, do your writes to
         | the kernel via send, and then once you have an application
         | level "message" ready to send out -- i.e., once you're at the
         | point where you're going to go to sleep and wait for the other
         | end to respond, unset TCP_CORK & then go back to your event
         | loop & sleep. The "uncork" at the end + nodelay sends the final
         | partial frame, if there is one.
        
       | elhosots wrote:
       | This sounds like the root of my vncviewer / server interaction
       | bugs i experience with some vnc viewer/server combo's between
       | ubuntu linux and freebsd... (tight/tiger)
        
       | evanelias wrote:
       | John Nagle has posted insightful comments about the historical
       | background for this many times, for example
       | https://news.ycombinator.com/item?id=9048947 referenced in the
       | article. He's a prolific HN commenter (#11 on the leaderboard) so
       | it can be hard to find everything, but some more comments
       | searchable via
       | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
       | or
       | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
        
         | Animats wrote:
         | The sending pattern matters. Send/Receive/Send/Receive won't
         | trigger the problem, because the request will go out
         | immediately and the reply will provide an ACK and allow another
         | request. Bulk transfers won't cause the problem, because if you
         | fill the outgoing block size, there's no delay.
         | 
         | But Send/Send/Receive will. This comes up a lot in game
         | systems, where most of the traffic is small events going one
         | way.
        
           | pipe01 wrote:
           | I would imagine that games that require exotic sending
           | patterns would use UDP, giving them more control over the
           | protocol
        
             | codr7 wrote:
             | Size prefixed messages are pretty common, perfectly
             | possible to perform as one send but takes more work.
        
         | EvanAnderson wrote:
         | I love it when Nagle's algorithm comes up on HN. Inevitably
         | someone, not knowing "Animats" is John Nagle, responds a
         | comment from Animats with a "knowing better" tone. >smile<
         | 
         | (I also really like Animats' comments, too.)
        
           | geoelectric wrote:
           | I have to confess that when I saw this post, I quickly
           | skimmed the threads to check if someone was trying to educate
           | Animats on TCP. Think I've only seen that happen in the wild
           | once or twice, but it absolutely made my day when it did.
        
             | ryandrake wrote:
             | It's always the highlight of my day when it happens, almost
             | as nice as when someone chimes in to educate John Carmack
             | on 3D graphics and VR technology.
        
           | userbinator wrote:
           | I always check if the man himself makes an appearance every
           | time I see that. He has posted a few comments in here
           | already.
        
           | jeltz wrote:
           | It is like when someone here accused Andres Freund
           | (PostgreSQL core dev who recently became famous due to the xz
           | backdoor) of Dunning-Kruger when he had commented on
           | something related to PostgreSQL's architecture which he had
           | spent many many hours working on personally (I think it was
           | pluggable storage).
           | 
           | Maybe you just tried to educate the leading expert in the
           | world on his own expertise. :D
        
         | SushiHippie wrote:
         | FYI the best way to filter by author is 'author:Animats' this
         | will only show results from the user Animats and won't match
         | animats inside the comment text.
         | 
         | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
        
       | pclmulqdq wrote:
       | In a world where bandwidth was limited, and the packet size
       | minimum was 64 bytes plus an inter-frame gap (it still is for
       | most Ethernet networks), sending a TCP packet for literally every
       | byte wasted a huge amount of bandwidth. The same goes for sending
       | empty acks.
       | 
       | On the other hand, my general position is: it's not TCP_NODELAY,
       | it's TCP.
        
         | metadaemon wrote:
         | I'd just love a protocol that has a built in mechanism for
         | realizing the other side of the pipe disconnected for any
         | reason.
        
           | koverstreet wrote:
           | Like TCP keepalives?
        
             | mort96 wrote:
             | If the feature already technically exists in TCP, it's
             | either broken or disabled by default, which is pretty much
             | the same as not having it.
        
               | voxic11 wrote:
               | keepalives are an optional TCP feature so they are not
               | necessarily supported by all TCP implementations and
               | therefor default to off even when supported.
        
               | dilyevsky wrote:
               | Where is it off? Most linux distros have it on it's just
               | the default kickoff timer is ridiculously long (like 2
               | hours iirc). Besides, TCP keepalives won't help with the
               | issue at hand and were put in for totally different
               | purpose (gc'ing idle connections). Most of the time you
               | don't even need them because the other side will send RST
               | packet if it already closed the socket.
        
               | halter73 wrote:
               | AFAIK, all Linux distros plus Windows and macOS have TCP
               | keepalives off by default as mandated by the RFC 1122.
               | Even when they are optionally turned on using
               | SO_KEEPALIVE, the interval defaults to two hours because
               | that is the minimum default interval allowed by spec.
               | That can then be optionally reduced with something like
               | /proc/sys/net/ipv4/tcp_keepalive_time (system wide) or
               | TCP_KEEPIDLE (per socket).
               | 
               | By default, completely idle TCP connections will stay
               | alive indefinitely from the perspective of both peers
               | even if their physical connection is severed.
               | Implementors MAY include "keep-alives" in their TCP
               | implementations, although this practice is not
               | universally                 accepted.  If keep-alives are
               | included, the application MUST                 be able to
               | turn them on or off for each TCP connection, and
               | they MUST default to off.                      Keep-alive
               | packets MUST only be sent when no data or
               | acknowledgement packets have been received for the
               | connection within an interval.  This interval MUST be
               | configurable and MUST default to no less than two hours.
               | 
               | [0]:
               | https://datatracker.ietf.org/doc/html/rfc1122#page-101
        
               | dilyevsky wrote:
               | OK you're right - it's coming back to me now. I've been
               | spoiled by software that enables keep-alive on sockets.
        
               | mort96 wrote:
               | So we need a protocol with some kind of non-optional
               | default-enabled keepalive.
        
               | josefx wrote:
               | Now your connections start to randomly fail in production
               | because the implementation defaults to 20ms and your
               | local tests never caught that.
        
               | mort96 wrote:
               | I'm sure there's some middle ground between "never time
               | out" and "time out after 20ms" that works reasonably well
               | for most use cases
        
               | hi-v-rocknroll wrote:
               | You're conflating all optional TCP features of all
               | operating systems, network devices, and RFCs together.
               | This lack of nuance fails to appreciate that different
               | applications have different needs for how they use TCP: (
               | server | client ) x ( one way | chatty bidirectional |
               | idle tinygram | mixed ). If a feature needs to be used on
               | a particular connection, then use it. ;)
        
           | the8472 wrote:
           | If a socket is closed properly there'll be a FIN and the
           | other side can learn about it by polling the socket.
           | 
           | If the network connection is lost due to external
           | circumstances (say your modem crashes) then how would that
           | information propagate from the point of failure to the remote
           | end _on an idle connection_? Either you actively probe
           | (keepalives) and risk false positives or you wait until you
           | hear again from the other side, risking false negatives.
        
             | sophacles wrote:
             | It gets even worse - routing changes causing traffic to
             | blackhole would still be undetectable without a timeout
             | mechanism, since probes and responses would be lost.
        
             | dataflow wrote:
             | > If the network connection is lost due to external
             | circumstances (say your modem crashes) then how would that
             | information propagate from the point of failure to the
             | remote end _on an idle connection_?
             | 
             | Observe the line voltage? If it gets cut then you have a
             | problem...
             | 
             | > Either you actively probe (keepalives) and risk false
             | positives
             | 
             | What false positives? Are you thinking there's an adversary
             | on the other side?
        
               | pjc50 wrote:
               | This is a L2 vs L3 thing.
               | 
               | Most network links absolutely will detect that the link
               | has gone away; the little LED will turn off and the OS
               | will be informed on both ends of that link.
               | 
               | But one of the link ends is a router, and these are
               | (except for NAT) _stateless_. The router _does not know_
               | what TCP connections are currently running through it, so
               | it cannot notify them - until a packet for that link
               | arrives, at which point it can send back an ICMP packet.
               | 
               | A TCP link with no traffic on it _does not exist_ on the
               | intermediate routers.
               | 
               | (Direct contrast to the old telecom ATM protocol, which
               | was circuit switched and required "reservation" of a full
               | set of end-to-end links).
        
               | ncruces wrote:
               | For a given connection, (most) packages might go through
               | (e.g.) 10 links. If one link goes down (or is saturated
               | and dropping packets) the connection is supposed to route
               | around it.
               | 
               | So, except for the links on either of end going down (one
               | end really, if the other is on a "data center" the TCP
               | connection is likely terminated in a "server" with
               | redundant networking) you wouldn't want to have a
               | connection terminated just because a link died.
               | 
               | That's explicitly against the goal of a packed switched
               | network.
        
           | toast0 wrote:
           | That's possible in circuit switched networking with various
           | types of supervision, but packet switched networking has
           | taken over because it's much less expensive to implement.
           | 
           | Attempts to add connection monitoring usually make things
           | worse --- if you need to reroute a cable, and one or both
           | ends of the cable will detect a cable disconnection and close
           | user sockets, that's not great, now you do a quick change
           | with a small period of data loss but otherwise minor
           | interruption; all of the established connections will be
           | dropped.
        
           | noselasd wrote:
           | SCTP has hearbeats to detect that.
        
           | sophacles wrote:
           | That's really really hard. For a full, guaranteed way to do
           | this we'd need circuit switching (or circuit switching
           | emulation). It's pretty expensive to do in packet networks -
           | each flow would need to be tracked by each middle box, so a
           | lot more RAM at every hop, and probably a lot more processing
           | power. If we go with circuit establishment, its also kind of
           | expensive and breaks the whole "distributed, decentralized,
           | self-healing network" property of the Internet.
           | 
           | It's possible to do better than TCP these days, bandwidth is
           | much much less constrained than it was when TCP was designed,
           | but it's still a hard problem to do detection of pipe
           | disconnected for _any_ reason other than timeouts (which we
           | already have).
        
           | pclmulqdq wrote:
           | Several of the "reliable UDP" protocols I have worked on in
           | the past have had a heartbeat mechanism that is specifically
           | for detecting this. If you haven't sent a packet down the
           | wire in 10-100 milliseconds, you will send an extra packet
           | just to say you're still there.
           | 
           | It's very useful to do this in intra-datacenter protocols.
        
           | 01HNNWZ0MV43FF wrote:
           | To re-word everyone else's comments - "Disconnected" is not
           | well-defined in any network.
        
             | dataflow wrote:
             | > To re-word everyone else's comments - "Disconnected" is
             | not well-defined in any network.
             | 
             | Parent said disconnected pipe, not network. It's
             | sufficiently well-definable there.
        
               | Spivak wrote:
               | I think it's a distinction without a difference in this
               | case. You can't know if the reason your water stopped is
               | because the water is shut off, the pipe broke, or it's
               | just slow.
               | 
               | When all you have to go on is "I stopped getting packets"
               | the best you can do is give up after a bit. TCP
               | keepsalives do kinda suck and are full of interesting
               | choices that don't seem to have passed the test of time.
               | But they are there and if you control both sides of the
               | connection you can be sure they work.
        
               | dataflow wrote:
               | There's a crucial difference in fact, which is that the
               | peer you're defining connectedness to is a single well-
               | defined peer that is directly connected to you, which
               | "The Network" is not.
               | 
               | As for the analogy, uh, this ain't water. Monitor the
               | line voltage or the fiber brightness or something, it'll
               | tell you very quickly if the other endpoint is
               | disconnected. It's up to the physical layer to provide a
               | mechanism to detect disconnection, but it's not somehow
               | impossible or rocket science...
        
               | umanwizard wrote:
               | Well, isn't that already how it works? If I physically
               | unplug my ethernet cable, won't TCP-related syscalls
               | start failing immediately?
        
               | dataflow wrote:
               | Probably, but I don't know how the physical layers work
               | underneath. But regardless, it's trivial to just monitor
               | _something_ constantly to ensure the connection is still
               | there, you just need the hardware and protocol support.
        
               | pjc50 wrote:
               | Modern Ethernet has what it calls "fast link pulses";
               | every 16ms there's some traffic to check. It's telephones
               | that use voltage for hook detection.
               | 
               | However, that only applies to the two ends of that cable,
               | not between you and the datacentre on the other side of
               | the world.
        
               | remram wrote:
               | > I don't know how it works... it's trivial
               | 
               | Come on now...
               | 
               | And it is easy to monitor, it is just an application
               | concern not a L3-L4 one.
        
               | pjc50 wrote:
               | Last time I looked the behavior differed; some OSs will
               | _immediately_ reset TCP connections which were using an
               | interface when it goes away, others will wait until a
               | packet is attempted.
        
               | iudqnolq wrote:
               | I ran into this with websockets. At least under certain
               | browser/os pairs you won't ever receive a close event if
               | you disconnect from wifi. I guess you need to manually
               | monitor ping/pong messages and close it yourself after a
               | timeout?
        
               | Spivak wrote:
               | In a packet-switched network there isn't one connection
               | between you and your peer. Even if you had line
               | monitoring that wouldn't be enough on its own to tell you
               | that your packet can't get there -- "routing around the
               | problem" isn't just a turn of phrase. On the opposite end
               | networks are best-effort so even if the line is up you
               | might get stuck in congestion which to you might as well
               | be dropped.
               | 
               | You can get the guarantees you want with a circuit
               | switched network but there's a lot of trade-offs namely
               | bandwidth and self-healing.
        
               | pjc50 wrote:
               | See https://news.ycombinator.com/item?id=40316922 :
               | "pipe" is L3, the network links are L2.
        
           | jallmann wrote:
           | These types of keepalives are usually best handled at the
           | application protocol layer where you can design in more knobs
           | and respond in different ways. Otherwise you may see
           | unexpected interactions between different keepalive
           | mechanisms in different parts of the protocol stack.
        
           | drb999 wrote:
           | What you're looking for is:
           | https://datatracker.ietf.org/doc/html/rfc5880
           | 
           | BFD, it's used for millisecond failure detection and
           | typically combined with BGP sessions (tcp based) to ensure
           | seamless failover without packet drops.
        
         | niutech wrote:
         | Shouldn't QUIC (https://en.wikipedia.org/wiki/QUIC) solve the
         | TCP issues like latency?
        
           | djha-skin wrote:
           | Quic is mostly used between client and data center, but not
           | between two datacenter computers. TCP is the better choice
           | once inside the datacenter.
           | 
           | Reasons:
           | 
           |  _Security Updates_
           | 
           | Phones run old kernels and new apps. So it makes a lot of
           | sense to put something that needs updated a lot like the
           | network stack into user space, and quic does well here.
           | 
           | Data center computers run older apps on newer kernels, so it
           | makes sense to put the network stack into the kernel where
           | updates and operational tweaks can happen independent of the
           | app release cycle.
           | 
           |  _Encryption Overhead_
           | 
           | The overhead of TLS is not always needed inside a data
           | center, where it is always needed on a phone.
           | 
           |  _Head of Line Blocking_
           | 
           | Super important on a throttled or bad phone connection, not a
           | big deal when all of your datacenter servers have 10G
           | connections to everything else.
           | 
           | In my opinion TCP is a battle hardened technology that just
           | works even when things go bad. That it contains a setting
           | with perhaps a poor default is a small thing in comparison to
           | its good record for stability in most situations. It's also
           | comforting to know I can tweak kernel parameters if I need
           | something special for my particular use case.
        
             | mjb wrote:
             | Many performance-sensitive in-datacenter applications have
             | moved away from TCP to reliable datagram protocols. Here's
             | what that looks like at AWS:
             | https://ieeexplore.ieee.org/document/9167399
        
           | jallmann wrote:
           | The specific issues that this article discusses (eg Nagle's
           | algorithm) will be present in most packet-switched transport
           | protocols, especially ones that rely on acknowledgements for
           | reliability. The QUIC RFC mentions this:
           | https://datatracker.ietf.org/doc/html/rfc9000#section-13
           | 
           | Packet overhead, ack frequency, etc are the tip of the
           | iceberg though. QUIC addresses some of the biggest issues
           | with TCP such as head-of-line blocking but still shares the
           | more finicky issues, such as different flow and congestion
           | control algorithms interacting poorly.
        
           | klabb3 wrote:
           | As someone who needed high throughput and looked to QUIC
           | because of control of buffers, I recommend against it at this
           | time. It's got tons of performance problems depending on impl
           | and the API is different.
           | 
           | I don't think QUIC is bad, or even overengineered, really. It
           | delivers useful features, in theory, that are quite well
           | designed for the modern web centric world. Instead I got a
           | much larger appreciation for TCP, and how well it works
           | everywhere: on commodity hardware, middleboxes, autotuning,
           | NIC offloading etc etc. Never underestimate battletested
           | tech.
           | 
           | In that sense, the lack of TCP_NODELAY is an exception to the
           | rule that TCP performs well out of the box (golang is already
           | doing this by default). As such, I think it's time to change
           | the default. Not using buffers correctly is a programming
           | error, imo, and can be patched.
        
             | supriyo-biswas wrote:
             | Was this ever implemented though? I found [1] but it was
             | frozen due to age and was never worked on, it seems.
             | 
             | (Edit: doing some more reading, it seems TCP_NODELAY was
             | always the default in Golang. Enable TCP_NODELAY =>
             | "disable Nagle's algorithm")
             | 
             | [1] https://github.com/golang/go/issues/57530
        
               | bboreham wrote:
               | Yes. That issue is confusingly titled, but consists
               | solely of a quote from the author of the code talking
               | about what they were thinking at the time they did it.
        
       | zengid wrote:
       | Relevant Oxide and Friends podcast episode
       | https://www.youtube.com/watch?v=mqvVmYhclAg
        
         | matthavener wrote:
         | This was a great episode and the really drove home the
         | importance of visualization.
        
       | rsc wrote:
       | Not if you use a modern language that enables TCP_NODELAY by
       | default, like Go. :-)
        
         | andrewfromx wrote:
         | https://news.ycombinator.com/item?id=34179426
         | 
         | https://github.com/golang/go/issues/57530
         | 
         | huh, TIL.
        
         | silverwind wrote:
         | Node.js also does this since at least 2020.
        
           | Sammi wrote:
           | Since 2022 v.18.
           | 
           | PR: https://github.com/nodejs/node/pull/42163
           | 
           | Changelog entry: https://github.com/nodejs/node/blob/main/doc
           | /changelogs/CHAN...
        
         | eru wrote:
         | Why do you need a whole language for that? Couldn't you just
         | use a 'modern' networking library?
        
           | rsc wrote:
           | Sure, like the one in https://9fans.github.io/plan9port/. :-)
        
       | ironman1478 wrote:
       | I've fixed multiple latency issues due to nagle's multiple times
       | in my career. It's the first thing I jump to. I feel like the
       | logic behind it is sound, but it just doesn't work for some
       | workloads. It should be something that an engineer needs to be
       | forced to set while creating a socket, instead of letting the OS
       | choose a default. I think that's the main issue. Not that it's a
       | good / bad option but that there is a setting that people might
       | not know about that manipulates how data is sent over the wire so
       | aggressively.
        
         | hinkley wrote:
         | What you really want is for the delay to be n microseconds, but
         | there's no good way to do that except putting your own user
         | space buffering in front of the system calls (user space works
         | better, unless you have something like io_uring amortizing
         | system call times)
        
           | bobmcnamara wrote:
           | I'd rather have portable TCP_CORK
        
             | hinkley wrote:
             | Cork is probably how you'd implement this in userspace so
             | why not both?
        
           | mjevans wrote:
           | It'd probably be amazing how many poorly coded games would
           | work better if something like...
           | 
           | TCP_60FPSBUFFER
           | 
           | Would wait for ~16mS after the first packet is queued and
           | batch the data stream up.
        
             | dishsoap wrote:
             | Most games use UDP.
        
             | Chaosvex wrote:
             | Adding delay to multiplayer games? That's worse.
        
           | jnordwick wrote:
           | linux has auto-corking (and I know of no way to disable it)
           | that will do these short delays on small packets even if the
           | dev doesn't want it
        
         | Bluecobra wrote:
         | I agree, it has been fairly well known to disable Nagle's
         | Algorithm in HFT/low latency trading circles for quite some
         | time now (like > 15 years). It's one of the first things I look
         | for.
        
           | Scubabear68 wrote:
           | I was setting TCP_NODELAY at Bear Stearns for custom
           | networking code circa 1994 or so.
        
             | kristjansson wrote:
             | This is why I love this place
        
           | mcoliver wrote:
           | Same in M&E / vfx
        
           | Reason077 wrote:
           | Surely serious HFT systems bypass TCP altogether now days. In
           | that world, every millisecond of latency can potentially cost
           | a lot of money.
           | 
           | These are the guys that use microwave links to connect to
           | exchanges because fibre-optics have too much latency.
        
         | nsguy wrote:
         | The logic is really for things like Telnet sessions. IIRC that
         | was the whole motivation.
        
           | bobmcnamara wrote:
           | And for block writes!
           | 
           | The Nagler turns a series of 4KB pages over TCP into a stream
           | of MTU sized packets, rather than a short packet aligned to
           | the end of each page.
        
         | nailer wrote:
         | You're right re: making delay explicit, but also crappy use the
         | space networking tools don't show whether no_delay is enabled
         | on sockets.
         | 
         | Last time I had to do some Linux stuff, maybe 10 years ago you
         | had to write a systemtap program. I guess it's EBNF now. But I
         | bet the userspace tools still suck.
        
           | nailer wrote:
           | > use the space
           | 
           | Userspace. Sorry, was using voice dictation.
        
         | Sebb767 wrote:
         | > It should be something that an engineer needs to be forced to
         | set while creating a socket, instead of letting the OS choose a
         | default.
         | 
         | If the intention is mostly to fix applications with bad
         | `write`-behavior, this would make setting TCP_DELAY a pretty
         | exotic option - you would need a software engineer to be both
         | smart enough to know to set this option, but not smart enough
         | to distribute their write-calls well and/or not go for writing
         | their own (probably better fitted) application-specific version
         | of Nagles.
        
         | nh2 wrote:
         | Same here. I have a hobby that on any RPC framework I
         | encounter, I file a Github issue "did you think of TCP_NODELAY
         | or can this framework do only 20 calls per second?".
         | 
         | So far, it's found a bug every single time.
         | 
         | Some examples: https://cloud-
         | haskell.atlassian.net/browse/DP-108 or
         | https://github.com/agentm/curryer/issues/3
         | 
         | I disagree on the "not a good / bad option" though.
         | 
         | It's a kernel-side heuristic for "magically fixing" badly
         | behaved applications.
         | 
         | As the article states, no sensible application does 1-byte
         | network write() syscalls. Software that does that should be
         | fixed.
         | 
         | It makes sense only in the case when you are the kernel
         | sysadmin and somehow cannot fix the software that runs on the
         | machine, maybe for team-political reasons. I claim that's
         | pretty rare.
         | 
         | For all other cases, it makes sane software extra complicated:
         | You need to explicitly opt-out of odd magic that makes poorly-
         | written software have slightly more throughput, and that makes
         | correctly-written software have huge, surprising latency.
         | 
         | John Nagle says here and in linked threads that Delayed Acks
         | are even worse. I agree. But the Send/Send/Receive receive
         | pattern that Nagle's Algorithm degrades is a totally valid and
         | common use case, including anything that does pipelined RPC
         | over TCP.
         | 
         | Both Delayed Acks and Nagle's Algorithm should be opt-in, in my
         | opinion. It should be called TCP_DELAY, which you can opt-into
         | if you can't be asked to implement basic userspace buffering.
         | 
         | People shouldn't /need/ to know about these. Make the default
         | case be the unsurprising one.
        
           | carterschonwald wrote:
           | Oh hey! It's been a while how're you?!
        
           | a_t48 wrote:
           | Thanks for the reminder to set this on the new framework I'm
           | working on. :)
        
           | jandrese wrote:
           | The problem with making it opt in is that the point of the
           | protocol was to fix apps that, while they perform fine for
           | the developer on his LAN, would be hell on internet routers.
           | So the people who benefit are the ones who don't know what
           | they are doing and only use the defaults.
        
           | klabb3 wrote:
           | > As the article states, no sensible application does 1-byte
           | network write() syscalls. Software that does that should be
           | fixed.
           | 
           | Yes! And worse, those that _do_ are not gonna be "fixed" by
           | delays either. In this day and age with fast internets, a
           | syscall per byte will bottleneck the CPU way before it'll
           | saturate the network path. The cpu limit when I've been
           | tuning buffers have been somewhere in the 4k-32k range for
           | 10Gbps ish.
           | 
           | > Both Delayed Acks and Nagle's Algorithm should be opt-in,
           | in my opinion.
           | 
           | Agreed, it causes more problems than it solves and is very
           | outdated. Now, the challenge is rolling out such a change as
           | smoothly as possible, which requires coordination and a lot
           | of trivia knowledge of legacy systems. Migrations are never
           | trivial.
        
             | oefrha wrote:
             | I doubt the libc default in established systems can change
             | now, but newer languages and libraries can learn the lesson
             | and do the right thing. For instance, Go sets TCP_NODELAY
             | by default: https://news.ycombinator.com/item?id=34181846
        
           | pzs wrote:
           | "As the article states, no sensible application does 1-byte
           | network write() syscalls." - the problem that this flag was
           | meant to solve was that when a user was typing at a remote
           | terminal, which used to be a pretty common use case in the
           | 80's (think telnet), there was one byte available to send at
           | a time over a network with a bandwidth (and latency) severely
           | limited compared to today's networks. The user was happy to
           | see that the typed character arrived to the other side. This
           | problem is no longer significant, and the world has changed
           | so that this flag has become a common issue in many current
           | use cases.
           | 
           | Was terminal software poorly written? I don't feel
           | comfortable to make such judgement. It was designed for a
           | constrained environment with different priorities.
           | 
           | Anyway, I agree with the rest of your comment.
        
             | SoftTalker wrote:
             | > when a user was typing at a remote terminal, which used
             | to be a pretty common use case in the 80's
             | 
             | Still is for some. I'm probably working in a terminal on an
             | ssh connection to a remote system for 80% of my work day.
        
               | dgoldstein0 wrote:
               | sure, but we do so with much better networks than in the
               | 80s. The extra overhead is not going to matter when even
               | a bad network nowadays is measured in megabits per second
               | per user. The 80s had no such luxury.
        
               | underdeserver wrote:
               | If you're working on a distributed system, most of the
               | traffic is not going to be your SSH session though.
        
               | adgjlsfhk1 wrote:
               | the difference is that with kb/s speed, 40x of 10
               | characters per second overhead mattered. now, humans
               | aren't nearly fast enough to contest a network.
        
           | hgomersall wrote:
           | Would one not also get clobbered by all the sys calls for
           | doing many small packets? It feels like coalescing in
           | userspace is a much better strategy all round if that's
           | desired, but I'm not super experienced.
        
           | imp0cat wrote:
           | I have a hobby that on any RPC framework I encounter, I file
           | a Github issue "did you think of TCP_NODELAY or can this
           | framework do only 20 calls per second?".
           | 
           | So true. Just last month we had to apply the TCP_NODELAY fix
           | to one of our libraries. :)
        
           | kazinator wrote:
           | It's very easy to end up with small writes. E.g.
           | 1. Write four bytes (length of frame)       2. Write the
           | frame (write the frame itself)
           | 
           | The easiest fix in C code, with the least chance of introduce
           | a buffer overflow or bad performance is to keep these two
           | pieces of information in separate buffers, and use writev.
           | (How portable is that compared to send?)
           | 
           | If you have to combine the two into one flat frame, you're
           | looking at allocating and copying memory.
           | 
           | Linux has something called corking: you can "cork" a socket
           | (so that it doesn't transmit), write some stuff to it
           | multiple times and "uncork". It's extra syscalls though,
           | yuck.
           | 
           | You could use a buffered stream where you control flushes:
           | basically another copying layer.
        
         | inopinatus wrote:
         | With some vendors you have to solve it like a policy problem,
         | via a LD_PRELOAD shim.
        
         | ww520 wrote:
         | Same here. My first job out of college was at a database
         | company. Queries at the client side of the client-server based
         | database were slow. It was thought the database server was slow
         | as hardware back then was pretty pathetic. I traced it down to
         | the network driver and found out the default setting of
         | TCP_NODELAY was off. I looked like a hero when turning on that
         | option and the db benchmarks jumped up.
        
         | pjc50 wrote:
         | > I feel like the logic behind it is sound, but it just doesn't
         | work for some workloads.
         | 
         | The logic is _only_ sound for interactive plaintext typing
         | workloads. It should have been turned off by default 20 years
         | ago, let alone now.
        
           | p_l wrote:
           | Remember that IPv4 original "target replacement date" (as it
           | was only an "experimental" protocol) was 1990...
           | 
           | And a common thing in many more complex/advanced protocols
           | was to explicitly delineate "messages", which avoids the
           | issue of Nagle's algorithm altogether.
        
         | immibis wrote:
         | Not when creating a socket - when sending data. When sending
         | data, you should indicate whether this data block prefers high
         | throughput or low latency.
        
         | kazinator wrote:
         | > _that an engineer needs to be forced to set while creating a
         | socket_
         | 
         | Because there aren't enough steps in setting up sockets! Haha.
         | 
         | I suspect that what would happen is that many of the
         | programming language run-times in the world which have easier-
         | to-use socket abstractions would pick a default and hide it
         | from the programmer, so as not to expose an extra step.
        
       | JoshTriplett wrote:
       | I do wish that TCP_NODELAY was the default, and there was a
       | TCP_DELAY option instead. That'd be a world in which people who
       | _want_ the batch-style behavior (optimizing for throughput and
       | fewer packets at the expense of latency) could still opt into it.
        
         | mzs wrote:
         | So do I, but I with there was a new one TCP_RTTDELAY. It would
         | take a byte that would be what 128th of RTT you want to use for
         | Nagle instead of one RTT or full* buffer. 0 would be the
         | default, behaving as you and I prefer.
         | 
         | * "Given the vast amount of work a modern server can do in even
         | a few hundred microseconds, delaying sending data for even one
         | RTT isn't clearly a win."
         | 
         | I don't think that's such an issue anymore either, given that
         | the server produces so much data it fills the output buffer
         | quickly anyway, the data is then immediately sent before the
         | delay runs its course.
        
       | stonemetal12 wrote:
       | >To make a clearer case, let's turn back to the justification
       | behind Nagle's algorithm: amortizing the cost of headers and
       | avoiding that 40x overhead on single-byte packets. But does
       | anybody send single byte packets anymore?
       | 
       | That is a bit of a strawman there. While he uses single byte
       | packets as the worst case example, the issue as stated is any not
       | full packet.
        
       | somat wrote:
       | What about the opposite, disable delayed acks.
       | 
       | The problem is the pathological behavior when tinygram prevention
       | interacts with delayed acks. There is an exposed option to turn
       | off tinygram prevention(TCP_NODELAY), how would you tun off
       | delayed acks instead? Say if you wanted to benchmark all four
       | combinations and see what works best.
       | 
       | doing a little research I found:
       | 
       | linux has the TCP_QUICKACK socket option but you have to set it
       | every time you receive. there is also
       | /proc/sys/net/ipv4/tcp_delack_min and
       | /proc/sys/net/ipv4/tcp_ato_min
       | 
       | freebsd has net.inet.tcp.delayed_ack and net.inet.tcp.delacktime
        
         | Animats wrote:
         | > linux has the TCP_QUICKACK socket option but you have to set
         | it every time you receive
         | 
         | Right. What were they thinking? Why would you want it off only
         | some of the time?
        
         | batmanthehorse wrote:
         | In CentOS/RedHat you can add `quickack 1` to the end of a route
         | to tell it to disable delayed acks for that route.
        
           | rbjorklin wrote:
           | And with systemd >= 253 you can set it as part of the network
           | config to have it be applied automatically.
           | https://github.com/systemd/systemd/issues/25906
        
         | mjb wrote:
         | TCP_QUICKACK does fix the worst version of the problem, but
         | doesn't fix the entire problem. Nagles algorithm will still
         | wait for up to one round-trip time before sending data (at
         | least as specified in the RFC), which is extra latency with
         | nearly no added value.
        
         | Culonavirus wrote:
         | Apparently you have time to "do a little research" but not to
         | read the entire article you're reacting to? It specifically
         | mentions TCP_QUICKACK.
        
       | benreesman wrote:
       | Nagle and no delay are like like 90+% of the latency bugs I've
       | dealt with.
       | 
       | Two reasonable ideas that mix terribly in practice.
        
       | tempaskhn wrote:
       | Wow, never would have thought of that.
        
       | jedberg wrote:
       | This is an interesting thing that points out why abstraction
       | layers can be bad without proper message passing mechanisms.
       | 
       | This could be fixed if there was a way for the application at L7
       | to tell the TCP stack at L4 "hey, I'm an interactive shell so I
       | expect to have a lot of tiny packets, you should leave
       | TCP_NODELAY on for these packets" so that it can be off by
       | default but on for that application to reduce overhead.
       | 
       | Of course nowadays it's probably an unnecessary optimization
       | anyway, but back in '84 it would have been super handy.
        
         | Dylan16807 wrote:
         | "I'm an interactive shell so I expect to have a lot of tiny
         | packets" is what the delay is _for_. If you want to turn it off
         | for those, you should turn it off for everything.
         | 
         | (If you're worried about programs that buffer badly, then you
         | could compensate with a 1ms delay. But not this round trip
         | stuff.)
        
         | eru wrote:
         | The take-away I get is that abstraction layers (in the kernel)
         | can be bad.
         | 
         | Operating system kernels should enable secure multiplexing of
         | resources. Abstraction and portability should be done via
         | libraries.
         | 
         | See https://en.wikipedia.org/wiki/Exokernel
        
       | landswipe wrote:
       | Use UDP ;)
        
         | gafferongames wrote:
         | Bingo
        
         | hi-v-rocknroll wrote:
         | Too many applications end up reinventing TCP or SCTP in user-
         | space. Also, network-level QoS applied to unrecognized UDP
         | protocols typically means it gets throttled before TCP. Use UDP
         | when nothing else will work, when the use-case doesn't need a
         | persistent connection, and when no other messaging or transport
         | library is suitable.
        
       | epolanski wrote:
       | I was curious whether I had to change anything in my applications
       | after reading that so did a bit of research.
       | 
       | Both Node.js and Curl use TCP_NODELAY by default from a long
       | time.
        
         | Sammi wrote:
         | Nodejs enabled TCP_NODELAY by default in 2022 v.18.
         | 
         | PR: https://github.com/nodejs/node/pull/42163
         | 
         | Changelog entry:
         | https://github.com/nodejs/node/blob/main/doc/changelogs/CHAN...
        
           | epolanski wrote:
           | That's the HTTP module, but it was already moved to NODELAY
           | for the 'net' module in 2015.
        
       | gafferongames wrote:
       | It's always TCP_NODELAY. Except when it's head of line blocking,
       | then it's not.
        
       | kaoD wrote:
       | As a counterpoint, here's the story of how for me it wasn't
       | TCP_NODELAY: for some reason my Nodejs TCP service was talking a
       | few seconds to reply to my requests in localhost (Windows
       | machine). After the connection was established everything was
       | pretty normal but it consistently took a few seconds to establish
       | the connection.
       | 
       | I even downloaded netcat for Windows to go as bare ones as
       | possible... and the exact same thing happened.
       | 
       | I rewrote a POC service in Rust and... oh wow, the same thing
       | happens.
       | 
       | It took me a very long time of not finding anything on the
       | internet (and getting yelled at in Stack Overflow, or rather one
       | of its sister sites) and painstakingly debugging (including
       | writing my own tiny client with tons of debug statements) until I
       | realized "localhost" was resolving first to IPv6 loopback in
       | Windows and, only after quietly timing out there (because I was
       | only listening on IPv4 loopback), it did try and instantly
       | connect through IPv4.
        
         | littlestymaar wrote:
         | I've seen this too, but luckily someone one the internet gave
         | me a pointer to the exact problem so I didn't have to go deep
         | to figure out.
        
       | pandemicsyn wrote:
       | I was gonna say its always lro offload but my experience is
       | dated.
        
       | 0xbadcafebee wrote:
       | The takeaway is odd. Clearly Nagle's Algorithm was an attempt at
       | batched writes. It doesn't matter what your hardware or network
       | or application or use-case or anything is; in some cases, batched
       | writes are better.
       | 
       | Lots of computing today uses batched writes. Network applications
       | benefit from it too. Newer higher-level protocols like QUIC do
       | batching of writes, effectively moving all of TCP's independent
       | connection and error handling into userspace, so the protocol can
       | move as much data into the application as fast as it can, and let
       | the application (rather than a host tcp/ip stack, router, etc)
       | worry about the connection and error handling of individual
       | streams.
       | 
       | Once our networks become saturated the way they were in the old
       | days, Nagle's algorithm will return in the form of a QUIC
       | modification, probably deeper in the application code, to wait to
       | send a QUIC packet until some criteria is reached. Everything in
       | technology is re-invented once either hardware or software
       | reaches a bottleneck (and they always will as their capabilities
       | don't grow at the same rate).
       | 
       | (the other case besides bandwidth where Nagle's algorithm is
       | useful is if you're saturating Packets Per Second (PPS) from tiny
       | packets)
        
         | Spivak wrote:
         | Yes but it seems this particular implementation is using a
         | heuristic for how to batch that made some assumptions that
         | didn't pan out.
        
         | p_l wrote:
         | The difference between QUIC and TCP is the original sin of TCP
         | (and its predecessor) - that of emulating an async serial port
         | connection, with no visible messaging layer.
         | 
         | It meant that you could use a physical teletypewriter to
         | connect to services (simplified description - slap a modem on a
         | serial port, dial into a TIP, write host address and port
         | number, voila), but it also means that TCP has no idea of
         | message boundaries, and while you can push some of that
         | knowledge now the early software didn't.
         | 
         | In comparison, QUIC and many other non-TCP protocols (SCTP,
         | TP4) explicitly provide for messaging boundaries - your
         | interface to the system isn't based on emulated serial ports
         | but on _messages_ that might at most get reassembled.
        
           | utensil4778 wrote:
           | It's kind of incredible to think how many things in computers
           | and electronics turn out to just be a serial port.
           | 
           | One day, some future engineer is going to ask why their warp
           | core diagnostic port runs at 9600 8n1.
        
         | adgjlsfhk1 wrote:
         | batching needs to be application controlled rather than
         | protocol controlled. the protocol doesn't have enough context
         | to batch correctly.
        
       | ryjo wrote:
       | I just ran into this this week implementing a socket library in
       | CLIPS. I used Berkley sockets, and before that I had only worked
       | with higher-level languages/frameworks that abstracts a lot of
       | these concerns away. I was quite confused when Firefox would show
       | a "connection reset by peer." It didn't occur to me it could be
       | an issue "lower" in the stack. `tcpdump` helped me to observe the
       | port and I saw that the server never sent anything before my
       | application closed the connection.
        
       | AtNightWeCode wrote:
       | Agreed. Another thing along the same path is expect. Needs to be
       | disabled in many cloud services.
        
       | meisel wrote:
       | Is this something I should also adjust on my personal Ubuntu
       | machine for better network performance?
        
       | projectileboy wrote:
       | The real issue in modern data centers is TCP. Of course at
       | present, we need to know about these little annoyances at the
       | application layer, but what we really need is innovation in the
       | data center at level 4. And yes I know that many people are
       | looking into this and have been for years, but the economic
       | motivation clearly has not yet been strong enough. But that may
       | change if the public's appetite for LLM-based tooling causes data
       | centers to increase 10x (which seems likely).
        
       | maple3142 wrote:
       | Not sure if this is a bit off topic or not, but I recently
       | encountered a problem where my program are continuously calling
       | write to a socket in a loop that loops N times, with each of them
       | sending about a few hundred bytes of data representing an
       | application-level message. The loop can be understanded as some
       | "batched messages" to server. After that, the program will try to
       | receive data from server and do some processing.
       | 
       | The problem is that if N is above certain limit (e.g. 4), the
       | server will resulting in some error saying that the data is
       | truncated somehow. I want to make N larger because the round-trip
       | latency is already high enough, so being blocked by this is
       | pretty annoying. Eventually, I found an answer on stackoverflow
       | saying that setting TCP_NODELAY can fix this, and it actually
       | magically enable be to increase N to a larger number like 64 or
       | 128 without causing issues. Still not sure why TCP_NODELAY can
       | fix this issue and why this problem happens in the first place.
        
         | blahgeek wrote:
         | > The problem is that if N is above certain limit (e.g. 4), the
         | server will resulting in some error saying that the data is
         | truncated somehow.
         | 
         | Maybe your server expects full application-level messages from
         | single "recv" call? This is is not correct. A message may be
         | spited across multiple recv buffers.
        
         | gizmo686 wrote:
         | My guess would be that the server assumes that every call to
         | recv() terminates on a message boundary.
         | 
         | With TCP_NODELAY and small messages, this works out fine. Every
         | message is contained in a single packet, and the userspace
         | buffer being read into is large enough to contain it. As such,
         | whenever the kernel has any data to give to userspace, it has
         | an integer number of messages to give. Nothing requires the
         | kernel to respect that, but it will not go out of its way to
         | break it.
         | 
         | In contrast, without TCP_NODELAY, messages get concatenated and
         | then fragmented based on where packet boundaries occur. Now,
         | the natural end point for a call to recv() is not the message
         | boundary, but the packet boundary.
         | 
         | The server is supposed to see that it is in the middle of a
         | message, and make another call to recv() to get the rest of it;
         | but clearly it does not do that.
        
           | caf wrote:
           | Otherwise known as the "TCP is a stream-based abstraction,
           | not a packet-based abstraction" bug.
           | 
           | A related one is failing to process the second of two
           | complete commands that happen to arrive in the same recv()
           | call.
        
             | lanstin wrote:
             | I find these bugs to be a sign that the app is not using a
             | good wrapper but just mostly gets lucky that the packet
             | isn't split randomly on the way.
        
       | ramblemonkey wrote:
       | What if we changed the kernel or tcp stack to hold on to the
       | packet for only a short time before sending it out. This could
       | allow you to balance the latency against the network cost of many
       | small packets. The tcp stack could even do it dynamically if
       | needed.
        
         | tucnak wrote:
         | Genius
        
       | Ono-Sendai wrote:
       | From my blog > 10 years ago but sadly still relevant: "Sockets
       | should have a flushHint() API call.":
       | https://forwardscattering.org/post/3
        
       | hi-v-rocknroll wrote:
       | Apropos repost from 2015:
       | 
       | > That still irks me. The real problem is not tinygram
       | prevention. It's ACK delays, and that stupid fixed timer. They
       | both went into TCP around the same time, but independently. I did
       | tinygram prevention (the Nagle algorithm) and Berkeley did
       | delayed ACKs, both in the early 1980s. The combination of the two
       | is awful. Unfortunately by the time I found about delayed ACKs, I
       | had changed jobs, was out of networking, and doing a product for
       | Autodesk on non-networked PCs.
       | 
       | > Delayed ACKs are a win only in certain circumstances - mostly
       | character echo for Telnet. (When Berkeley installed delayed ACKs,
       | they were doing a lot of Telnet from terminal concentrators in
       | student terminal rooms to host VAX machines doing the work. For
       | that particular situation, it made sense.) The delayed ACK timer
       | is scaled to expected human response time. A delayed ACK is a bet
       | that the other end will reply to what you just sent almost
       | immediately. Except for some RPC protocols, this is unlikely. So
       | the ACK delay mechanism loses the bet, over and over, delaying
       | the ACK, waiting for a packet on which the ACK can be
       | piggybacked, not getting it, and then sending the ACK, delayed.
       | There's nothing in TCP to automatically turn this off. However,
       | Linux (and I think Windows) now have a TCP_QUICKACK socket
       | option. Turn that on unless you have a very unusual application.
       | 
       | > Turning on TCP_NODELAY has similar effects, but can make
       | throughput worse for small writes. If you write a loop which
       | sends just a few bytes (worst case, one byte) to a socket with
       | "write()", and the Nagle algorithm is disabled with TCP_NODELAY,
       | each write becomes one IP packet. This increases traffic by a
       | factor of 40, with IP and TCP headers for each payload. Tinygram
       | prevention won't let you send a second packet if you have one in
       | flight, unless you have enough data to fill the maximum sized
       | packet. It accumulates bytes for one round trip time, then sends
       | everything in the queue. That's almost always what you > want. If
       | you have TCP_NODELAY set, you need to be much more aware of
       | buffering and flushing issues.
       | 
       | > None of this matters for bulk one-way transfers, which is most
       | HTTP today. (I've never looked at the impact of this on the SSL
       | handshake, where it might matter.)
       | 
       | > Short version: set TCP_QUICKACK. If you find a case where that
       | makes things worse, let me know.
       | 
       | > John Nagle
       | 
       | (2015)
       | 
       | https://news.ycombinator.com/item?id=10608356
       | 
       | ---
       | 
       | Support platform survey:
       | 
       | TCP_QUICKACK: Linux (must set again every every recv())
       | 
       | TCP_NODELAY: Linux, Apple, Windows, Solaris, FreeBSD, OpenBSD,
       | and NetBSD
       | 
       | References:
       | 
       | https://www.man7.org/linux/man-pages/man7/tcp.7.html
       | 
       | https://opensource.apple.com/source/xnu/xnu-1504.9.17/bsd/ne...
       | 
       | https://learn.microsoft.com/en-us/windows/win32/winsock/ippr...
       | 
       | https://docs.oracle.com/cd/E88353_01/html/E37851/esc-tcp-4p....
       | 
       | https://man.freebsd.org/cgi/man.cgi?query=tcp
       | 
       | https://man.openbsd.org/tcp
       | 
       | https://man.netbsd.org/NetBSD-8.0/tcp.4
        
       | resonious wrote:
       | ~15 years ago I played an MMO that was very real-time, and yet
       | all of the communication was TCP. Literally you'd click a button,
       | and you would not even see your action play out until a response
       | packet came back.
       | 
       | All of the kids playing this game (me included) eventually
       | figured out you could turn on TCP_NODELAY to make the game
       | buttery smooth - especially for those in California close to the
       | game servers.
        
         | jonathanlydall wrote:
         | Not sure if you're talking about WoW, but around that time ago
         | an update to the game did exactly this change (and possibly
         | more).
         | 
         | An interesting side-effect of this was that before the change
         | if something stalled the TCP stream, the game would hang for a
         | while then very quickly replay all the missed incoming events
         | (which was very often you being killed). After the change you'd
         | instead just be disconnected.
        
           | mst wrote:
           | I think I have a very vague memory of the "hang, hang, hang,
           | SURPRISE! You're dead" thing happening in Diablo II but it's
           | been so long I wouldn't bet on having remembered correctly.
        
       | trollied wrote:
       | Brings back memories. This was a big Sybase performance win back
       | in the day.
        
       | f1shy wrote:
       | If you are sooo worried about latency, maybe TCP is a bad choice
       | to start with... I hate to see people using TCP for all, without
       | the minimal understanding of what problems TCP wants to solve,
       | and specially which dont.
        
         | pcai wrote:
         | TCP solves for "when i send a message i want the other side to
         | actually receive it" which is...fairly common
        
           | adgjlsfhk1 wrote:
           | tcp enforces a much stricter ordering than desirable (head of
           | line blocking). quic does a much better job of emulating a
           | stream of independent tasks.
        
       | kazinator wrote:
       | Here is the thing. Nagle's and the delayed ACK may suck for
       | individual app performance, but fewer packets on the network is
       | better for the entire network.
        
       ___________________________________________________________________
       (page generated 2024-05-10 23:01 UTC)